linux_old1

Commit Graph

Author	SHA1	Message	Date
Jeff Layton	f5d7726900	ceph: make iterate_session_caps a public symbol Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:37 +02:00
Jeff Layton	40e7e2c0e8	ceph: fix NULL pointer deref when debugging is enabled Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:37 +02:00
Jeff Layton	428bb68ad9	ceph: properly handle granular statx requests cephfs can benefit from statx. We can have the client just request caps sufficient for the needed attributes and leave off the rest. Also, recognize when AT_STATX_DONT_SYNC is set, and just scrape the inode without doing any call in that case. Force a call to the MDS in the event that AT_STATX_FORCE_SYNC is set. Link: http://tracker.ceph.com/issues/39258 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:37 +02:00
Jeff Layton	ffb61c55b2	ceph: remove superfluous inode_lock in ceph_fsync Originally, filemap_write_and_wait took the i_mutex internally, but commit `02c24a8218` pushed the mutex acquisition into the individual fsync routines, leaving it up to the subsystem maintainers to remove it if it wasn't needed. For ceph, I see no reason to take the inode_lock here. All of the operations inside that lock are protected by their own locking. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:37 +02:00
Yan, Zheng	570df4e9c2	ceph: snapshot nfs re-export To support snapshot nfs re-export, we need a way to lookup snapped inode by file handle. For directory inode, snapped metadata are always stored together with head inode. Client just need to pass vinodeno_t to MDS. For non-directory inode, there can be multiple version of snapped inodes and they can be stored in different dirfrags. Besides vinodeno_t, client also need to tell mds from which dirfrag it got the snapped inode. Another problem of supporting snapshot nfs re-export is that there can be multiple paths to access a snapped inode. For example: mkdir -p d1/d2/d3 mkdir d1/.snap/s1 Paths 'd1/.snap/s1/d2/d3', 'd1/d2/.snap/_s1_<inode number of d1>/d3' and 'd1/d2/d3/.snap/_s1_<inode number of d1>' are all reference to the same snapped inode. For a given snapped inode, There is no convenient way to get the first form and the second form paths. For simplicity, ceph_get_parent() return snapdir for snapped directory inode. Furthermore, client may access snapshot of deleted directory. For example: mkdir -p d1/d2 mkdir d1/.snap/s1 open d1/.snap/s1/d2 rm -rf d1/d2 <nfs server restart> The path constucted by ceph_get_parent() and ceph_get_name() is '<inode of d2>/.snap/_s1_<inode number of d1>'. Futher lookup parent of <inode of d2> will fail. To workaround this case, this patch uses d_obtain_root() to get dentry for snapdir of deleted directory. snapdir dentry has no DCACHE_DISCONNECTED flag set, reconnect_path() stops when it reaches snapdir dentry. Link: http://tracker.ceph.com/issues/22105 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:36 +02:00
Luis Henriques	0c44a8e0fc	ceph: quota: fix quota subdir mounts The CephFS kernel client does not enforce quotas set in a directory that isn't visible from the mount point. For example, given the path '/dir1/dir2', if quotas are set in 'dir1' and the filesystem is mounted with mount -t ceph <server>:<port>:/dir1/ /mnt then the client won't be able to access 'dir1' inode, even if 'dir2' belongs to a quota realm that points to it. This patch fixes this issue by simply doing an MDS LOOKUPINO operation for unknown inodes. Any inode reference obtained this way will be added to a list in ceph_mds_client, and will only be released when the filesystem is umounted. Link: https://tracker.ceph.com/issues/38482 Reported-by: Hendrik Peyerl <hpeyerl@plusline.net> Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:36 +02:00
Luis Henriques	3886274adf	ceph: factor out ceph_lookup_inode() This function will be used by __fh_to_dentry and by the quotas code, to find quota realm inodes that are not visible in the mountpoint. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:36 +02:00
Zhi Zhang	1b52931ca9	ceph: remove duplicated filelock ref increase Inode i_filelock_ref is increased in ceph_lock or ceph_flock, but it is increased again in ceph_lock_message. This results in this ref won't become zero. If CEPH_I_ERROR_FILELOCK flag is set in remove_session_caps once, this flag can't be cleared even if client is back to normal. So further file lock will return EIO. Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2019-05-07 19:22:36 +02:00
Linus Torvalds	0968621917	Printk changes for 5.2 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY= =WZfC -----END PGP SIGNATURE----- Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Allow state reset of printk_once() calls. - Prevent crashes when dereferencing invalid pointers in vsprintf(). Only the first byte is checked for simplicity. - Make vsprintf warnings consistent and inlined. - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf modifiers. - Some clean up of vsprintf and test_printf code. * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: lib/vsprintf: Make function pointer_string static vsprintf: Limit the length of inlined error messages vsprintf: Avoid confusion between invalid address and value vsprintf: Prevent crash when dereferencing invalid pointers vsprintf: Consolidate handling of unknown pointer specifiers vsprintf: Factor out %pO handler as kobject_string() vsprintf: Factor out %pV handler as va_format() vsprintf: Factor out %p[iI] handler as ip_addr_string() vsprintf: Do not check address of well-known strings vsprintf: Consistent %pK handling for kptr_restrict == 0 vsprintf: Shuffle restricted_pointer() printk: Tie printk_once / printk_deferred_once into .data.once for reset treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively lib/test_printf: Switch to bitmap_zalloc()	2019-05-07 09:18:12 -07:00
David Howells	f5e4546347	afs: Implement YFS ACL setting Implement the setting of YFS ACLs in AFS through the interface of setting the afs.yfs.acl extended attribute on the file. Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
David Howells	ae46578b96	afs: Get YFS ACLs and information through xattrs The YFS/AuriStor variant of AFS provides more capable ACLs and provides per-volume ACLs and per-file ACLs as well as per-directory ACLs. It also provides some extra information that can be retrieved through four ACLs: (1) afs.yfs.acl The YFS file ACL (not the same format as afs.acl). (2) afs.yfs.vol_acl The YFS volume ACL. (3) afs.yfs.acl_inherited "1" if a file's ACL is inherited from its parent directory, "0" otherwise. (4) afs.yfs.acl_num_cleaned The number of of ACEs removed from the ACL by the server because the PT entries were removed from the PTS database (ie. the subject is no longer known). Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
Joe Gorse	b10494af49	afs: implement acl setting Implements the setting of ACLs in AFS by means of setting the afs.acl extended attribute on the file. Signed-off-by: Joe Gorse <jhgorse@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
David Howells	260f082bae	afs: Get an AFS3 ACL as an xattr Implement an xattr on AFS files called "afs.acl" that retrieves a file's ACL. It returns the raw AFS3 ACL from the result of calling FS.FetchACL, leaving any interpretation to userspace. Note that whilst YFS servers will respond to FS.FetchACL, this will render a more-advanced YFS ACL down. Use "afs.yfs.acl" instead for that. Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
David Howells	a2f611a3dc	afs: Fix getting the afs.fid xattr The AFS3 FID is three 32-bit unsigned numbers and is represented as three up-to-8-hex-digit numbers separated by colons to the afs.fid xattr. However, with the advent of support for YFS, the FID is now a 64-bit volume number, a 96-bit vnode/inode number and a 32-bit uniquifier (as before). Whilst the sprintf in afs_xattr_get_fid() has been partially updated (it currently ignores the upper 32 bits of the 96-bit vnode number), the size of the stack-based buffer has not been increased to match, thereby allowing stack corruption to occur. Fix this by increasing the buffer size appropriately and conditionally including the upper part of the vnode number if it is non-zero. The latter requires the lower part to be zero-padded if the upper part is non-zero. Fixes: `3b6492df41` ("afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS") Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
David Howells	c73aa4102f	afs: Fix the afs.cell and afs.volume xattr handlers Fix the ->get handlers for the afs.cell and afs.volume xattrs to pass the source data size to memcpy() rather than target buffer size. Overcopying the source data occasionally causes the kernel to oops. Fixes: `d3e3b7eac8` ("afs: Add metadata xattrs") Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
Marc Dionne	c0abbb5791	afs: Calculate i_blocks based on file size While it's not possible to give an accurate number for the blocks used on the server, populate i_blocks based on the file size so that 'du' can give a reasonable estimate. The value is rounded up to 1K granularity, for consistency with what other AFS clients report, and the servers' 1K usage quota unit. Note that the value calculated by 'du' at the root of a volume can still be slightly lower than the quota usage on the server, as 0-length files are charged 1 quota block, but are reported as occupying 0 blocks. Again, this is consistent with other AFS clients. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
David Howells	b134d687dd	afs: Log more information for "kAFS: AFS vnode with undefined type\n" Log more information when "kAFS: AFS vnode with undefined type\n" is displayed due to a vnode record being retrieved from the server that appears to have a duff file type (usually 0). This prints more information to try and help pin down the problem. Signed-off-by: David Howells <dhowells@redhat.com>	2019-05-07 16:48:44 +01:00
Shenghui Wang	7889f44dd9	io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible() This issue is found by running liburing/test/io_uring_setup test. When test run, the testcase "attempt to bind to invalid cpu" would not pass with messages like: io_uring_setup(1, 0xbfc2f7c8), \ flags: IORING_SETUP_SQPOLL\|IORING_SETUP_SQ_AFF, \ resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \ sq_thread_cpu: 2 expected -1, got 3 FAIL On my system, there is: CPU(s) possible : 0-3 CPU(s) online : 0-1 CPU(s) offline : 2-3 CPU(s) present : 0-1 The sq_thread_cpu 2 is offline on my system, so the bind should fail. But cpu_possible() will pass the check. We shouldn't be able to bind to an offline cpu. Use cpu_online() to do the check. After the change, the testcase run as expected: EINVAL will be returned for cpu offlined. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-07 08:41:26 -06:00
Roberto Bergantinos Corpas	950a578c61	NFS: make nfs_match_client killable Actually we don't do anything with return value from nfs_wait_client_init_complete in nfs_match_client, as a consequence if we get a fatal signal and client is not fully initialised, we'll loop to "again" label This has been proven to cause soft lockups on some scenarios (no-carrier but configured network interfaces) Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2019-05-07 10:38:14 -04:00
Linus Torvalds	81ff5d2cba	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Add support for AEAD in simd - Add fuzz testing to testmgr - Add panic_on_fail module parameter to testmgr - Use per-CPU struct instead multiple variables in scompress - Change verify API for akcipher Algorithms: - Convert x86 AEAD algorithms over to simd - Forbid 2-key 3DES in FIPS mode - Add EC-RDSA (GOST 34.10) algorithm Drivers: - Set output IV with ctr-aes in crypto4xx - Set output IV in rockchip - Fix potential length overflow with hashing in sun4i-ss - Fix computation error with ctr in vmx - Add SM4 protected keys support in ccree - Remove long-broken mxc-scc driver - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits) crypto: ccree - use a proper le32 type for le32 val crypto: ccree - remove set but not used variable 'du_size' crypto: ccree - Make cc_sec_disable static crypto: ccree - fix spelling mistake "protedcted" -> "protected" crypto: caam/qi2 - generate hash keys in-place crypto: caam/qi2 - fix DMA mapping of stack memory crypto: caam/qi2 - fix zero-length buffer DMA mapping crypto: stm32/cryp - update to return iv_out crypto: stm32/cryp - remove request mutex protection crypto: stm32/cryp - add weak key check for DES crypto: atmel - remove set but not used variable 'alg_name' crypto: picoxcell - Use dev_get_drvdata() crypto: crypto4xx - get rid of redundant using_sd variable crypto: crypto4xx - use sync skcipher for fallback crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues crypto: crypto4xx - fix ctr-aes missing output IV crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o crypto: ccree - handle tee fips error during power management resume crypto: ccree - add function to handle cryptocell tee fips error ...	2019-05-06 20:15:06 -07:00
Linus Torvalds	2c6a392cdd	Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stack trace updates from Ingo Molnar: "So Thomas looked at the stacktrace code recently and noticed a few weirdnesses, and we all know how such stories of crummy kernel code meeting German engineering perfection end: a 45-patch series to clean it all up! :-) Here's the changes in Thomas's words: 'Struct stack_trace is a sinkhole for input and output parameters which is largely pointless for most usage sites. In fact if embedded into other data structures it creates indirections and extra storage overhead for no benefit. Looking at all usage sites makes it clear that they just require an interface which is based on a storage array. That array is either on stack, global or embedded into some other data structure. Some of the stack depot usage sites are outright wrong, but fortunately the wrongness just causes more stack being used for nothing and does not have functional impact. Another oddity is the inconsistent termination of the stack trace with ULONG_MAX. It's pointless as the number of entries is what determines the length of the stored trace. In fact quite some call sites remove the ULONG_MAX marker afterwards with or without nasty comments about it. Not all architectures do that and those which do, do it inconsistenly either conditional on nr_entries == 0 or unconditionally. The following series cleans that up by: 1) Removing the ULONG_MAX termination in the architecture code 2) Removing the ULONG_MAX fixups at the call sites 3) Providing plain storage array based interfaces for stacktrace and stackdepot. 4) Cleaning up the mess at the callsites including some related cleanups. 5) Removing the struct stack_trace based interfaces This is not changing the struct stack_trace interfaces at the architecture level, but it removes the exposure to the generic code'" * 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) x86/stacktrace: Use common infrastructure stacktrace: Provide common infrastructure lib/stackdepot: Remove obsolete functions stacktrace: Remove obsolete functions livepatch: Simplify stack trace retrieval tracing: Remove the last struct stack_trace usage tracing: Simplify stack trace retrieval tracing: Make ftrace_trace_userstack() static and conditional tracing: Use percpu stack trace buffer more intelligently tracing: Simplify stacktrace retrieval in histograms lockdep: Simplify stack trace handling lockdep: Remove save argument from check_prev_add() lockdep: Remove unused trace argument from print_circular_bug() drm: Simplify stacktrace handling dm persistent data: Simplify stack trace handling dm bufio: Simplify stack trace retrieval btrfs: ref-verify: Simplify stack trace retrieval dma/debug: Simplify stracktrace retrieval fault-inject: Simplify stacktrace retrieval mm/page_owner: Simplify stack trace handling ...	2019-05-06 13:11:48 -07:00
Theodore Ts'o	db90f41916	ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2019-05-06 14:03:52 -04:00
Colin Ian King	efeb862bd5	io_uring: fix shadowed variable ret return code being not checked Currently variable ret is declared in a while-loop code block that shadows another variable ret. When an error occurs in the while-loop the error return in ret is not being set in the outer code block and so the error check on ret is always going to be checking on the wrong ret variable resulting in check that is always going to be true and a premature return occurs. Fix this by removing the declaration of the inner while-loop variable ret so that shadowing does not occur. Addresses-Coverity: ("'Constant' variable guards dead code") Fixes: `6b06314c47` ("io_uring: add file set registration") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-06 10:21:34 -06:00
Kirill Smelkov	438ab720c6	vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files This amends commit `10dce8af34` ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") in how position is passed into .read()/.write() handler for stream-like files: Rasmus noticed that we currently pass 0 as position and ignore any position change if that is done by a file implementation. This papers over bugs if ppos is used in files that declare themselves as being stream-like as such bugs will go unnoticed. Even if a file implementation is correctly converted into using stream_open, its read/write later could be changed to use ppos and even though that won't be working correctly, that bug might go unnoticed without someone doing wrong behaviour analysis. It is thus better to pass ppos=NULL into read/write for stream-like files as that don't give any chance for ppos usage bugs because it will oops if ppos is ever used inside .read() or .write(). Note 1: rw_verify_area, new_sync_{read,write} needs to be updated because they are called by vfs_read/vfs_write & friends before file_operations .read/.write . Note 2: if file backend uses new-style .read_iter/.write_iter, position is still passed into there as non-pointer kiocb.ki_pos . Currently stream_open.cocci (semantic patch added by `10dce8af34`) ignores files whose file_operations has *_iter methods. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>	2019-05-06 17:46:52 +03:00
Jiufei Xue	98487de318	ovl: check the capability before cred overridden We found that it return success when we set IMMUTABLE_FL flag to a file in docker even though the docker didn't have the capability CAP_LINUX_IMMUTABLE. The commit `d1d04ef857` ("ovl: stack file ops") and `dab5ca8fd9` ("ovl: add lsattr/chattr support") implemented chattr operations on a regular overlay file. ovl_real_ioctl() overridden the current process's subjective credentials with ofs->creator_cred which have the capability CAP_LINUX_IMMUTABLE so that it will return success in vfs_ioctl()->cap_capable(). Fix this by checking the capability before cred overridden. And here we only care about APPEND_FL and IMMUTABLE_FL, so get these information from inode. [SzM: move check and call to underlying fs inside inode locked region to prevent two such calls from racing with each other] Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 14:00:37 +02:00
Amir Goldstein	d989903058	ovl: do not generate duplicate fsnotify events for "fake" path Overlayfs "fake" path is used for stacked file operations on underlying files. Operations on files with "fake" path must not generate fsnotify events with path data, because those events have already been generated at overlayfs layer and because the reported event->fd for fanotify marks on underlying inode/filesystem will have the wrong path (the overlayfs path). Link: https://lore.kernel.org/linux-fsdevel/20190423065024.12695-1-jencce.kernel@gmail.com/ Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: `d1d04ef857` ("ovl: stack file ops") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:51 +02:00
Amir Goldstein	9e46b840c7	ovl: support stacked SEEK_HOLE/SEEK_DATA Overlay file f_pos is the master copy that is preserved through copy up and modified on read/write, but only real fs knows how to SEEK_HOLE/SEEK_DATA and real fs may impose limitations that are more strict than ->s_maxbytes for specific files, so we use the real file to perform seeks. We do not call real fs for SEEK_CUR:0 query and for SEEK_SET:0 requests. Fixes: `d1d04ef857` ("ovl: stack file ops") Reported-by: Eddie Horng <eddiehorng.tw@gmail.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:51 +02:00
Amir Goldstein	3428030da0	ovl: fix missing upper fs freeze protection on copy up for ioctl Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file with data before FS_IOC_SETFLAGS ioctl. The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably caused the confusion. File may be open O_RDONLY, but ioctl modifies the file. VFS does not call mnt_want_write_file() nor lock inode mutex, but fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls mnt_want_write_file() for the overlay file, and fs-specific code calls mnt_want_write_file() for upper fs file, but there was no call for ovl_want_write() for copy up duration which prevents overlayfs from copying up on a frozen upper fs. Fixes: `dab5ca8fd9` ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2019-05-06 13:54:50 +02:00
Linus Torvalds	51987affd6	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - a couple of ->i_link use-after-free fixes - regression fix for wrong errno on absent device name in mount(2) (this cycle stuff) - ancient UFS braino in large GID handling on Solaris UFS images (bogus cut'n'paste from large UID handling; wrong field checked to decide whether we should look at old (16bit) or new (32bit) field) * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour Abort file_remove_privs() for non-reg. files [fix] get rid of checking for absent device name in vfs_get_tree() apparmorfs: fix use-after-free on symlink traversal securityfs: fix use-after-free on symlink traversal	2019-05-05 09:28:45 -07:00
Martin Brandenburg	33713cd09c	orangefs: truncate before updating size Otherwise we race with orangefs_writepage/orangefs_writepages which and does not expect i_size < page_offset. Fixes xfstests generic/129. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:39:10 -04:00
Mike Marshall	dd59a6475c	orangefs: copy Orangefs-sized blocks into the pagecache if possible. ->readpage looks in file->private_data to try and find out how the userspace program set "count" in read(2) or with "dd bs=" or whatever. ->readpage uses "count" and inode->i_size to calculate how much data Orangefs should deposit in the Orangefs shared buffer, and remembers which slot the data is in. After copying data from the Orangefs shared buffer slot into "the page", readpage tries to increment through the pagecache index and fill as many pages as it can from the extra data in the shared buffer. Hopefully these extra pages will soon be needed by the vfs, and they'll be in the pagecache already. Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Martin Brandenburg <martin@omnibond.com>	2019-05-03 14:32:39 -04:00
Mike Marshall	4077a0f25b	orangefs: pass slot index back to readpage. When userspace deposits more than a page of data into the shared buffer, we'll need to know which slot it is in when we get back to readpage so that we can try to use the extra data to fill some extra pages. Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Martin Brandenburg <martin@omnibond.com>	2019-05-03 14:32:39 -04:00
Mike Marshall	c2549f8c7a	orangefs: remember count when reading. Orangefs wins when it can do IO on large (up to four meg) blocks at a time, and looses when it has to do tiny "small io" reads and writes. Accessing Orangefs through the pagecache with the kernel module helps with small io, both reading and writing, a great deal. Readpage generally tries to fetch a page (four k) at a time. We'll let users use "count" (as in read(2) or pread(2) for example) as a knob to control how much data they get from Orangefs at a time and we'll try to use the data to fill extra pagecache pages when we get to ->readpage, hopefully resulting in fewer calls to readpage and Orangefs userspace. We need a way to remember how they set count so that we can still have it available when we get to ->readpage. - We'll use file->private_data to keep track of "count". We'll wrap generic_file_open with orangefs_file_open and initialize private_data to NULL there. - In ->read_iter we have access to both "count" and file, so we'll kmalloc some space onto file->private_data and store "count" there. - We'll kfree file->private_data each time we visit ->flush and reinitialize it to NULL. Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Martin Brandenburg <martin@omnibond.com>	2019-05-03 14:32:39 -04:00
Martin Brandenburg	8f04e1be78	orangefs: add orangefs_revalidate_mapping This is modeled after NFS, except our method is different. We use a simple timer to determine whether to invalidate the page cache. This is bound to perform. This addes a sysfs parameter cache_timeout_msecs which controls the time between page cache invalidations. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:39 -04:00
Martin Brandenburg	c472ebc255	orangefs: implement writepages Go through pages and look for a consecutive writable region. After finding a number of consecutive writable pages or when finding that the next page's dirty range is not contiguous and cannot be written as one request, send the write to the server. The number of pages is determined by the client-core's buffer size. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:39 -04:00
Martin Brandenburg	52e2d0a380	orangefs: write range tracking Attach the actual range of bytes written to plus the responsible uid/gid to each dirty page. This information must be sent to the server when the page is written out. Now write_begin, page_mkwrite, and invalidatepage keep up with this information. There are several conditions where they must write out the page immediately to store the new range. Two non-contiguous ranges cannot be stored on a single page. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	90fc07065a	orangefs: avoid fsync service operation on flush Without this, an fsync call is sent to the server even if no data changed. This resulted in a rather severe (50%) performance regression under certain metadata-heavy workloads. In the past, everything was direct IO. Nothing happend on a close call. An explicit fsync call would send an fsync request to the server which in turn fsynced the underlying file. Now there are cached writes. Then fsync began writing out dirty pages in addition to making an fsync request to the server, and close began calling fsync. With this commit, close only writes out dirty pages, and does not make the fsync request. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	8a88bbce6f	orangefs: skip inode writeout if nothing to write Would happen if an inode is dirty but whatever happened is not something that can be written out to OrangeFS. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	3e9dfc6e1e	orangefs: move do_readv_writev to direct_IO direct_IO was the only caller and all direct_IO did was call it, so there's no use in having the code spread out into so many functions. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	43f3457604	orangefs: do not return successful read when the client-core disappeared Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	85ac799cf9	orangefs: implement writepage Now orangefs_inode_getattr fills from cache if an inode has dirty pages. also if attr_valid and dirty pages and !flags, we spin on inode writeback before returning if pages still dirty after: should it be other way Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	c453dcfc79	orangefs: migrate to generic_file_read_iter Remove orangefs_inode_read. It was used by readpage. Calling wait_for_direct_io directly serves the purpose just as well. There is now no check of the bufmap size in the readpage path. There are already other places the bufmap size is assumed to be greater than PAGE_SIZE. Important to call truncate_inode_pages now in the write path so a subsequent read sees the new data. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	0dcac0f781	orangefs: service ops done for writeback are not killable Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	a68d9c606a	orangefs: remove orangefs_readpages It's a copy of the loop which would run in read_pages from mm/readahead.c. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	afd9fb2a31	orangefs: reorganize setattr functions to track attribute changes OrangeFS accepts a mask indicating which attributes were changed. The kernel must not set any bits except those that were actually changed. The kernel must set the uid/gid of the request to the actual uid/gid responsible for the change. Code path for notify_change initiated setattrs is orangefs_setattr(dentry, iattr) -> __orangefs_setattr(inode, iattr) In kernel changes are initiated by calling __orangefs_setattr. Code path for writeback is orangefs_write_inode -> orangefs_inode_setattr attr_valid and attr_uid and attr_gid change together under i_lock. I_DIRTY changes separately. __orangefs_setattr lock if needs to be cleaned first, unlock and retry set attr_valid copy data in unlock mark_inode_dirty orangefs_inode_setattr lock copy attributes out unlock clear getattr_time # __writeback_single_inode clears dirty orangefs_inode_getattr # possible to get here with attr_valid set and not dirty lock if getattr_time ok or attr_valid set, unlock and return unlock do server operation # another thread may getattr or setattr, so check for that lock if getattr_time ok or attr_valid, unlock and return else, copy in update getattr_time unlock Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	df2d7337b5	orangefs: let setattr write to cached inode This is a fairly big change, but ultimately it's not a lot of code. Implement write_inode and then avoid the call to orangefs_inode_setattr within orangefs_setattr. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	f2d34c738c	orangefs: set up and use backing_dev_info Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	5e4f606e26	orangefs: hold i_lock during inode_getattr This should be a no-op now. When inode writeback works, this will prevent a getattr from overwriting inode data while an inode is transitioning to dirty. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	5e7f1d4338	orangefs: update attributes rather than relying on server This should be a no-op now, but once inode writeback works, it'll be necessary to have the correct attribute in the dirty inode. Previously the attribute fetch timeout was marked invalid and the server provided the updated attribute. When the inode is dirty, the server cannot be consulted since it does not yet know the pending setattr. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	8b60785c1d	orangefs: simplify orangefs_inode_getattr interface No need to store the received mask. It is either STATX_BASIC_STATS or STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache is bypassed anyway, so the cached mask is unnecessary to decide whether to do a real getattr. This is a change. Previously a getattr would want size and use the cached size. All of the in-kernel callers that wanted size did not want a cached size. Now a getattr cannot use the cached size if it wants size at all. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:38 -04:00
Martin Brandenburg	66d5477d70	orangefs: do not invalidate attributes on inode create When an inode is created, we fetch attributes from the server. There is no need to turn around and invalidate them. No need to initialize attributes after the getattr either. Either it'll be exactly the same, or it'll be something else and wrong. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:37 -04:00
Martin Brandenburg	fc2e2e9c43	orangefs: implement xattr cache This uses the same timeout as the getattr cache. This substantially increases performance when writing files with smaller buffer sizes. When writing, the size is (often) changed, which causes a call to notify_change which calls security_inode_need_killpriv which needs a getxattr. Caching it reduces traffic to the server. Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2019-05-03 14:32:37 -04:00
Scott Mayhew	1c73b9d24f	nfsd: update callback done processing Instead of having the convention where individual nfsd4_callback_ops->done operations return -1 to indicate the callback path is down, move the check to nfsd4_cb_done. Only mark the callback path down on transport-level errors, not NFS-level errors. The existing logic causes the server to set SEQ4_STATUS_CB_PATH_DOWN just because the client returned an error to a CB_RECALL for a delegation that the client had already done a FREE_STATEID for. But clearly that error doesn't mean that there's anything wrong with the backchannel. Additionally, handle NFS4ERR_DELAY in nfsd4_cb_recall_done. The client returns NFS4ERR_DELAY if it is already in the process of returning the delegation. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2019-05-03 11:01:38 -04:00
David S. Miller	ff24e4980a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Three trivial overlapping conflicts. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-05-02 22:14:21 -04:00
Stefan Bühler	5dcf877fb1	req->error only used for iopoll No need to set it in io_poll_add; io_poll_complete doesn't use it to set the result in the CQE. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:54 -06:00
Jens Axboe	9b402849e8	io_uring: add support for eventfd notifications Allow registration of an eventfd, which will trigger an event every time a completion event happens for this io_uring instance. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:54 -06:00
Jens Axboe	5d17b4a4b7	io_uring: add support for IORING_OP_SYNC_FILE_RANGE This behaves just like sync_file_range(2) does. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:54 -06:00
Jens Axboe	22f96b3808	fs: add sync_file_range() helper This just pulls out the ksys_sync_file_range() code to work on a struct file instead of an fd, so we can use it elsewhere. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:53 -06:00
Jens Axboe	de0617e467	io_uring: add support for marking commands as draining There are no ordering constraints between the submission and completion side of io_uring. But sometimes that would be useful to have. One common example is doing an fsync, for instance, and have it ordered with previous writes. Without support for that, the application must do this tracking itself. This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked with this flag, then it will not be issued before previous commands have completed, and subsequent commands submitted after the drain will not be issued before the drain is started.. If there are no pending commands, setting this flag will not change the behavior of the issue of the command. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-02 14:08:53 -06:00
Linus Torvalds	5ce3307b6d	for-linus-20190502 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzLBg4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsG8EACHxeAgDY3j+ZSLqn9sCFzvDo+hc9q2x64o ydscoxcdRSZYAU1dP6bnjQBFZbwVHaDzYG11L9aPwlvbsxis9/XWASc/gCo8sumN itmk3jKW5zYXlLw57ojh2+/UbPKnO5fKJcA66AakDGTopp3nYZQvov1kUJKOwTiz Ir48vEwST4DU9szkfGGeKOGG735veU/yarUNyADwc+CIigBcIa/A/bZtY8P4xV5d 6bSLnMayehntZaHpVUPwDqOgP7+nSqIzQ88BPAKQnSzDat/SK4C0pgIX9L2nS1/4 7fBpg3/HN0sFgUuhAG/3ILD82yD3CWDGbt2xif2CP0ylvVOMB6d/ObDAIxwTBKVd MwJ3U0RK+Ph2rvaglp6ifk3YgS0Nzt4N5F9AVmp32RR3QpqzJ2pLN4PUjbdWaEbr lugu6uengn0flu90mesBazrVNrZkrZ2w859tq9TNWPYpdIthKrvNn44J1bXx+yUj F1d+Bf0Sxxnc1b8hEKn9KOXPRoIY07hd/wUi2a/F4RVGSFa2b9CnaengqXIDzQNC duEErt+Pd2xlBA4WhPlKSI2PmrWJvpDr7iuAiqdC8cC69hpF4F2/yWCiL+JYI4pF v0zPn214KZ2rg9XZfb0Me8ff1BNvmXJMKzvecX2mOcclcO97PFVkqJhcx/75JE/6 ooQiTofl+w== =KjZ4 -----END PGP SIGNATURE----- Merge tag 'for-linus-20190502' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "This is mostly io_uring fixes/tweaks. Most of these were actually done in time for the last -rc, but I wanted to ensure that everything tested out great before including them. The code delta looks larger than it really is, as it's mostly just comment additions/changes. Outside of the comment additions/changes, this is mostly removal of unnecessary barriers. In all, this pull request contains: - Tweak to how we handle errors at submission time. We now post a completion event if the error occurs on behalf of an sqe, instead of returning it through the system call. If the error happens outside of a specific sqe, we return the error through the system call. This makes it nicer to use and makes the "normal" use case behave the same as the offload cases. (me) - Fix for a missing req reference drop from async context (me) - If an sqe is submitted with RWF_NOWAIT, don't punt it to async context. Return -EAGAIN directly, instead of using it as a hint to do async punt. (Stefan) - Fix notes on barriers (Stefan) - Remove unnecessary barriers (Stefan) - Fix potential double free of memory in setup error (Mark) - Further improve sq poll CPU validation (Mark) - Fix page allocation warning and leak on buffer registration error (Mark) - Fix iov_iter_type() for new no-ref flag (Ming) - Fix a case where dio doesn't honor bio no-page-ref (Ming)" * tag 'for-linus-20190502' of git://git.kernel.dk/linux-block: io_uring: avoid page allocation warnings iov_iter: fix iov_iter_type block: fix handling for BIO_NO_PAGE_REF io_uring: drop req submit reference always in async punt io_uring: free allocated io_memory once io_uring: fix SQPOLL cpu validation io_uring: have submission side sqe errors post a cqe io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP io_uring: remove unnecessary barrier after incrementing dropped counter io_uring: remove unnecessary barrier before reading SQ tail io_uring: remove unnecessary barrier after updating SQ head io_uring: remove unnecessary barrier before reading cq head io_uring: remove unnecessary barrier before wq_has_sleeper io_uring: fix notes on barriers io_uring: fix handling SQEs requesting NOWAIT	2019-05-02 09:55:04 -07:00
Nikolay Borisov	b1c16ac978	btrfs: Use kvmalloc for allocating compressed path context Recent refactoring of cow_file_range_async means it's now possible to request a rather large physically contiguous memory via kmalloc. The size is dependent on the number of 512k chunks that the compressed range consists of. David reported multiple OOM messages on such large allocations. Fix it by switching to using kvmalloc. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	7447555fe7	btrfs: Factor out common extent locking code in submit_compressed_extents Irrespective of whether the compress code fell back to uncompressed or a compressed extent has to be submitted, the extent range is always locked. So factor out the common lock_extent call at the beginning of the loop. No functional changes just removes one duplicate lock_extent call. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	4336650aff	btrfs: Set io_tree only once in submit_compressed_extents The inode never changes so it's sufficient to dereference it and get the iotree only once, before the execution of the main loop. No functional changes, only the size of the function is decreased: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta submit_compressed_extents 1240 1196 -44 Total: Before=88476, After=88432, chg -0.05% Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	69684c5a88	btrfs: Replace clear_extent_bit with unlock_extent Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	1368c6dac7	btrfs: Make compress_file_range take only struct async_chunk All context this function needs is held within struct async_chunk. Currently we not only pass the struct but also every individual member. This is redundant, simplify it by only passing struct async_chunk and leaving it to compress_file_range to extract the values it requires. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	c5a68aec4e	btrfs: Remove fs_info from struct async_chunk The associated btrfs_work already contains a reference to the fs_info so use that instead of passing it via async_chunk. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:19 +02:00
Nikolay Borisov	b5326271e7	btrfs: Rename async_cow to async_chunk Now that we have an explicit async_chunk struct rename references to variables of this type to async_chunk. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:18 +02:00
Nikolay Borisov	97db120451	btrfs: Preallocate chunks in cow_file_range_async This commit changes the implementation of cow_file_range_async in order to get rid of the BUG_ON in the middle of the loop. Additionally it reworks the inner loop in the hopes of making it more understandable. The idea is to make async_cow be a top-level structured, shared amongst all chunks being sent for compression. This allows to perform one memory allocation at the beginning and gracefully fail the IO if there isn't enough memory. Now, each chunk is going to be described by an async_chunk struct. It's the responsibility of the final chunk to actually free the memory. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:48:18 +02:00
Josef Bacik	c8eaeac7b7	btrfs: reserve delalloc metadata differently With the per-inode block reserves we started refilling the reserve based on the calculated size of the outstanding csum bytes and extents for the inode, including the amount we were adding with the new operation. However, generic/224 exposed a problem with this approach. With 1000 files all writing at the same time we ended up with a bunch of bytes being reserved but unusable. When you write to a file we reserve space for the csum leaves for those bytes, the number of extent items required to cover those bytes, and a single transaction item for updating the inode at ordered extent finish for that range of bytes. This is held until the ordered extent finishes and we release all of the reserved space. If a second write comes in at this point we would add a single reservation for the new outstanding extent and however many reservations for the csum leaves. At this point we find the delta of how much we have reserved and how much outstanding size this is and attempt to reserve this delta. If the first write finishes it will not release any space, because the space it had reserved for the initial write is still needed for the second write. However some space would have been used, as we have added csums, extent items, and dirtied the inode. Our reserved space would be > 0 but less than the total needed reserved space. This is just for a single inode, now consider generic/224. This has 1000 inodes writing in parallel to a very small file system, 1GiB. In my testing this usually means we get about a 120MiB metadata area to work with, more than enough to allow the writes to continue, but not enough if all of the inodes are stuck trying to reserve the slack space while continuing to hold their leftovers from their initial writes. Fix this by pre-reserved _only_ for the space we are currently trying to add. Then once that is successful modify our inodes csum count and outstanding extents, and then add the newly reserved space to the inodes block_rsv. This allows us to actually pass generic/224 without running out of metadata space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-05-02 13:47:12 +02:00
Al Viro	4e9036042f	ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour To choose whether to pick the GID from the old (16bit) or new (32bit) field, we should check if the old gid field is set to 0xffff. Mainline checks the old UID field instead - cut'n'paste from the corresponding code in ufs_get_inode_uid(). Fixes: `252e211e90` Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-02 02:24:50 -04:00
Eric Sandeen	910832697c	xfs: change some error-less functions to void types There are several functions which have no opportunity to return an error, and don't contain any ASSERTs which could be argued to be better constructed as error cases. So, make them voids to simplify the callers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2019-05-01 20:26:30 -07:00
Christoph Hellwig	cbbf4c0be8	iomap: move iomap_read_inline_data around iomap_read_inline_data ended up being placed in the middle of the bio based read I/O completion handling, which tends to confuse the heck out of me whenever I follow the code. Move it to a more suitable place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-05-01 20:16:40 -07:00
Al Viro	f276ae0dd6	orangefs: make use of ->free_inode() Acked-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:27 -04:00
Al Viro	b62de32257	hugetlb: make use of ->free_inode() moving synchronous parts of ->destroy_inode() to ->evict_inode() is not possible here - they are balancing the stuff done in ->alloc_inode(), not the things acquired while using it or sanity checks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:27 -04:00
Al Viro	0b269ded4e	overlayfs: make use of ->free_inode() synchronous parts are left in ->destroy_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:27 -04:00
Al Viro	b3b4a6e356	jfs: switch to ->free_inode() synchronous part can be moved to ->evict_inode(), the rest - ->free_inode() fodder Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	9baf28bbfe	fuse: switch to ->free_inode() fuse_destroy_inode() is gone - sanity checks that need the stack trace of the caller get moved into ->evict_inode(), the rest joins the RCU-delayed part which becomes ->free_inode(). While we are at it, don't just pass the address of what happens to be the first member of structure to kmem_cache_free() - get_fuse_inode() is there for purpose and it gives the proper container_of() use. No behaviour change, but verifying correctness is easier that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	94053139d4	ext4: make use of ->free_inode() the rest of this ->destroy_inode() instance could probably be folded into ext4_evict_inode() Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	586a94fdc9	ecryptfs: make use of ->free_inode() no idea if crypto destruction could be moved there as well Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	cfa6d41263	ceph: use ->free_inode() a lot of non-delayed work in this case; all of that is left in ->destroy_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	26602cab41	btrfs: use ->free_inode() a lot of stuff remains in ->destroy_inode() Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	51b9fe48c4	afs: switch to use of ->free_inode() debugging printks left in ->destroy_inode() and so's the update of inode count; we could take the latter to RCU-delayed part (would take only moving the check on module exit past rcu_barrier() there), but debugging output ought to either stay where it is or go into ->evict_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	a2b757fe0f	ntfs: switch to ->free_inode() move the synchronous stuff from ->destroy_inode() to ->evict_inode(), turn the RCU-delayed part into ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	98835e884c	ufs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	d984892bd7	coda: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:26 -04:00
Al Viro	6becf8edf1	sysv: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	a78bb3838d	udf: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	dc43175996	ubifs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	56b5af1931	squashfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	bcb8d71bda	romfs: convert to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	a5a8cbea63	reiserfs: convert to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	45c2a3ff3a	qnx6: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	bc40ddd12c	qnx4: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	4aa6b55c05	procfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	363db959ae	openpromfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	e91b9194bc	ocfs2: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	9fbc000786	dlmfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	977c3d1894	nilfs2: switch to ->free_inode() kill an extern that went stale 9 years ago, while we are at it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	ca1a199e3b	nfs{,4}: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	d67a398a5f	minix: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	db0bd7b719	jffs2: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:25 -04:00
Al Viro	07b0120710	isofs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	4d436d5cd5	hpfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	08ccfc5c36	hostfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	08ab229393	hfsplus: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	6d845e2286	hfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	784494e1d7	gfs2: switch to ->free_inode() ... and use GFS2_I() to get the containing gfs2_inode by inode; yes, we can feed the address of the first member of structure to kmem_cache_free(), but let's do it in an obviously safe way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	9f179271e7	freevxfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	f9ec991d41	fat: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	d01718a050	f2fs: switch to ->free_inode() Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	a2d1b88bec	ext2: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	f415c51123	efs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	6234ddf429	debugfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	c2e6802e7b	cifs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	41149cb08a	bdev: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	8d8fc9cbc7	bfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:24 -04:00
Al Viro	49f82a808b	befs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:23 -04:00
Al Viro	312a679183	affs: switch to ->free_inode() Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:23 -04:00
Al Viro	8f05a79535	adfs: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:23 -04:00
Al Viro	5e8a0770c0	9p: switch to ->free_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:43:23 -04:00
Al Viro	fdb0da89f4	new inode method: ->free_inode() A lot of ->destroy_inode() instances end with call_rcu() of a callback that does RCU-delayed part of freeing. Introduce a new method for doing just that, with saner signature. Rules: ->destroy_inode ->free_inode f g immediate call of f(), RCU-delayed call of g() f NULL immediate call of f(), no RCU-delayed calls NULL g RCU-delayed call of g() NULL NULL RCU-delayed default freeing IOW, NULL ->free_inode gives the same behaviour as now. Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could mandate the latter form, but that would have very little benefit beyond making rules a bit more symmetric. It would break backwards compatibility, require extra boilerplate and expected semantics for (NULL, NULL) pair would have no use whatsoever... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-05-01 22:37:39 -04:00
Linus Torvalds	7e74e235bb	gcc-9: don't warn about uninitialized btrfs extent_type variable The 'extent_type' variable does seem to be reliably initialized, but it's _very_ non-obvious, since there's a "goto next" case that jumps over the normal initialization. That will then always trigger the "start >= extent_end" test, which will end up never falling through to the use of that variable. But the code is certainly not obvious, and the compiler warning looks reasonable. Make 'extent_type' an int, and initialize it to an invalid negative value, which seems to be the common pattern in other places. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-05-01 12:19:20 -07:00
Jan Kara	11a6f8e2db	fsnotify: Clarify connector assignment in fsnotify_add_mark_list() Add a comment explaining why WRITE_ONCE() is enough when setting mark->connector which can get dereferenced by RCU protected readers. Signed-off-by: Jan Kara <jack@suse.cz>	2019-05-01 18:05:11 +02:00
Mark Rutland	d4ef647510	io_uring: avoid page allocation warnings In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER. On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation. Avoid this by using kvmalloc() for allocations dependent on the user-provided iov_len. At the same time, fix a leak of imu->bvec when registration fails. Full splat from before this patch: WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Fixes: `edafccee56` ("io_uring: add support for pre-mapped user IO buffers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 10:00:25 -06:00
Andreas Gruenbacher	df0db3ecdb	iomap: Add a page_prepare callback Move the page_done callback into a separate iomap_page_ops structure and add a page_prepare calback to be called before the next page is written to. In gfs2, we'll want to start a transaction in page_prepare and end it in page_done. Other filesystems that implement data journaling will require the same kind of mechanism. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-05-01 07:47:37 -07:00
Andreas Gruenbacher	7a77dad7e3	iomap: Fix use-after-free error in page_done callback In iomap_write_end, we're not holding a page reference anymore when calling the page_done callback, but the callback needs that reference to access the page. To fix that, move the put_page call in __generic_write_end into the callers of __generic_write_end. Then, in iomap_write_end, put the page after calling the page_done callback. Reported-by: Jan Kara <jack@suse.cz> Fixes: `63899c6f88` ("iomap: add a page_done callback") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-05-01 07:47:37 -07:00
Andreas Gruenbacher	26ddb1f4fd	fs: Turn __generic_write_end into a void function The VFS-internal __generic_write_end helper always returns the value of its @copied argument. This can be confusing, and it isn't very useful anyway, so turn __generic_write_end into a function returning void instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-05-01 07:47:37 -07:00
Christoph Hellwig	dbc582b6fb	iomap: Clean up __generic_write_end calling Move the call to __generic_write_end into iomap_write_end instead of duplicating it in each of the three branches. This requires open coding the generic_write_end for the buffer_head case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-05-01 07:47:37 -07:00
Ming Lei	60a27b906d	block: fix handling for BIO_NO_PAGE_REF Commit `399254aaf4` ("block: add BIO_NO_PAGE_REF flag") introduces BIO_NO_PAGE_REF, and once this flag is set for one bio, all pages in the bio won't be get/put during IO. However, if one bio is submitted via __blkdev_direct_IO_simple(), even though BIO_NO_PAGE_REF is set, pages still may be put. Fixes this issue by avoiding to put pages if BIO_NO_PAGE_REF is set. Fixes: `399254aaf4` ("block: add BIO_NO_PAGE_REF flag") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 08:38:47 -06:00
Jens Axboe	817869d251	io_uring: drop req submit reference always in async punt If we don't end up actually calling submit in io_sq_wq_submit_work(), we still need to drop the submit reference to the request. If we don't, then we can leak the request. This can happen if we race with ring shutdown while flushing the workqueue for requests that require use of the mm_struct. Fixes: `e65ef56db4` ("io_uring: use regular request ref counts") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 08:38:47 -06:00
Mark Rutland	52e04ef4c9	io_uring: free allocated io_memory once If io_allocate_scq_urings() fails to allocate an sq_* region, it will call io_mem_free() for any previously allocated regions, but leave dangling pointers to these regions in the ctx. Any regions which have not yet been allocated are left NULL. Note that when returning -EOVERFLOW, the previously allocated sq_ring is not freed, which appears to be an unintentional leak. When io_allocate_scq_urings() fails, io_uring_create() will call io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_* regions, assuming the pointers are valid and not NULL. This can result in pages being freed multiple times, which has been observed to corrupt the page state, leading to subsequent fun. This can also result in virt_to_page() on NULL, resulting in the use of bogus page addresses, and yet more subsequent fun. The latter can be detected with CONFIG_DEBUG_VIRTUAL on arm64. Adding a cleanup path to io_allocate_scq_urings() complicates the logic, so let's leave it to io_ring_ctx_free() to consistently free these pointers, and simplify the io_allocate_scq_urings() error paths. Full splats from before this patch below. Note that the pointer logged by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and is actually NULL. [ 26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0 [ 26.102976] flags: 0x63fffc000000() [ 26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000 [ 26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 [ 26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 26.143960] ------------[ cut here ]------------ [ 26.146020] kernel BUG at include/linux/mm.h:547! [ 26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 26.149163] Modules linked in: [ 26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb) [ 26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18 [ 26.156566] Hardware name: linux,dummy-virt (DT) [ 26.158089] pstate: 40400005 (nZcv daif +PAN -UAO) [ 26.159869] pc : io_mem_free+0x9c/0xa8 [ 26.161436] lr : io_mem_free+0x9c/0xa8 [ 26.162720] sp : ffff000013003d60 [ 26.164048] x29: ffff000013003d60 x28: ffff800025048040 [ 26.165804] x27: 0000000000000000 x26: ffff800025048040 [ 26.167352] x25: 00000000000000c0 x24: ffff0000112c2820 [ 26.169682] x23: 0000000000000000 x22: 0000000020000080 [ 26.171899] x21: ffff80002143b418 x20: ffff80002143b400 [ 26.174236] x19: ffff80002143b280 x18: 0000000000000000 [ 26.176607] x17: 0000000000000000 x16: 0000000000000000 [ 26.178997] x15: 0000000000000000 x14: 0000000000000000 [ 26.181508] x13: 00009178a5e077b2 x12: 0000000000000001 [ 26.183863] x11: 0000000000000000 x10: 0000000000000980 [ 26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20 [ 26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118 [ 26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000 [ 26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00 [ 26.196642] x1 : 0000000000000000 x0 : 000000000000003e [ 26.198892] Call trace: [ 26.199893] io_mem_free+0x9c/0xa8 [ 26.201155] io_ring_ctx_wait_and_kill+0xec/0x180 [ 26.202688] io_uring_setup+0x6c4/0x6f0 [ 26.204091] __arm64_sys_io_uring_setup+0x18/0x20 [ 26.205576] el0_svc_common.constprop.0+0x7c/0xe8 [ 26.207186] el0_svc_handler+0x28/0x78 [ 26.208389] el0_svc+0x8/0xc [ 26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000) [ 26.211995] ---[ end trace bdb81cd43a21e50d ]--- [ 81.770626] ------------[ cut here ]------------ [ 81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 ( (null)) [ 81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68 [ 81.831202] Modules linked in: [ 81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19 [ 81.835616] Hardware name: linux,dummy-virt (DT) [ 81.836863] pstate: 60400005 (nZCv daif +PAN -UAO) [ 81.838727] pc : __virt_to_phys+0x48/0x68 [ 81.840572] lr : __virt_to_phys+0x48/0x68 [ 81.842264] sp : ffff80002cf67c70 [ 81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18 [ 81.846463] x27: 0000000000000000 x26: 0000000020000080 [ 81.849148] x25: 0000000000000000 x24: ffff80001bb01f40 [ 81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60 [ 81.854351] x21: ffff800014358cc0 x20: ffff800014358d98 [ 81.856711] x19: 0000000000000000 x18: 0000000000000000 [ 81.859132] x17: 0000000000000000 x16: 0000000000000000 [ 81.861586] x15: 0000000000000000 x14: 0000000000000000 [ 81.863905] x13: 0000000000000000 x12: ffff1000037603e9 [ 81.866226] x11: 1ffff000037603e8 x10: 0000000000000980 [ 81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920 [ 81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47 [ 81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000 [ 81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58 [ 81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000 [ 81.880453] Call trace: [ 81.881164] __virt_to_phys+0x48/0x68 [ 81.882919] io_mem_free+0x18/0x110 [ 81.886585] io_ring_ctx_wait_and_kill+0x13c/0x1f0 [ 81.891212] io_uring_setup+0xa60/0xad0 [ 81.892881] __arm64_sys_io_uring_setup+0x2c/0x38 [ 81.894398] el0_svc_common.constprop.0+0xac/0x150 [ 81.896306] el0_svc_handler+0x34/0x88 [ 81.897744] el0_svc+0x8/0xc [ 81.898715] ---[ end trace b4a703802243cbba ]--- Fixes: `2b188cc1bb` ("Add io_uring IO interface") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 08:38:47 -06:00
Mark Rutland	975554b03e	io_uring: fix SQPOLL cpu validation In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu value from userspace. On v5.1-rc7 on arm64 with CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat: WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline] There was an attempt to fix this in commit: `917257daa0` ("io_uring: only test SQPOLL cpu after we've verified it") ... by adding a check after the cpu value had been limited to NR_CPU_IDS using array_index_nospec(). However, this left an unbound check at the start of the function, for which the warning still fires. Let's fix this correctly by checking that the cpu value is bound by nr_cpu_ids before passing it to cpu_possible(). Note that only nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs. Following the intent from the commit message for `917257daa0`, the check is moved under the SQ_AFF branch, which is the only branch where the cpu values is consumed. The check is performed before bounding the value with array_index_nospec() so that we don't silently accept bogus cpu values from userspace, where array_index_nospec() would force these values to 0. I suspect we can remove the array_index_nospec() call entirely, but I've conservatively left that in place, updated to use nr_cpu_ids to match the prior check. Tested on arm64 with the Syzkaller reproducer: https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293 https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000 Full splat from before this patch: WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c cpu_max_bits_warn include/linux/cpumask.h:121 [inline] cpumask_check include/linux/cpumask.h:128 [inline] cpumask_test_cpu include/linux/cpumask.h:344 [inline] io_sq_offload_start fs/io_uring.c:2244 [inline] io_uring_create fs/io_uring.c:2864 [inline] io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 __do_sys_io_uring_setup fs/io_uring.c:2929 [inline] __se_sys_io_uring_setup fs/io_uring.c:2926 [inline] __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Fixes: `917257daa0` ("io_uring: only test SQPOLL cpu after we've verified it") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Simplied the logic Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 08:38:37 -06:00
Jens Axboe	5c8b0b54db	io_uring: have submission side sqe errors post a cqe Currently we only post a cqe if we get an error OUTSIDE of submission. For submission, we return the error directly through io_uring_enter(). This is a bit awkward for applications, and it makes more sense to always post a cqe with an error, if the error happens on behalf of an sqe. This changes submission behavior a bit. io_uring_enter() returns -ERROR for an error, and > 0 for number of sqes submitted. Before this change, if you wanted to submit 8 entries and had an error on the 5th entry, io_uring_enter() would return 4 (for number of entries successfully submitted) and rewind the sqring. The application would then have to peek at the sqring and figure out what was wrong with the head sqe, and then skip it itself. With this change, we'll return 5 since we did consume 5 sqes, and the last sqe (with the error) will result in a cqe being posted with the error. This makes the logic easier to handle in the application, and it cleans up the submission part. Suggested-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-01 06:37:55 -06:00
Eric Biggers	6ee9706aa2	libfs: document simple_get_link() Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-04-30 23:59:25 -04:00
Debabrata Banerjee	50b29d8f03	ext4: fix ext4_show_options for file systems w/o journal Instead of removing EXT4_MOUNT_JOURNAL_CHECKSUM from s_def_mount_opt as I assume was intended, all other options were blown away leading to _ext4_show_options() output being incorrect. Fixes: `1e381f60da` ("ext4: do not allow journal_opts for fs w/o journal") Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org	2019-04-30 23:08:15 -04:00
Linus Torvalds	f2bc9c908d	Merge tag 'fsnotify_for_v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fix from Jan Kara: "A fix of user trigerable NULL pointer dereference syzbot has recently spotted. The problem was introduced in this merge window so no CC stable is needed" * tag 'fsnotify_for_v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: Fix NULL ptr deref in fanotify_get_fsid()	2019-04-30 15:03:00 -07:00
Chengguang Xu	632a9f3acd	quota: check time limit when back out space/inode change When we fail from allocating inode/space, we back out the change we already did. In a special case which has exceeded soft limit by the change, we should also check time limit and reset it properly. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jan Kara <jack@suse.cz>	2019-04-30 18:05:55 +02:00
Stefan Bühler	62977281a6	io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP There is no operation to order with afterwards, and removing the flag is not critical in any way. There will always be a "race condition" where the application will trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	b841f19524	io_uring: remove unnecessary barrier after incrementing dropped counter smp_store_release in io_commit_sqring already orders the store to dropped before the update to SQ head. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	82ab082c0e	io_uring: remove unnecessary barrier before reading SQ tail There is no operation before to order with. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	9e4c15a393	io_uring: remove unnecessary barrier after updating SQ head There is no operation afterwards to order with. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	115e12e58d	io_uring: remove unnecessary barrier before reading cq head The memory operations before reading cq head are unrelated and we don't care about their order. Document that the control dependency in combination with READ_ONCE and WRITE_ONCE forms a barrier we need. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	4f7067c3fb	io_uring: remove unnecessary barrier before wq_has_sleeper wq_has_sleeper has a full barrier internally. The smp_rmb barrier in io_uring_poll synchronizes with it. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	1e84b97b73	io_uring: fix notes on barriers The application reading the CQ ring needs a barrier to pair with the smp_store_release in io_commit_cqring, not the barrier after it. Also a write barrier after writing something (but not before writing anything interesting) doesn't order anything, so an smp_wmb() after writing SQ tail is not needed. Additionally consider reading SQ head and writing CQ tail in the notes. Also add some clarifications how the various other fields in the ring buffers are used. Signed-off-by: Stefan Bühler <source@stbuehler.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Stefan Bühler	8449eedaa1	io_uring: fix handling SQEs requesting NOWAIT Not all request types set REQ_F_FORCE_NONBLOCK when they needed async punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't be punted. Signed-off-by: Stefan Bühler <source@stbuehler.de> Merged with my previous patch for this. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:40:02 -06:00
Christoph Hellwig	2b070cfe58	block: remove the i argument to bio_for_each_segment_all We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-30 09:26:13 -06:00
Darrick J. Wong	75efa57d0b	xfs: add online scrub for superblock counters Teach online scrub how to check the filesystem summary counters. We use the incore delalloc block counter along with the incore AG headers to compute expected values for fdblocks, icount, and ifree, and then check that the percpu counter is within a certain threshold of the expected value. This is done to avoid having to freeze or otherwise lock the filesystem, which means that we're only checking that the counters are fairly close, not that they're exactly correct. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-04-30 08:19:13 -07:00
Christoph Hellwig	9407928575	xfs: don't parse the mtpt mount option The text isn't really any more useful than the default unknown option handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2019-04-30 08:19:13 -07:00
Darrick J. Wong	710d707d2f	xfs: always rejoin held resources during defer roll During testing of xfs/141 on a V4 filesystem, I observed some inconsistent behavior with regards to resources that are held (i.e. remain locked) across a defer roll. The transaction roll always gives the defer roll function a new transaction, even if committing the old transaction fails. However, the defer roll function only rejoins the held resources if the transaction commit succeedied. This means that callers of defer roll have to figure out whether the held resources are attached to the transaction being passed back. Worse yet, if the defer roll was part of a defer finish call, we have a third possibility: the defer finish could pass back a dirty transaction with dirty held resources and an error code. The only sane way to handle all of these scenarios is to require that the code that held the resource either cancel the transaction before unlocking and releasing the resources, or use functions that detach resources from a transaction properly (e.g. xfs_trans_brelse) if they need to drop the reference before committing or cancelling the transaction. In order to make this so, change the defer roll code to join held resources to the new transaction unconditionally and fix all the bhold callers to release the held buffers correctly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2019-04-30 08:19:13 -07:00
Josef Bacik	4297ff84dc	btrfs: track DIO bytes in flight When diagnosing a slowdown of generic/224 I noticed we were not doing anything when calling into shrink_delalloc(). This is because all writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes counter is 0, which short circuits most of the work inside of shrink_delalloc(). However O_DIRECT writes still consume metadata resources and generate ordered extents, which we can still wait on. Fix this by tracking outstanding DIO write bytes, and use this as well as the delalloc bytes counter to decide if we need to lookup and wait on any ordered extents. If we have more DIO writes than delalloc bytes we'll go ahead and wait on any ordered extents regardless of our flush state as flushing delalloc is likely to not gain us anything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ use dio instead of odirect in identifiers ] Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:25:37 +02:00
Anand Jain	da9b6ec829	btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop Since now the trans argument is never NULL in btrfs_set_prop we don't have to check. So delete it and use btrfs_setxattr that makes use of that. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:54 +02:00
Anand Jain	717ebdc320	btrfs: delete unused function btrfs_set_prop_trans The last consumer of btrfs_set_prop_trans() was taken away by the patch ("btrfs: start transaction in xattr_handler_set_prop") so now this function can be deleted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:54 +02:00
Anand Jain	b3f6a4be13	btrfs: start transaction in xattr_handler_set_prop btrfs specific extended attributes on the inode are set using btrfs_xattr_handler_set_prop(), and the required transaction for this update is started by btrfs_setxattr(). For better visibility of the transaction start and end, do this in btrfs_xattr_handler_set_prop(). For which this patch copied code of btrfs_setxattr() as it is in the original, which needs proper error handling. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:54 +02:00
Anand Jain	44e5194b5e	btrfs: drop local copy of inode i_mode There isn't real use of making struct inode::i_mode a local copy, it saves a dereference one time, not much. Just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	3c8d8b6357	btrfs: drop old_fsflags in btrfs_ioctl_setflags btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only once. Instead used it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	d2b8fcfe43	btrfs: modify local copy of btrfs_inode flags Instead of updating the binode::flags directly, update a local copy, and then at the point of no error, store copy it to the binode::flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	11d3cd5c62	btrfs: drop useless inode i_flags copy and restore The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the inode::i_flags update functions such as btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called in btrfs_ioctl_setflags() instead of btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the inode::i_flags remains unmodified until the thread has checked all the conditions. So drop the saved inode::i_flags in out_i_flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	ff9fef559b	btrfs: start transaction in btrfs_ioctl_setflags() Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This flags also includes compression attribute for which we would set/reset the compression extended attribute. While doing this there is a bit of duplicate code, the following things happens twice: - start/end_transaction - inode_inc_iversion() - current_time update to inode->i_ctime - and btrfs_update_inode() These are updated both at btrfs_ioctl_setflags() and btrfs_set_props() as well. This patch merges these two duplicate codes at btrfs_ioctl_setflags(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	cd31af158b	btrfs: export btrfs_set_prop Make btrfs_set_prop() a non-static function, so that it can be called from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of btrfs_set_prop_trans() so that we can use the transaction which is already started in the current thread. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:53 +02:00
Anand Jain	f22125e5d8	btrfs: refactor btrfs_set_props to validate externally In preparation to merge multiple transactions when setting the compression flags, split btrfs_set_props() validation part outside of it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Qu Wenruo	7c15d41016	btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe We have a long standing problem with reversed keys that's detected by btrfs_set_item_key_safe. This is hard to reproduce so we'd like to capture more information for later analysis. Let's dump the leaf content before triggering BUG_ON() so that we can have some clue on what's going wrong. The output of tree locks should help us to debug such problem. Sample stacktrace: generic/522 [00:07:05] [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05 [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192! [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G W 5.1.0-rc5-default+ #562 [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014 [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.493845] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.495608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 [27161.498100] Call Trace: [27161.498771] __btrfs_drop_extents+0x6ec/0xdf0 [btrfs] [27161.499872] btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs] [27161.501114] btrfs_log_inode+0x7ff/0xdc0 [btrfs] [27161.502114] ? __mutex_unlock_slowpath+0x4b/0x2b0 [27161.503172] btrfs_log_inode_parent+0x237/0x9c0 [btrfs] [27161.504348] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [27161.505374] btrfs_sync_file+0x1b7/0x480 [btrfs] [27161.506371] __x64_sys_msync+0x180/0x210 [27161.507208] do_syscall_64+0x54/0x180 [27161.507932] entry_SYSCALL_64_after_hwframe+0x49/0xbe [27161.508839] RIP: 0033:0x7f8ea5aa9c61 [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61 [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000 [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000 [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938 [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000 [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug] [27161.522272] ---[ end trace d5afec7ccac6a252 ]--- [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.534405] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.536048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Qu Wenruo	02529d7a10	btrfs: tree-checker: Allow error injection for tree-checker Allowing error injection for btrfs_check_leaf_full() and btrfs_check_node() is useful to test the failure path of btrfs write time tree check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Nikolay Borisov	51d470aeaa	btrfs: Document btrfs_csum_one_bio Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Filipe Manana	b8aa330d2a	Btrfs: improve performance on fsync of files with multiple hardlinks Commit `41bd606769` ("Btrfs: fix fsync of files with multiple hard links in new directories") introduced a path that makes fsync fallback to a full transaction commit in order to avoid losing hard links and new ancestors of the fsynced inode. That path is triggered only when the inode has more than one hard link and either has a new hard link created in the current transaction or the inode was evicted and reloaded in the current transaction. That path ends up getting triggered very often (hundreds of times) during the course of pgbench benchmarks, resulting in performance drops of about 20%. This change restores the performance by not triggering the full transaction commit in those cases, and instead iterate the fs/subvolume tree in search of all possible new ancestors, for all hard links, to log them. Reported-by: Zhao Yuhu <zyuhu@suse.com> Tested-by: James Wang <jnwang@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Filipe Manana	62d54f3a7f	Btrfs: fix race between send and deduplication that lead to failures and crashes Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:52 +02:00
Filipe Manana	9f89d5de86	Btrfs: send, flush dellaloc in order to avoid data loss When we set a subvolume to read-only mode we do not flush dellaloc for any of its inodes (except if the filesystem is mounted with -o flushoncommit), since it does not affect correctness for any subsequent operations - except for a future send operation. The send operation will not be able to see the delalloc data since the respective file extent items, inode item updates, backreferences, etc, have not hit yet the subvolume and extent trees. Effectively this means data loss, since the send stream will not contain any data from existing delalloc. Another problem from this is that if the writeback starts and finishes while the send operation is in progress, we have the subvolume tree being being modified concurrently which can result in send failing unexpectedly with EIO or hitting runtime errors, assertion failures or hitting BUG_ONs, etc. Simple reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo $ btrfs property set /mnt/sv ro true $ btrfs send -f /tmp/send.stream /mnt/sv $ od -t x1 -A d /mnt/sv/foo 0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea * 0110592 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /tmp/send.stream /mnt $ echo $? 0 $ od -t x1 -A d /mnt/sv/foo 0000000 # ---> empty file Since this a problem that affects send only, fix it in send by flushing dellaloc for all the roots used by the send operation before send starts to process the commit roots. This is a problem that affects send since it was introduced (commit `31db9f7c23` ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")) but backporting it to older kernels has some dependencies: - For kernels between 3.19 and 4.20, it depends on commit `3cd24c6980` ("btrfs: use tagged writepage to mitigate livelock of snapshot") because the function btrfs_start_delalloc_snapshot() does not exist before that commit. So one has to either pick that commit or replace the calls to btrfs_start_delalloc_snapshot() in this patch with calls to btrfs_start_delalloc_inodes(). - For kernels older than 3.19 it also requires commit `e5fa8f865b` ("Btrfs: ensure send always works on roots without orphans") because it depends on the function ensure_commit_roots_uptodate() which that commits introduced. - No dependencies for 5.0+ kernels. A test case for fstests follows soon. CC: stable@vger.kernel.org # 3.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
Filipe Manana	03628cdbc6	Btrfs: do not start a transaction during fiemap During fiemap, for regular extents (non inline) we need to check if they are shared and if they are, set the shared bit. Checking if an extent is shared requires checking the delayed references of the currently running transaction, since some reference might have not yet hit the extent tree and be only in the in-memory delayed references. However we were using a transaction join for this, which creates a new transaction when there is no transaction currently running. That means that two more potential failures can happen: creating the transaction and committing it. Further, if no write activity is currently happening in the system, and fiemap calls keep being done, we end up creating and committing transactions that do nothing. In some extreme cases this can result in the commit of the transaction created by fiemap to fail with ENOSPC when updating the root item of a subvolume tree because a join does not reserve any space, leading to a trace like the following: heisenberg kernel: ------------[ cut here ]------------ heisenberg kernel: BTRFS: Transaction aborted (error -28) heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 heisenberg kernel: Call Trace: heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] heisenberg kernel: ? _cond_resched+0x15/0x30 heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] heisenberg kernel: kthread+0x112/0x130 heisenberg kernel: ? kthread_bind+0x30/0x30 heisenberg kernel: ret_from_fork+0x35/0x40 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do a transaction join to avoid the overhead of creating a new transaction (if there is currently no running transaction) and introducing a potential point of failure when the new transaction gets committed, instead use a transaction attach to grab a handle for the currently running transaction if any. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ Fixes: `afce772e87` ("btrfs: fix check_shared for fiemap ioctl") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
David Sterba	f5c8daa5b2	btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
David Sterba	c6e340bc1c	btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_op Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
David Sterba	5c5aff98f8	btrfs: remove unused parameter fs_info from emit_last_fiemap_cache Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
David Sterba	033774dc5a	btrfs: remove unused parameter fs_info from CHECK_FE_ALIGNED Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:51 +02:00
David Sterba	179d1e6a3b	btrfs: remove unused parameter fs_info from from tree_advance Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:50 +02:00
David Sterba	c7da9597fe	btrfs: remove unused parameter fs_info from tree_move_down Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:50 +02:00
David Sterba	c71dd88007	btrfs: remove unused parameter fs_info from btrfs_extend_item Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:50 +02:00
David Sterba	78ac4f9e5a	btrfs: remove unused parameter fs_info from btrfs_truncate_item Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:50 +02:00
David Sterba	25263cd7ce	btrfs: remove unused parameter fs_info from split_item Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:50 +02:00
Qu Wenruo	c4140cbf35	btrfs: qgroup: Don't scan leaf if we're modifying reloc tree Since reloc tree doesn't contribute to qgroup numbers, just skip them. This should catch the final cause of unnecessary data ref processing when running balance of metadata with qgroups on. The 4G data 16 snapshots test (*) should explain it pretty well: \| delayed subtree \| refactor delayed ref \| this patch --------------------------------------------------------------------- relocated \| 22653 \| 22673 \| 22744 qgroup dirty \| 122792 \| 48360 \| 70 time \| 24.494 \| 11.606 \| 3.944 Finally, we're at the stage where qgroup + metadata balance cost no obvious overhead. Test environment: Test VM: - vRAM 8G - vCPU 8 - block dev vitrio-blk, 'unsafe' cache mode - host block 850evo Test workload: - Copy 4G data from /usr/ to one subvolume - Create 16 snapshots of that subvolume, and modify 3 files in each snapshot - Enable quota, rescan - Time "btrfs balance start -m" Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	ffd4bb2a19	btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	82fa113fcc	btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	ddf30cf03f	btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes() Since add_pinned_bytes() only needs to know if the extent is metadata and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as we don't need various owner/level trick to determine extent type. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	8a5040f7d9	btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as btrfs_ref describes a metadata/data reference update comprehensively. Now we have one less function use confusing owner/level trick. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	76675593b6	btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref() Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor btrfs_add_delayed_data_ref(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:49 +02:00
Qu Wenruo	ed4f255b9b	btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref() btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and some callers like btrfs_inc_extent_ref() are using @owner as level for delayed tree ref. Instead of making the parameter list longer, use btrfs_ref to refactor it, so each parameter assignment should be self-explaining without dirty level/owner trick, and provides the basis for later refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
Qu Wenruo	dd28b6a5aa	btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref The process_func function pointer is local to __btrfs_mod_ref() and points to either btrfs_inc_extent_ref() or btrfs_free_extent(). Open code it to make later delayed ref refactor easier, so we can refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
Qu Wenruo	b28b1f0ce4	btrfs: delayed-ref: Introduce better documented delayed ref structures Current delayed ref interface has several problems: - Longer and longer parameter lists bytenr num_bytes parent ---------- so far so good ref_root owner offset ---------- I don't feel good now - Different interpretation of the same parameter Above @owner for data ref is inode number (u64), while for tree ref, it's level (int). They are even in different size range. For level we only need 0 ~ 8, while for ino it's BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID. And @offset doesn't even make sense for tree ref. Such parameter reuse may look clever as an hidden union, but it destroys code readability. To solve both problems, we introduce a new structure, btrfs_ref to solve them: - Structure instead of long parameter list This makes later expansion easier, and is better documented. - Use btrfs_ref::type to distinguish data and tree ref - Use proper union to store data/tree ref specific structures. - Use separate functions to fill data/tree ref data, with a common generic function to fill common bytenr/num_bytes members. All parameters will find its place in btrfs_ref, and an extra member, @real_root, inspired by ref-verify code, is newly introduced for later qgroup code, to record which tree is triggered by this extent modification. This patch doesn't touch any code, but provides the basis for further refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
Filipe Manana	bfc61c3626	Btrfs: do not start a transaction at iterate_extent_inodes() When finding out which inodes have references on a particular extent, done by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both v1 and v2) ioctl and from scrub we use the transaction join API to grab a reference on the currently running transaction, since in order to give accurate results we need to inspect the delayed references of the currently running transaction. However, if there is currently no running transaction, the join operation will create a new transaction. This is inefficient as the transaction will eventually be committed, doing unnecessary IO and introducing a potential point of failure that will lead to a transaction abort due to -ENOSPC, as recently reported [1]. That's because the join, creates the transaction but does not reserve any space, so when attempting to update the root item of the root passed to btrfs_join_transaction(), during the transaction commit, we can end up failling with -ENOSPC. Users of a join operation are supposed to actually do some filesystem changes and reserve space by some means, which is not the case of iterate_extent_inodes(), it is a read-only operation for all contextes from which it is called. The reported [1] -ENOSPC failure stack trace is the following: heisenberg kernel: ------------[ cut here ]------------ heisenberg kernel: BTRFS: Transaction aborted (error -28) heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 heisenberg kernel: Call Trace: heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] heisenberg kernel: ? _cond_resched+0x15/0x30 heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] heisenberg kernel: kthread+0x112/0x130 heisenberg kernel: ? kthread_bind+0x30/0x30 heisenberg kernel: ret_from_fork+0x35/0x40 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- So fix that by using the attach API, which does not create a transaction when there is currently no running transaction. [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
David Sterba	65237ee3b6	btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:48 +02:00
David Sterba	163e97ee0d	btrfs: get fs_info from device in btrfs_scrub_cancel_dev We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
David Sterba	f331a9525f	btrfs: get fs_info from device in btrfs_rm_dev_item We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
David Sterba	8087c19345	btrfs: get fs_info from eb in __push_leaf_left We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
David Sterba	f72f0010b2	btrfs: get fs_info from eb in __push_leaf_right We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
Nikolay Borisov	50489a5734	btrfs: Remove bio_offset argument from submit_bio_hook None of the implementers of the submit_bio_hook use the bio_offset parameter, simply remove it. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
Nikolay Borisov	e68f2ee721	btrfs: Always pass 0 bio_offset for btree_submit_bio_start The btree submit hook queues the async csum and forwards the bio_offset parameter passed to btree_submit_bio_hook. This is redundant since btree_submit_bio_start calls btree_csum_one_bio which doesn't use the offset at all. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:47 +02:00
Nikolay Borisov	e7681167c3	btrfs: Pass 0 for bio_offset to btrfs_wq_submit_bio Buffered writeback always calls btrfs_csum_one_bio with the last 2 arguments being 0 irrespective of what the bio_offset has been passed to btrfs_submit_bio_start. Make this apparent by explicitly passing 0 for bio_offset when calling btrfs_wq_submit_bio from btrfs_submit_bio_hook. This will allow for further simplifications down the line. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
Nikolay Borisov	c2ccfbc62e	btrfs: Remove 'tree' argument from read_extent_buffer_pages This function always uses the btree inode's io_tree. Stop taking the tree as a function argument and instead access it internally from read_extent_buffer_pages. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
Nikolay Borisov	a56b1c7bc8	btrfs: Change submit_bio_hook to taking an inode directly The only possible 'private_data' that is passed to this function is actually an inode. Make that explicit by changing the signature of the call back. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
Nikolay Borisov	a9355a0ef3	btrfs: Define submit_bio_hook's type directly There is no need to use a typedef to define the type of the function and then use that to define the respective member in extent_io_ops. Define struct's member directly. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
David Sterba	2ccf545e0d	btrfs: get fs_info from block group in search_free_space_info We can read fs_info from the block group cache structure and can drop it from the parameters. Though the transaction is also availabe, it's not guaranteed to be non-NULL. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
David Sterba	2ceeae2e4c	btrfs: get fs_info from block group in btrfs_find_space_cluster We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:46 +02:00
David Sterba	6701bdb39c	btrfs: get fs_info from block group in write_pinned_extent_entries We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
David Sterba	bb6cb1c5b9	btrfs: get fs_info from block group in load_free_space_cache We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
David Sterba	7949f3392e	btrfs: get fs_info from block group in lookup_free_space_inode We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
David Sterba	fdf08605b9	btrfs: get fs_info from block group in pin_down_extent We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
David Sterba	f87b7eb821	btrfs: get fs_info from block group in next_block_group We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
Filipe Manana	32b593bfcb	Btrfs: remove no longer used function to run delayed refs asynchronously It used to be called from only two places (truncate path and releasing a transaction handle), but commits `28bad21257` ("btrfs: fix truncate throttling") and `db2462a6ad` ("btrfs: don't run delayed refs in the end transaction logic") removed their calls to this function, so it's not used anymore. Just remove it and all its helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:45 +02:00
Anand Jain	e3de9b159a	btrfs: cleanup btrfs_setxattr_trans and drop transaction parameter Previous patch made sure that btrfs_setxattr_trans() is called only when transaction NULL. Clean up btrfs_setxattr_trans() and drop the parameter. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Anand Jain	04e6863b19	btrfs: split btrfs_setxattr calls regarding transaction When the caller has already created the transaction handle, btrfs_setxattr() will use it. Also adds assert in btrfs_setxattr(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Anand Jain	353c2ea735	btrfs: remove redundant readonly root check in btrfs_setxattr_trans btrfs_setxattr_trans() is called by 5 functions as below and all of them do updates. None of them would be roun on a read-only root. So its ok to remove the readonly root check here as it's a high-level conditon. 1. __btrfs_set_acl() btrfs_init_acl() btrfs_init_inode_security() 2. __btrfs_set_acl() btrfs_set_acl() 3. btrfs_set_prop() btrfs_set_prop_trans() / \ btrfs_ioctl_setflags() btrfs_xattr_handler_set_prop() 4. btrfs_xattr_handler_set() 5. btrfs_initxattrs() btrfs_xattr_security_init() btrfs_init_inode_security() Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Anand Jain	3e125a74fb	btrfs: export btrfs_setxattr Preparatory patch, as we are going split the calls with and without transaction to use the respective btrfs_setxattr() and btrfs_setxattr_trans() functions. Export btrfs_setxattr() for calls outside of xattr.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Anand Jain	2d74fa3efc	btrfs: rename do_setxattr to btrfs_setxattr When trans is not NULL btrfs_setxattr() calls do_setxattr() directly with a check for readonly root. Rename do_setxattr() btrfs_setxattr() in preparation to call do_setxattr() directly instead. Preparatory patch, no functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Anand Jain	cac237ae09	btrfs: rename btrfs_setxattr to btrfs_setxattr_trans Rename btrfs_setxattr() to btrfs_setxattr_trans(), so that do_setxattr() can be renamed to btrfs_setxattr(). Preparatory patch, no functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:44 +02:00
Qu Wenruo	31aab40207	btrfs: trace: Introduce trace events for all btrfs tree locking events Unlike btrfs_tree_lock() and btrfs_tree_read_lock(), the remaining functions in locking.c will not sleep, thus doesn't make much sense to record their execution time. Those events are introduced mainly for user space tool to audit and detect lock leakage or dead lock. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
Qu Wenruo	34e73cc930	btrfs: trace: Introduce trace events for sleepable tree lock There are two tree lock events which can sleep: - btrfs_tree_read_lock() - btrfs_tree_lock() Sometimes we may need to look into the concurrency picture of the fs. For that case, we need the execution time of above two functions and the owner of @eb. Here we introduce a trace events for user space tools like bcc, to get the execution time of above two functions, and get detailed owner info where eBPF code can't. All the overhead is hidden behind the trace events, so if events are not enabled, there is no overhead. These trace events also output bytenr and generation, allow them to be pared with unlock events to pin down deadlock. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
Filipe Manana	74f657d89c	Btrfs: remove no longer used member num_dirty_bgs from transaction The member num_dirty_bgs of struct btrfs_transaction is not used anymore, it is set and incremented but nothing reads its value anymore. Its last read use was removed by commit `64403612b7` ("btrfs: rework btrfs_check_space_for_delayed_refs"). So just remove that member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
David Sterba	2b584c688b	btrfs: get fs_info from trans in btrfs_run_dev_replace We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
David Sterba	196c9d8de8	btrfs: get fs_info from trans in btrfs_run_dev_stats We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:43 +02:00
David Sterba	5c466629e2	btrfs: get fs_info from trans in btrfs_finish_sprout We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	6f8e0fc77c	btrfs: get fs_info from trans in init_first_rw_device We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	94f94ad972	btrfs: get fs_info from trans in copy_for_split We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	6ad3cf6df0	btrfs: get fs_info from trans in insert_ptr We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	55d32ed8d3	btrfs: get fs_info from trans in balance_node_right We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	d30a668f1b	btrfs: get fs_info from trans in push_node_left We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:42 +02:00
David Sterba	fe04153452	btrfs: get fs_info from trans in btrfs_write_out_cache We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	4ca75f1bd4	btrfs: get fs_info from trans in create_free_space_inode We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	907877664e	btrfs: get fs_info from trans in btrfs_set_log_full_commit We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	4884b8e8eb	btrfs: get fs_info from trans in btrfs_need_log_full_commit We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	9b7a2440ae	btrfs: get fs_info from trans in btrfs_create_tree We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	6b27940843	btrfs: get fs_info from trans in update_block_group We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:41 +02:00
David Sterba	5742d15fa7	btrfs: get fs_info from trans in btrfs_write_dirty_block_groups We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:40 +02:00
David Sterba	bbebb3e0ba	btrfs: get fs_info from trans in btrfs_setup_space_cache We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:40 +02:00
David Sterba	39db232dae	btrfs: get fs_info from trans in write_one_cache_group We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:40 +02:00
Nikolay Borisov	f9756261c2	btrfs: Remove redundant inode argument from btrfs_add_ordered_sum Ordered csums are keyed off of a btrfs_ordered_extent, which already has a reference to the inode. This implies that an explicit inode argument is redundant. So remove it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:40 +02:00
Qu Wenruo	8d47a0d8f7	btrfs: Do mandatory tree block check before submitting bio There are at least 2 reports about a memory bit flip sneaking into on-disk data. Currently we only have a relaxed check triggered at btrfs_mark_buffer_dirty() time, as it's not mandatory and only for CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build, it doesn't help users to detect such problem. This patch will address the hole by triggering comprehensive check on tree blocks before writing it back to disk. The design points are: - Timing of the check: Tree block write hook This timing is chosen to reduce the overhead. The comprehensive check should be as expensive as a checksum calculation. Doing full check at btrfs_mark_buffer_dirty() is too expensive for end user. - Loose empty leaf check Originally for an empty leaf, tree-checker will report error if it's not a tree root. The problem for such check at write time is: * False alert for tree root created in current transaction In that case, the commit root still needs to be written to disk. And since current root can differ from commit root, then it will cause false alert. This happens for log tree. * False alert for relocated tree block Relocated tree block can be written to disk due to memory pressure, in that case an empty csum tree root can be written to disk and cause false alert, since csum root node hasn't been updated. Previous patch of removing comprehensive empty leaf owner check has paved the way for this patch. The example error output will be something like: BTRFS critical (device dm-3): corrupt leaf: root=2 block=1350630375424 slot=68, bad key order, prev (10510212874240 169 0) current (1714119868416 169 0) BTRFS error (device dm-3): block=1350630375424 write time tree block corruption detected BTRFS: error (device dm-3) in btrfs_commit_transaction:2220: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-3): forced readonly BTRFS warning (device dm-3): Skipping commit of aborted transaction. BTRFS: error (device dm-3) in cleanup_transaction:1839: errno=-5 IO failure BTRFS info (device dm-3): delayed_refs has NO entry Reported-by: Leonard Lausen <leonard@lausen.nl> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:40 +02:00
Qu Wenruo	ff2ac107fa	btrfs: tree-checker: Remove comprehensive root owner check Commit `1ba98d086f` ("Btrfs: detect corruption when non-root leaf has zero item") introduced comprehensive root owner checker. However it's pretty expensive tree search to locate the owner root, especially when it get reused by mandatory read and write time tree-checker. This patch will remove that check, and completely rely on owner based empty leaf check, which is much faster and still works fine for most case. And since we skip the old root owner check, now write time tree check can be merged with btrfs_check_leaf_full(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
Robbie Ko	39ad317315	Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve When doing fallocate, we first add the range to the reserve_list and then reserve the quota. If quota reservation fails, we'll release all reserved parts of reserve_list. However, cur_offset is not updated to indicate that this range is already been inserted into the list. Therefore, the same range is freed twice. Once at list_for_each_entry loop, and once at the end of the function. This will result in WARN_ON on bytes_may_use when we free the remaining space. At the end, under the 'out' label we have a call to: btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset); The start offset, third argument, should be cur_offset. Everything from alloc_start to cur_offset was freed by the list_for_each_entry_safe_loop. Fixes: `18513091af` ("btrfs: update btrfs_space_info's bytes_may_use timely") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	178507595c	btrfs: get fs_info from eb in read_one_dev We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	9690ac0987	btrfs: get fs_info from eb in read_one_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	ddaf1d5aef	btrfs: get fs_info from eb in btrfs_check_chunk_valid We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	6ec0896c4c	btrfs: get fs_info from eb in should_balance_chunk We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:39 +02:00
David Sterba	813fd1dcab	btrfs: get fs_info from eb in btrfs_check_node We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
David Sterba	cfdaad5e5f	btrfs: get fs_info from eb in btrfs_check_leaf_relaxed We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
David Sterba	1c4360ee05	btrfs: get fs_info from eb in btrfs_check_leaf_full We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
Nikolay Borisov	929be17a9b	btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit Instead of always calling the allocator to search for a free extent, that satisfies the input criteria, switch btrfs_trim_free_extents to using find_first_clear_extent_bit. With this change it's no longer necessary to read the device tree in order to figure out holes in the devices. Now the code always searches in-memory data structure to figure out the space range which contains the requested which should result in speed improvements. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
Nikolay Borisov	45bfcfc168	btrfs: Implement find_first_clear_extent_bit This function is very similar to find_first_extent_bit except that it locates the first contiguous span of space which does not have bits set. It's intended use is in the freespace trimming code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
Nikolay Borisov	8811133d8a	btrfs: Optimize unallocated chunks discard Currently unallocated chunks are always trimmed. For example 2 consecutive trims on large storage would trim freespace twice irrespective of whether the space was actually allocated or not between those trims. Optimise this behavior by exploiting the newly introduced alloc_state tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark those unallocated chunks which have been trimmed and have not been allocated afterwards. On chunk allocation the respective underlying devices' physical space will have its CHUNK_TRIMMED flag cleared. This avoids submitting discards for space which hasn't been changed since the last time discard was issued. This applies to the single mount period of the filesystem as the information is not stored permanently. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:38 +02:00
Nikolay Borisov	e74e3993bc	btrfs: Factor out in_range macro This is used in more than one places so let's factor it out in ctree.h. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	60dfdf25bd	btrfs: Remove 'trans' argument from find_free_dev_extent(_start) Now that these functions no longer require a handle to transaction to inspect pending/pinned chunks the argument can be removed. At the same time also remove any surrounding code which acquired the handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Jeff Mahoney	1c11b63eff	btrfs: replace pending/pinned chunks lists with io tree The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	68c94e55e1	btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in close_ctree Following the introduction of the alloc_state tree, some of the callees of btrfs_mapping_tree_free will have to interact with the btrfs_device of the constituent devices. Enable this by moving the code responsible for freeing devices after the last user (btrfs_mapping_tree_free). Otherwise the kernel could crash due to use-after-free. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	8e75fd893b	btrfs: Stop using call_rcu for device freeing btrfs_device structs are freed from RCU context since device iteration is protected by RCU. Currently this is achieved by using call_rcu since no blocking functions are called within btrfs_free_device. Future refactoring of pending/pinned chunks will require calling sleeping functions. This patch is in preparation for these changes by simply switching from RCU callbacks to explicit calls of synchronize_rcu and calling btrfs_free_device directly. This is functionally equivalent, making sure that there are no readers at that time. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:37 +02:00
Nikolay Borisov	4ca7365606	btrfs: Implement set_extent_bits_nowait It will be used in a future patch that will require modifying an extent_io_tree struct under a spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	930b090729	btrfs: Introduce new bits for device allocation tree Rather than hijacking the existing defines let's just define new bits, with more descriptive names. Instead of using yet more (currently at 18) bits for the new flags, use the fact those flags will be specific to the device allocation tree so define them using existing EXTENT_* flags. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	39e264a40d	btrfs: Populate ->orig_block_len during read_one_chunk Chunks read from disk currently don't get their ->orig_block_len member set, in contrast when a new chunk is allocated, the respective extent_map's ->orig_block_len is assigned the size of the stripe of this chunk. Let's apply the same strategy for chunks which are read from disk, not only does this codify the invariant that ->orig_block_len always contains the size of the stripe for a chunk (when the em belongs to the mapping tree). But it's also a preparatory patch for further work around tracking chunk allocation in an extent tree rather than pinned/pending lists. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	41e7acd38c	btrfs: Rename and export clear_btree_io_tree This function is going to be used to clear out the device extent allocation information. Give it a more generic name and export it. This is in preparation to replacing the pending/pinned chunk lists with an extent tree. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	61d0d0d2cb	btrfs: Handle pending/pinned chunks before blockgroup relocation during device shrink During device shrink pinned/pending chunks (i.e. those which have been deleted/created respectively, in the current transaction and haven't touched disk) need to be accounted when doing device shrink. Presently this happens after the main relocation loop in btrfs_shrink_device, which could lead to making another go in the body of the function. Since there is no hard requirement to perform pinned/pending chunks handling after the relocation loop, move the code before it. This leads to simplifying the code flow around - i.e. no need to use 'goto again'. A notable side effect of this change is that modification of the device's size requires a transaction to be started and committed before the relocation loop starts. This is necessary to ensure that relocation process sees the shrunk device size. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	bbbf7243d6	btrfs: combine device update operations during transaction commit We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:36 +02:00
Nikolay Borisov	c2d1b3aae3	btrfs: Honour FITRIM range constraints during free space trim Up until now trimming the freespace was done irrespective of what the arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments will be entirely ignored. Fix it by correctly handling those paramter. This requires breaking if the found freespace extent is after the end of the passed range as well as completing trim after trimming fstrim_range::len bytes. Fixes: `499f377f49` ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
Robbie Ko	040ee6120c	Btrfs: send, improve clone range Improve clone_range in two scenarios. 1. Remove the limit of inode size when find clone inodes We can do partial clone, so there is no need to limit the size of the candidate inode. When clone a range, we clone the legal range only by bytenr, offset, len, inode size. 2. In the scenarios of rewrite or clone_range, data_offset rarely matches exactly, so the chance of a clone is missed. e.g. 1. Write a 1M file dd if=/dev/zero of=1M bs=1M count=1 2. Clone 1M file cp --reflink 1M clone 3. Rewrite 4k on the clone file dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc The disk layout is as follows: item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53 extent data disk byte 1103101952 nr 1048576 extent data offset 0 nr 1048576 ram 1048576 extent compression(none) ... item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53 extent data disk byte 1104150528 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53 extent data disk byte 1103101952 nr 1048576 extent data offset 4096 nr 1044480 ram 1048576 extent compression(none) When send, inode 258 file offset 4096~1048576 (item 23) has a chance to clone_range, but because data_offset does not match inode 257 (item 16), it causes missed clone and can only transfer actual data. Improve the problem by judging whether the current data_offset has overlap with the file extent item, and if so, adjusting offset and extent_len so that we can clone correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
Anand Jain	8b4d1efc9e	btrfs: prop: open code btrfs_set_prop in inherit_prop When an inode inherits property from its parent, we call btrfs_set_prop(). btrfs_set_prop() does an elaborate checks, which is not required in the context of inheriting a property. Instead just open-code only the required items from btrfs_set_prop() and then call btrfs_setxattr() directly. So now the only user of btrfs_set_prop() is gone, (except for the wraper function btrfs_set_prop_trans()). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
Anand Jain	ae0bc86310	btrfs: drop unused parameter in mount_subvol @device_name in mount_subvol() is not used, drop it. Also see: `5bedc48a8f` ("btrfs: drop unused parameters from mount_subvol"). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
David Sterba	39e57f495b	btrfs: tree-checker: get fs_info from eb in check_inode_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
David Sterba	412a23127c	btrfs: tree-checker: get fs_info from eb in check_dev_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:35 +02:00
David Sterba	5617ed80cb	btrfs: tree-checker: get fs_info from eb in dev_item_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	d001e4a3fe	btrfs: tree-checker: get fs_info from eb in chunk_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	e2ccd361ef	btrfs: tree-checker: get fs_info from eb in check_leaf We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	0076bc89a7	btrfs: tree-checker: get fs_info from eb in check_leaf_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	ae2a19d8ad	btrfs: tree-checker: get fs_info from eb in check_extent_data_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	af60ce2b93	btrfs: tree-checker: get fs_info from eb in check_block_group_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:34 +02:00
David Sterba	4806bd886a	btrfs: tree-checker: get fs_info from eb in block_group_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
David Sterba	ce4252c049	btrfs: tree-checker: get fs_info from eb in check_dir_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
David Sterba	d98ced688f	btrfs: tree-checker: get fs_info from eb in dir_item_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
David Sterba	68128ce756	btrfs: tree-checker: get fs_info from eb in check_csum_item We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
David Sterba	1fd715ffdd	btrfs: tree-checker: get fs_info from eb in file_extent_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
David Sterba	86a6be3abe	btrfs: tree-checker: get fs_info from eb in generic_err We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:33 +02:00
Qu Wenruo	6bf9e4bd6a	btrfs: inode: Verify inode mode to avoid NULL pointer dereference [BUG] When accessing a file on a crafted image, btrfs can crash in block layer: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 136501067 P4D 136501067 PUD 124519067 PMD 0 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.0.0-rc8-default #252 RIP: 0010:end_bio_extent_readpage+0x144/0x700 Call Trace: <IRQ> blk_update_request+0x8f/0x350 blk_mq_end_request+0x1a/0x120 blk_done_softirq+0x99/0xc0 __do_softirq+0xc7/0x467 irq_exit+0xd1/0xe0 call_function_single_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x1e/0x170 [CAUSE] The crafted image has a tricky corruption, the INODE_ITEM has a different type against its parent dir: item 20 key (268 INODE_ITEM 0) itemoff 2808 itemsize 160 generation 13 transid 13 size 1048576 nbytes 1048576 block group 0 mode 121644 links 1 uid 0 gid 0 rdev 0 sequence 9 flags 0x0(none) This mode number 0120000 means it's a symlink. But the dir item think it's still a regular file: item 8 key (264 DIR_INDEX 5) itemoff 3707 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 item 40 key (264 DIR_ITEM 51821248) itemoff 1573 itemsize 32 location key (268 INODE_ITEM 0) type FILE transid 13 data_len 0 name_len 2 name: f4 For symlink, we don't set BTRFS_I(inode)->io_tree.ops and leave it empty, as symlink is only designed to have inlined extent, all handled by tree block read. Thus no need to trigger btrfs_submit_bio_hook() for inline file extent. However end_bio_extent_readpage() expects tree->ops populated, as it's reading regular data extent. This causes NULL pointer dereference. [FIX] This patch fixes the problem in two ways: - Verify inode mode against its dir item when looking up inode So in btrfs_lookup_dentry() if we find inode mode mismatch with dir item, we error out so that corrupted inode will not be accessed. - Verify inode mode when getting extent mapping Only regular file should have regular or preallocated extent. If we found regular/preallocated file extent for symlink or the rest, we error out before submitting the read bio. With this fix that crafted image can be rejected gracefully: BTRFS critical (device loop0): inode mode mismatch with dir: inode mode=0121644 btrfs type=7 dir type=1 Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202763 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	496245cac5	btrfs: tree-checker: Verify inode item There is a report in kernel bugzilla about mismatch file type in dir item and inode item. This inspires us to check inode mode in inode item. This patch will check the following members: - inode key objectid Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO. - inode key offset Should be 0 - inode item generation - inode item transid No newer than sb generation + 1. The +1 is for log tree. - inode item mode No unknown bits. No invalid S_IF* bit. NOTE: S_IFMT check is not enough, need to check every know type. - inode item nlink Dir should have no more link than 1. - inode item flags Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	80e46cf22b	btrfs: tree-checker: Enhance chunk checker to validate chunk profile Btrfs-progs already have a comprehensive type checker, to ensure there is only 0 (SINGLE profile) or 1 (DUP/RAID0/1/5/6/10) bit set for chunk profile bits. Do the same work for kernel. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202765 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	ab4ba2e133	btrfs: tree-checker: Verify dev item [BUG] For fuzzed image whose DEV_ITEM has invalid total_bytes as 0, then kernel will just panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 #PF error: [normal kernel read fault] PGD 800000022b2bd067 P4D 800000022b2bd067 PUD 22b2bc067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1106 Comm: mount Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_verify_dev_extents+0x2a5/0x5a0 Call Trace: open_ctree+0x160d/0x2149 btrfs_mount_root+0x5b2/0x680 [CAUSE] If device extent verification finds a deivce with 0 total_bytes, then it assumes it's a seed dummy, then search for seed devices. But in this case, there is no seed device at all, causing NULL pointer. [FIX] Since this is caused by fuzzed image, let's go the tree-check way, just add a new verification for device item. Reported-by: Yoon Jungyeon <jungyeon@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202691 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	075cb3c78f	btrfs: tree-checker: Check chunk item at tree block read time Since we have btrfs_check_chunk_valid() in tree-checker, let's do chunk item verification in tree-checker too. Since the tree-checker is run at endio time, if one chunk leaf fails chunk verification, we can still retry the other copy, making btrfs more robust to fuzzed image as we may still get a good chunk item. Also since we have done chunk verification in tree block read time, skip the btrfs_check_chunk_valid() call in read_one_chunk() if we're reading chunk items from leaf. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	bf871c3b43	btrfs: tree-checker: Make btrfs_check_chunk_valid() return EUCLEAN instead of EIO To follow the standard behavior of tree-checker. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:32 +02:00
Qu Wenruo	f114024376	btrfs: tree-checker: Make chunk item checker messages more readable Old error message would be something like: BTRFS error (device dm-3): invalid chunk num_stipres: 0 New error message would be: Btrfs critical (device dm-3): corrupt superblock syschunk array: chunk_start=2097152, invalid chunk num_stripes: 0 Or Btrfs critical (device dm-3): corrupt leaf: root=3 block=8388608 slot=3 chunk_start=2097152, invalid chunk num_stripes: 0 And for certain error message, also output expected value. The error message levels are changed from error to critical. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
Qu Wenruo	82fc28fbed	btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it By function, chunk item verification is more suitable to be done inside tree-checker. So move btrfs_check_chunk_valid() to tree-checker.c and export it. And since it's now moved to tree-checker, also add a better comment for what this function is doing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
David Sterba	90b1377daa	btrfs: qgroup: remove obsolete fs_info members The commit `fcebe4562d` ("Btrfs: rework qgroup accounting") reworked qgroups and added some new structures. Another rework of qgroup mechanics `e69bcee376` ("btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.") stopped using them and left uncleaned. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
David Sterba	e064d5e9f0	btrfs: get fs_info from eb in btrfs_verify_level_key We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
David Sterba	5ab12d1ff8	btrfs: get fs_info from eb in btree_read_extent_buffer_pages We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
David Sterba	d0d20b0f5c	btrfs: get fs_info from eb in read_node_slot We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:31 +02:00
David Sterba	e902baac65	btrfs: get fs_info from eb in btrfs_leaf_free_space We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	6a884d7d52	btrfs: get fs_info from eb in clean_tree_block We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	ed874f0db8	btrfs: get fs_info from eb in tree_mod_log_eb_copy We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	b0c9b3b05d	btrfs: get fs_info from eb in check_tree_block_fsid We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	bcdc428cfe	btrfs: get fs_info from eb in btrfs_exclude_logged_extents We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	8f881e8c18	btrfs: get fs_info from eb in leaf_data_end We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:30 +02:00
David Sterba	0ab0206328	btrfs: get fs_info from eb in write_one_eb We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:29 +02:00
David Sterba	20a1fbf97e	btrfs: get fs_info from eb in repair_eb_io_failure We can read fs_info from extent buffer and can drop it from the parameters. As all callsites are updated, add the btrfs_ prefix as the function is exported. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:29 +02:00
David Sterba	9df76fb544	btrfs: get fs_info from eb in lock_extent_buffer_for_io We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:29 +02:00
Phillip Potter	7d157c3d48	btrfs: use common file type conversion Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Common implementation can be found via commit: `bbe7449e25` "fs: common implementation of file type" Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:29 +02:00
Goldwyn Rodrigues	7984ae52bb	btrfs: Perform locking/unlocking in btrfs_remap_file_range() Move code to make it more readable, so as locking and unlocking is done in the same function. The generic checks that are now performed in the locked section are unaffected. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:29 +02:00
Arnd Bergmann	290342f661	btrfs: use BUG() instead of BUG_ON(1) BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' # define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x))) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here max_chunk_size); ^~~~~~~~~~~~~~ include/linux/kernel.h:860:36: note: expanded from macro 'min' #define min(x, y) __careful_cmp(x, y, <) ^ include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp' __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op)) ^ include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once' typeof(y) unique_y = (y); \ ^ fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0) ^ fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning u64 max_chunk_size; ^ = 0 Change it to BUG() so clang can see that this code path can never continue. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:28 +02:00
David Sterba	247462a5ac	btrfs: move tree block wait and write helpers to tree-log The wrapper names better describe what's happening so they're not deleted though they're trivial, but at least moved closer to their place of use. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:28 +02:00
David Sterba	d4eb671a08	btrfs: remove stale definition of BUFFER_LRU_MAX Long time ago (2008), the extent buffers were organized in a LRU list and switched to rb-tree in `6af118ce51` ("Btrfs: Index extent buffers in an rbtree"). There was one stale macro definition left. Signed-off-by: David Sterba <dsterba@suse.com>	2019-04-29 19:02:28 +02:00

... 4 5 6 7 8 ...

58798 Commits