linux

Commit Graph

Author	SHA1	Message	Date
Dominik Brodowski	2dae024806	fs: add do_readlinkat() helper; remove internal call to sys_readlinkat() Using the do_readlinkat() helper removes an in-kernel call to the sys_readlinkat() syscall. This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:34 +02:00
Steve French	07108d0e7c	cifs: Add minor debug message during negprot Check for unknown security mode flags during negotiate protocol if debugging enabled. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-04-02 13:11:15 -05:00
Steve French	7ea884c77e	smb3: Fix root directory when server returns inode number of zero Some servers return inode number zero for the root directory, which causes ls to display incorrect data (missing "." and ".."). If the server returns zero for the inode number of the root directory, fake an inode number for it. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org>	2018-04-02 13:11:03 -05:00
Steve French	6c4ba31133	cifs: fix sparse warning on previous patch in a few printks Signed-off-by: Steve French <smfrench@gmail.com> CC: Ronnie Sahlberg <lsahlber@redhat.com>	2018-04-02 13:10:40 -05:00
Ronnie Sahlberg	93012bf984	cifs: add server->vals->header_preamble_size This variable is set to 4 for all protocol versions and replaces the hardcoded constant 4 throughought the code. This will later be updated to reflect whether a response packet has a 4 byte length preamble or not once we start removing this field from the SMB2+ dialects. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-04-02 13:09:44 -05:00
Martin Brandenburg	dbcb5e7fc4	orangefs: open code short single-use functions Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-04-02 08:10:17 -04:00
Colin Ian King	81e3d0253f	orangefs: replace vmalloc and memset with vzalloc Use vzalloc instead of the vmalloc, memset combo Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-04-02 08:10:17 -04:00
David Reynolds	c2676ef801	orangefs: bug fix for a race condition when getting a slot When a slot becomes free, call wake_up_locked regardless of the number of slots available. Without this patch, wake_up_locked is only called when going from no free slots to one. This means that there is a chance a waiting task will not be woken up. In many cases, the system will bounce between 0 and 1 free slots, and the waiting tasks will be woken up. But if there is still a waiting task and another slot becomes available before the number of free slots reaches zero, that waiting task may never be woken up since the number of free slots may never reach zero again. The bug behavior is easy to reproduce with the following script, where /mnt/orangefs is an OrangeFS file system. for i in {1..100}; do for j in {1..20}; do dd if=/dev/zero of=/mnt/orangefs/tmp$j bs=32768 count=32 & done wait done Signed-off-by: David Reynolds <david@omnibond.com> Reviewed-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2018-04-02 08:10:17 -04:00
Luis Henriques	9122eed528	ceph: quota: report root dir quota usage in statfs This commit changes statfs default behaviour when reporting usage statistics. Instead of using the overall filesystem usage, statfs now reports the quota for the filesystem root, if ceph.quota.max_bytes has been set for this inode. If quota hasn't been set, it falls back to the old statfs behaviour. A new mount option is also added ('noquotadf') to disable this behaviour. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:53 +02:00
Luis Henriques	d557c48db7	ceph: quota: add counter for snaprealms with quota By keeping a counter with the number of snaprealms that have quota set allows to optimize the functions that need to walk throught the realms hierarchy looking for quotas. Thus, if this counter is zero it's safe to assume that there are no realms with quota. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:53 +02:00
Luis Henriques	e3161f17d9	ceph: quota: cache inode pointer in ceph_snap_realm Keep a pointer to the inode in struct ceph_snap_realm. This allows to optimize functions that walk the realms hierarchy (e.g. in quotas). Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:53 +02:00
Yan, Zheng	0eb6bbe4d9	ceph: fix root quota realm check Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:52 +02:00
Yan, Zheng	2596366907	ceph: don't check quota for snap inode snap inode's i_snap_realm is not pointing to ceph_snap_realm. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:52 +02:00
Luis Henriques	1ab302a0cb	ceph: quota: update MDS when max_bytes is approaching When we're reaching the ceph.quota.max_bytes limit, i.e., when writing more than 1/16th of the space left in a quota realm, update the MDS with the new file size. This mirrors the fuse-client approach with commit 122c50315ed1 ("client: Inform mds file size when approaching quota limit"), in the ceph git tree. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:52 +02:00
Luis Henriques	2b83845f8b	ceph: quota: support for ceph.quota.max_bytes Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:52 +02:00
Luis Henriques	cafe21a4fb	ceph: quota: don't allow cross-quota renames This patch changes ceph_rename so that -EXDEV is returned if an attempt is made to mv a file between two different dir trees with different quotas setup. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:52 +02:00
Luis Henriques	b7a2921765	ceph: quota: support for ceph.quota.max_files This patch adds support for the max_files quota. It hooks into all the ceph functions that add new filesystem objects that need to be checked against the quota limits. When these limits are hit, -EDQUOT is returned. Note that we're not checking quotas on ceph_link(). ceph_link doesn't really create a new inode, and since the MDS doesn't update the directory statistics when a new (hard) link is created (only with symlinks), they are not accounted as a new file. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:51 +02:00
Luis Henriques	fb18a57568	ceph: quota: add initial infrastructure to support cephfs quotas This patch adds the infrastructure required to support cephfs quotas as it is currently implemented in the ceph fuse client. Cephfs quotas can be set on any directory, and can restrict the number of bytes or the number of files stored beneath that point in the directory hierarchy. Quotas are set using the extended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', and can be removed by setting these attributes to '0'. Link: http://tracker.ceph.com/issues/22372 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 11:17:51 +02:00
Yan, Zheng	7aac453a03	ceph: rename function drop_leases() to a more descriptive name Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:49 +02:00
Chengguang Xu	50c55aeca2	ceph: fix invalid point dereference for error case in mdsc destroy 1. set fsc->mdsc after successfully allocate all necessary memory in mdsc init. 2. if fsc->mdsc is NULL, just skip destroy operation in mdsc destroy. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:49 +02:00
Chengguang Xu	98cfda8104	ceph: return proper bool type to caller instead of pointer Change to return true/false only for bool type return code. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:49 +02:00
Chengguang Xu	bb48bd4dc4	ceph: optimize memory usage In current code, regular file and directory use same struct ceph_file_info to store fs specific data so the struct has to include some fields which are only used for directory (e.g., readdir related info), when having plenty of regular files, it will lead to memory waste. This patch introduces dedicated ceph_dir_file_info cache for readdir related thins. So that regular file does not include those unused fields anymore. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:49 +02:00
Chengguang Xu	47474d0b01	ceph: optimize mds session register Do memory allocation first, so that avoid unnecessary initialization of newly allocated session in error case. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:48 +02:00
Chengguang Xu	57a35dfb52	libceph, ceph: add __init attribution to init funcitons Add __init attribution to the functions which are called only once during initiating/registering operations and deleting unnecessary symbol exports. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:48 +02:00
Chengguang Xu	51b10f3fe4	ceph: filter out used flags when printing unused open flags Filter out used access mode flags when printing unused open flags. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:48 +02:00
Yan, Zheng	1582af2eaa	ceph: don't wait on writeback when there is no more dirty pages In sync mode, writepages() needs to write all dirty pages. But it can only write dirty pages associated with the oldest snapc. To write dirty pages associated with next snapc, it needs to wait until current writes complete. If there is no more dirty pages, writepages() should not wait on writeback. Otherwise, dirty page writeback becomes very slow. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:48 +02:00
Yan, Zheng	af9cc401ce	ceph: invalidate pages that beyond EOF in ceph_writepages_start() Dirty pages can be associated with different capsnap. Different capsnap may have different EOF value. So invalidating dirty pages according to the largest EOF value is wrong. Dirty pages beyond EOF, but associated with other capsnap, do not get invalidated. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:47 +02:00
Chengguang Xu	bc4b5ad3a6	ceph: mark the cap cache as unreclaimable Releasing cap is affected by many factors (e.g., avail_count/reserve_count/min_count) and min_count could be specified high volume in client mount option. Hence it's better to mark cap cache as unreclaimable in case of non-trivial discrepancies between memory shown as reclaimable and what is actually reclaimed. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:47 +02:00
Chengguang Xu	73737682e0	ceph: change variable name to follow common rule Variable name ci is mostly used for ceph_inode_info. Variable name fi is mostly used for ceph_file_info. Variable name cf is mostly used for ceph_cap_flush. Change variable name to follow above common rules in case of confusing. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:47 +02:00
Chengguang Xu	79cd674aed	ceph: optimizing cap reservation When caps_avail_count is in a low level, most newly trimmed caps will probably go into ->caps_list and caps_avail_count will be increased. Hence after trimming, should recheck caps_avail_count to effectly reuse newly trimmed caps. Also, when releasing unnecessary caps follow the same rule of ceph_put_cap. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:47 +02:00
Chengguang Xu	b517c1d87f	ceph: release unreserved caps if having enough available caps When unreserving caps check if there is too mamy available caps in the ->caps_list, if so release unreserved caps. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:47 +02:00
Chengguang Xu	e327ce0685	ceph: optimizing cap allocation When setting high volume of caps_min_count or having many unreserved caps, unused caps may always keep in the ->caps_list even can't get new cap from kmem_cache_alloc because lack of maximum limitation of caps_avail_count. Hence reuse caps in ->caps_list if available, it's maybe better than setting max limitation of caps_avail_count and releasing unused caps when reaching the limit. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:46 +02:00
Chengguang Xu	b884014a91	ceph: adding protection for showing cap reservation info Adding spinlock protection during getting cap reservation ralated fields so that the numbers match below BUG_ON condition in the code. BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count + mdsc->caps_reserve_count + mdsc->caps_avail_count); Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:46 +02:00
Chengguang Xu	4d8969af28	ceph: use seq_show_option for string type options Using seq_show_option to replace seq_printf for string type options. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:45 +02:00
Chengguang Xu	11e1478df9	libceph, ceph: change permission for readonly debugfs entries Remove write permission for debugfs entries which only have readonly function. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:45 +02:00
Chengguang Xu	7ae7a828d9	ceph: keep consistent semantic in fscache related option combination When specifying multiple fscache related options, the result isn't always the same as option order, this fix will keep strict consistent meaning by order. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:45 +02:00
Chengguang Xu	4c069a5821	ceph: add newline to end of debug message format Some of dout format do not include newline in the end, fix for the files which are in fs/ceph and net/ceph directories, and changing printk to dout for printing debug info in super.c Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:44 +02:00
Ilya Dryomov	08c1ac508b	libceph, ceph: move ceph_calc_file_object_mapping() to striper.c ceph_calc_file_object_mapping() has nothing to do with osdmaps. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-04-02 10:12:43 +02:00
Ilya Dryomov	dccbf08005	libceph, ceph: change ceph_calc_file_object_mapping() signature - make it void - xlen (object extent length) out parameter should be u32 because only a single stripe unit is mapped at a time Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2018-04-02 10:12:38 +02:00
Theodore Ts'o	e40ff21389	ext4: force revalidation of directory pointer after seekdir(2) A malicious user could force the directory pointer to be in an invalid spot by using seekdir(2). Use the mechanism we already have to notice if the directory has changed since the last time we called ext4_readdir() to force a revalidation of the pointer. Reported-by: syzbot+1236ce66f79263e8a862@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-04-01 23:21:03 -04:00
Long Li	21a4e14aae	cifs: smbd: disconnect transport on RDMA errors On RDMA errors, transport should disconnect the RDMA CM connection. This will notify the upper layer, and it will attempt transport reconnect. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:40 -05:00
Long Li	48f238a79f	cifs: smbd: avoid reconnect lockup During transport reconnect, other processes may have registered memory and blocked on transport. This creates a deadlock situation because the transport resources can't be freed, and reconnect is blocked. Fix this by returning to upper layer on timeout. Before returning, transport status is set to reconnecting so other processes will release memory registration resources. Upper layer will retry the reconnect. This is not in fast I/O path so setting the timeout to 5 seconds. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:40 -05:00
Steve French	2a18287b54	Don't log confusing message on reconnect by default Change the following message (which can occur on reconnect) from a warning to an FYI message. It is confusing to users. [58360.523634] CIFS VFS: Free previous auth_key.response = 00000000a91cdc84 By default this message won't show up on reconnect unless the user bumps up the log level to include FYI messages. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2018-04-01 20:24:40 -05:00
Steve French	2564f2ff83	Don't log expected error on DFS referral request STATUS_FS_DRIVER_REQUIRED is expected when DFS is not turned on on the server. Do not log it on DFS referral response. It clutters the dmesg log unnecessarily at mount time. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>	2018-04-01 20:24:40 -05:00
Phillip Potter	31cd106bb1	fs: cifs: Replace _free_xid call in cifs_root_iget function Modify end of cifs_root_iget function in fs/cifs/inode.c to call free_xid(xid) instead of _free_xid(xid), thereby allowing debug notification of this action when enabled. Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Signed-off-by: Steve French <smfrench@gmail.com>	2018-04-01 20:24:40 -05:00
Steve French	d68f353fc9	SMB3.1.1 dialect is no longer experimental SMB3.1.1 is a very important dialect, with much improved security. We can remove the ExPERIMENTAL comments about it. It is widely supported by servers. Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:40 -05:00
Steve French	6188f28bf6	Tree connect for SMB3.1.1 must be signed for non-encrypted shares SMB3.1.1 tree connect was only being signed when signing was mandatory but needs to always be signed (for non-guest users). See MS-SMB2 section 3.2.4.1.1 Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:40 -05:00
Ronnie Sahlberg	262916bc69	fix smb3-encryption breakage when CONFIG_DEBUG_SG=y We can not use the standard sg_set_buf() fucntion since when CONFIG_DEBUG_SG=y this adds a check that will BUG_ON for cifs.ko when we pass it an object from the stack. Create a new wrapper smb2_sg_set_buf() which avoids doing that particular check and use it for smb3 encryption instead. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:40 -05:00
Gustavo A. R. Silva	70e80655f5	CIFS: fix sha512 check in cifs_crypto_secmech_release It seems this is a copy-paste error and that the proper variable to use in this particular case is _sha512_ instead of _md5_. Addresses-Coverity-ID: 1465358 ("Copy-paste error") Fixes: 1c6614d229e7 ("CIFS: add sha512 secmech") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>	2018-04-01 20:24:40 -05:00
Aurelien Aptel	8bd68c6e47	CIFS: implement v3.11 preauth integrity SMB3.11 clients must implement pre-authentification integrity. * new mechanism to certify requests/responses happening before Tree Connect. * supersedes VALIDATE_NEGOTIATE * fixes signing for SMB3.11 Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-04-01 20:24:40 -05:00
Aurelien Aptel	5fcd7f3f96	CIFS: add sha512 secmech * prepare for SMB3.11 pre-auth integrity * enable sha512 when SMB311 is enabled in Kconfig * add sha512 as a soft dependency Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-04-01 20:24:39 -05:00
Aurelien Aptel	82fb82be05	CIFS: refactor crypto shash/sdesc allocation&free shash and sdesc and always allocated and freed together. * abstract this in new functions cifs_alloc_hash() and cifs_free_hash(). * make smb2/3 crypto allocation independent from each other. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:39 -05:00
Ronnie Sahlberg	b7a73c84eb	cifs: fix memory leak in SMB2_open() Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org>	2018-04-01 20:24:39 -05:00
Colin Ian King	ac65cb6203	CIFS: SMBD: fix spelling mistake: "faield" and "legnth" Trivial fix to spelling mistake in log_rdma_send and log_rdma_mr message text. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>	2018-04-01 20:24:39 -05:00
David S. Miller	c0b458a946	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-04-01 19:49:34 -04:00
Theodore Ts'o	54dd0e0a1b	ext4: add extra checks to ext4_xattr_block_get() Add explicit checks in ext4_xattr_block_get() just in case the e_value_offs and e_value_size fields in the the xattr block are corrupted in memory after the buffer_verified bit is set on the xattr block. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2018-03-30 20:04:11 -04:00
David Sterba	57599c7e77	btrfs: lift errors from add_extent_changeset to the callers The missing error handling in add_extent_changeset was hidden, so make it at least visible in the callers. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:03:25 +02:00
Liu Bo	f50f435390	Btrfs: print error messages when failing to read trees When mount fails to read trees like fs tree, checksum tree, extent tree, etc, there is not enough information about where went wrong. With this, messages like "BTRFS warning (device sdf): failed to read root (objectid=7): -5" would help us a bit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:02:14 +02:00
David Sterba	38e82de8cc	btrfs: user proper type for btrfs_mask_flags flags All users pass a local unsigned int and not the __uXX types that are supposed to be used for userspace interfaces. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:07 +02:00
David Sterba	7e79cb86be	btrfs: split dev-replace locking helpers for read and write The current calls are unclear in what way btrfs_dev_replace_lock takes the locks, so drop the argument, split the helpers and use similar naming as for read and write locks. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:07 +02:00
David Sterba	e7ab0af6c3	btrfs: remove stale comments about fs_mutex The fs_mutex has been killed in 2008, `a213501153` ("Btrfs: Replace the big fs_mutex with a collection of other locks"), still remembered in some comments. We don't have any extra needs for locking in the ACL handlers. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:07 +02:00
David Sterba	88c14590cd	btrfs: use RCU in btrfs_show_devname for device list traversal The show_devname callback is used to print device name in /proc/self/mounts, we need to traverse the device list consistently and read the name that's copied to a seq buffer so we don't need further locking. If the first device is being deleted at the same time, the RCU will allow us to read the device name, though it will become stale right after the RCU protection ends. This is unavoidable and the user can expect that the device will disappear from the filesystem's list at some point. The device_list_mutex was pretty heavy as it is used eg. for writing superblock and a few other IO related contexts. This can stall any application that reads the proc file for no reason. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
David Sterba	d1980131ca	btrfs: update barrier in should_cow_block Once there was a simple int force_cow that was used with the plain barriers, and then converted to a bit, so we should use the appropriate barrier helper. Other variables in the complex if condition do not depend on a barrier, so we should be fine in case the atomic barrier becomes a no-op. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
David Sterba	a32bf9a302	btrfs: use lockdep_assert_held for mutexes Using lockdep_assert_held is preferred, replace mutex_is_locked. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
David Sterba	a4666e688f	btrfs: use lockdep_assert_held for spinlocks Using lockdep_assert_held is preferred, replace assert_spin_locked. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
Qu Wenruo	581c176041	btrfs: Validate child tree block's level and first key We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:06 +02:00
Qu Wenruo	3c0efdf03b	btrfs: tests/qgroup: Fix wrong tree backref level The extent tree of the test fs is like the following: BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919 item 0 key (4096 168 4096) itemoff 3944 itemsize 51 extent refs 1 gen 1 flags 2 tree block key (68719476736 0 0) level 1 ^^^^^^^ ref#0: tree block backref root 5 And it's using an empty tree for fs tree, so there is no way that its level can be 1. For REAL (created by mkfs) fs tree backref with no skinny metadata, the result should look like: item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51 refs 1 gen 4 flags TREE_BLOCK tree block key (256 INODE_ITEM 0) level 0 ^^^^^^^ tree block backref root 5 Fix the level to 0, so it won't break later tree level checker. Fixes: `faa2dbf004` ("Btrfs: add sanity tests for new qgroup accounting code") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
Filipe Manana	8434ec46c6	Btrfs: fix copy_items() return value when logging an inode When logging an inode, at tree-log.c:copy_items(), if we call btrfs_next_leaf() at the loop which checks for the need to log holes, we need to make sure copy_items() returns the value 1 to its caller and not 0 (on success). This is because the path the caller passed was released and is now different from what is was before, and the caller expects a return value of 0 to mean both success and that the path has not changed, while a return value of 1 means both success and signals the caller that it can not reuse the path, it has to perform another tree search. Even though this is a case that should not be triggered on normal circumstances or very rare at least, its consequences can be very unpredictable (especially when replaying a log tree). Fixes: `16e7549f04` ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
Filipe Manana	4ee3fad34a	Btrfs: fix fsync after hole punching when using no-holes feature When we have the no-holes mode enabled and fsync a file after punching a hole in it, we can end up not logging the whole hole range in the log tree. This happens if the file has extent items that span more than one leaf and we punch a hole that covers a range that starts in a leaf but does not go beyond the offset of the first extent in the next leaf. Example: $ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb $ mount /dev/sdb /mnt $ for ((i = 0; i <= 831; i++)); do offset=$((i * 2 * 256 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \ /mnt/foobar >/dev/null done $ sync # We now have 2 leafs in our filesystem fs tree, the first leaf has an # item corresponding the extent at file offset 216530944 and the second # leaf has a first item corresponding to the extent at offset 217055232. # Now we punch a hole that partially covers the range of the extent at # offset 216530944 but does go beyond the offset 217055232. $ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar <power fail> # mount to replay the log $ mount /dev/sdb /mnt # Before this patch, only the subrange [216658016, 216662016[ (length of # 4000 bytes) was logged, leaving an incorrect file layout after log # replay. Fix this by checking if there is a hole between the last extent item that we processed and the first extent item in the next leaf, and if there is one, log an explicit hole extent item. Fixes: `16e7549f04` ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
David Sterba	a1840b5023	btrfs: use helper to set ulist aux from a qgroup We have a nice helper to do proper casting of a qgroup to a ulist aux value. And several places that could make use of it. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
Qu Wenruo	0b78877a2a	Revert "btrfs: qgroups: Retry after commit on getting EDQUOT" This reverts commit `48a89bc4f2`. The idea to commit transaction and free some space after hitting qgroup limit is good, although the problem is it can easily cause deadlocks. One deadlock example is caused by trying to flush data while still holding it: Call Trace: __schedule+0x49d/0x10f0 schedule+0xc6/0x290 schedule_timeout+0x187/0x1c0 wait_for_completion+0x204/0x3a0 btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs] qgroup_reserve+0x913/0xa10 [btrfs] btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs] btrfs_check_data_free_space+0x96/0xd0 [btrfs] __btrfs_buffered_write+0x3ac/0xd40 [btrfs] btrfs_file_write_iter+0x62a/0xba0 [btrfs] __vfs_write+0x320/0x430 vfs_write+0x107/0x270 SyS_write+0xbf/0x150 do_syscall_64+0x1b0/0x3d0 entry_SYSCALL64_slow_path+0x25/0x25 Another can be caused by trying to commit one transaction while nesting with trans handle held by ourselves: btrfs_start_transaction() \|- btrfs_qgroup_reserve_meta_pertrans() \|- qgroup_reserve() \|- btrfs_join_transaction() \|- btrfs_commit_transaction() The retry is causing more problems than exppected when limit is enabled. At least a graceful EDQUOT is way better than deadlock. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
Qu Wenruo	4ee0d8832c	btrfs: qgroup: Update trace events for metadata reservation Now trace_qgroup_meta_reserve() will have extra type parameter. And introduce two new trace events: 1) trace_qgroup_meta_free_all_pertrans() For btrfs_qgroup_free_meta_all_pertrans() 2) trace_qgroup_meta_convert() For btrfs_qgroup_convert_reserved_meta() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:05 +02:00
Qu Wenruo	8287475a20	btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space For quota disabled->enable case, it's possible that at reservation time quota was not enabled so no bytes were really reserved, while at release time, quota was enabled so we will try to release some bytes we didn't really own. Such situation can cause metadata reserveation underflow, for both types, also less possible for per-trans type since quota enable will commit transaction. To address this, record qgroup meta reserved bytes into root::qgroup_meta_rsv_pertrans and ::prealloc. So at releasing time we won't free any bytes we didn't reserve. For DATA, it's already handled by io_tree, so nothing needs to be done there. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:04 +02:00
Qu Wenruo	4f5427ccce	btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item Quite similar for delalloc, some modification to delayed-inode and delayed-item reservation. Also needs extra parameter for release case to distinguish normal release and error release. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 02:01:03 +02:00
Theodore Ts'o	9496005d6c	ext4: add bounds checking to ext4_xattr_find_entry() Add some paranoia checks to make sure we don't stray beyond the end of the valid memory region containing ext4 xattr entries while we are scanning for a match. Also rename the function to xattr_find_entry() since it is static and thus only used in fs/ext4/xattr.c Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2018-03-30 20:00:56 -04:00
Qu Wenruo	43b18595d6	btrfs: qgroup: Use separate meta reservation type for delalloc Before this patch, btrfs qgroup is mixing per-transcation meta rsv with preallocated meta rsv, making it quite easy to underflow qgroup meta reservation. Since we have the new qgroup meta rsv types, apply it to delalloc reservation. Now for delalloc, most of its reserved space will use META_PREALLOC qgroup rsv type. And for callers reducing outstanding extent like btrfs_finish_ordered_io(), they will convert corresponding META_PREALLOC reservation to META_PERTRANS. This is mainly due to the fact that current qgroup numbers will only be updated in btrfs_commit_transaction(), that's to say if we don't keep such placeholder reservation, we can exceed qgroup limitation. And for callers freeing outstanding extent in error handler, we will just free META_PREALLOC bytes. This behavior makes callers of btrfs_qgroup_release_meta() or btrfs_qgroup_convert_meta() to be aware of which type they are. So in this patch, btrfs_delalloc_release_metadata() and its callers get an extra parameter to info qgroup to do correct meta convert/release. The good news is, even we use the wrong type (convert or free), it won't cause obvious bug, as prealloc type is always in good shape, and the type only affects how per-trans meta is increased or not. So the worst case will be at most metadata limitation can be sometimes exceeded (no convert at all) or metadata limitation is reached too soon (no free at all). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:14 +02:00
Qu Wenruo	64cfaef636	btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS For meta_prealloc reservation users, after btrfs_join_transaction() caller will modify tree so part (or even all) meta_prealloc reservation should be converted to meta_pertrans until transaction commit time. This patch introduces a new function, btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC reservation user. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:14 +02:00
Qu Wenruo	e1211d0e89	btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup Since qgroup has seperate metadata reservation types now, we can completely get rid of the old root->qgroup_meta_rsv, which mostly acts as current META_PERTRANS reservation type. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:14 +02:00
Qu Wenruo	733e03a0b2	btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans Btrfs uses 2 different methods to reseve metadata qgroup space. 1) Reserve at btrfs_start_transaction() time This is quite straightforward, caller will use the trans handler allocated to modify b-trees. In this case, reserved metadata should be kept until qgroup numbers are updated. 2) Reserve by using block_rsv first, and later btrfs_join_transaction() This is more complicated, caller will reserve space using block_rsv first, and then later call btrfs_join_transaction() to get a trans handle. In this case, before we modify trees, the reserved space can be modified on demand, and after btrfs_join_transaction(), such reserved space should also be kept until qgroup numbers are updated. Since these two types behave differently, split the original "META" reservation type into 2 sub-types: META_PERTRANS: For above case 1) META_PREALLOC: For reservations that happened before btrfs_join_transaction() of case 2) NOTE: This patch will only convert existing qgroup meta reservation callers according to its situation, not ensuring all callers are at correct timing. Such fix will be added in later patches. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:14 +02:00
Qu Wenruo	5c40507ffb	btrfs: qgroup: Cleanup the remaining old reservation counters So qgroup is switched to new separate types reservation system. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Qu Wenruo	64ee4e751a	btrfs: qgroup: Update trace events to use new separate rsv types Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Qu Wenruo	429d6275d5	btrfs: qgroup: Fix wrong qgroup reservation update for relationship modification When modifying qgroup relationship, for qgroup which only owns exclusive extents, we will go through quick update path. In this path, we will add/subtract exclusive and reference number for parent qgroup, since the source (child) qgroup only has exclusive extents, destination (parent) qgroup will also own or lose those extents exclusively. The same should be the same for reservation, since later reservation adding/releasing will also affect parent qgroup, without the reservation carried from child, parent will underflow reservation or have dead reservation which will never be freed. However original code doesn't do the same thing for reservation. It handles qgroup reservation quite differently: It removes qgroup reservation, as it's allocating space from the reserved qgroup for relationship adding. But does nothing for qgroup reservation if we're removing a qgroup relationship. According to the original code, it looks just like because we're adding qgroup->rfer, the code assumes we're writing new data, so it's follows the normal write routine, by reducing qgroup->reserved and adding qgroup->rfer/excl. This old behavior is wrong, and should be fixed to follow the same excl/rfer behavior. Just fix it by using the correct behavior described above. Fixes: `31193213f1` ("Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Qu Wenruo	dba213242f	btrfs: qgroup: Make qgroup_reserve and its callers to use separate reservation type Since most callers of qgroup_reserve() are already defined by type, converting qgroup_reserve() is quite an easy work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Qu Wenruo	f59c0347d4	btrfs: qgroup: Introduce helpers to update and access new qgroup rsv Introduce helpers to: 1) Get total reserved space For limit calculation 2) Add/release reserved space for given type With underflow detection and warning 3) Add/release reserved space according to child qgroup Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Qu Wenruo	d4e5c92055	btrfs: qgroup: Skeleton to support separate qgroup reservation type Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv to store different types of reservation. This patch only updates the header needed to compile. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:13 +02:00
Omar Sandoval	0a0d4415e3	Btrfs: delete dead code in btrfs_orphan_add() btrfs_orphan_add() has had this case commented out since it was first introduced in commit `d68fc57b7e` ("Btrfs: Metadata reservation for orphan inodes"). Most of the orphan cleanup code has been rewritten since then, so it's safe to say that this code isn't needed. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ switch to bool ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Misono, Tomohiro	4408ea7c5f	btrfs: ctree.h: Fix wrong comment position about csum size Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Jeff Mahoney	75cb379d26	btrfs: defer adding raid type kobject until after chunk relocation Any time the first block group of a new type is created, we add a new kobject to sysfs to hold the attributes for that type. Kobject-internal allocations always use GFP_KERNEL, making them prone to fs-reclaim races. While it appears as if this can occur any time a block group is created, the only times the first block group of a new type can be created in memory is at mount and when we create the first new block group during raid conversion. This patch adds a new list to track pending kobject additions and then handles them after we do chunk relocation. Between relocating the target chunk (or forcing allocation of a new chunk in the case of data) and removing the old chunk, we're in a safe place for fs-reclaim to occur. We're holding the volume mutex, which is already held across page faults, and the delete_unused_bgs_mutex, which will only stall the cleaner thread. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Jeff Mahoney	dc2d3005d2	btrfs: remove dead create_space_info calls Since commit `2be12ef79` (btrfs: Separate space_info create/update), we've separated out the creation and updating of the space info structures. That commit was a straightforward refactoring of the two parts of update_space_info, but we can go a step further. Since commits `c59021f84` (Btrfs: fix OOPS of empty filesystem after balance) and `b742bb82f` (Btrfs: Link block groups of different raid types), we know that the space_info structures will be created at mount and there will only ever be, at most, three of them. This patch cleans out the create_space_info calls after __find_space_info returns NULL since __find_space_info can't return NULL. The initial cause for reviewing this was the kobject_add calls from create_space_info occuring in sites where fs-reclaim wasn't allowed. Now we are certain they occur only early in the mount process and are safe. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Liu Bo	580c6efaf9	Btrfs: replace: cache rbio when rebuild data on missing device Rebuild on missing device is as same as recover, after it's done, rbio has data which is consistent with on-disk data, so it can be cached to avoid further reads. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:12 +02:00
Jeff Mahoney	8a5a916d9a	btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers While running btrfs/011, I hit the following lockdep splat. This is the important bit: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] The percpu_counter_init call in btrfs_alloc_subvolume_writers uses GFP_KERNEL, which we can't do during transaction commit. This switches it to GFP_NOFS. ======================================================== WARNING: possible irq lock inversion dependency detected 4.12.14-kvmsmall #8 Tainted: G W -------------------------------------------------------- kswapd0/50 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (pcpu_alloc_mutex){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_alloc_mutex); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); * DEADLOCK * 2 locks held by kswapd0/50: #0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0 #1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50 the shortest dependencies between 2nd lock and 1st lock: -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 RECLAIM_FS-ON-W at: __kmalloc+0x47/0x310 pcpu_extend_area_map+0x2b/0xc0 pcpu_alloc+0x3ec/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 __kmem_cache_create+0x1bf/0x390 create_cache+0xba/0x1b0 kmem_cache_create+0x1f8/0x2b0 ksm_init+0x6f/0x19d do_one_initcall+0x50/0x1b0 kernel_init_freeable+0x201/0x289 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 setup_cpu_cache+0x2f/0x1f0 __kmem_cache_create+0x1bf/0x390 create_boot_cache+0x8b/0xb1 kmem_cache_init+0xa1/0x19e start_kernel+0x270/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 } ... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0 ... acquired at: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] transaction_kthread+0x176/0x1b0 [btrfs] kthread+0x102/0x140 ret_from_fork+0x3a/0x50 -> (&fs_info->commit_root_sem){++++..} ops: 1566382 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs] ... acquired at: cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] btrfs_create_tree+0xbb/0x2a0 [btrfs] btrfs_create_uuid_tree+0x37/0x140 [btrfs] open_ctree+0x23c0/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&found->groups_sem){++++..} ops: 2134587 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 INITIAL USE at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs] ... acquired at: find_free_extent+0xcb4/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] __btrfs_cow_block+0x110/0x5b0 [btrfs] btrfs_cow_block+0xd7/0x290 [btrfs] btrfs_search_slot+0x1f6/0x960 [btrfs] btrfs_lookup_inode+0x2a/0x90 [btrfs] __btrfs_update_delayed_inode+0x65/0x210 [btrfs] btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs] btrfs_evict_inode+0x3fe/0x6a0 [btrfs] evict+0xc4/0x190 __dentry_kill+0xbf/0x170 dput+0x2ae/0x2f0 SyS_rename+0x2a6/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&delayed_node->mutex){+.+.-.} ops: 5580204 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 IN-RECLAIM_FS-W at: __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs] ... acquired at: __lock_acquire+0x264/0x11c0 lock_acquire+0xbd/0x1e0 __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 stack backtrace: CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x78/0xb7 print_irq_inversion_bug.part.38+0x19f/0x1aa check_usage_forwards+0x102/0x120 ? ret_from_fork+0x3a/0x50 ? check_usage_backwards+0x110/0x110 mark_lock+0x16c/0x270 __lock_acquire+0x264/0x11c0 ? pagevec_lookup_entries+0x1a/0x30 ? truncate_inode_pages_range+0x2b3/0x7f0 lock_acquire+0xbd/0x1e0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] __mutex_lock+0x4e/0x8c0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ? mem_cgroup_shrink_node+0x2c0/0x2c0 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x3a/0x50 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Anand Jain	8ba0ae7821	btrfs: drop optimal argument from find_live_mirror() Drop optimal argument from the function find_live_mirror() as we can deduce it in the function itself. Also rename optimal to preferred_mirror. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Anand Jain	99f92a7c1e	btrfs: drop num argument from find_live_mirror() Obtain the stripes info from the map directly and so no need to pass it as an argument. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Nikolay Borisov	0a1e458a1e	btrfs: Drop fs_info parameter from __btrfs_run_delayed_refs It's provided by transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Nikolay Borisov	5ead2dd02c	btrfs: Drop fs_info parameter from btrfs_finish_extent_commit It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:11 +02:00
Nikolay Borisov	460fb20a4b	btrfs: Drop fs_info parameter from btrfs_qgroup_account_extents It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:10 +02:00
Nikolay Borisov	c79a70b133	btrfs: drop fs_info parameter from btrfs_run_delayed_refs It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:10 +02:00
Nikolay Borisov	39d7d09dc2	btrfs: Remove unused flush var in shrink_delalloc Added by `08e007d2e5` ("Btrfs: improve the noflush reservation") and made redundant by `17024ad0a0` ("Btrfs: fix early ENOSPC due to delalloc"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:10 +02:00
Nikolay Borisov	101d2dc0b2	btrfs: Remove unused extent_root var from caching_thread Added by `b4570aa994` ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled.") and obsoleted by `2ff7e61e0d` ("btrfs: take an fs_info directly when the root is not used otherwise"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:41:10 +02:00
Nikolay Borisov	338dae1ae6	btrfs: remove max_active var from open_ctree Introduced by `5cdc7ad337` ("btrfs: Replace fs_info->workers with btrfs_workqueue.") but obsoleted by `2a4581983f` ("btrfs: factor btrfs_init_workqueues() out of open_ctree()"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:57 +02:00
Nikolay Borisov	8535dc1967	btrfs: Remove unused root var from relink_file_extents Added in `38c227d87c` ("Btrfs: snapshot-aware defrag") but subsequently made redundant by `0b246afa62` ("btrfs: root->fs_info cleanup, add fs_info convenience variables"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:57 +02:00
Nikolay Borisov	4eeb97c67a	btrfs: Remove unused tot_len var from lzo_decompress Added already unused in `a6fa6fae40` ("btrfs: Add lzo compression support"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:57 +02:00
Nikolay Borisov	d6e823a578	btrfs: Remove unused length var from scrub_handle_errored_block Added in `b5d67f64f9` ("Btrfs: change scrub to support big blocks") but rendered redundant by `be50a8ddaa` ("Btrfs: Simplify scrub_setup_recheck_block()'s argument"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:57 +02:00
Nikolay Borisov	a6dbceafb9	btrfs: Remove unused op_key var from add_delayed_refs Added as part of `86d5f99442` ("btrfs: convert prelimary reference tracking to use rbtrees") but never used. tmp_op_key essentially subsumed that variable. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:57 +02:00
Qu Wenruo	ba89b80268	btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk When checking the minimal nr_devs, there is one dead and meaningless condition: if (ndevs < devs_increment * sub_stripes \|\| ndevs < devs_min) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This condition is meaningless, @devs_increment has nothing to do with @sub_stripes. In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1 (RAID10) already has the @devs_increment set to 2. So no need to multiple it by @sub_stripes. And above condition is also dead. For RAID10, @devs_increment * @sub_stripes equals 4, which is also the @devs_min of RAID10. For other profiles, @sub_stripes is always 1, and since @ndevs is rounded down to @devs_increment, the condition will always be true. Remove the meaningless condition to make later reader wander less. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
Nikolay Borisov	6f47c706d9	btrfs: Document parameters of btrfs_reserve_extent This function is the entry to the extent allocator and as such has quite a number of parameters. Some of those have subtle effects on the allocation algorithm. Document the parameters. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
Nikolay Borisov	d87ff75863	btrfs: Handle error from btrfs_uuid_tree_rem call in _btrfs_ioctl_set_received_subvol As with every function which deals with modifying the btree btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM). Handle return error value from this function gracefully by aborting the transaction. Fixes: `dd5f9615fc` ("Btrfs: maintain subvolume items in the UUID tree") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
Nikolay Borisov	776c4a7ce8	btrfs: Use sizeof directly instead of a constant variable The kernel would like to have all stack VLA usage removed[1]. Unfortunately using an integer constant variable as the size of an array is still considered a VLA. Instead let's use directly sizeof(var) which removes the VLA usage. Use the occasion to remove csum_size altogether and use sizeof() also for the size passed to memcmp [1]: https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
David Sterba	d0ee393493	btrfs: rename submit callbacks and drop double underscores Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:56 +02:00
David Sterba	6c55343587	btrfs: remove unused parameters from extent_submit_bio_done_t Remove parameters not used by any of the callbacks. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	d0779291b1	btrfs: remove unused parameters from extent_submit_bio_start_t Remove parameters not used by any of the callbacks. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	a758781d4b	btrfs: separate types for submit_bio_start and submit_bio_done The callbacks make use of different parameters that are passed to the other type unnecessarily. This patch adds separate types for each and the unused parameters will be removed. The type extent_submit_bio_hook_t keeps all parameters and can be used where the start/done types are not appropriate. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	d9d19a010b	btrfs: kill tree_mod_log_set_root_pointer helper A useless wrapper around tree_mod_log_insert_root that hides missing error handling. Move it to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	0e82bcfe3c	btrfs: kill tree_mod_log_set_node_key helper A trivial wrapper that can be simply opencoded and makes the GFP allocation request more visible. The error handling is now moved to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	bf1d342510	btrfs: kill trivial wrapper tree_mod_log_eb_move The wrapper is effectively an alias for tree_mod_log_insert_move but also hides the missing error handling. To make that more visible, lift the BUG_ON to the callers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:55 +02:00
David Sterba	b1a09f1ec5	btrfs: remove trivial locking wrappers of tree mod log The wrappers are trivial and do not bring any extra value on top of the plain locking primitives. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:54 +02:00
David Sterba	bcd24dabe0	btrfs: drop fs_info parameter from __tree_mod_log_oldest_root It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:54 +02:00
David Sterba	b6dfa35bd5	btrfs: embed tree_mod_move structure to tree_mod_elem The tree_mod_move is not used anywhere and can be embedded as anonymous structure. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:54 +02:00
David Sterba	a446a979ff	btrfs: drop unused fs_info parameter from tree_mod_log_eb_move Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:54 +02:00
David Sterba	95b757c164	btrfs: drop fs_info parameter from tree_mod_log_free_eb It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:54 +02:00
David Sterba	db7279a20b	btrfs: drop fs_info parameter from tree_mod_log_free_eb It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	e09c2efe7e	btrfs: drop fs_info parameter from tree_mod_log_insert_key It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	6074d45f60	btrfs: drop fs_info parameter from tree_mod_log_insert_move It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	3ac6de1abd	btrfs: drop fs_info parameter from tree_mod_log_set_node_key It's provided by the extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	b8b3d625ce	btrfs: document more parameters of submit_extent_page Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	0c8508a6e7	btrfs: cleanup merging conditions in submit_extent_page The merge call was factored out to a separate helper but it's a trivial one and arguably we can opencode it and cache the value. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:53 +02:00
David Sterba	8eec8296a0	btrfs: remove redundant variable in __do_readpage The value of page_end is only stored to end, no other use. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:52 +02:00
David Sterba	5c2b1fd753	btrfs: assume that bio_ret is always valid in submit_extent_page All callers pass a valid pointer so we can drop the redundant checks. The call to submit_one_bio never happend and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:52 +02:00
Liu Bo	6ca1765b36	Btrfs: scrub: batch rebuild for raid56 In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K) as unit, however, scrub_extent() sets blocksize as unit, so rebuild process may be triggered on every block on a same stripe. A typical example would be that when we're replacing a disappeared disk, all reads on the disks get -EIO, every block (size is 4K if blocksize is 4K) would go thru these, scrub_handle_errored_block scrub_recheck_block # re-read pages one by one scrub_recheck_block # rebuild by calling raid56_parity_recover() page by page Although with raid56 stripe cache most of reads during rebuild can be avoided, the parity recover calculation(xor or raid6 algorithms) needs to be done $(BTRFS_STRIPE_LEN / blocksize) times. This makes it smarter by doing raid56 scrub/replace on stripe length. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:52 +02:00
David Sterba	416a72022e	btrfs: sort and group mount option definitions Sort mount options by the primary name, followed by the 'no-' counterpart if it exists. Group the deprecated and debugging options. Enum and token defintions are synced. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:52 +02:00
Howard McLauchlan	62b8e07731	btrfs: Add nossd_spread mount option Btrfs has two mount options for SSD optimizations: ssd and ssd_spread. Presently there is an option to disable all SSD optimizations, but there isn't an option to disable just ssd_spread. This patch adds a mount option nossd_spread that disables ssd_spread only. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:52 +02:00
Nikolay Borisov	92e2f7e370	btrfs: Remove btrfs_fs_info::open_ioctl_trans Since userspace transaction have been removed we no longer have use for this field so delete it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Nikolay Borisov	bcf3a3e7fb	btrfs: Remove code referencing unused TRANS_USERSPACE Now that the userspace transaction ioctls have been removed, TRANS_USERSPACE is no longer used hence we can remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Nikolay Borisov	859e682d58	btrfs: Remove btrfs_file_private::trans Now that the userspace transaction IOCTL have been removed, this member is no longer used so just remove it Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Nikolay Borisov	7a5a07a810	btrfs: Remove userspace transaction ioctls Commit `3558d4f88e` ("btrfs: Deprecate userspace transaction ioctls") marked the beginning of the end of userspace transaction. This commit finishes the job! There are no known users and ceph does not use the ioctl anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Sage Weil <sage@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Qu Wenruo	4d31778aa2	btrfs: qgroup: Fix root item corruption when multiple same source snapshots are created with quota enabled When multiple pending snapshots referring to the same source subvolume are executed, enabled quota will cause root item corruption, where root items are using old bytenr (no backref in extent tree). This can be triggered by fstests btrfs/152. The cause is when source subvolume is still dirty, extra commit (simplied transaction commit) of qgroup_account_snapshot() can skip dirty roots not recorded in current transaction, making root item of source subvolume not updated. Fix it by forcing recording source subvolume in current transaction before qgroup sub-transaction commit. Reported-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Nikolay Borisov	2e32ef87b0	btrfs: Relax memory barrier in btrfs_tree_unlock When performing an unlock on an extent buffer we'd like to order the decrement of extent_buffer::blocking_writers with waking up any waiters. In such situations it's sufficient to use smp_mb__after_atomic rather than the heavy smp_mb. On architectures where atomic operations are fully ordered (such as x86 or s390) unconditionally executing a heavyweight smp_mb instruction causes a severe hit to performance while bringin no improvements in terms of correctness. The better thing is to use the appropriate smp_mb__after_atomic routine which will do the correct thing (invoke a full smp_mb or in the case of ordered atomics insert a compiler barrier). Put another way, an RMW atomic op + smp_load__after_atomic equals, in terms of semantics, to a full smp_mb. This ensures that none of the problems described in the accompanying comment of waitqueue_active occur. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:51 +02:00
Anand Jain	7c829b722d	btrfs: add define for oldest generation Some functions can filter metadata by the generation. Add a define that will annotate such arguments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:50 +02:00
David Sterba	051c98eb11	btrfs: open code trivial helper btrfs_page_exists_in_range The called function name is self explanatory. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:50 +02:00
Matthew Wilcox	965aab1cfc	btrfs: Use filemap_range_has_page() The current implementation of btrfs_page_exists_in_range() gives the wrong answer if the workingset code has stored a shadow entry in the page cache. The filemap_range_has_page() function does not have this problem, and it's shared code, so use it instead. eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-31 01:26:45 +02:00
Theodore Ts'o	de05ca8526	ext4: move call to ext4_error() into ext4_xattr_check_block() Refactor the call to EXT4_ERROR_INODE() into ext4_xattr_check_block(). This simplifies the code, and fixes a problem where not all callers of ext4_xattr_check_block() were not resulting in ext4_error() getting called when the xattr block is corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-30 15:42:25 -04:00
Dan Williams	5f0663bb4a	ext4, dax: introduce ext4_dax_aops In preparation for the dax implementation to start associating dax pages to inodes via page->mapping, we need to provide a 'struct address_space_operations' instance for dax. Otherwise, direct-I/O triggers incorrect page cache assumptions and warnings. Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-03-30 11:34:55 -07:00
Dan Williams	6e2608dfd9	xfs, dax: introduce xfs_dax_aops In preparation for the dax implementation to start associating dax pages to inodes via page->mapping, we need to provide a 'struct address_space_operations' instance for dax. Otherwise, direct-I/O triggers incorrect page cache assumptions and warnings like the following: WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] [..] CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984 [..] Call Trace: set_page_dirty_lock+0x40/0x60 bio_set_pages_dirty+0x37/0x50 iomap_dio_actor+0x2b7/0x3b0 ? iomap_dio_zero+0x110/0x110 iomap_apply+0xa4/0x110 iomap_dio_rw+0x29e/0x3b0 ? iomap_dio_zero+0x110/0x110 ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_read_iter+0xa0/0xc0 [xfs] __vfs_read+0xf9/0x170 vfs_read+0xa6/0x150 SyS_pread64+0x93/0xb0 entry_SYSCALL_64_fastpath+0x1f/0x96 ...where the default set_page_dirty() handler assumes that dirty state is being tracked in 'struct page' flags. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-03-30 11:34:55 -07:00
Dan Williams	15aa8a0118	block, dax: remove dead code in blkdev_writepages() Block device inodes never have S_DAX set, so kill the check for DAX and diversion to dax_writeback_mapping_range(). Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-03-30 11:34:55 -07:00
Dan Williams	f44c77630d	fs, dax: prepare for dax-specific address_space_operations In preparation for the dax implementation to start associating dax pages to inodes via page->mapping, we need to provide a 'struct address_space_operations' instance for dax. Define some generic VFS aops helpers for dax. These noop implementations are there in the dax case to prevent the VFS from falling back to operations with page-cache assumptions, dax_writeback_mapping_range() may not be referenced in the FS_DAX=n case. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Matthew Wilcox <mawilcox@microsoft.com> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-03-30 11:34:55 -07:00
Dan Williams	3fe0791c29	dax: store pfns in the radix In preparation for examining the busy state of dax pages in the truncate path, switch from sectors to pfns in the radix. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-03-30 11:34:54 -07:00
Yan, Zheng	85784f9395	ceph: only dirty ITER_IOVEC pages for direct read If a page is already locked, attempting to dirty it leads to a deadlock in lock_page(). This is what currently happens to ITER_BVEC pages when a dio-enabled loop device is backed by ceph: $ losetup --direct-io /dev/loop0 /mnt/cephfs/img $ xfs_io -c 'pread 0 4k' /dev/loop0 Follow other file systems and only dirty ITER_IOVEC pages. Cc: stable@kernel.org Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2018-03-30 11:17:48 +02:00
Tyson Nottingham	27f394a771	ext4: don't show data=<mode> option if defaulted Previously, mount -l would show data=<mode> even if the ext4 default journaling mode was being used. Change this to be consistent with the rest of the options. Ext4 already did the right thing when the journaling mode being used matched the one specified in the superblock's default mount options. The reason it failed to do the right thing for the ext4 defaults is that, when set, they were never included in sbi->s_def_mount_opt (unlike the superblock's defaults, which were). Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:56:10 -04:00
Tyson Nottingham	ceec03764a	ext4: omit init_itable=n in procfs when disabled Don't show init_itable=n in /proc/fs/ext4/<dev>/options when filesystem is mounted with noinit_itable. Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:53:33 -04:00
Tyson Nottingham	68afa7e083	ext4: show more binary mount options in procfs Previously, /proc/fs/ext4/<dev>/options would only show binary options if they were set (1 in the options bit mask). E.g. it would show "grpid" if it was set, but it would not show "nogrpid" if grpid was not set. This seems sensible, but when an option is absent from the file, it can be hard for the unfamiliar to know what is being used. E.g. if there isn't a (no)grpid entry, nogrpid is in effect. But if there isn't a (no)auto_da_alloc entry, auto_da_alloc is in effect. If there isn't a (minixdf\|bsddf) entry, it turns out bsddf is in effect. It all depends on how the option is implemented. It's clearer to be explicit, so print the corresponding option regardless of whether it means a 1 or a 0 in the bit mask. Note that options which do not have an explicit disable option aren't indicated as being disabled even with this change (e.g. dax). Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:51:10 -04:00
Tyson Nottingham	bc1420ae56	ext4: simplify kobject usage Replace kset with generic kobject provided by kobject_create_and_add(), since the latter is sufficient. Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:41:34 -04:00
Tyson Nottingham	6ca06829fb	ext4: remove unused parameters in sysfs code Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:13:10 -04:00
Tyson Nottingham	c2e5df7626	ext4: null out kobject* during sysfs cleanup Make cleanup of ext4_feat kobject consistent with similar objects. Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-30 00:03:38 -04:00
Theodore Ts'o	18db4b4e6f	ext4: don't allow r/w mounts if metadata blocks overlap the superblock If some metadata block, such as an allocation bitmap, overlaps the superblock, it's very likely that if the file system is mounted read/write, the results will not be pretty. So disallow r/w mounts for file systems corrupted in this particular way. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-29 22:10:35 -04:00
Theodore Ts'o	a45403b515	ext4: always initialize the crc32c checksum driver The extended attribute code now uses the crc32c checksum for hashing purposes, so we should just always always initialize it. We also want to prevent NULL pointer dereferences if one of the metadata checksum features is enabled after the file sytsem is originally mounted. This issue has been assigned CVE-2018-1094. https://bugzilla.kernel.org/show_bug.cgi?id=199183 https://bugzilla.redhat.com/show_bug.cgi?id=1560788 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-29 22:10:31 -04:00
Theodore Ts'o	8e4b5eae5d	ext4: fail ext4_iget for root directory if unallocated If the root directory has an i_links_count of zero, then when the file system is mounted, then when ext4_fill_super() notices the problem and tries to call iput() the root directory in the error return path, ext4_evict_inode() will try to free the inode on disk, before all of the file system structures are set up, and this will result in an OOPS caused by a NULL pointer dereference. This issue has been assigned CVE-2018-1092. https://bugzilla.kernel.org/show_bug.cgi?id=199179 https://bugzilla.redhat.com/show_bug.cgi?id=1560777 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-29 21:56:09 -04:00
Al Viro	cbd4a5bcb2	d_genocide: move export to definition Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:08:21 -04:00
Al Viro	42177007aa	fold dentry_lock_for_move() into its sole caller and clean it up Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:49 -04:00
Al Viro	076515fc92	make non-exchanging __d_move() copy ->d_parent rather than swap them Currently d_move(from, to) does the following: * name/parent of from <- old name/parent of to, from hashed there * to is unhashed * name of to is preserved * if from used to be detached, to gets detached * if from used to be attached, parent of to <- old parent of from. That's both user-visibly bogus and complicates reasoning a lot. Much saner semantics would be * name/parent of from <- name/parent of to, from hashed there. * to is unhashed * name/parent of to is unchanged. The price, of course, is that old parent of from might lose a reference. However, * all potentially cross-directory callers of d_move() have both parents pinned directly; typically, dentries themselves are grabbed only after we have grabbed and locked both parents. IOW, the decrement of old parent's refcount in case of d_move() won't reach zero. * __d_move() from d_splice_alias() is done to detached alias. No refcount decrements in that case * __d_move() from __d_unalias() can get the refcount to zero. So let's grab a reference to alias' old parent before calling __d_unalias() and dput() it after we'd dropped rename_lock. That does make d_splice_alias() potentially blocking. However, it has no callers in non-sleepable contexts (and the case where we'd grown that dget/dput pair is _very_ rare, so performance is not an issue). Another thing that needs adjustment is unlocking in the end of __d_move(); folded it in. And cleaned the remnants of bogus ordering from the "lock them in the beginning" counterpart - it's never been right and now (well, for 7 years now) we have that thing always serialized on rename_lock anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:49 -04:00
Al Viro	cd1c0c9321	debugfs_lookup(): switch to lookup_one_len_unlocked() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:47 -04:00
Al Viro	a03ece5ff2	fold lookup_real() into __lookup_hash() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:47 -04:00
Al Viro	7a5cf791a7	split d_path() and friends into a separate file Those parts of fs/dcache.c are pretty much self-contained. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:46 -04:00
Al Viro	43986d63b6	dcache.c: trim includes Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:45 -04:00
John Ogness	8f04da2adb	fs/dcache: Avoid a try_lock loop in shrink_dentry_list() shrink_dentry_list() holds dentry->d_lock and needs to acquire dentry->d_inode->i_lock. This cannot be done with a spin_lock() operation because it's the reverse of the regular lock order. To avoid ABBA deadlocks it is done with a trylock loop. Trylock loops are problematic in two scenarios: 1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are preemptible. As a consequence the i_lock holder can be preempted by a higher priority task. If that task executes the trylock loop it will do so forever and live lock. 2) In virtual machines trylock loops are problematic as well. The VCPU on which the i_lock holder runs can be scheduled out and a task on a different VCPU can loop for a whole time slice. In the worst case this can lead to starvation. Commits `47be61845c` ("fs/dcache.c: avoid soft-lockup in dput()") and `046b961b45` ("shrink_dentry_list(): take parent's d_lock earlier") are addressing exactly those symptoms. Avoid the trylock loop by using dentry_kill(). When pruning ancestors, the same code applies that is used to kill a dentry in dput(). This also has the benefit that the locking order is now the same. First the inode is locked, then the parent. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:44 -04:00
Al Viro	f657a666fd	get rid of trylock loop around dentry_kill() In case when trylock in there fails, deal with it directly in dentry_kill(). Note that in cases when we drop and retake ->d_lock, we need to recheck whether to retain the dentry. Another thing is that dropping/retaking ->d_lock might have ended up with negative dentry turning into positive; that, of course, can happen only once... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:44 -04:00
Al Viro	62d9956cef	handle move to LRU in retain_dentry() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:43 -04:00
Al Viro	a338579f2f	dput(): consolidate the "do we need to retain it?" into an inlined helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:43 -04:00
Al Viro	8b987a46a1	split the slow part of lock_parent() off Turn the "trylock failed" part into uninlined __lock_parent(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:42 -04:00
Al Viro	65d8eb5a8f	now lock_parent() can't run into killed dentry all remaining callers hold either a reference or ->i_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:42 -04:00
Al Viro	3b3f09f48b	get rid of trylock loop in locking dentries on shrink list In case of trylock failure don't re-add to the list - drop the locks and carefully get them in the right order. For shrink_dentry_list(), somebody having grabbed a reference to dentry means that we can kick it off-list, so if we find dentry being modified under us we don't need to play silly buggers with retries anyway - off the list it is. The locking logics taken out into a helper of its own; lock_parent() is no longer used for dentries that can be killed under us. [fix from Eric Biggers folded] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-29 15:07:02 -04:00
Eric Biggers	ce3fd194fc	ext4: limit xattr size to INT_MAX ext4 isn't validating the sizes of xattrs where the value of the xattr is stored in an external inode. This is problematic because ->e_value_size is a u32, but ext4_xattr_get() returns an int. A very large size is misinterpreted as an error code, which ext4_get_acl() translates into a bogus ERR_PTR() for which IS_ERR() returns false, causing a crash. Fix this by validating that all xattrs are <= INT_MAX bytes. This issue has been assigned CVE-2018-1095. https://bugzilla.kernel.org/show_bug.cgi?id=199185 https://bugzilla.redhat.com/show_bug.cgi?id=1560793 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Fixes: `e50e5129f3` ("ext4: xattr-in-inode support")	2018-03-29 14:31:42 -04:00
Abhi Das	5e86d9d122	gfs2: time journal recovery steps accurately This patch spits out the time taken by the various steps in the journal recover process. Previously, the journal recovery time didn't account for finding the journal head in the log which takes up a significant portion of time. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-29 10:41:27 -07:00
Eric Sandeen	dc1baa715b	xfs: do not log/recover swapext extent owner changes for deleted inodes Today if we run xfs_fsr and crash[1], log replay can fail because the recovery code tries to instantiate the donor inode from disk to replay the swapext, but it's been deleted and we get verifier failures when we try to read the inode off disk with i_mode == 0. This fixes both sides: We don't log the swapext change if the inode has been deleted, and we don't try to recover it either. [1] or if systemd doesn't cleanly unmount root, as it is wont to do ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-29 10:19:15 -07:00
Andreas Gruenbacher	fffb64127a	gfs2: Zero out fallocated blocks in fallocate_chunk Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them out in fallocate_chunk, much higher up the call stack. This gets rid of gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew buffer flag. I can't think of a reason why zeroing out the blocks in gfs2_iomap_alloc would have any benefits: there is no additional locking at that level that would add protection to the newly allocated blocks. While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de>	2018-03-29 06:50:32 -07:00
Colin Ian King	f62fd7a777	ecryptfs: fix spelling mistake: "cadidate" -> "candidate" Trivial fix to spelling mistake in debug message text. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>	2018-03-29 01:33:30 +00:00
Guenter Roeck	ab13a9218c	ecryptfs: lookup: Don't check if mount_crypt_stat is NULL mount_crypt_stat is assigned to &ecryptfs_superblock_to_private(ecryptfs_dentry->d_sb)->mount_crypt_stat, and mount_crypt_stat is not the first object in struct ecryptfs_sb_info. mount_crypt_stat is therefore never NULL. At the same time, no crash in ecryptfs_lookup() has been reported, and the lookup functions in other file systems don't check if d_sb is NULL either. Given that, remove the NULL check. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>	2018-03-29 01:33:29 +00:00
Eric Biggers	53fedcc09c	f2fs: reserve bits for fs-verity Reserve an F2FS feature flag and inode flag for fs-verity. This is an in-development feature that is planned be discussed at LSF/MM 2018 [1]. It will provide file-based integrity and authenticity for read-only files. Most code will be in a filesystem-independent module, with smaller changes needed to individual filesystems that opt-in to supporting the feature. An early prototype supporting F2FS is available [2]. Reserving the F2FS on-disk bits for fs-verity will prevent users of the prototype from conflicting with other new F2FS features. Note that we're reserving the inode flag in f2fs_inode.i_advise, which isn't really appropriate since it's not a hint or advice. But ->i_advise is already being used to hold the 'encrypt' flag; and F2FS's ->i_flags uses the generic FS_* values, so it seems ->i_flags can't be used for an F2FS-specific flag without additional work to remove the assumption that ->i_flags uses the generic flags namespace. [1] https://marc.info/?l=linux-fsdevel&m=151690752225644 [2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-28 12:56:49 -07:00
Greg Kroah-Hartman	b24d0d5b12	Merge 4.16-rc7 into char-misc-next We want the hyperv fix in here for merging and testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-28 12:27:35 +02:00
Christoph Hellwig	0e11f6443f	fs: move I_DIRTY_INODE to fs.h And use it in a few more places rather than opencoding the values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:02 -04:00
Christoph Hellwig	f35562549f	ubifs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC \| I_DIRTY_DATASYNC) call I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as in mark dirty to be written out by fdatasync as well. So dirtying for both flags makes no sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:02 -04:00
Christoph Hellwig	2c2acd2d19	ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC \| I_DIRTY_DATASYNC) call I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as in mark dirty to be written out by fdatasync as well. So dirtying for both flags makes no sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:02 -04:00
Christoph Hellwig	937d330512	gfs2: fix bogus __mark_inode_dirty(I_DIRTY_SYNC \| I_DIRTY_DATASYNC) calls I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as in mark dirty to be written out by fdatasync as well. So dirtying for both flags makes no sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:01 -04:00
Christoph Hellwig	cab64df194	fs: fold open_check_o_direct into do_dentry_open do_dentry_open is where we do the actual open of the file, so this is where we should do our O_DIRECT sanity check to cover all potential callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-28 01:39:01 -04:00
Yunlei He	d21b0f238a	f2fs: Add a segment type check in inplace write This patch add a segment type check in IPU, in case of something wrong with blkadd in dnode. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-27 20:13:54 -07:00
Yunlong Song	2f52052929	f2fs: no need to initialize zero value for GFP_F2FS_ZERO Since f2fs_inode_info is allocated with flag GFP_F2FS_ZERO, so we do not need to initialize zero value for its member any more. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-27 20:12:53 -07:00
Chao Yu	780de47cf6	f2fs: don't track new nat entry in nat set Nat entry set is used only in checkpoint(), and during checkpoint() we won't flush new nat entry with unallocated address, so we don't need to add new nat entry into nat set, then nat_entry_set::entry_cnt can indicate actual entry count we need to flush in checkpoint(). Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-27 20:10:29 -07:00
Chao Yu	df033caf51	f2fs: clean up with F2FS_BLK_ALIGN Clean up F2FS_BYTES_TO_BLK(x + F2FS_BLKSIZE - 1) with F2FS_BLK_ALIGN(x). Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-27 20:09:32 -07:00
David Howells	a25e21f0bc	rxrpc, afs: Use debug_ids rather than pointers in traces In rxrpc and afs, use the debug_ids that are monotonically allocated to various objects as they're allocated rather than pointers as kernel pointers are now hashed making them less useful. Further, the debug ids aren't reused anywhere nearly as quickly. In addition, allow kernel services that use rxrpc, such as afs, to take numbers from the rxrpc counter, assign them to their own call struct and pass them in to rxrpc for both client and service calls so that the trace lines for each will have the same ID tag. Signed-off-by: David Howells <dhowells@redhat.com>	2018-03-27 23:03:00 +01:00
Kirill Tkhai	2f635ceeb2	net: Drop pernet_operations::async Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:09 -04:00
Kirill Tkhai	67441c2472	net: Convert nfsd_net_ops These pernet_operations look similar to rpcsec_gss_net_ops, they just create and destroy another caches. So, they also can be async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-27 13:18:07 -04:00
Masanari Iida	bc8282a730	treewide: Fix typos in printk This patch fixes spelling typos found in printk. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2018-03-27 09:51:22 +02:00
Theodore Ts'o	7dac4a1726	ext4: add validity checks for bitmap block numbers An privileged attacker can cause a crash by mounting a crafted ext4 image which triggers a out-of-bounds read in the function ext4_valid_block_bitmap() in fs/ext4/balloc.c. This issue has been assigned CVE-2018-1093. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199181 BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1560782 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-26 23:54:10 -04:00
Kirill Tkhai	dbf7bb4437	net: Convert nfs4blocklayout_net_ops These pernet_operations create and destroy per-net pipe and dentry, and they seem safe to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:03:26 -04:00
Kirill Tkhai	436de50094	net: Convert nfs4_dns_resolver_ops These pernet_operations look similar to rpcsec_gss_net_ops, they just create and destroy another cache. Also they create and destroy directory. So, they also look safe to be async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-26 13:03:26 -04:00
Brian Foster	72c44e35f0	xfs: clean up xfs_mount allocation and dynamic initializers Most of the generic data structures embedded in xfs_mount are dynamically initialized immediately after mp is allocated. A few fields are left out and initialized during the xfs_mountfs() sequence, after mp has been attached to the superblock. To clean this up and help prevent premature access of associated fields, refactor xfs_mount allocation and all dependent init calls into a new helper. This self-documents that all low level data structures (i.e., locks, trees, etc.) should be initialized before xfs_mount is attached to the superblock. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-26 08:54:15 -07:00
Arnd Bergmann	a687a53370	treewide: simplify Kconfig dependencies for removed archs A lot of Kconfig symbols have architecture specific dependencies. In those cases that depend on architectures we have already removed, they can be omitted. Acked-by: Kalle Valo <kvalo@codeaurora.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2018-03-26 15:55:57 +02:00
Liu Bo	4759700a71	Btrfs: dev-replace: make sure target is identical to source when raid56 rebuild fails In the last step of scrub_handle_error_block, we try to combine good copies on all possible mirrors, this works fine for raid1 and raid10, but not for raid56 as it's doing parity rebuild. If parity rebuild doesn't get back with correct data which matches its checksum, in case of replace we'd rather write what is stored in the source device than the data calculuated from parity. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:43 +02:00
Liu Bo	d6a691350b	Btrfs: raid56: remove redundant async_missing_raid56 async_missing_raid56() is identical to async_read_rebuild(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:43 +02:00
Su Yue	005d67127f	btrfs: adjust return values of btrfs_inode_by_name Previously, btrfs_inode_by_name() returned 0 which left caller to check objectid of location even location if the type was invalid. Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a dir entry is found. Removal of label out_err also simplifies the function. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ drop unlikely ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:43 +02:00
Anand Jain	9b99b11564	btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids This function btrfs_close_extra_devices() is about freeing extra devids which once it may have belonged to this filesystem. So rename it and add the comment. The _devid suffix is appropriate as this function won't handle devices which are outside of the filesytem being mounted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Nikolay Borisov	d02c0e2019	btrfs: Remove root argument from cow_file_range_inline This argument is always set to the root of the inode, which is also passed. So let's get a reference inside the function and simplify the arg list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Liu Bo	895a72be41	Btrfs: send: fix typo in TLV_PUT According to tlv_put()'s prototype, data and attrlen needs to be exchanged in the macro, but seems all callers are already aware of this misorder and are therefore not affected. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Nikolay Borisov	e5b84f7a25	btrfs: Remove root argument from btrfs_log_dentry_safe Now that nothing uses the root arg of btrfs_log_dentry_safe it can be safely removed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Nikolay Borisov	f882274b2d	btrfs: Remove root arg from btrfs_log_inode_parent btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe and btrfs_log_new_name) both of which pass inode->root as the root argument and the inode itself. Remove the redundant root argument and get a reference to the root directly from the inode, also remove redundant root != inode->root check from the same function. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:42 +02:00
Nikolay Borisov	448f3a17ac	btrfs: Remove redundant comment from btrfs_search_forward This function always sets keep_locks to 1 and saves the old value of keep_locks which is restored at the end. So there is no way it can be called without keep_locks being set. Remove comment imposing redundant requirement on callers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
David Sterba	738c93d42c	btrfs: move btrfs_listxattr prototype to xattr.h There's a proper header for xattr handlers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
David Sterba	bcadd7050a	btrfs: adjust return type of btrfs_getxattr The xattr_handler::get prototype returns int, use it. The only ssize_t exception is the per-inode listxattr handler. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
David Sterba	ab0d093616	btrfs: drop extern from function declarations Extern for functions does not make any difference, there are only a few so let's remove them before it's too late. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
David Sterba	7852781d94	btrfs: drop underscores from exported xattr functions Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
Filipe Manana	ffa7c4296e	Btrfs: send, do not issue unnecessary truncate operations When send finishes processing an inode representing a regular file, it always issues a truncate operation for that file, even if its size did not change or the last write sets the file size correctly. In the most common cases, the issued write operations set the file to correct size (either full or incremental sends) or the file size did not change (for incremental sends), so the only case where a truncate operation is needed is when a file size becomes smaller in the send snapshot when compared to the parent snapshot. By not issuing unnecessary truncate operations we reduce the stream size and save time in the receiver. Currently truncating a file to the same size triggers writeback of its last page (if it's dirty) and waits for it to complete (only if the file size is not aligned with the filesystem's sector size). This is being fixed by another patch and is independent of this change (that patch's title is "Btrfs: skip writeback of last page when truncating file to same size"). The following script was used to measure time spent by a receiver without this change applied, with this change applied, and without this change and with the truncate fix applied (the fix to not make it start and wait for writeback to complete). $ cat test_send.sh #!/bin/bash SRC_DEV=/dev/sdc DST_DEV=/dev/sdd SRC_MNT=/mnt/sdc DST_MNT=/mnt/sdd mkfs.btrfs -f $SRC_DEV >/dev/null mkfs.btrfs -f $DST_DEV >/dev/null mount $SRC_DEV $SRC_MNT mount $DST_DEV $DST_MNT echo "Creating source filesystem" for ((t = 0; t < 10; t++)); do ( for ((i = 1; i <= 20000; i++)); do xfs_io -f -c "pwrite -S 0xab 0 5000" \ $SRC_MNT/file_$i > /dev/null done ) & worker_pids[$t]=$! done wait ${worker_pids[@]} echo "Creating and sending snapshot" btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null /usr/bin/time -f "send took %e seconds" \ btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1 /usr/bin/time -f "receive took %e seconds" \ btrfs receive -f $SRC_MNT/send_file $DST_MNT umount $SRC_MNT umount $DST_MNT The results, which are averages for 5 runs for each case, were the following: * Without this change average receive time was 26.49 seconds standard deviation of 2.53 seconds * Without this change and with the truncate fix average receive time was 12.51 seconds standard deviation of 0.32 seconds * With this change and without the truncate fix average receive time was 10.02 seconds standard deviation of 1.11 seconds Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:41 +02:00
Filipe Manana	213e8c5520	Btrfs: skip writeback of last page when truncating file to same size When we truncate a file to the same size and that size is not aligned with the sector size, we end up triggering writeback (and wait for it to complete) of the last page. This is unncessary as we can not have delayed allocation beyond the inode's i_size and the goal of truncating a file to its own size is to discard prealloc extents (allocated via the fallocate(2) system call). Besides the unnecessary IO start and wait, it also breaks the oppurtunity for larger contiguous extents on disk, as before the last dirty page there might be other dirty pages. This scenario is probably not very common in general, however it is common for btrfs receive implementations because currently the send stream always issues a truncate operation for each processed inode as the last operation for that inode (this truncate operation is not always needed and the send implementation will be addressed to avoid them). So improve this by not starting and waiting for writeback of the inode's last page when we are truncating to exactly the same size. The following script was used to quickly measure the time a receive operation takes: $ cat test_send.sh #!/bin/bash SRC_DEV=/dev/sdc DST_DEV=/dev/sdd SRC_MNT=/mnt/sdc DST_MNT=/mnt/sdd mkfs.btrfs -f $SRC_DEV >/dev/null mkfs.btrfs -f $DST_DEV >/dev/null mount $SRC_DEV $SRC_MNT mount $DST_DEV $DST_MNT echo "Creating source filesystem" for ((t = 0; t < 10; t++)); do ( for ((i = 1; i <= 20000; i++)); do xfs_io -f -c "pwrite -S 0xab 0 5000" \ $SRC_MNT/file_$i > /dev/null done ) & worker_pids[$t]=$! done wait ${worker_pids[@]} echo "Creating and sending snapshot" btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null /usr/bin/time -f "send took %e seconds" \ btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1 /usr/bin/time -f "receive took %e seconds" \ btrfs receive -f $SRC_MNT/send_file $DST_MNT umount $SRC_MNT umount $DST_MNT The results for 5 runs were the following: * Without this change average receive time was 26.49 seconds standard deviation of 2.53 seconds * With this change average receive time was 12.51 seconds standard deviation of 0.32 seconds Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
Liu Bo	ed5d5f37e6	Btrfs: dev-replace: skip prealloc extents when copy nocow pages It doens't make sense to process prealloc extents as pages will be filled with zero when reading prealloc extents. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
Anand Jain	d612ac59ef	btrfs: unify types for metadata_ratio and data_chunk_allocations We have btrfs_fs_info::data_chunk_allocations and btrfs_fs_info::metadata_ratio declared as unsigned which would be unsinged int and kernel style prefers unsigned int over bare unsigned. So this patch changes them to u32. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
Nikolay Borisov	de224b7c56	btrfs: Remove redundant memory barriers around dio_private error status Using any kind of memory barriers around atomic operations which have a return value is redundant, since those operations themselves are fully ordered. atomic_t.txt states: - RMW operations that have a return value are fully ordered; Fully ordered primitives are ordered against everything prior and everything subsequent. Therefore a fully ordered primitive is like having an smp_mb() before and an smp_mb() after the primitive. Given this let's replace the extra memory barriers with comments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
Anand Jain	16db5758fe	btrfs: remove assert in btrfs_init_dev_replace_tgtdev() In the same function we just ran btrfs_alloc_device() which means the btrfs_device::resized_list is sure to be empty and we are protected with the btrfs_fs_info::volume_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:40 +02:00
David Sterba	e67c718b5b	btrfs: add more __cold annotations The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
David Sterba	ffc5a3794f	btrfs: add (the only possible) __exit annotation Recently, the __init annotations have been added. There's unfortunatelly only one case where we can add __exit, because most of the cleanup helpers are also called from the __init phase. As the __exit annotated functions get discarded completely for a built-in code, we'd miss them from the init phase. Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Anand Jain	ccb0e7d1c1	btrfs: verify subvolid mount parameter We aren't verifying the parameter passed to the subvolid mount option, so we won't report and fail the mount if a junk value is specified for example, -o subvolid=abc. This patch verifies the subvolid option with match_u64. Up to now the memparse function accepts the K/M/G/ suffixes, that are usually meant for size values and do not make sense for a subvolume it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Liu Bo	5811375325	Btrfs: fix unexpected cow in run_delalloc_nocow Fstests generic/475 provides a way to fail metadata reads while checking if checksum exists for the inode inside run_delalloc_nocow(), and csum_exist_in_range() interprets error (-EIO) as inode having checksum and makes its caller enter the cow path. In case of free space inode, this ends up with a warning in cow_file_range(). The same problem applies to btrfs_cross_ref_exist() since it may also read metadata in between. With this, run_delalloc_nocow() bails out when errors occur at the two places. cc: <stable@vger.kernel.org> v2.6.28+ Fixes: `17d217fe97` ("Btrfs: fix nodatasum handling in balancing code") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Nikolay Borisov	9678c54388	btrfs: Remove custom crc32c init code The custom crc32 init code was introduced in `14a958e678` ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by `60efa5eb2e` ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:39 +02:00
Qu Wenruo	3e72ee8874	btrfs: Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index() Function __get_raid_index() is used to convert block group flags into raid index, which can be used to get various info directly from btrfs_raid_array[]. Refactor this function a little: 1) Rename to btrfs_bg_flags_to_raid_index() Double underline prefix is normally for internal functions, while the function is used by both extent-tree and volumes. Although the name is a little longer, but it should explain its usage quite well. 2) Move it to volumes.h and make it static inline Just several if-else branches, really no need to define it as a normal function. This also makes later code re-use between kernel and btrfs-progs easier. 3) Remove function get_block_group_index() Really no need to do such a simple thing as an exported function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:38 +02:00
Qu Wenruo	2f659546c9	btrfs: tree-checker: Replace root parameter with fs_info When inspecting the error message with real corruption, the "root=%llu" always shows "1" (root tree), instead of the correct owner. The problem is that we are getting @root from page->mapping->host, which points the same btree inode, so we will always get the same root. This makes the root owner output meaningless, and harder to port tree-checker to btrfs-progs. So get rid of the false and meaningless @root parameter and replace it with @fs_info. To get the owner, we can only rely on btrfs_header_owner() now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:38 +02:00
Liu Bo	393da91819	Btrfs: add tracepoint for em's EEXIST case This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the subtle bugs around merge_extent_mapping. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:38 +02:00
Nikolay Borisov	5d23515be6	btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable Currently btrfs_run_qgroups is doing a bit too much. Not only is it responsible for synchronizing in-memory state of qgroups to disk but it also contains code to trigger the initial qgroup rescan when quota is enabled initially. This condition is detected by checking that BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set. Nothing really requires from the code to be structured (and scattered) the way it is so let's streamline things. First move the quota rescan code into btrfs_quota_enable, where its invocation is closer to the use. This also makes the FS_QUOTA_ENABLING flag redundant so let's remove it as well. This has been tested with a full xfstest run with qgroups enabled on the scratch device of every xfstest and no regressions were observed. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:38 +02:00
Gu JinXiang	7ce311d552	btrfs: use reada direction enum instead of constant value in load_free_space_tree load_free_space_tree calls either function load_free_space_bitmaps or load_free_space_extents. And either of those two will lead to call btrfs_next_item. So in function load_free_space_tree, use READA_FORWARD to read forward ahead. This also changes the value from READA_BACK to READA_FORWARD, since according to the logic, it should reada_for_search forward, not backward. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Gu Jinxiang	019599ada7	btrfs: use reada direction enum instead of constant value in populate_free_space_tree populate_free_space_tree calls function btrfs_search_slot_for_read with parameter int find_higher = 1, it means that, if no exact match is found, then use the next higher item. So in function populate_free_space_tree, use READA_FORWARD to read forward ahead. This also changes the value from READA_BACK to READA_FORWARD, since according to the logic, it should reada_for_search forward, not backward. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Nikolay Borisov	c1c3fac2a9	btrfs: Remove btrfs_inode::delayed_iput_count delayed_iput_count wa supposed to be used to implement, well, delayed iput. The idea is that we keep accumulating the number of iputs we do until eventually the inode is deleted. Turns out we never really switched the delayed_iput_count from 0 to 1, hence all conditional code relying on the value of that member being different than 0 was never executed. This, as it turns out, didn't cause any problem due to the simple fact that the generic inode's i_count member was always used to count the number of iputs. So let's just remove the unused member and all unused code. This patch essentially provides no functional changes. While at it, also add proper documentation for btrfs_add_delayed_iput Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Qu Wenruo	793ff2c88c	btrfs: volumes: Cleanup stripe size calculation Cleanup the following things: 1) open coded SZ_16M round up 2) use min() to replace open-coded size comparison 3) code style Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Nikolay Borisov	da07d4ab33	btrfs: Streamline btrfs_delalloc_reserve_metadata initial operations The behavior of btrfs_delalloc_reserve_metadata depends on whether the inode we are allocating for is the freespace inode or not. As it stands if we are the free node we set 'flush' and 'delalloc_lock' variable to certain values. Subsequently we check the values of those vars and act accordingly. Instead, simplify things by having 1 if which checks whether we are the freespace inode or not and do any specific operation in either branches of that if. This makes the code a bit easier to understand, as an added bonus it also shrinks the compiled size: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17) Function old new delta btrfs_delalloc_reserve_metadata 1876 1859 -17 Total: Before=85966, After=85949, chg -0.02% No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Edmund Nadolski <enadolski@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Anand Jain	b1b8e38622	btrfs: insert newly opened device to the end of the list Add opened device to the tail of dev_alloc_list instead of head, so that it maintains the same order as dev_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:37 +02:00
Anand Jain	f8e10cd3f8	btrfs: keep device list sorted By maintaining the device list sorted lets us reproduce the problems related to missing chunk in the degraded mode much more consistent. So fix this by sorting the devices by devid within the kernel. So that we know which device is assigned to the struct fs_info::latest_bdev when all the devices are having and same SB generation. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Liu Bo	3d5addafd0	Btrfs: do not check inode's runtime flags under root->orphan_lock It's not necessary to hold ->orphan_lock when checking inode's runtime flags. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Nikolay Borisov	bc5511d0ed	btrfs: Use schedule_timeout_interruptible Instead of manually fiddling with the state of the task (RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible which adjusts the task state as needed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Nikolay Borisov	f9cacae314	btrfs: Move error handling of btrfs_start_dirty_block_groups closer to call site Even though btrfs_start_dirty_block_groups is fairly in the beginning of btrfs_commit_transaction outside of the critical section defined by the transaction states it can only be run by a single comitter. In other words it defines its own critical section thanks to the BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its error handling is outside of this critical section which is a bit counter-intuitive. So move the error handling righ after the function is executed and let the sole runner of dirty block groups handle the return value. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Anand Jain	7ef2d6a722	btrfs: not a disk error if the bio_add_page fails bio_add_page() can fail for logical reasons as from the bio_add_page() comments: /* * This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or * it's a cloned bio. */ Here we have just allocated the bio, so both of those failures can't occur. So drop the check. We can also drop the error stats for write error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:36 +02:00
Qu Wenruo	4117f207d4	btrfs: Add chunk allocation ENOSPC debug message for enospc_debug mount option Enospc_debug makes extent allocator print more debug messages, however for chunk allocation, there is no debug message for enospc_debug at all. This patch will add message for the following parts of chunk allocator: 1) No rw device at all Quite rare, but at least output one message for this case. 2) Not enough space for some device This debug message is quite handy for unbalanced disks with stripe based profiles (RAID0/10/5/6). 3) Not enough free devices This debug message should tell us if current chunk allocator is working correctly under minimal device requirements. Although in most cases, we will hit other ENOSPC before we even hit a chunk allocator ENOSPC, but such debug info won't help. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Anand Jain	566b1760b4	btrfs: use ASSERT to report logical error in cow_file_range() Use ASSERT to report logical error in cow_file_range(), also move it a bit closer to when the num_bytes is derived. The extent start could be (u64)-1 in some cases, the assert should catch that we do not accidentally pass it to cow_file_range. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Anand Jain	3752d22fce	btrfs: cow_file_range() num_bytes and disk_num_bytes are same This patch deletes local variable disk_num_bytes as its value is same as num_bytes in the function cow_file_range(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Anand Jain	2afb9653bf	btrfs: remove unused function btrfs_async_submit_limit() Commit [1] removed the need to use btrfs_async_submit_limit(), so delete it. [1] commit `736cd52e0c` Btrfs: remove nr_async_submits and async_submit_draining Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Yang Shi	86d750a48a	btrfs: remove unused hardirq.h Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by btrfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Nikolay Borisov	9a3daff320	btrfs: Add enospc_debug printing in metadata_reserve_bytes Currently when enospc_debug mount option is turned on we do not print any debug info in case metadata reservation failures happen. Fix this by adding the necessary hook in reserve_metadata_bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:35 +02:00
Nikolay Borisov	45ae2c1841	btrfs: Document consistency of transaction->io_bgs list The reason why io_bgs can be modified without holding any lock is non-obvious. Document it and reference that documentation from the respective call sites. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:34 +02:00
Nikolay Borisov	bf6d7d4900	btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgs list_first_entry is essentially a wrapper over cotnainer_of. The latter can never return null even if it's working on inconsistent list since it will either crash or return some offset in the wrong struct. Additionally, for the dirty_bgs list the iteration is done under dirty_bgs_lock which ensures consistency of the list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:34 +02:00
Anand Jain	8f2ceaa7b4	btrfs: log, when replace, is canceled by the user For debugging or administration purposes, we would want to know if and when the user cancels the replace, to complement the existing messages when dev-replace starts or finishes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, fold fix for RCU warning from Nikolay ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:34 +02:00
Anand Jain	acf18c56fd	btrfs: fix null pointer deref when target device is missing The replace target device can be missing when mounted with -o degraded, but we wont allocate a missing btrfs_device to it. So check the device before accessing. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs] Call Trace: btrfs_dev_replace_cancel+0x15f/0x180 [btrfs] btrfs_ioctl+0x2216/0x2590 [btrfs] do_vfs_ioctl+0x625/0x650 SyS_ioctl+0x4e/0x80 do_syscall_64+0x5d/0x160 entry_SYSCALL64_slow_path+0x25/0x25 This patch has been moved in front of patch "btrfs: log, when replace, is canceled by the user" that could reproduce the crash if the system reboots inside btrfs_dev_replace_start before the btrfs_dev_replace_finishing call. $ mkfs /dev/sda $ mount /dev/sda mnt $ btrfs replace start /dev/sda /dev/sdb <insert reboot> $ mount po degraded /dev/sdb mnt <crash> Signed-off-by: Anand Jain <anand.jain@oracle.com> [ added reproducer description from mail ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:34 +02:00
Anand Jain	eceff22a80	btrfs: add a comment to mark the deprecated mount option The options alloc_start and subvolrootid are deprecated, comment them in the tokens list. And leave them as it is. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:34 +02:00
Anand Jain	d374060864	btrfs: manage commit mount option as %u As the commit mount option is unsigned so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Anand Jain	02453bdeb0	btrfs: manage check_int_print_mask mount option as %u As check_int_print_mask mount option is unsigned so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Anand Jain	764cb8b43d	btrfs: manage metadata_ratio mount option as %u As metadata_ratio mount option is unsinged so manage it as %u for token verifications, instead of %d. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Anand Jain	f7b885befd	btrfs: manage thread_pool mount option as %u The mount option thread_pool is always unsigned. Manage it that way all around. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Anand Jain	ba020491c8	btrfs: extent_buffer_uptodate() make it static and inline extent_buffer_uptodate() is a trivial wrapper around test_bit() and nothing else. So make it static and inline, save on code space and call indirection. Before: text data bss dec hex filename 1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko After: text data bss dec hex filename `1131090` 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Nikolay Borisov	70458a5819	btrfs: Remove fs_info argument of btrfs_write_and_wait_transaction We already pass btrfs_trans_handle which contains a reference to the fs_info so use that. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:33 +02:00
Nikolay Borisov	e9b919b1f7	btrfs: Remove fs_info argument from btrfs_update_commit_device_bytes_used We already pass the btrfs_transaction which references fs_info so no need to pass the later as an argument. Also use the opportunity to shorten transaction->trans. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	08d50ca32c	btrfs: Remove fs_info argument from create_pending_snapshots/create_pending_snapshot We already pass the trans handle which has a reference to fs_info to create_pending_snapshot so we can refer to it directly. Doing this obviates the need to pass the fs_info to create_pending_snapshots as well. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	16916a88d4	btrfs: Remove fs_info argument from switch_commit_roots We already have the fs_info from the passed transaction so use it directly. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	97cb39bb91	btrfs: Remove root argument of cleanup_transaction The only thing the passed root is used for is: 1. get a reference to the fs_info and to 2. call trace_btrfs_transaction_commit. We can achieve 1) by simply referring to the fs_info from passed trans object. As far as 2) is concerned cleanup_transaction is called from only one place and the 'root' argument passed is the one from the trans handle. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	9386d8bc58	btrfs: Don't pass fs_info to commit_cowonly_roots We already pass a transaction handle which refrences the fs_info so we can grab it from there. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	7e4443d9eb	btrfs: Don't pass fs_info to commit_fs_roots We already pass the transaction handle which has a reference to the fs_info. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:32 +02:00
Nikolay Borisov	e5c304e651	btrfs: Don't pass fs_info to btrfs_run_delayed_items/_nr We already pass the transaction which has a reference to the fs_info, so use that. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:31 +02:00
Nikolay Borisov	b84acab38f	btrfs: Don't pass fs_info to __btrfs_run_delayed_items We already pass the transaction handle, which contains a refrence to the fs_info so grab it from there. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:31 +02:00
Nikolay Borisov	2121705420	btrfs: Don't pass fs_info arg to btrfs_start_dirty_block_groups It can be referenced from the passed transaction so no point in passing it as a function argument. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:31 +02:00
Nikolay Borisov	6c686b359a	btrfs: Remove fs_info argument from btrfs_create_pending_block_groups It can be referenced from the passed transaciton so no point in passing it as function argument. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:31 +02:00
Nikolay Borisov	dc60c525cf	btrfs: Remove fs_info argument from btrfs_trans_release_metadata All current callers of this function just get a reference to the trans->fs_info member and pass it as the second argument. Collapse this into the function itself. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:31 +02:00
Nikolay Borisov	c9b577c01a	btrfs: Open code btrfs_write_and_wait_marked_extents btrfs_write_and_wait_transaction is essentially a wrapper of btrfs_write_and_wait_marked_extents with the addition of calling clear_btree_io_tree. Having the code split doesn't really bring any benefit. Open code the later into the former and add proper documentation header. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Nikolay Borisov	0e34693f7b	btrfs: Make btrfs_trans_release_metadata private to transaction.c This function is only ever used in __btrfs_end_transaction and btrfs_commit_transaction so there is no need to export it via header. Let's move it closer to where it's used, make it static and remove it from the header. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Anand Jain	15fc1283f6	btrfs: open code btrfs_init_dev_replace_tgtdev_for_resume() btrfs_init_dev_replace_tgtdev_for_resume() initializes replace target device in a few simple steps, so do it at the parent function. Moreover, there isn't any other caller so just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Anand Jain	18e67c73dc	btrfs: btrfs_dev_replace_cancel() can return int Current u64 return from btrfs_dev_replace_cancel() was probably done to match the btrfs_ioctl_dev_replace_args::result. However as our actual return value fits in int, and it further gets typecast to u64, so just return int. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Anand Jain	17d202b973	btrfs: rename __btrfs_dev_replace_cancel() Remove __ which is for the special functions. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:30 +02:00
Anand Jain	97282031a6	btrfs: open code btrfs_dev_replace_cancel() btrfs_dev_replace_cancel() calls __btrfs_dev_replace_cancel() for the actual cancel so just open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Nikolay Borisov	af89e0dc2c	btrfs: Don't hardcode the csum size in btrfs_ordered_sum_size Currently the function uses a hardcoded value for the checksum size of a sector. This is fine, given that we currently support only a single algorithm, whose checksum is 4 bytes == sizeof(u32). Despite not having other algorithms, btrfs' design supports using a different algorithm whith different space requirements. To future-proof the code query the size of the currently used algorithm from the in-memory copy of the super block. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Colin Ian King	97dc231e89	Btrfs: extent map selftest: add missing void parameter to btrfs_test_extent_map Add a missing void parameter to function btrfs_test_extent_map, fixes sparse warning: warning: non-ANSI function declaration of function 'btrfs_test_extent_map' Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Colin Ian King	7a61f88088	btrfs: remove redundant check on ret and goto The check for a non-zero ret is redundant as the goto will jump to the very next statement anyway. Remove this extraneous code. Detected by CoverityScan, CID#1463784 ("Identical code for different branches") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Nikolay Borisov	7806c6eb15	btrfs: Remove unused btrfs_start_transaction_lflush function Commit `0e8c36a9fd` ("Btrfs: fix lots of orphan inodes when the space is not enough") changed the way transaction reservation is made in btrfs_evict_node and as a result this function became unused. This has been the status quo for 5 years in which time no one noticed, so I'd say it's safe to assume it's unlikely it will ever be used again. Historical note: there were more attempts to remove the function, the reasoning was missing and only based on some static analysis tool reports. Other reason for rejection was that there seemed to be connection to BTRFS_RESERVE_FLUSH_LIMIT and that would need to be removeed to. This was not correct so removing the function is all we can do. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ add the note ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Howard McLauchlan	b6a535faed	btrfs: print error if primary super block write fails Presently, failing a primary super block write but succeeding in at least one super block write in general will appear to users as if nothing important went wrong. However, upon unmounting and re-mounting, the file system will be in a rolled back state. This was discovered with a BCC program that uses bpf_override_return() to fail super block writes. This patch outputs an error clarifying that the primary super block write has failed, so users can expect potentially erroneous behaviour. It also forces wait_dev_supers() to return an error to its caller if the primary super block write fails. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:29 +02:00
Qu Wenruo	062d4d1f40	btrfs: Refactor parameter of BTRFS_MAX_DEVS() from root to fs_info Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:28 +02:00
Liu Bo	af2679e4a7	Btrfs: enhance leak debug checker for extent state and extent buffer This prints out eb->bflags since it contains some useful information, e.g. whether eb is dirty. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-26 15:09:28 +02:00
Joe Perches	447a5647c9	treewide: Align function definition open/close braces Some functions definitions have either the initial open brace and/or the closing brace outside of column 1. Move those braces to column 1. This allows various function analyzers like gnu complexity to work properly for these modified functions. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Nicolin Chen <nicoleotsuka@gmail.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2018-03-26 11:13:09 +02:00
zhenwei.pi	dcae058a8d	ext4: fix comments in ext4_swap_extents() "mark_unwritten" in comment and "unwritten" in the function arguments is mismatched. Signed-off-by: zhenwei.pi <zhenwei.pi@youruncloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-26 01:44:03 -04:00
Goldwyn Rodrigues	043d20d159	ext4: use generic_writepages instead of __writepage/write_cache_pages Code cleanup. Instead of writing an internal static function, use the available generic_writepages(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-26 01:32:50 -04:00
Dave Chinner	fa4493f0d9	xfs: remove dead inode version setting code We can only get into the branch if CRCs are enabled, so there's no need to check inside the branch for CRCs being enabled.... Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-23 18:05:09 -07:00
Dave Chinner	ee457001ed	xfs: catch inode allocation state mismatch corruption We recently came across a V4 filesystem causing memory corruption due to a newly allocated inode being setup twice and being added to the superblock inode list twice. From code inspection, the only way this could happen is if a newly allocated inode was not marked as free on disk (i.e. di_mode wasn't zero). Running the metadump on an upstream debug kernel fails during inode allocation like so: XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod= e.c, line: 838 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:114! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0= 1/2014 RIP: 0010:assfail+0x28/0x30 RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202 RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30 R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10 FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000= 000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0 Call Trace: xfs_ialloc+0x383/0x570 xfs_dir_ialloc+0x6a/0x2a0 xfs_create+0x412/0x670 xfs_generic_create+0x1f7/0x2c0 ? capable_wrt_inode_uidgid+0x3f/0x50 vfs_mkdir+0xfb/0x1b0 SyS_mkdir+0xcf/0xf0 do_syscall_64+0x73/0x1a0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Extracting the inode number we crashed on from an event trace and looking at it with xfs_db: xfs_db> inode 184452204 xfs_db> p core.magic = 0x494e core.mode = 0100644 core.version = 2 core.format = 2 (extents) core.nlinkv2 = 1 core.onlink = 0 ..... Confirms that it is not a free inode on disk. xfs_repair also trips over this inode: ..... zero length extent (off = 0, fsbno = 0) in ino 184452204 correcting nextents for inode 184452204 bad attribute fork in inode 184452204, would clear attr fork bad nblocks 1 for inode 184452204, would reset to 0 bad anextents 1 for inode 184452204, would reset to 0 imap claims in-use inode 184452204 is free, would correct imap would have cleared inode 184452204 ..... disconnected inode 184452204, would move to lost+found And so we have a situation where the directory structure and the inobt thinks the inode is free, but the inode on disk thinks it is still in use. Where this corruption came from is not possible to diagnose, but we can detect it and prevent the kernel from oopsing on lookup. The reproducer now results in: $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5} mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu= re needs cleaning mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o= utput error mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o= utput error .... And this corruption shutdown: [ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not= marked free on disk [ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 = of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670 [ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #= 443 [ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO= S 1.10.2-1 04/01/2014 [ 54.852859] Call Trace: [ 54.853531] dump_stack+0x85/0xc5 [ 54.854385] xfs_trans_cancel+0x197/0x1c0 [ 54.855421] xfs_create+0x425/0x670 [ 54.856314] xfs_generic_create+0x1f7/0x2c0 [ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50 [ 54.858586] vfs_mkdir+0xfb/0x1b0 [ 54.859458] SyS_mkdir+0xcf/0xf0 [ 54.860254] do_syscall_64+0x73/0x1a0 [ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 54.862492] RIP: 0033:0x7fb73bddf547 [ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000= 000000000053 [ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73= bddf547 [ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda= a55449a [ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a= 8670dd0 [ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000= 00001ff [ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda= a553500 [ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1= 024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050 [ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt= ing down filesystem [ 54.884597] XFS (loop0): Please umount the filesystem and rectify the = problem(s) Note that this crash is only possible on v4 filesystemsi or v5 filesystems mounted with the ikeep mount option. For all other V5 filesystems, this problem cannot occur because we don't read inodes we are allocating from disk - we simply overwrite them with the new inode information. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Tested-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-23 18:05:09 -07:00
Darrick J. Wong	b83e4c3ced	xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we encounter discrepancies here so that we know that it's a cross referencing error, not necessarily a corruption in the inobt itself. The userspace xfs_scrub program will try to repair outright corruptions in the agi/inobt prior to phase 3 so that the inode scan will proceed. If only a cross-referencing error is noted, the repair program defers the repair attempt until it can check the other space metadata at least once. It is therefore essential that the inobt scrubber can correctly distinguish between corruptions and "unable to cross-reference something else with this inobt". The same reasoning applies to "xfs: record inode buf errors as a xref error in inobt scrubber". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:09 -07:00
Darrick J. Wong	5927268f5a	xfs: flag inode corruption if parent ptr doesn't get us a real inode If a directory's parent inode pointer doesn't point to an inode, the directory should be flagged as corrupt. Enable IGET_UNTRUSTED here so that _iget will return -EINVAL if the inobt does not confirm that the inode is present and allocated and we can flag the directory corruption. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	6a96c56505	xfs: don't accept inode buffers with suspicious unlinked chains When we're verifying inode buffers, sanity-check the unlinked pointer. We don't want to run the risk of trying to purge something that's obviously broken. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	8bb82bc12a	xfs: move inode extent size hint validation to libxfs Extent size hint validation is used by scrub to decide if there's an error, and it will be used by repair to decide to remove the hint. Since these use the same validation functions, move them to libxfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	1b44a6aecc	xfs: record inode buf errors as a xref error in inobt scrubber During the inode btree scrubs we try to confirm the freemask bits against the inode records. If the inode buffer read fails, this is a cross-referencing error, not a corruption of the inode btree itself. Use the xref_process_error call here. Found via core.version middlebit fuzz in xfs/415. The userspace xfs_scrub program will try to repair outright corruptions in the agi/inobt prior to phase 3 so that the inode scan will proceed. If only a cross-referencing error is noted, the repair program defers the repair attempt until it can check the other space metadata at least once. It is therefore essential that the inobt scrubber can correctly distinguish between corruptions and "unable to cross-reference something else with this inobt". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	7e56d9eaea	xfs: remove xfs_buf parameter from inode scrub methods Now that we no longer do raw inode buffer scrubbing, the bp parameter is no longer used anywhere we're dealing with an inode, so remove it and all the useless NULL parameters that go with it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	d0018ad889	xfs: inode scrubber shouldn't bother with raw checks The inode scrubber tries to _iget the inode prior to running checks. If that _iget call fails with corruption errors that's an automatic fail, regardless of whether it was the inode buffer read verifier, the ifork verifier, or the ifork formatter that errored out. Therefore, get rid of the raw mode scrub code because it's not needed. Found by trying to fix some test failures in xfs/379 and xfs/415. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:08 -07:00
Darrick J. Wong	5e777b62b0	xfs: bmap scrubber should do rmap xref with bmap for sparse files When we're scanning an extent mapping inode fork, ensure that every rmap record for this ifork has a corresponding bmbt record too. This (mostly) provides the ability to cross-reference rmap records with bmap data. The rmap scrubber cannot do the xref on its own because that requires taking an ilock with the agf lock held, which violates our locking order rules (inode, then agf). Note that we only do this for forks that are in btree format due to the increased complexity; or forks that should have data but suspiciously have zero extents because the inode could have just had its iforks zapped by the inode repair code and now we need to reclaim the old extents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:07 -07:00
Darrick J. Wong	6edb181053	xfs: refactor inode buffer verifier error logging When the inode buffer verifier encounters an error, it's much more helpful to print a buffer from the offending inode instead of just the start of the inode chunk buffer. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:07 -07:00
Darrick J. Wong	90a58f9571	xfs: refactor inode verifier error logging Refactor some of the inode verifier failure logging call sites to use the new xfs_inode_verifier_error method which dumps the offending buffer as well as the code location of the failed check. This trims the output, makes it clearer to the admin that repair must be run, and gives the developers more details to work from. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:07 -07:00
Darrick J. Wong	30b0984d91	xfs: refactor bmap record validation Refactor the bmap validator into a more complete helper that looks for extents that run off the end of the device, overflow into the next AG, or have invalid flag states. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:07 -07:00
Darrick J. Wong	6915ef35c0	xfs: sanity-check the unused space before trying to use it In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if it doesn't make sense. Since a carefully crafted fuzzed image can cause the kernel to crash after blowing a bunch of assertions, let's move those checks into a validator function and rig everything up to return EFSCORRUPTED to userspace. Found by lastbit fuzzing ltail.bestcount via xfs/391. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-23 18:05:07 -07:00
Brian Foster	a27ba2607e	xfs: detect agfl count corruption and reset agfl The struct xfs_agfl v5 header was originally introduced with unexpected padding that caused the AGFL to operate with one less slot than intended. The header has since been packed, but the fix left an incompatibility for users who upgrade from an old kernel with the unpacked header to a newer kernel with the packed header while the AGFL happens to wrap around the end. The newer kernel recognizes one extra slot at the physical end of the AGFL that the previous kernel did not. The new kernel will eventually attempt to allocate a block from that slot, which contains invalid data, and cause a crash. This condition can be detected by comparing the active range of the AGFL to the count. While this detects a padding mismatch, it can also trigger false positives for unrelated flcount corruption. Since we cannot distinguish a size mismatch due to padding from unrelated corruption, we can't trust the AGFL enough to simply repopulate the empty slot. Instead, avoid unnecessarily complex detection logic and and use a solution that can handle any form of flcount corruption that slips through read verifiers: distrust the entire AGFL and reset it to an empty state. Any valid blocks within the AGFL are intentionally leaked. This requires xfs_repair to rectify (which was already necessary based on the state the AGFL was found in). The reset mitigates the side effect of the padding mismatch problem from a filesystem crash to a free space accounting inconsistency. The generic approach also means that this patch can be safely backported to kernels with or without a packed struct xfs_agfl. Check the AGF for an invalid freelist count on initial read from disk. If detected, set a flag on the xfs_perag to indicate that a reset is required before the AGFL can be used. In the first transaction that attempts to use a flagged AGFL, reset it to empty, warn the user about the inconsistency and allow the freelist fixup code to repopulate the AGFL with new blocks. The xfs_perag flag is cleared to eliminate the need for repeated checks on each block allocation operation. This allows kernels that include the packing fix commit `96f859d52b` ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct") to handle older unpacked AGFL formats without a filesystem crash. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-23 18:05:06 -07:00
Christoph Hellwig	3e4da466bf	xfs: unwind the try_again loop in xfs_log_force Instead split out a __xfs_log_fore_lsn helper that gets called again with the already_slept flag set to true in case we had to sleep. This prepares for aio_fsync support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-23 18:05:06 -07:00
Christoph Hellwig	93806299b5	xfs: refactor xfs_log_force_lsn Use the the smallest possible loop as preable to find the correct iclog buffer, and then use gotos for unwinding to straighten the code. Also fix the top of function comment while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-23 18:05:06 -07:00
Andreas Gruenbacher	bb491ce67a	gfs2: Check for the end of metadata in punch_hole When punching a hole or truncating an inode down to a given size, also check if the truncate point / start of the hole is within the range we have metadata for. Otherwise, we can end up freeing blocks that shouldn't be freed, corrupting the inode, or crashing the machine when trying to punch a hole into the void. When growing an inode via truncate, we set the new size but we don't allocate additional levels of indirect blocks and grow the inode height. When shrinking that inode again, the new size may still point beyond the end of the inode's metadata. Fixes xfstest generic/476. Debugged-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-23 11:43:02 -07:00
David S. Miller	03fe2debbb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Fun set of conflict resolutions here... For the mac80211 stuff, these were fortunately just parallel adds. Trivially resolved. In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the function phy_disable_interrupts() earlier in the file, whilst in 'net-next' the phy_error() call from this function was removed. In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the 'rt_table_id' member of rtable collided with a bug fix in 'net' that added a new struct member "rt_mtu_locked" which needs to be copied over here. The mlxsw driver conflict consisted of net-next separating the span code and definitions into separate files, whilst a 'net' bug fix made some changes to that moved code. The mlx5 infiniband conflict resolution was quite non-trivial, the RDMA tree's merge commit was used as a guide here, and here are their notes: ==================== Due to bug fixes found by the syzkaller bot and taken into the for-rc branch after development for the 4.17 merge window had already started being taken into the for-next branch, there were fairly non-trivial merge issues that would need to be resolved between the for-rc branch and the for-next branch. This merge resolves those conflicts and provides a unified base upon which ongoing development for 4.17 can be based. Conflicts: drivers/infiniband/hw/mlx5/main.c - Commit `42cea83f95` (IB/mlx5: Fix cleanup order on unload) added to for-rc and commit `b5ca15ad7e` (IB/mlx5: Add proper representors support) add as part of the devel cycle both needed to modify the init/de-init functions used by mlx5. To support the new representors, the new functions added by the cleanup patch needed to be made non-static, and the init/de-init list added by the representors patch needed to be modified to match the init/de-init list changes made by the cleanup patch. Updates: drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function prototypes added by representors patch to reflect new function names as changed by cleanup patch drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init stage list to match new order from cleanup patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-23 11:31:58 -04:00
Mimi Zohar	0834136aea	fuse: define the filesystem as untrusted Files on FUSE can change at any point in time without IMA being able to detect it. The file data read for the file signature verification could be totally different from what is subsequently read, making the signature verification useless. FUSE can be mounted by unprivileged users either today with fusermount installed with setuid, or soon with the upcoming patches to allow FUSE mounts in a non-init user namespace. This patch sets the SB_I_IMA_UNVERIFIABLE_SIGNATURE flag and when appropriate sets the SB_I_UNTRUSTED_MOUNTER flag. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Dongsu Park <dongsu@kinvolk.io> Cc: Alban Crequy <alban@kinvolk.io> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-03-23 06:31:37 -04:00
Linus Torvalds	f36b7534b8	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, thp: do not cause memcg oom for thp mm/vmscan: wake up flushers for legacy cgroups too Revert "mm: page_alloc: skip over regions of invalid pfns where possible" mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() mm/thp: do not wait for lock_page() in deferred_split_scan() mm/khugepaged.c: convert VM_BUG_ON() to collapse fail x86/mm: implement free pmd/pte page interfaces mm/vmalloc: add interfaces to free unmapped page table h8300: remove extraneous __BIG_ENDIAN definition hugetlbfs: check for pgoff value overflow lockdep: fix fs_reclaim warning MAINTAINERS: update Mark Fasheh's e-mail mm/mempolicy.c: avoid use uninitialized preferred_node	2018-03-22 18:48:43 -07:00
Mike Kravetz	63489f8e82	hugetlbfs: check for pgoff value overflow A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size. A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG. The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account. [mike.kravetz@oracle.com: v3] Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com [akpm@linux-foundation.org: include mmdebug.h] [akpm@linux-foundation.org: fix -ve left shift count on sh] Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com Fixes: `045c7a3f53` ("hugetlbfs: fix offset overflow in hugetlbfs mmap") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nic Losby <blurbdust@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-03-22 17:07:01 -07:00
James Morris	5893ed18a2	Linux 4.16-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqvCPYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOaAH/171cgZGFEXSONxK 3O1AAv61wN5K/ISMt6mnelWR6fZg195FarOx0Rnq7Ot8OWuVa8CGcyT4vX4Z7nb9 SVMQKNMPCVQE4WCDOv6S0njChmRC0BxBoVJtTN9fhywdYgX1KcaTS/drMRHACF5n rB9eouMQScfMzKGAW08gp5NvEGJ6W1SLX7La3/u0751dYisdJSP7+vFZNxUrGXEA yIPOQjFu0Tfo8GXz/BwC678RZVzVLN0sE6+/vM7zNnoDlsRVkdDIVMo3UiVqm/NK B37/TlZz8CYoapoKnRRB5giXnSPDSXtsikbGy3mcy0u5imGe+ZgdjrdYSaLk31cR NVZY08k= =pu3X -----END PGP SIGNATURE----- Merge tag 'v4.16-rc6' into next-general Merge to Linux 4.16-rc6 at the request of Jarkko, for his TPM updates.	2018-03-23 08:26:16 +11:00
Linus Torvalds	c4f4d2f917	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Always validate XFRM esn replay attribute, from Florian Westphal. 2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long. 3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul Triebitz. 4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel Axtens. 5) Fix some interrupt handling issues in e1000e driver, from Benjamin Poitier. 6) Use strlcpy() in several ethtool get_strings methods, from Florian Fainelli. 7) Fix rhlist dup insertion, from Paul Blakey. 8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev. 9) Fix driver unload crash when link is up in smsc911x, from Jeremy Linton. 10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric Dumazet. 11) Need to purge the write queue when TCP connections are aborted, otherwise userspace using MSG_ZEROCOPY can't close the fd. From Soheil Hassas Yeganeh. 12) Fix double free in error path of team driver, from Arkadi Sharshevsky. 13) Filter fixes for hv_netvsc driver, from Stephen Hemminger. 14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo Bianconi. 15) Properly filter out unsupported feature flags in macvlan driver, from Shannon Nelson. 16) Don't request loading the diag module for a protocol if the protocol itself is not even registered. From Xin Long. 17) If datagram connect fails in ipv6, make sure the socket state is consistent afterwards. From Paolo Abeni. 18) Use after free in qed driver, from Dan Carpenter. 19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the entry. From Sabrina Dubroca. 20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins. 21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita. 22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel. 23) Fix NULL derefs in error paths of tcf__init(), from Davide Caratti. 24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli. 25) Memory leak in gemini driver, from Igor Pylypiv. 26) IDR leaks in error paths of tcf__init() functions, from Davide Caratti. 27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun. 28) Missing dev_put() in error path of macsec_newlink(), from Dan Carpenter. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits) macsec: missing dev_put() on error in macsec_newlink() net: dsa: Fix functional dsa-loop dependency on FIXED_PHY hv_netvsc: common detach logic hv_netvsc: change GPAD teardown order on older versions hv_netvsc: use RCU to fix concurrent rx and queue changes hv_netvsc: disable NAPI before channel close net/ipv6: Handle onlink flag with multipath routes ppp: avoid loop in xmit recursion detection code ipv6: sr: fix NULL pointer dereference when setting encap source address ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state net: aquantia: driver version bump net: aquantia: Implement pci shutdown callback net: aquantia: Allow live mac address changes net: aquantia: Add tx clean budget and valid budget handling logic net: aquantia: Change inefficient wait loop on fw data reads net: aquantia: Fix a regression with reset on old firmware net: aquantia: Fix hardware reset when SPI may rarely hangup s390/qeth: on channel error, reject further cmd requests s390/qeth: lock read device while queueing next buffer s390/qeth: when thread completes, wake up all waiters ...	2018-03-22 14:10:29 -07:00
Eric Sandeen	0d9366d67b	ext4: don't complain about incorrect features when probing If mount is auto-probing for filesystem type, it will try various filesystems in order, with the MS_SILENT flag set. We get that flag as the silent arg to ext4_fill_super. If we're probing (silent==1) then don't complain about feature incompatibilities that are found if it looks like it's actually a different valid extN type - failed probes should be silent in this case. If the on-disk features are unknown even to ext4, then complain. Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com> Tested-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2018-03-22 11:59:00 -04:00
Nikolay Borisov	1d39834fba	ext4: remove EXT4_STATE_DIOREAD_LOCK flag Commit `16c5468859` ("ext4: Allow parallel DIO reads") reworked the way locking happens around parallel dio reads. This resulted in obviating the need for EXT4_STATE_DIOREAD_LOCK flag and accompanying logic. Currently this amounts to dead code so let's remove it. No functional changes Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2018-03-22 11:52:10 -04:00
Jiri Slaby	fe23cb65c2	ext4: fix offset overflow on 32-bit archs in ext4_iomap_begin() ext4_iomap_begin() has a bug where offset returned in the iomap structure will be truncated to unsigned long size. On 64-bit architectures this is fine but on 32-bit architectures obviously not. Not many places actually use the offset stored in the iomap structure but one of visible failures is in SEEK_HOLE / SEEK_DATA implementation. If we create a file like: dd if=/dev/urandom of=file bs=1k seek=8m count=1 then lseek64("file", 0x100000000ULL, SEEK_DATA) wrongly returns 0x100000000 on unfixed kernel while it should return 0x200000000. Avoid the overflow by proper type cast. Fixes: `545052e9e3` ("ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA") Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # v4.15	2018-03-22 11:50:26 -04:00
Eryu Guan	45d8ec4d9f	ext4: update i_disksize if direct write past ondisk size Currently in ext4 direct write path, we update i_disksize only when new eof is greater than i_size, and don't update it even when new eof is greater than i_disksize but less than i_size. This doesn't work well with delalloc buffer write, which updates i_size and i_disksize only when delalloc blocks are resolved (at writeback time), the i_disksize from direct write can be lost if a previous buffer write succeeded at write time but failed at writeback time, then results in corrupted ondisk inode size. Consider this case, first buffer write 4k data to a new file at offset 16k with delayed allocation, then direct write 4k data to the same file at offset 4k before delalloc blocks are resolved, which doesn't update i_disksize because it writes within i_size(20k), but the extent tree metadata has been committed in journal. Then writeback of the delalloc blocks fails (due to device error etc.), and i_size/i_disksize from buffer write can't be written to disk (still zero). A subsequent umount/mount cycle recovers journal and writes extent tree metadata from direct write to disk, but with i_disksize being zero. Fix it by updating i_disksize too in direct write path when new eof is greater than i_disksize but less than i_size, so i_disksize is always consistent with direct write. This fixes occasional i_size corruption in fstests generic/475. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2018-03-22 11:44:59 -04:00
Eryu Guan	73fdad00b2	ext4: protect i_disksize update by i_data_sem in direct write path i_disksize update should be protected by i_data_sem, by either taking the lock explicitly or by using ext4_update_i_disksize() helper. But the i_disksize updates in ext4_direct_IO_write() are not protected at all, which may be racing with i_disksize updates in writeback path in delalloc buffer write path. This is found by code inspection, and I didn't hit any i_disksize corruption due to this bug. Thanks to Jan Kara for catching this bug and suggesting the fix! Reported-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2018-03-22 11:41:25 -04:00
Richard Guy Briggs	ea841bafda	audit: add refused symlink to audit_names Audit link denied events for symlinks had duplicate PATH records rather than just updating the existing PATH record. Update the symlink's PATH record with the current dentry and inode information. See: https://github.com/linux-audit/audit-kernel/issues/21 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-03-21 11:31:03 -04:00
Richard Guy Briggs	94b9d9b7a1	audit: remove path param from link denied function In commit `45b578fe4c` ("audit: link denied should not directly generate PATH record") the need for the struct path *link parameter was removed. Remove the now useless struct path argument. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-03-21 11:17:41 -04:00
Linus Torvalds	645102eac1	Just one fix for an occasional panic from Jeff Layton. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJasXu4AAoJECebzXlCjuG+fNcP/3QYZwLdIYe0pZSIn/9V6sW1 PwmxhTkBOEkeP4RTArI/4spZk3mrTqv7SC+ACYKbkD910uDC8cY3ElcwKxi82NL7 9M4gNGnscKJtUye2LhoamZZLOLDIzr302pEv0IIh3v11sAu0rK1K0bqHSfI1ga/q r3S1sRtad+UktVH+jJ8MNE4NBfXC/xTqPz6Z+wd30CPmJKaA7Oqle7Wd3479zj0c A7l+EtleFo/bctHwY7yp+lVs1Yqr3ShS+qu+YAknBTnGco9JJFJsZM6x33DFMZC4 EvFPO0gh4PuN/SkgT4zBvBlSXQjuqUWTU98ohqP7ugTtd0dXEIpR4DLLwWScA2KV z+YIJIKUh8Ryh8KI+J50Ed9WseVlmACHOBTJjLW7KnSYtey3+mgh5kp6Vcd9IxHc Gju7YgDK5FOV2JykaHQ/qCaSyL6ao1firapSrtZK4N27LpJ05FreK6qnLJtHmiwU pqVHyfrH4kfho1T0Nr/bw4bnugBKZVXBbBQLKp2twgJhYkUkP+vxV8QMX3h531hB jsQV9OIVPkdTjr5L0BmQMJ4GugM1Zr4mIunY6Mw0z5ax1JeBUelWeQccqL1YCSGL TZr+0N9yOuzQHX+e+mZ8vFYqfUTRF/Xfkn1+ZV3ZQU/yY8oBkDbQddaLoo1hjLUx 2dpR3xIhzyDkP59jrK4M =oGAf -----END PGP SIGNATURE----- Merge tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux Pull nfsd fix from Bruce Fields: "Just one fix for an occasional panic from Jeff Layton" * tag 'nfsd-4.16-1' of git://linux-nfs.org/~bfields/linux: nfsd: remove blocked locks on client teardown	2018-03-20 16:10:26 -07:00
J. Bruce Fields	353601e7d3	nfsd: create a separate lease for each delegation Currently we only take one vfs-level delegation (lease) for each file, no matter how many clients hold delegations on that file. Let's instead keep a one-to-one mapping between NFSv4 delegations and VFS delegations. This turns out to be simpler. There is still a many-to-one mapping of NFS opens to NFS files, and the delegations on one file are all associated with one struct file. The VFS can still distinguish between these delegations since we're setting fl_owner to the struct nfs4_delegation now, not to the shared file. I'm replacing at least one complicated function wholesale, which I don't like to do, but I haven't figured out how to do this more incrementally. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:14 -04:00
J. Bruce Fields	86d29b10eb	nfsd: move sc_file assignment into alloc_init_deleg Take an easy chance to simplify the caller a little. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:13 -04:00
J. Bruce Fields	0af6e690f0	nfsd: factor out common delegation-destruction code Pull some duplicated code into a common helper. This changes the order in destroy_delegation a little, but it looks to me like that shouldn't matter. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:13 -04:00
J. Bruce Fields	68b18f5294	nfsd: make nfs4_get_existing_delegation less confusing This doesn't "get" anything. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:12 -04:00
J. Bruce Fields	0c911f5408	nfsd4: dp->dl_stid.sc_file doesn't need locking The delegation isn't visible to anyone yet. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:12 -04:00
J. Bruce Fields	653e514e9e	nfsd4: set fl_owner to delegation, not file pointer For now this makes no difference, as for files having delegations, there's a one-to-one relationship between an nfs4_file and its nfs4_delegation. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:11 -04:00
J. Bruce Fields	cba7b3d150	nfsd: simplify nfs4_put_deleg_lease calls Every single caller gets the file out of the delegation, so let's do that once in nfs4_put_deleg_lease. Plus we'll need it there for other reasons. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:11 -04:00
J. Bruce Fields	b8232d3315	nfsd: simplify put of fi_deleg_file fi_delegees is basically just a reference count on users of fi_deleg_file, which is cleared when fi_delegees goes to zero. The fi_deleg_file check here is redundant. Also add an assertion to make sure we don't have unbalanced puts. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2018-03-20 17:51:10 -04:00
Greg Kroah-Hartman	4958134df5	Merge 4.16-rc6 into tty-next We want the serial/tty fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-20 11:27:18 +01:00
Peter Zijlstra	e24e960c7f	sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Becker <jlbec@evilplan.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-20 08:23:22 +01:00
Peter Zijlstra	723c921e7d	sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-20 08:23:21 +01:00
Peter Zijlstra	dc5d4afbb0	sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-20 08:23:21 +01:00
Peter Zijlstra	4625956a4e	sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-20 08:23:20 +01:00
Peter Zijlstra	ab1fbe3247	sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-03-20 08:23:19 +01:00
Grygorii Strashko	2399ac42e7	sysfs: symlink: export sysfs_create_link_nowarn() The sysfs_create_link_nowarn() is going to be used in phylib framework in subsequent patch which can be built as module. Hence, export sysfs_create_link_nowarn() to avoid build errors. Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Fixes: `a399546049` ("net: phy: Relax error checking on sysfs_create_link()") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-03-19 21:14:26 -04:00
Jeff Layton	9258a2d5cd	nfsd: move nfs4_client allocation to dedicated slabcache On x86_64, it's 1152 bytes, so we can avoid wasting 896 bytes each. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:13 -04:00
J. Bruce Fields	9d7ed1355d	nfsd: don't require low ports for gss requests In a traditional NFS deployment using auth_unix, the clients are trusted to correctly report the credentials of their logged-in users. The server assumes that only root on client machines is allowed to send requests from low-numbered ports, so it can use the originating port number to distinguish "real" NFS clients from NFS clients run by ordinary users, to prevent ordinary users from spoofing credentials. The originating port number on a gss-authenticated request is less important. The authentication ties the request to a user, and we take it as proof that that user authorized the request. The low port number check no longer adds much. So, don't enforce low port numbers in the auth_gss case. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:13 -04:00
J. Bruce Fields	edcc8452a0	nfsd: remove unsused "cp_consecutive" field Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:13 -04:00
Colin Ian King	554faf2819	lockd: make nlm_ntf_refcnt and nlm_ntf_wq static The variables nlm_ntf_refcnt and nlm_ntf_wq are local to the source and do not need to be in global scope, so make them static. Cleans up sparse warnings: fs/lockd/svc.c:60:10: warning: symbol 'nlm_ntf_refcnt' was not declared. Should it be static? fs/lockd/svc.c:61:1: warning: symbol 'nlm_ntf_wq' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:13 -04:00
Jeff Layton	bd2decac5e	nfsd4: send the special close_stateid in v4.0 replies as well We already send it for v4.1, but RFC7530 also notes that the stateid in the close reply is bogus. Always send the special close stateid, even in v4.0 responses. No client should put any meaning on it whatsoever. For now, we continue to increment the stateid value, though that might not be necessary either. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:38:12 -04:00
Jeff Layton	68ef3bc316	nfsd: remove blocked locks on client teardown We had some reports of panics in nfsd4_lm_notify, and that showed a nfs4_lockowner that had outlived its so_client. Ensure that we walk any leftover lockowners after tearing down all of the stateids, and remove any blocked locks that they hold. With this change, we also don't need to walk the nbl_lru on nfsd_net shutdown, as that will happen naturally when we tear down the clients. Fixes: `76d348fadf` (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks) Reported-by: Frank Sorenson <fsorenso@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # 4.9 Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2018-03-19 16:37:21 -04:00
Tejun Heo	f729863a8c	fs/aio: Use rcu_work instead of explicit rcu and work item Workqueue now has rcu_work. Use it instead of open-coding rcu -> work item bouncing. Signed-off-by: Tejun Heo <tj@kernel.org>	2018-03-19 10:12:03 -07:00
Yunlei He	0833721ec3	f2fs: check blkaddr more accuratly before issue a bio This patch check blkaddr more accuratly before issue a write or read bio. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-18 23:33:50 -07:00
Ritesh Harjani	02117b8ae9	f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read Quota code itself is serializing the operations by taking mutex_lock. It seems a below deadlock can happen if GF_NOFS is not used in f2fs_quota_read __switch_to+0x88 __schedule+0x5b0 schedule+0x78 schedule_preempt_disabled+0x20 __mutex_lock_slowpath+0xdc //mutex owner is itself mutex_lock+0x2c dquot_commit+0x30 //mutex_lock(&dqopt->dqio_mutex); dqput+0xe0 __dquot_drop+0x80 dquot_drop+0x48 f2fs_evict_inode+0x218 evict+0xa8 dispose_list+0x3c prune_icache_sb+0x58 super_cache_scan+0xf4 do_shrink_slab+0x208 shrink_slab.part.40+0xac shrink_zone+0x1b0 do_try_to_free_pages+0x25c try_to_free_pages+0x164 __alloc_pages_nodemask+0x534 do_read_cache_page+0x6c read_cache_page+0x14 f2fs_quota_read+0xa4 read_blk+0x54 find_tree_dqentry+0xe4 find_tree_dqentry+0xb8 find_tree_dqentry+0xb8 find_tree_dqentry+0xb8 qtree_read_dquot+0x68 v2_read_dquot+0x24 dquot_acquire+0x5c // mutex_lock(&dqopt->dqio_mutex); dqget+0x238 __dquot_initialize+0xd4 dquot_initialize+0x10 dquot_file_open+0x34 f2fs_file_open+0x6c do_dentry_open+0x1e4 vfs_open+0x6c path_openat+0xa20 do_filp_open+0x4c do_sys_open+0x178 Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-18 23:31:28 -07:00
Mateusz Guzik	08fdc8a013	buffer.c: call thaw_super during emergency thaw There are 2 distinct freezing mechanisms - one operates on block devices and another one directly on super blocks. Both end up with the same result, but thaw of only one of these does not thaw the other. In particular fsfreeze --freeze uses the ioctl variant going to the super block. Since prior to this patch emergency thaw was not doing a relevant thaw, filesystems frozen with this method remained unaffected. The patch is a hack which adds blind unfreezing. In order to keep the super block write-locked the whole time the code is shuffled around and the newly introduced __iterate_supers is employed. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-19 02:21:40 -04:00
Ingo Molnar	793057e1c7	vfs: Replace stray non-ASCII homoglyph characters with their ASCII equivalents Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-19 01:07:42 -04:00
Rasmus Villemoes	1c94984396	vfs: make sure struct filename->iname is word-aligned I noticed that offsetof(struct filename, iname) is actually 28 on 64 bit platforms, so we always pass an unaligned pointer to strncpy_from_user. This is mostly a problem for those 64 bit platforms without HAVE_EFFICIENT_UNALIGNED_ACCESS, but even on x86_64, unaligned accesses carry a penalty. A user-space microbenchmark doing nothing but strncpy_from_user from the same (aligned) source string runs about 5% faster when the destination is aligned. That number increases to 20% when the string is long enough (~32 bytes) that we cross a cache line boundary - that's for example the case for about half the files a "git status" in a kernel tree ends up stat'ing. This won't make any real-life workloads 5%, or even 1%, faster, but path lookup is common enough that cutting even a few cycles should be worthwhile. So ensure we always pass an aligned destination pointer to strncpy_from_user. Instead of explicit padding, simply swap the refcnt and aname members, as suggested by Al Viro. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-19 01:07:04 -04:00
Kees Cook	7bd698b3c0	exec: Set file unwritable before LSM check The LSM check should happen after the file has been confirmed to be unchanging. Without this, we could have a race between the Time of Check (the call to security_kernel_read_file() which could read the file and make access policy decisions) and the Time of Use (starting with kernel_read_file()'s reading of the file contents). In theory, file contents could change between the two. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: James Morris <james.morris@microsoft.com>	2018-03-19 15:49:32 +11:00
Sheng Yong	ff62af200b	f2fs: introduce a new mount option test_dummy_encryption This patch introduces a new mount option `test_dummy_encryption' to allow fscrypt to create a fake fscrypt context. This is used by xfstests. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 14:02:57 +09:00
Sheng Yong	b7c409deda	f2fs: introduce F2FS_FEATURE_LOST_FOUND feature This patch introduces a new feature, F2FS_FEATURE_LOST_FOUND, which is set by mkfs. mkfs creates a directory named lost+found, which saves unreachable files. If fsck finds a file which has no parent, or its parent is removed by fsck, the file will be placed under lost+found directory by fsck. lost+found directory could not be encrypted. As a result, the root directory cannot be encrypted too. So if LOST_FOUND feature is enabled, let's avoid to encrypt root directory. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 14:02:47 +09:00
Qiuyang Sun	da5ce874f8	f2fs: release locks before return in f2fs_ioc_gc_range() Currently, we will leave the kernel with locks still held when the gc_range is invalid. This patch fixes the bug. Signed-off-by: Qiuyang Sun <sunqiuyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:58:16 +09:00
Jaegeuk Kim	bb1105e479	f2fs: align memory boundary for bitops For example, in arm64, free_nid_bitmap should be aligned to word size in order to use bit operations. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:39 +09:00
Chao Yu	c56675750d	f2fs: remove unneeded set_cold_node() When setting COLD_BIT_SHIFT flag in node block, we only need to call set_cold_node() in new_node_page() and recover_inode_page() during node page initialization. So remove unneeded set_cold_node() in other places. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:33 +09:00
Hyunchul Lee	b91050a80c	f2fs: add nowait aio support This patch adds nowait aio support[1]. Return EAGAIN if any of the following checks fail for direct I/O: - i_rwsem is not lockable - Blocks are not allocated at the write location And xfstests generic/471 is passed. [1]: 6be96d "Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT" Signed-off-by: Hyunchul Lee <cheol.lee@lge.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:32 +09:00
Chao Yu	63189b7859	f2fs: wrap all options with f2fs_sb_info.mount_opt This patch merges miscellaneous mount options into struct f2fs_mount_info, After this patch, once we add new mount option, we don't need to worry about recovery of it in remount_fs(), since we will recover the f2fs_sb_info.mount_opt including all options. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:30 +09:00
Yunlei He	5d7881cadf	f2fs: Don't overwrite all types of node to keep node chain Currently, we enable node SSR by default, and mixed different types of node segment to do SSR more intensively. Although reuse warm node is not allowed, warm node chain will be destroyed by errors introduced by other types node chain. So we'd better forbid reusing all types of node to keep warm node chain. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:29 +09:00
Junling Zheng	93cf93f17c	f2fs: introduce mount option for fsync mode Commit "0a007b97aad6"(f2fs: recover directory operations by fsync) fixed xfstest generic/342 case, but it also increased the written data and caused the performance degradation. In most cases, there's no need to do so heavy fsync actually. So we introduce new mount option "fsync_mode={posix,strict}" to control the policy of fsync. "fsync_mode=posix" is set by default, and means that f2fs uses a light fsync, which follows POSIX semantics. And "fsync_mode=strict" means that it's a heavy fsync, which behaves in line with xfs, ext4 and btrfs, where generic/342 will pass, but the performance will regress. Signed-off-by: Junling Zheng <zhengjunling@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:28 +09:00
Chao Yu	eea52882d8	f2fs: fix to restore old mount option in ->remount_fs This patch fixes to restore old mount option once we encounter failure in ->remount_fs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:27 +09:00
Chao Yu	deeedd71e3	f2fs: wrap sb_rdonly with f2fs_readonly Use f2fs_readonly to wrap sb_rdonly for cleanup, and spread it in all places. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:57:26 +09:00
Jaegeuk Kim	162b27aec9	f2fs: avoid selinux denial on CAP_SYS_RESOURCE This fixes CAP_SYS_RESOURCE denial of selinux when using resgid, since it seems selinux reports it at the first place, but mostly we don't need to check this condition first. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-17 13:55:43 +09:00
Linus Torvalds	8f5fd927c3	for-4.16-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlqrzY4ACgkQxWXV+ddt WDtV4BAAiiv5XECNB3lIvcoFjst8E7xIJNeKKzFZh8HNm+L94zP00uqrpHkwQSUv tk9TGSYgbJJZMhw6I1rwYSKsoTkvP6NVMQZjpkJktWto9NXObKPvJzjKMdB+GYQR TWGLBaHsMhruMTOewhdVsxA5od5+LravygHaLmTDQmho8jZSSEe0qqpqlakRI9sQ +VZv1PDEI68IDMKMjOT7de7Ssq9c99BpGNXxvi5fy5kijyfgrs/BUfHBNV5r66MS y0Yw1Ab+nlLC7cE5Ql24/snDojcLoSl7f5eYt7ib+DAmsJucYQzTfUetFcrYf8IC EAmFs5Br6pbopi5M1IOxGZABNbr0qMS+zLvoF52cyka32nrmK/IMWl1s8bW0xfva gxkjouzJDOTMq28asSdyu73i9yFyLB+tEHpojECljo+K5ASzvDad34xqVRlxY6vN UJm4d76OqTfAyTRo7sQRWfLz/Cb1W+/v0mPotPWHWVyF0bUNsBgX5YKJhma3pU/z rHwhI/fqg0NTXBcNUMCosEwS2u5vnjIfiOHkDTl1Dv747S/jM0AaNEvpjDo1cOuk lsl4xKoDkQ6RpBCeFxRCW0wk1Nl/Su6RQ217xTtND+aIa4c6G7A9X8QAjj2XNAuK CktJJD7wHYSBTu1sQ+u2R/c2QEgkAxABr4NpmWTPCBNlIKEjd4A= =bXFe -----END PGP SIGNATURE----- Merge tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "There's an important revert in this pull request that needs to go to stable as it causes a corruption on big endian machines. The other fix is for FIEMAP incorrectly reporting shared extents before a sync and one fix for a crash in raid56. So far we got only one report about the BE corruption, the stable kernels were out for like a week, so hopefully the scope of the damage is low" * tag 'for-4.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: use proper endianness accessors for super_copy" btrfs: add missing initialization in btrfs_check_shared btrfs: Fix NULL pointer exception in find_bio_stripe	2018-03-16 13:37:42 -07:00
David Sterba	093e037ca8	Revert "btrfs: use proper endianness accessors for super_copy" This reverts commit `3c181c12c4`. The offending patch was merged in 4.16-rc4 and was promptly applied to stable kernels 4.14.25 and 4.15.8. The patch causes a corruption in several superblock items on big-endian machines because of messed up endianity conversions. The damage is manually repairable. A filesystem cannot be mounted again after it has been unmounted once. We do a full revert and not a fixup so stable can pick that patch ASAP. Fixes: `3c181c12c4` ("btrfs: use proper endianness accessors for super_copy") Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de CC: stable@vger.kernel.org # 4.14+ Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-16 14:49:44 +01:00
Arnd Bergmann	e05a959f4b	procfs: remove CONFIG_HARDWALL dependency Hardwall is a tile specific feature, and with the removal of the tile architecture, this has become dead code, so let's remove it. Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2018-03-16 10:56:08 +01:00
Linus Torvalds	df09348f78	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: - backport-friendly part of lock_parent() race fix - a fix for an assumption in the heurisic used by path_connected() that is not true on NFS - livelock fixes for d_alloc_parallel() * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Teach path_connected to handle nfs filesystems with multiple roots. fs: dcache: Use READ_ONCE when accessing i_dir_seq fs: dcache: Avoid livelock between d_alloc_parallel and __d_add lock_parent() needs to recheck if dentry got __dentry_kill'ed under it	2018-03-15 18:57:14 -07:00
Eric W. Biederman	95dd77580c	fs: Teach path_connected to handle nfs filesystems with multiple roots. On nfsv2 and nfsv3 the nfs server can export subsets of the same filesystem and report the same filesystem identifier, so that the nfs client can know they are the same filesystem. The subsets can be from disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no way to find the common root of all directory trees exported form the server with the same filesystem identifier. The practical result is that in struct super s_root for nfs s_root is not necessarily the root of the filesystem. The nfs mount code sets s_root to the root of the first subset of the nfs filesystem that the kernel mounts. This effects the dcache invalidation code in generic_shutdown_super currently called shrunk_dcache_for_umount and that code for years has gone through an additional list of dentries that might be dentry trees that need to be freed to accomodate nfs. When I wrote path_connected I did not realize nfs was so special, and it's hueristic for avoiding calling is_subdir can fail. The practical case where this fails is when there is a move of a directory from the subtree exposed by one nfs mount to the subtree exposed by another nfs mount. This move can happen either locally or remotely. With the remote case requiring that the move directory be cached before the move and that after the move someone walks the path to where the move directory now exists and in so doing causes the already cached directory to be moved in the dcache through the magic of d_splice_alias. If someone whose working directory is in the move directory or a subdirectory and now starts calling .. from the initial mount of nfs (where s_root == mnt_root), then path_connected as a heuristic will not bother with the is_subdir check. As s_root really is not the root of the nfs filesystem this heuristic is wrong, and the path may actually not be connected and path_connected can fail. The is_subdir function might be cheap enough that we can call it unconditionally. Verifying that will take some benchmarking and the result may not be the same on all kernels this fix needs to be backported to. So I am avoiding that for now. Filesystems with snapshots such as nilfs and btrfs do something similar. But as the directory tree of the snapshots are disjoint from one another and from the main directory tree rename won't move things between them and this problem will not occur. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Fixes: `397d425dc2` ("vfs: Test for and handle paths that are unreachable from their mnt_root") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-15 18:48:38 -04:00
Christoph Hellwig	df79b81b2e	xfs: minor cleanup for xfs_reflink_end_cow Use xfs_iext_prev_extent to skip to the previous extent instead of opencoding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Christoph Hellwig	1d4352de51	xfs: minor cleanup for xfs_get_blocks Simplify the control flow a bit in preparation for O_ATOMIC-related changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Christoph Hellwig	f5c54717bf	xfs: remove xfs_zero_range This helper doesn't add any real value over just calling iomap_zero_range directly, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Christoph Hellwig	c7dbe3f2c4	xfs: assert that xfs_reflink_allocate_cow is called with XFS_ILOCK_EXCL Now that we convert COW preallocations from unwritten to real on every call this function needs to be called with the ilock held exclusively. Fortunately we already do that, but update the assert to match. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Christoph Hellwig	7d9df3c163	xfs: don't use XFS_BMAPI_ENTRIRE in xfs_get_blocks There is no reason to get a mapping bigger than what we were asked for. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Christoph Hellwig	5bcffe300c	xfs: fix the check for COW extents in xfs_swap_extents i_cnextents does not include delayed allocated extents, so switch to the inode fork size check that we already use in other places instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-15 10:31:38 -07:00
Boris Brezillon	8f347c4232	mtd: Unconditionally update ->fail_addr and ->addr in part_erase() ->fail_addr and ->addr can be updated no matter the result of parent->_erase(), we just need to remove the code doing the same thing in mtd_erase_callback() to avoid adjusting those fields twice. Note that this can be done because all MTD users have been converted to not pass an erase_info->callback() and are thus only taking the ->addr_fail and ->addr fields into account after part_erase() has returned. While we're at it, get rid of the erase_info->mtd field which was only needed to let mtd_erase_callback() get the partition device back. Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com> Reviewed-by: Richard Weinberger <richard@nod.at>	2018-03-15 18:22:26 +01:00
Boris Brezillon	884cfd9023	mtd: Stop assuming mtd_erase() is asynchronous None of the mtd->_erase() implementations work in an asynchronous manner, so let's simplify MTD users that call mtd_erase(). All they need to do is check the value returned by mtd_erase() and assume that != 0 means failure. Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com> Reviewed-by: Richard Weinberger <richard@nod.at>	2018-03-15 18:21:07 +01:00
Arnd Bergmann	58eb5b6707	pstore: fix crypto dependencies The new crypto API use causes some problems with Kconfig dependencies, including this link error: fs/pstore/platform.o: In function `pstore_register': platform.c:(.text+0x248): undefined reference to `crypto_has_alg' platform.c:(.text+0x2a0): undefined reference to `crypto_alloc_base' fs/pstore/platform.o: In function `pstore_unregister': platform.c:(.text+0x498): undefined reference to `crypto_destroy_tfm' crypto/lz4hc.o: In function `lz4hc_sdecompress': lz4hc.c:(.text+0x1a): undefined reference to `LZ4_decompress_safe' crypto/lz4hc.o: In function `lz4hc_decompress_crypto': lz4hc.c:(.text+0x5a): undefined reference to `LZ4_decompress_safe' crypto/lz4hc.o: In function `lz4hc_scompress': lz4hc.c:(.text+0xaa): undefined reference to `LZ4_compress_HC' crypto/lz4hc.o: In function `lz4hc_mod_init': lz4hc.c:(.init.text+0xf): undefined reference to `crypto_register_alg' lz4hc.c:(.init.text+0x1f): undefined reference to `crypto_register_scomp' lz4hc.c:(.init.text+0x2f): undefined reference to `crypto_unregister_alg' The problem is that with CONFIG_CRYPTO=m, we must not 'select CRYPTO_LZ4' from a bool symbol, or call crypto API functions from a built-in module. This turns the sub-options into 'tristate' ones so the dependencies are honored, and makes the pstore itself select the crypto core if necessary. Fixes: `cb3bee0369` ("pstore: Use crypto compress API") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-15 10:06:06 -07:00
Srivatsa S. Bhat	f33ff110ef	block, char_dev: Use correct format specifier for unsigned ints register_blkdev() and __register_chrdev_region() treat the major number as an unsigned int. So print it the same way to avoid absurd error statements such as: "... major requested (-1) is greater than the maximum (511) ..." (and also fix off-by-one bugs in the error prints). While at it, also update the comment describing register_blkdev(). Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-15 17:59:24 +01:00
Srivatsa S. Bhat	652d703b21	char_dev: Fix off-by-one bugs in find_dynamic_major() CHRDEV_MAJOR_DYN_END and CHRDEV_MAJOR_DYN_EXT_END are valid major numbers. So fix the loop iteration to include them in the search for free major numbers. While at it, also remove a redundant if condition ("cd->major != i"), as it will never be true. Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-15 17:59:24 +01:00
Andreas Gruenbacher	ee6ed857c8	gfs2: gfs2_iomap_end tracepoint: log block address In the gfs2_iomap_end tracepoint, log the physical block address, just as in the gfs2_bmap tracepoint. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-15 07:17:17 -07:00
Edmund Nadolski	18bf591ba9	btrfs: add missing initialization in btrfs_check_shared This patch addresses an issue that causes fiemap to falsely report a shared extent. The test case is as follows: xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5 sync xfs_io -c "fiemap -v" /media/scratch/file5 which gives the resulting output: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x2001 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 This is because btrfs_check_shared calls find_parent_nodes repeatedly in a loop, passing a share_check struct to report the count of shared extent. But btrfs_check_shared does not re-initialize the count value to zero for subsequent calls from the loop, resulting in a false share count value. This is a regressive behavior from 4.13. With proper re-initialization the test result is as follows: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 which corrects the regression. Fixes: `3ec4d3238a` ("btrfs: allow backref search checks for shared extents") Signed-off-by: Edmund Nadolski <enadolski@suse.com> [ add text from cover letter to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-14 22:26:46 +01:00
Dmitriy Gorokh	047fdea634	btrfs: Fix NULL pointer exception in find_bio_stripe On detaching of a disk which is a part of a RAID6 filesystem, the following kernel OOPS may happen: [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 [63122.971202] Oops: 0000 [#1] SMP [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8 [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000 [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287 [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690 [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000 [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600 [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500 [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004 [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000 [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0 [63123.009969] Call Trace: [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] [63123.010251] bio_endio+0xa1/0x120 [63123.010378] generic_make_request+0x218/0x270 [63123.010921] submit_bio+0x66/0x130 [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] [63123.011759] ? ___cache_free+0x1c5/0x300 [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] [63123.012656] process_one_work+0x14d/0x350 [63123.012888] worker_thread+0x4d/0x3a0 [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 [63123.013192] kthread+0x109/0x140 [63123.013315] ? process_scheduled_works+0x40/0x40 [63123.013472] ? kthread_stop+0x110/0x110 [63123.013610] ret_from_fork+0x25/0x30 [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8 [63123.014678] CR2: 0000000000000080 [63123.016590] ---[ end trace a295ea7259c17880 ]— This is reproducible in a cycle, where a series of writes is followed by SCSI device delete command. The test may take up to few minutes. Fixes: `74d46992e0` ("block: replace bi_bdev with a gendisk pointer and partitions index") [ no signed-off-by provided ] Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-03-14 22:26:35 +01:00
Tejun Heo	d0264c01e7	fs/aio: Use RCU accessors for kioctx_table->table[] While converting ioctx index from a list to a table, `db446a08c2` ("aio: convert the ioctx list to table lookup v3") missed tagging kioctx_table->table[] as an array of RCU pointers and using the appropriate RCU accessors. This introduces a small window in the lookup path where init and access may race. Mark kioctx_table->table[] with __rcu and use the approriate RCU accessors when using the field. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: `db446a08c2` ("aio: convert the ioctx list to table lookup v3") Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.12+	2018-03-14 12:10:17 -07:00
Tejun Heo	a6d7cff472	fs/aio: Add explicit RCU grace period when freeing kioctx While fixing refcounting, `e34ecee2ae` ("aio: Fix a trinity splat") incorrectly removed explicit RCU grace period before freeing kioctx. The intention seems to be depending on the internal RCU grace periods of percpu_ref; however, percpu_ref uses a different flavor of RCU, sched-RCU. This can lead to kioctx being freed while RCU read protected dereferences are still in progress. Fix it by updating free_ioctx() to go through call_rcu() explicitly. v2: Comment added to explain double bouncing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Fixes: `e34ecee2ae` ("aio: Fix a trinity splat") Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.13+	2018-03-14 12:10:17 -07:00
Christoph Hellwig	e6b9657056	xfs: refactor xfs_log_force Streamline the conditionals so that it is more obvious which specific case form the top of the function comments is being handled. Use gotos only for early returns. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:52 -07:00
Christoph Hellwig	656de4ffaf	xfs: merge _xfs_log_force_lsn and xfs_log_force_lsn Switch to a single interface for flushing the log to a specific LSN, which gives consistent trace point coverage and a less confusing interface. The was only a single user of the previous xfs_log_force_lsn function, which now also passes a NULL log_flushed argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:52 -07:00
Christoph Hellwig	60e5bb7844	xfs: merge _xfs_log_force and xfs_log_force Switch to a single interface for flushing the whole log, which gives consistent trace point coverage, and removes the unused log_flushed argument for the previous _xfs_log_force callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:52 -07:00
Christoph Hellwig	2b56c2857f	xfs: remove the unused log_flushed variable in xfs_extent_busy_flush Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:52 -07:00
Christoph Hellwig	4f7aa2fd8c	xfs: remove an outdated comment for xfs_inode_item_committing The function now does something, and that something is central to our inode logging scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:51 -07:00
Christoph Hellwig	2022ab36fe	xfs: remove misleading comment text on xfs_inode_item_unlock Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-14 11:12:51 -07:00
Christian Brauner	4e15f760a4	devpts: comment devpts_mntget() Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-14 13:31:23 +01:00
Christian Brauner	a319b01d90	devpts: resolve devpts bind-mounts Most libcs will still look at /dev/ptmx when opening the master fd of a pty device. When /dev/ptmx is a bind-mount of /dev/pts/ptmx and the TIOCGPTPEER ioctl() is used to safely retrieve a file descriptor for the slave side of the pty based on the master fd, the /proc/self/fd/{0,1,2} symlinks will point to /. A very simply reproducer for this issue presupposing a libc that uses TIOCGPTPEER in its openpty() implementation is: unshare --mount mount --bind /dev/pts/ptmx /dev/ptmx chmod 666 /dev/ptmx script ls -al /proc/self/fd/0 Having bind-mounts of /dev/pts/ptmx to /dev/ptmx not working correctly is a regression. In addition, it is also a fairly common scenario in containers employing user namespaces. The reason for the current failure is that the kernel tries to verify the useability of the devpts filesystem without resolving the /dev/ptmx bind-mount first. This will lead it to detect that the dentry is escaping its bind-mount. The reason is that while the devpts filesystem mounted at /dev/pts has the devtmpfs mounted at /dev as its parent mount: 21 -- -- / /dev -- 21 -- / /dev/pts devtmpfs and devpts are on different devices -- -- 0:6 / /dev -- -- 0:20 / /dev/pts This has the consequence that the pathname of the parent directory of the devpts filesystem mount at /dev/pts is /. So if /dev/ptmx is a bind-mount of /dev/pts/ptmx then the /dev/ptmx bind-mount and the devpts mount at /dev/pts will end up being located on the same device which is recorded in the superblock of their vfsmount. This means the parent directory of the /dev/ptmx bind-mount will be /ptmx: -- -- ---- /ptmx /dev/ptmx Without the bind-mount resolution patch the kernel will now perform the bind-mount escape check directly on /dev/ptmx. The function responsible for this is devpts_ptmx_path() which calls pts_path() which in turn calls path_parent_directory(). Based on the above explanation, path_parent_directory() will yield / as the parent directory for the /dev/ptmx bind-mount and not the expected /dev. Thus, the kernel detects that /dev/ptmx is escaping its bind-mount and will set /proc/<pid>/fd/<nr> to /. This patch changes the logic to first resolve any bind-mounts. After the bind-mounts have been resolved (i.e. we have traced it back to the associated devpts mount) devpts_ptmx_path() can be called. In order to guarantee correct path generation for the slave file descriptor the kernel now requires that a pts directory is found in the parent directory of the ptmx bind-mount. This implies that when doing bind-mounts the ptmx bind-mount and the devpts mount should have a common parent directory. A valid example is: mount -t devpts devpts /dev/pts mount --bind /dev/pts/ptmx /dev/ptmx an invalid example is: mount -t devpts devpts /dev/pts mount --bind /dev/pts/ptmx /ptmx This allows us to support: - calling open on ptmx devices located inside non-standard devpts mounts: mount -t devpts devpts /mnt master = open("/mnt/ptmx", ...); slave = ioctl(master, TIOCGPTPEER, ...); - calling open on ptmx devices located outside the devpts mount with a common ancestor directory: mount -t devpts devpts /dev/pts mount --bind /dev/pts/ptmx /dev/ptmx master = open("/dev/ptmx", ...); slave = ioctl(master, TIOCGPTPEER, ...); while failing on ptmx devices located outside the devpts mount without a common ancestor directory: mount -t devpts devpts /dev/pts mount --bind /dev/pts/ptmx /ptmx master = open("/ptmx", ...); slave = ioctl(master, TIOCGPTPEER, ...); in which case save path generation cannot be guaranteed. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Eric Biederman <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-14 13:31:23 +01:00
Christian Brauner	7d71109df1	devpts: hoist out check for DEVPTS_SUPER_MAGIC Hoist the check whether we have already found a suitable devpts filesystem out of devpts_ptmx_path() in preparation for the devpts bind-mount resolution patch. This is a non-functional change. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-14 13:31:23 +01:00
Chao Yu	b6a06cbbb5	f2fs: support hot file extension This patch supports to recognize hot file extension in f2fs, so that we can allocate proper hot segment location for its data, which can lead to better hot/cold seperation in filesystem. In addition, we changes a bit on query/add/del operation method for extension_list sysfs entry as below: - Query: cat /sys/fs/f2fs/<disk>/extension_list - Add: echo 'extension' > /sys/fs/f2fs/<disk>/extension_list - Del: echo '!extension' > /sys/fs/f2fs/<disk>/extension_list - Add: echo '[h/c]extension' > /sys/fs/f2fs/<disk>/extension_list - Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list - [h] means add/del hot file extension - [c] means add/del cold file extension Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:57 +09:00
Chao Yu	1dc0f8991d	f2fs: fix to avoid race in between atomic write and background GC Sqlite user Background GC - move_data_block : move page #1 - f2fs_is_atomic_file - f2fs_ioc_start_atomic_write - f2fs_ioc_commit_atomic_write - commit_inmem_pages : commit page #1 & set node #2 dirty - f2fs_submit_page_write - f2fs_update_data_blkaddr - set_page_dirty : set node #2 dirty - f2fs_do_sync_file - fsync_node_pages : commit node #1 & node #2, then sudden power-cut In a race case, we may check FI_ATOMIC_FILE flag before starting atomic write flow, then we will commit meta data before data with reversed order, after a sudden pow-cut, database transaction will be inconsistent. So we'd better to exclude gc/atomic_write to each other by using lock instead of flag checking. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:55 +09:00
Jaegeuk Kim	b27bc8091c	f2fs: do gc in greedy mode for whole range if gc_urgent mode is set Otherwise, f2fs conducts GC on 8GB range only based on slow cost-benefit. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:54 +09:00
Jaegeuk Kim	dee02f0d62	f2fs: issue discard aggressively in the gc_urgent mode This patch avoids to skip discard commands when user sets gc_urgent mode. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:53 +09:00
Jaegeuk Kim	782911f491	f2fs: set readdir_ra by default It gives general readdir improvement. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:52 +09:00
Jaegeuk Kim	84b89e5d94	f2fs: add auto tuning for small devices If f2fs is running on top of very small devices, it's worth to avoid abusing free LBAs. In order to achieve that, this patch introduces some parameter tuning. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:51 +09:00
Jaegeuk Kim	079396270b	f2fs: add mount option for segment allocation policy This patch adds an mount option, "alloc_mode=%s" having two options, "default" and "reuse". In "alloc_mode=reuse" case, f2fs starts to allocate segments from 0'th segment all the time to reassign segments. It'd be useful for small-sized eMMC parts. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:50 +09:00
Jaegeuk Kim	69babac019	f2fs: don't stop GC if GC is contended Let's do GC as much as possible, while gc_urgent is set. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:49 +09:00
Chao Yu	846ae671ad	f2fs: expose extension_list sysfs entry This patch adds a sysfs entry 'extension_list' to support query/add/del item in extension list. Query: cat /sys/fs/f2fs/<device>/extension_list Add: echo 'extension' > /sys/fs/f2fs/<device>/extension_list Del: echo '!extension' > /sys/fs/f2fs/<device>/extension_list Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:48 +09:00
Chao Yu	17cd07ae95	f2fs: fix to set KEEP_SIZE bit in f2fs_zero_range As Jayashree Mohan reported: A simple workload to reproduce this would be : 1. create foo 2. Write (8K - 16K) // foo size = 16K now 3. fsync() 4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K 5. fdatasync() Crash now On recovery, we see that the file size is 4210688 and not 16K, which violates the semantics of keep_size flag. We have a test case to reproduce this using CrashMonkey on 4.15 kernel. Try this out by simply running : ./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400 -P -v tests/generic_468_zero.so The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this missing case. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:47 +09:00
Chao Yu	d0d3f1b329	f2fs: introduce sb_lock to make encrypt pwsalt update exclusive f2fs_super_block.encrypt_pw_salt can be udpated and persisted concurrently, result in getting different pwsalt in separated threads, so let's introduce sb_lock to exclude concurrent accessers. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:46 +09:00
Colin Ian King	8fe326cb99	f2fs: remove redundant initialization of pointer 'p' Pointer p is initialized with a value that is never read and is later re-assigned a new value, hence the initialization is redundant and can be removed. Cleans up clang warning: fs/f2fs/extent_cache.c:463:19: warning: Value stored to 'p' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:45 +09:00
Gao Xiang	46706d5917	f2fs: flush cp pack except cp pack 2 page at first Previously, we attempt to flush the whole cp pack in a single bio, however, when suddenly powering off at this time, we could get into an extreme scenario that cp pack 1 page and cp pack 2 page are updated and latest, but payload or current summaries are still partially outdated. (see reliable write in the UFS specification) This patch submits the whole cp pack except cp pack 2 page at first, and then writes the cp pack 2 page with an extra independent bio with pre-io barrier. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:43 +09:00
Sheng Yong	ccd31cb28f	f2fs: clean up f2fs_sb_has_xxx functions This patch introduces F2FS_FEATURE_FUNCS to clean up the definitions of different f2fs_sb_has_xxx functions. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:42 +09:00
Tiezhu Yang	3bb09a0e7e	f2fs: remove redundant check of page type when submit bio This patch removes redundant check of page type when submit bio to make the logic more clear. Signed-off-by: Tiezhu Yang <kernelpatch@126.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:41 +09:00
Chao Yu	fb0e72c8b9	f2fs: fix to handle looped node chain during recovery There is no checksum in node block now, so bit-transition from hardware can make node_footer.next_blkaddr being corrupted w/o any detection, result in node chain becoming looped one. For this condition, during recovery, in order to avoid running into dead loop, let's detect it and just skip out. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:40 +09:00
Jaegeuk Kim	0f9ec2a8f6	f2fs: handle quota for orphan inodes This is to detect dquot_initialize errors early from evict_inode for orphan inodes. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:38 +09:00
Hyunchul Lee	f2e703f9a3	f2fs: support passing down write hints to block layer with F2FS policy Add 'whint_mode=fs-based' mount option. In this mode, F2FS passes down write hints with its policy. * whint_mode=fs-based. F2FS passes down hints with its policy. User F2FS Block ---- ---- ----- META WRITE_LIFE_MEDIUM; HOT_NODE WRITE_LIFE_NOT_SET WARM_NODE " COLD_NODE WRITE_LIFE_NONE ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME extension list " " -- buffered io WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG WRITE_LIFE_NONE " " WRITE_LIFE_MEDIUM " " WRITE_LIFE_LONG " " -- direct io WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_LONG " WRITE_LIFE_LONG Many thanks to Chao Yu and Jaegeuk Kim for comments to implement this patch. Signed-off-by: Hyunchul Lee <cheol.lee@lge.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:36 +09:00
Hyunchul Lee	0cdd319539	f2fs: support passing down write hints given by users to block layer Add the 'whint_mode' mount option that controls which write hints are passed down to block layer. There are "off" and "user-based" mode. The default mode is "off". 1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET. 2) whint_mode=user-based. F2FS tries to pass down hints given by users. User F2FS Block ---- ---- ----- META WRITE_LIFE_NOT_SET HOT_NODE " WARM_NODE " COLD_NODE " ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME extension list " " -- buffered io WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET WRITE_LIFE_NONE " " WRITE_LIFE_MEDIUM " " WRITE_LIFE_LONG " " -- direct io WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET WRITE_LIFE_NONE " WRITE_LIFE_NONE WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM WRITE_LIFE_LONG " WRITE_LIFE_LONG Many thanks to Chao Yu and Jaegeuk Kim for comments to implement this patch. Signed-off-by: Hyunchul Lee <cheol.lee@lge.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: avoid build warning] [Chao Yu: fix to restore whint_mode in ->remount_fs] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:35 +09:00
Chao Yu	cd36d7a17f	f2fs: fix to clear CP_TRIMMED_FLAG Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard before LBA becomes invalid again, fix it by clearing the flag in checkpoint without CP_TRIMMED reason. Fixes: `1f43e2ad7b` ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:34 +09:00
Chao Yu	199bc3fef2	f2fs: support large nat bitmap Previously, we will store all nat version bitmap in checkpoint pack block, so our total node entry number has a limitation which caused total node number can not exceed (3900 * 8) block * 455 node/block = 14196000. So that once user wants to create more nodes in large size image, it becomes a bottleneck, that's unreasonable. This patch detects the new layout of nat/sit version bitmap in image in order to enable supporting large nat bitmap. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:33 +09:00
Chao Yu	bf617f7a92	f2fs: fix to check extent cache in f2fs_drop_extent_tree If noextent_cache mount option is on, we will never initialize extent tree in inode, but still we're going to access it in f2fs_drop_extent_tree, result in kernel panic as below: BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 IP: _raw_write_lock+0xc/0x30 Call Trace: ? f2fs_drop_extent_tree+0x41/0x70 [f2fs] f2fs_fallocate+0x5a0/0xdd0 [f2fs] ? common_file_perm+0x47/0xc0 ? apparmor_file_permission+0x1a/0x20 vfs_fallocate+0x15b/0x290 SyS_fallocate+0x44/0x70 do_syscall_64+0x6e/0x160 entry_SYSCALL64_slow_path+0x25/0x25 This patch fixes to check extent cache status before using in f2fs_drop_extent_tree. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:32 +09:00
Chao Yu	4d817ae099	f2fs: restrict inline_xattr_size configuration This patch limits to enable inline_xattr_size mount option only if both extra_attr and flexible_inline_xattr feature is on in current image. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:31 +09:00
Yunlong Song	b94929d975	f2fs: fix heap mode to reset it back Commit `7a20b8a61e` ("f2fs: allocate node and hot data in the beginning of partition") introduces another mount option, heap, to reset it back. But it does not do anything for heap mode, so fix it. Cc: stable@vger.kernel.org Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:30 +09:00
Sheng Yong	0964fc1a82	f2fs: fix potential corruption in area before F2FS_SUPER_OFFSET sb_getblk does not guarantee the buffer head is uptodate. If bh is not uptodate, the data (may be used as boot code) in area before F2FS_SUPER_OFFSET may get corrupted when super block is committed. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:29 +09:00
Yunlong Song	bdbc90fa55	f2fs: don't put dentry page in pagecache into highmem Previous dentry page uses highmem, which will cause panic in platforms using highmem (such as arm), since the address space of dentry pages from highmem directly goes into the decryption path via the function fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not from highmem, and then cause panic since it doesn't call kmap_high but kunmap_high is triggered at the end. To fix this problem in a simple way, this patch avoids to put dentry page in pagecache into highmem. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix coding style] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-03-13 08:05:03 +09:00
Linus Torvalds	fc6eabbbf8	NFS client bugfixes for Linux 4.16 Hightlights include the following stable fixes: - NFS: Fix an incorrect type in struct nfs_direct_req - pNFS: Prevent the layout header refcount going to zero in pnfs_roc() - NFS: Fix unstable write completion -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJaprb9AAoJEGcL54qWCgDysK0P/Rwyc7fhcs4CvxETQT0LhbbB P6XDAb5pBeU+ca8ZWzqlpe1uKc0z5ykuTLuxLt2QvSb9+6h+8T1gxOa5SibehnwY 6UizLDgHMuY2V4wqm4XLHiZX7J0Evdo4rgNCp7qq2/TinBPYKQsqIz34+s9BZDti 6LG6WiVhOtkhs6s/sRbAn+FfMbSiQn54iQIgFPeO+3zEzaaLzUZqlVHO5yjxVnxh 6oYGkEPVzP09URUO/HJ1VTUZJnso1/axEcH9qhuJzZ+pANCpuBjSMoFzzE6oI2mD dfD8UJZjwqLb7AsFhgSQUrDXzaI0Xg7ccrImtCp4QLlTjja3TlOYMYbK9kDRHC1/ kT93cSFSIeSGpqLVSdNyiz7pCh2Cps2U0JTbhwE0R1NEskWQPzSwY73jEwiA/bex JOZBxtjVZt8TgOKHxPO65rvQwaq3T31KrQSDLqv3JReI0epJeqng0h+pwiZluNAs TFjE1aP0l567A3qQnPrHrGZ4IIs8XYobZ+MBxf9x5PaRWnGoPUkhHXMP2j8JCnwa ofPBR4Mx5WhnsB7Z5vsYetHtmD99QbJ0+WH9ObilUFt6gUgXoXCrvexIUq+LD3DP HIIqfbSorAIlQxtCwEz7m85B7VEKByaCWj8OiFt5VVAreEhiazmsncsyh8R5fa88 wJDxRe3QmNu7BEJ9NJKi =IQGX -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Hightlights include the following stable fixes: - NFS: Fix an incorrect type in struct nfs_direct_req - pNFS: Prevent the layout header refcount going to zero in pnfs_roc() - NFS: Fix unstable write completion" * tag 'nfs-for-4.16-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix unstable write completion pNFS: Prevent the layout header refcount going to zero in pnfs_roc() NFS: Fix an incorrect type in struct nfs_direct_req	2018-03-12 10:47:03 -07:00
Nikolay Borisov	ce3077ee80	direct-io: Remove unused DIO_SKIP_DIO_COUNT logic This flag was added by `fe0f07d08e` ("direct-io: only inc/deci inode->i_dio_count for file systems") as means to optimise the atomic modificaiton of the variable for blockdevices. However with the advent of `542ff7bf18` ("block: new direct I/O implementation") it became unused. So let's remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-03-12 10:21:24 -06:00
Nikolay Borisov	c8f4c36f81	direct-io: Remove unused DIO_ASYNC_EXTEND flag This flag was added by `6039257378` ("direct-io: add flag to allow aio writes beyond i_size") to support XFS. However, with the rework of XFS' DIO's path to use iomap in `acdda3aae1` ("xfs: use iomap_dio_rw") it became redundant. So let's remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-03-12 10:21:22 -06:00
Al Viro	c19457f0ae	d_delete(): get rid of trylock loop just grab ->i_lock first; we have a positive dentry, nothing's going to happen to inode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-12 11:59:14 -04:00
John Ogness	c1d0c1a2b5	fs/dcache: Move dentry_kill() below lock_parent() A subsequent patch will modify dentry_kill() to call lock_parent(). Move the dentry_kill() implementation "as is" below lock_parent() first. This will help simplify the review of the subsequent patch with dentry_kill() changes. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-12 11:59:13 -04:00
John Ogness	06080d100d	fs/dcache: Remove stale comment from dentry_kill() Commit `0d98439ea3` ("vfs: use lockred "dead" flag to mark unrecoverably dead dentries") removed the `ref' parameter in dentry_kill() but its documentation remained. Remove it. Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-12 11:59:13 -04:00
Al Viro	0632a9ac7b	take write_seqcount_invalidate() into __d_drop() ... and reorder it with making d_unhashed() true. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-03-12 11:59:12 -04:00
Brian Foster	0ab32086d0	xfs: account only rmapbt-used blocks against rmapbt perag res The rmapbt perag metadata reservation reserves blocks for the reverse mapping btree (rmapbt). Since the rmapbt uses blocks from the agfl and perag accounting is updated as blocks are allocated from the allocation btrees, the reservation actually accounts blocks as they are allocated to (or freed from) the agfl rather than the rmapbt itself. While this works for blocks that are eventually used for the rmapbt, not all agfl blocks are destined for the rmapbt. Blocks that are allocated to the agfl (and thus "reserved" for the rmapbt) but then used by another structure leads to a growing inconsistency over time between the runtime tracking of rmapbt usage vs. actual rmapbt usage. Since the runtime tracking thinks all agfl blocks are rmapbt blocks, it essentially believes that less future reservation is required to satisfy the rmapbt than what is actually necessary. The inconsistency is rectified across mount cycles because the perag reservation is initialized based on the actual rmapbt usage at mount time. The problem, however, is that the excessive drain of the reservation at runtime opens a window to allocate blocks for other purposes that might be required for the rmapbt on a subsequent mount. This problem can be demonstrated by a simple test that runs an allocation workload to consume agfl blocks over time and then observe the difference in the agfl reservation requirement across an unmount/mount cycle: mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194 ... ... : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1 umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0 mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194 As the above tracepoints show, the reservation requirement reduces from 3194 blocks to 2956 blocks as the workload runs. Without any other changes in the filesystem, the same reservation requirement jumps from 2956 to 3052 blocks over a umount/mount cycle. To address this divergence, update the RMAPBT reservation to account blocks used for the rmapbt only rather than all blocks filled into the agfl. This patch makes several high-level changes toward that end: 1.) Reintroduce an AGFL reservation type to serve as an accounting no-op for blocks allocated to (or freed from) the AGFL. 2.) Invoke RMAPBT usage accounting from the actual rmapbt block allocation path rather than the AGFL allocation path. The first change is required because agfl blocks are considered free blocks throughout their lifetime. The perag reservation subsystem is invoked unconditionally by the allocation subsystem, so we need a way to tell the perag subsystem (via the allocation subsystem) to not make any accounting changes for blocks filled into the AGFL. The second change causes the in-core RMAPBT reservation usage accounting to remain consistent with the on-disk state at all times and eliminates the risk of leaving the rmapbt reservation underfilled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:57 -07:00
Brian Foster	2159286335	xfs: rename agfl perag res type to rmapbt The AGFL perag reservation type accounts all allocations that feed into (or are released from) the allocation group free list (agfl). The purpose of the reservation is to support worst case conditions for the reverse mapping btree (rmapbt). As such, the agfl reservation usage accounting only considers rmapbt usage when the in-core counters are initialized at mount time. This implementation inconsistency leads to divergence of the in-core and on-disk usage accounting over time. In preparation to resolve this inconsistency and adjust the AGFL reservation into an rmapbt specific reservation, rename the AGFL reservation type and associated accounting fields to something more rmapbt-specific. Also fix up a couple tracepoints that incorrectly use the AGFL reservation type to pass the agfl state of the associated extent where the raw reservation type is expected. Note that this patch does not change perag reservation behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:57 -07:00
Brian Foster	b3fed43482	xfs: account format bouncing into rmapbt swapext tx reservation The extent swap mechanism requires a unique implementation for rmapbt enabled filesystems. Because the rmapbt tracks extent owner information, extent swap must individually unmap and remap each extent between the two inodes. The rmapbt extent swap transaction block reservation currently accounts for the worst case bmapbt block and rmapbt block consumption based on the extent count of each inode. There is a corner case that exists due to the extent swap implementation that is not covered by this reservation, however. If one of the associated inodes is just over the max extent count used for extent format inodes (i.e., the inode is in btree format by a single extent), the unmap/remap cycle of the extent swap can bounce the inode between extent and btree format multiple times, almost as many times as there are extents in the inode (if the opposing inode happens to have one less, for example). Each back and forth cycle involves a block free and allocation, which isn't a problem except for that the initial transaction reservation must account for the total number of block allocations performed by the chain of deferred operations. If not, a block reservation overrun occurs and the filesystem shuts down. Update the rmapbt extent swap block reservation to check for this situation and add some block reservation slop to ensure the entire operation succeeds. We'd never likely require reservation for both inodes as fsr wouldn't defrag the file in that case, but the additional reservation is constrained by the data fork size so be cautious and check for both. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:57 -07:00
Brian Foster	3e78b9a468	xfs: shutdown if block allocation overruns tx reservation The ->t_blk_res_used field tracks how many blocks have been used in the current transaction. This should never exceed the block reservation (->t_blk_res) for a particular transaction. We currently assert this condition in the transaction block accounting code, but otherwise take no additional action should this situation occur. The overrun generally has no effect if space ends up being available and the associated transaction commits. If the transaction is duplicated, however, the current block usage is used to determine the remaining block reservation to be transferred to the new transaction. If usage exceeds reservation, this calculation underflows and creates a transaction with an invalid and excessive reservation. When the second transaction commits, the release of unused blocks corrupts the in-core free space counters. With lazy superblock accounting enabled, this inconsistency eventually trickles to the on-disk superblock and corrupts the filesystem. Replace the transaction block usage accounting assert with an explicit overrun check. If the transaction overruns the reservation, shutdown the filesystem immediately to prevent corruption. Add a new assert to xfs_trans_dup() to catch any callers that might induce this invalid state in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:57 -07:00
Matthew Wilcox	57e8095611	xfs: Rename xa_ elements to ail_ This is a simple rename, except that xa_ail becomes ail_head. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:56 -07:00
Dave Chinner	ae23395d88	inode: don't memset the inode address space twice Noticed when looking at why cycling 600k inodes/s through the inode cache was taking a total of 8% cpu in memset() during inode initialisation. There is no need to zero the inode.i_data structure twice. This increases single threaded bulkstat throughput from ~200,000 inodes/s to ~220,000 inodes/s, so we save a substantial amount of CPU time per inode init by doing this. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:56 -07:00
Dave Chinner	a78ee256c3	xfs: convert XFS_AGFL_SIZE to a helper function The AGFL size calculation is about to get more complex, so lets turn the macro into a function first and remove the macro. Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick: forward port to newer kernel, simplify the helper] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-11 20:27:56 -07:00
Darrick J. Wong	6231848c3a	xfs: check for cow blocks before trying to clear them There's no point in allocating a transaction and locking the inode in preparation to clear cow blocks if there actually are any cow fork extents. Therefore, move the xfs_reflink_cancel_cow_range hunk to xfs_inactive and check the cow ifp first. This makes inode reclamation run faster. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-11 20:27:56 -07:00
Darrick J. Wong	3f883f5bb1	xfs: convert a few more directory asserts to corruption Yet another round of playing whack-a-mole with directory code that asserts on corrupt on-disk metadata when it really should be returning -EFSCORRUPTED instead of ASSERTing. Found by a xfs/391 crash while lastbit fuzzing of ltail.bestcount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-11 20:27:56 -07:00
Darrick J. Wong	8241f7f983	xfs: don't iunlock the quota ip when quota block In xfs_qm_dqalloc, we join the locked quota inode to the transaction we use to allocate blocks. If the allocation or mapping fails, we're not allowed to unlock the inode because the transaction code is in charge of unlocking it for us. Therefore, remove the iunlock call to avoid blowing asserts about unbalanced locking + mount hang. Found by corrupting the AGF and allocating space in the filesystem (quotacheck) immediately after mount. The upcoming agfl wrapping fixup test will trigger this scenario. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-03-11 20:27:56 -07:00
Vratislav Bendel	19957a1816	xfs: Correctly invert xfs_buftarg LRU isolation logic Due to an inverted logic mistake in xfs_buftarg_isolate() the xfs_buffers with zero b_lru_ref will take another trip around LRU, while isolating buffers with non-zero b_lru_ref. Additionally those isolated buffers end up right back on the LRU once they are released, because b_lru_ref remains elevated. Fix that circuitous route by leaving them on the LRU as originally intended. Signed-off-by: Vratislav Bendel <vbendel@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:56 -07:00
Dave Chinner	4df0f7f145	xfs: fix transaction allocation deadlock in IO path xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it while holding pages locked for writeback in the ->writepages path. The memory allocation is allowed to wait on pages under writeback, and so can wait on pages that are tagged as writeback by the caller. This affects both pre-IO submission and post-IO submission paths. Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(), xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range(). xfs_iomap_write_unwritten() already does the right thing, but the others don't. Fix them. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Fixes: `281627df3e` ("xfs: log file size updates at I/O completion time") Fixes: `43caeb187d` ("xfs: move mappings from cow fork to data fork after copy-write)" Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Christoph Hellwig	c3b1b13190	xfs: implement the lazytime mount option Use the VFS dirty inode tracking for lazytime inodes only, and just log them in ->dirty_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Christoph Hellwig	0d07e5573f	fs: don't clear I_DIRTY_TIME before calling mark_inode_dirty_sync __mark_inode_dirty already takes care of that, and for the XFS lazytime implementation we need to know that ->dirty_inode was called because I_DIRTY_TIME was set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Nikolay Borisov	bcab2ebfa1	xfs: Remove dead code from inode recover function The memcpy is guarded by a check which is performed a right before we call xfs_log_dinode_to_disk. At this point we are sure this check will always be false otherwise we would have errored out. So let's remove this dead weight. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Carlos Maiolino	e157ebdcb3	Cleanup old XFS_BTREE_* traces Remove unused legacy btree traces from IRIX era. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Eric Sandeen	4603fa744c	xfs: remove unused m_dmevmask from xfs_mount struct The dmevmask structure member is a dmapi leftover; it's set here and there but never actually used. Remove it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Dave Chinner	cb0a8d2302	xfs: fall back to vmalloc when allocation log vector buffers When using large directory blocks, we regularly see memory allocations of >64k being made for the shadow log vector buffer. When we are under memory pressure, kmalloc() may not be able to find contiguous memory chunks large enough to satisfy these allocations easily, and if memory is fragmented we can potentially stall here. TO avoid this problem, switch the log vector buffer allocation to use kmem_alloc_large(). This will allow failed allocations to fall back to vmalloc and so remove the dependency on large contiguous regions of memory being available. This should prevent slowdowns and potential stalls when memory is low and/or fragmented. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-03-11 20:27:55 -07:00
Geliang Tang	cb3bee0369	pstore: Use crypto compress API In the pstore compression part, we use zlib/lzo/lz4/lz4hc/842 compression algorithm API to implement pstore compression backends. But there are many repeat codes in these implementations. This patch uses crypto compress API to simplify these codes. 1) rewrite allocate_buf_for_compression, free_buf_for_compression, pstore_compress, pstore_decompress functions using crypto compress API. 2) drop compress, decompress, allocate, free functions in pstore_zbackend, and add zbufsize function to get each different compress buffer size. 3) use late_initcall to call ramoops_init later, to make sure the crypto subsystem has already initialized. 4) use 'unsigned int' type instead of 'size_t' in pstore_compress, pstore_decompress functions' length arguments. 5) rename 'zlib' to 'deflate' to follow the crypto API's name convention. Signed-off-by: Geliang Tang <geliangtang@gmail.com> [kees: tweaked error messages on allocation failures and Kconfig help] Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-09 14:16:29 -08:00
Linus Torvalds	719ea86151	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a corner case for NFS exporting (introduced in this cycle) as well as fixing miscellaneous bugs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: update Kconfig texts ovl: redirect_dir=nofollow should not follow redirect for opaque lower ovl: fix ptr_ret.cocci warnings ovl: check ERR_PTR() return value from ovl_lookup_real() ovl: check lower ancestry on encode of lower dir file handle ovl: hash non-dir by lower inode for fsnotify	2018-03-09 09:46:14 -08:00
Linus Torvalds	2d9b1d69c3	Changes since last update: - Fix some iomap locking problems - Don't allocate cow blocks when we're zeroing file data -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCgAGBQJamePeAAoJEPh/dxk0SrTryPgP/0sh8xcAFUHR0ad7D74nSp3o 1gjBi6Yl+i+hn5SZ/gkKPa5YFjTMaefcTV8rpZm3uPYbTf75TTscjKDDczFNEqoh CDKOeyfOchVSjhRI5ExElBh5ToxDgh+HMzzlmxk+olfWA+BXjJj2J2iLh871v9Ym hQl7xOYUDPu1xZz+jLhDLKG2Gd/R97ez2KZM43gKMAUsQ7eUbz6CtV73cFT+tNfP jjIe3xp6ohbJYJalMThbNcI2mOOssnCM8BHUDBJib7N4CXucJYTMpDAcGvNoFKul +K2Icyip1/6CS4LkWDpP29ayiH+sTbZk8/ipioe27hXXHGqKG84HQOBefRZmo4o0 CC64SoomuxQX6V7qL6hf0Sz82QQbl+ToSliZ97xXP5o84/OxhggZGqVdykTR2+DB 4POaJzRPmYJKPO4Q0yT2worB4oepAUrridJnaQE0y9F80xnLTvJxpD53atLCHulj Ht9NlvnvqndDYaYjQwvucOmdR9vCos3OjZCaXnYgL+EpbgZUvMeQoxuyBUBTdQeQ CNGq3EDssI75gD67sWeyFc+gOST259XofANw3mvlu3KS2/huviORJkDYdtzi2/LL gkCN9OW9CmXOaAiylE8dSAu98SAR+c83wmahMeBA2mq5vYrQU4rIDGYZi/Mulne4 tD7eW5nlvps9Fsvf58fd =T9nM -----END PGP SIGNATURE----- Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Darrick Wong: - Fix some iomap locking problems - Don't allocate cow blocks when we're zeroing file data * tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't block on the ilock for RWF_NOWAIT xfs: don't start out with the exclusive ilock for direct I/O xfs: don't allocate COW blocks for zeroing holes or unwritten extents	2018-03-09 09:37:29 -08:00
Kyle Spiers	a9cee1764b	reiserfs: Remove VLA from fs/reiserfs/reiserfs.h Remove Variable Length Array from fs/reiserfs/reiserfs.h. EMPTY_DIR_SIZE is used as an array size and as it is using strlen() it need not be evaluated at compile time. Change it's definition to use sizeof() to force evaluation of array length at compile time. Signed-off-by: Kyle Spiers <kyle@spiers.me> Signed-off-by: Jan Kara <jack@suse.cz>	2018-03-09 13:04:20 +01:00
Trond Myklebust	c4f24df942	NFS: Fix unstable write completion We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to ensure that all outstanding COMMIT requests to the inode in question are complete. Currently we may exit early from both nfs_commit_inode() and nfs_write_inode() even if there are COMMIT requests in flight, or unstable writes on the commit list. In order to get the right semantics w.r.t. sync_inode(), we don't need to have nfs_commit_inode() reset the inode dirty flags when called from nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that nfs_write_inode() leaves them in the right state if there are outstanding commits, or stable pages. Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: `dc4fd9ab01` ("nfs: don't wait on commit in nfs_commit_inode()...") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-03-08 12:56:32 -05:00
Trond Myklebust	9c6376ebdd	pNFS: Prevent the layout header refcount going to zero in pnfs_roc() Ensure that we hold a reference to the layout header when processing the pNFS return-on-close so that the refcount value does not inadvertently go to zero. Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.10+ Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>	2018-03-08 12:56:31 -05:00
Trond Myklebust	d9ee65539d	NFS: Fix an incorrect type in struct nfs_direct_req The start offset needs to be of type loff_t. Fixed: `5fadeb47dc` ("nfs: count DIO good bytes correctly with mirroring") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2018-03-08 12:56:31 -05:00
Andreas Gruenbacher	d39d18e0ef	gfs2: Improve gfs2_block_map comment Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-08 09:26:20 -07:00
Bob Peterson	b9e03f1861	GFS2: Only set PageChecked for jdata pages Before this patch, GFS2 was setting the PageChecked flag for ordered write pages. This is unnecessary. The ext3 file system only does it for jdata, and it's only used in jdata circumstances. It only muddies the already murky waters of writing pages in the aops. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-08 09:26:20 -07:00
Bob Peterson	9bc980cdb9	GFS2: Make function gfs2_remove_from_ail static Function gfs2_remove_from_ail is only ever used from log.c, so there is no reason to declare it extern. This patch removes the extern and declares it static. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-08 09:26:20 -07:00
Andreas Gruenbacher	83998ccd9b	gfs2: Dirty source inode during rename Mark the source inode dirty during a rename instead of just updating the underlying buffer head. Otherwise, fsync may find the inode clean and will then skip flushing the journal. A subsequent power failure will cause the rename to be lost. This happens in command sequences like: xfs_io -f -c 'pwrite 0 4096' -c 'fsync' foo mv foo bar xfs_io -c 'fsync' bar # power failure Fixes xfstests generic/322, generic/376. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-08 09:26:20 -07:00
Andreas Gruenbacher	174d1232eb	gfs2: Fix fallocate chunk size The chunk size of allocations in __gfs2_fallocate is calculated incorrectly. The size can collapse, causing __gfs2_fallocate to allocate one block at a time, which is very inefficient. This needs fixing in two places: In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate that there is no quota limit. This fixes callers that rely on ap->allowed to be set even when quotas are off. In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the loop to make sure that allocation limits from one resource group won't spill over into another resource group. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-08 09:26:20 -07:00
Kees Cook	f2531f1976	pstore/ram: Do not use stack VLA for parity workspace Instead of using a stack VLA for the parity workspace, preallocate a memory region. The preallocation is done to keep from needing to perform allocations during crash dump writing, etc. This also fixes a missed release of librs on free. Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-07 12:47:06 -08:00
Kees Cook	fe1d475888	pstore: Select compression at runtime To allow for easier build test coverage and run-time testing, this allows multiple compression algorithms to be built into pstore. Still only one is supported to operate at a time (which can be selected at build time or at boot time, similar to how LSMs are selected). Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-07 12:43:35 -08:00
Andreas Gruenbacher	3b5da96e45	gfs2: Fixes to "Implement iomap for block_map" (2) It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for block_map"' introduced another bug in gfs2_iomap_begin that can cause gfs2_block_map to set bh->b_size of an actual buffer to 0. This can lead to arbitrary incorrect behavior including crashes or disk corruption. Revert the incorrect part of that commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2018-03-07 11:40:38 -07:00
Miklos Szeredi	36cd95dfa1	ovl: update Kconfig texts Add some hints about overlayfs kernel config options. Enabling NFS export by default is especially recommended against, as it incurs a performance penalty even if the filesystem is not actually exported. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-03-07 11:47:15 +01:00
Kees Cook	555974068e	pstore: Avoid size casts for 842 compression Instead of casting, make sure we don't end up with giant values and just perform regular assignments with unsigned int instead of re-cast size_t. Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-06 15:15:24 -08:00
Geliang Tang	239b716199	pstore: Add lz4hc and 842 compression support Currently, pstore has supported three compression algorithms: zlib, lzo and lz4. This patch added two more compression algorithms: lz4hc and 842. Signed-off-by: Geliang Tang <geliangtang@gmail.com> [kees: tweaked Kconfig help text slightly] Signed-off-by: Kees Cook <keescook@chromium.org>	2018-03-06 15:06:11 -08:00

... 7 8 9 10 11 ...

53508 Commits