linux

Commit Graph

Author	SHA1	Message	Date
youngjun yoo	193bea1dd8	fs: f2fs: changed variable type of offset "unsigned" to "loff_t" clean up checkpatch warning: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Signed-off-by: youngjun yoo <youngjun.willow@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Chao Yu	4d57b86dd8	f2fs: clean up symbol namespace As Ted reported: "Hi, I was looking at f2fs's sources recently, and I noticed that there is a very large number of non-static symbols which don't have a f2fs prefix. There's well over a hundred (see attached below). As one example, in fs/f2fs/dir.c there is: unsigned char get_de_type(struct f2fs_dir_entry *de) This function is clearly only useful for f2fs, but it has a generic name. This means that if any other file system tries to have the same symbol name, there will be a symbol conflict and the kernel would not successfully build. It also means that when someone is looking f2fs sources, it's not at all obvious whether a function such as read_data_page(), invalidate_blocks(), is a generic kernel function found in the fs, mm, or block layers, or a f2fs specific function. You might want to fix this at some point. Hopefully Kent's bcachefs isn't similarly using genericly named functions, since that might cause conflicts with f2fs's functions --- but just as this would be a problem that we would rightly insist that Kent fix, this is something that we should have rightly insisted that f2fs should have fixed before it was integrated into the mainline kernel. acquire_orphan_inode add_ino_entry add_orphan_inode allocate_data_block allocate_new_segments alloc_nid alloc_nid_done alloc_nid_failed available_free_memory ...." This patch adds "f2fs_" prefix for all non-static symbols in order to: a) avoid conflict with other kernel generic symbols; b) to indicate the function is f2fs specific one instead of generic one; Reported-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Chao Yu	2e79d951ff	f2fs: make set_de_type() static Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Chao Yu	fc99fe27b9	f2fs: make __f2fs_write_data_pages() static Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Chao Yu	9fd62605bb	f2fs: fix to avoid accessing cross the boundary Configure io_bits with 2 and enable LFS mode, generic/017 reports below dmesg: BUG: unable to handle kernel NULL pointer dereference at 00000039 pdpt = 000000002fcb2001 pde = 0000000000000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: crc32_generic zram f2fs(O) bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi pcbc snd_seq joydev aesni_intel aes_i586 snd_seq_device snd_timer crypto_simd cryptd snd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic usbhid psmouse hid e1000 CPU: 2 PID: 20779 Comm: xfs_io Tainted: G O 4.17.0-rc2 #38 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 EIP: is_checkpointed_data+0x84/0xd0 [f2fs] EFLAGS: 00010207 CPU: 2 EAX: 00000000 EBX: f5cd7000 ECX: fffffe32 EDX: 00000039 ESI: 000001cd EDI: ec95fb6c EBP: e264bd80 ESP: e264bd6c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: 00000039 CR3: 2fe55660 CR4: 000406f0 Call Trace: __exchange_data_block+0xb3f/0x1000 [f2fs] f2fs_fallocate+0xab9/0x16b0 [f2fs] vfs_fallocate+0x17c/0x2d0 ksys_fallocate+0x42/0x70 sys_fallocate+0x31/0x40 do_fast_syscall_32+0xaa/0x22c entry_SYSENTER_32+0x4c/0x7b EIP: 0xb7f98c51 EFLAGS: 00000293 CPU: 2 EAX: ffffffda EBX: 00000003 ECX: 00000008 EDX: 01001000 ESI: 00000000 EDI: 00001000 EBP: 00000000 ESP: bfc0357c DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Code: 00 00 d3 e8 8b 4d ec 2b 02 8b 55 f0 6b c0 1c 03 41 70 29 d6 8b 93 d0 06 00 00 8b 40 0c 83 ea 01 21 d6 89 f2 89 f1 c1 ea 03 f7 d1 <0f> be 14 10 83 e1 07 b8 01 00 00 00 d3 e0 85 c2 89 f8 0f 95 c3 EIP: is_checkpointed_data+0x84/0xd0 [f2fs] SS:ESP: 0068:e264bd6c CR2: 0000000000000039 ---[ end trace 9a4d4087cce6080a ]--- This is because in recovery flow of __exchange_data_block, we didn't pass olen to __roll_back_blkaddrs, instead we passed len, which indicates wrong array size, result in copying random block address into dnode page. Later, once that random block address was accessed by is_checkpointed_data, it can cause NULL pointer dereference. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Chao Yu	fe16efe6a7	f2fs: fix to let caller retry allocating block address Configure io_bits with 2 and enable LFS mode, generic/013 reports below dmesg: BUG: unable to handle kernel NULL pointer dereference at 00000104 pdpt = 0000000029b7b001 pde = 0000000000000000 Oops: 0002 [#1] PREEMPT SMP Modules linked in: crc32_generic zram f2fs(O) rfcomm bnep bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev snd_seq_device aesni_intel snd_timer aes_i586 snd crypto_simd cryptd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 CPU: 0 PID: 11161 Comm: fsstress Tainted: G O 4.17.0-rc2 #38 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs] EFLAGS: 00010206 CPU: 0 EAX: e863dcd8 EBX: 00000000 ECX: 00000100 EDX: 00000200 ESI: e863dcf4 EDI: f6f82768 EBP: e863dbb0 ESP: e863db74 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: 00000104 CR3: 29a62020 CR4: 000406f0 Call Trace: do_write_page+0x6f/0xc0 [f2fs] write_data_page+0x4a/0xd0 [f2fs] do_write_data_page+0x327/0x630 [f2fs] __write_data_page+0x34b/0x820 [f2fs] __f2fs_write_data_pages+0x42d/0x8c0 [f2fs] f2fs_write_data_pages+0x27/0x30 [f2fs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0x94/0xd0 filemap_write_and_wait_range+0x3d/0xa0 __generic_file_write_iter+0x11a/0x1f0 f2fs_file_write_iter+0xdd/0x3b0 [f2fs] __vfs_write+0xd2/0x150 vfs_write+0x9b/0x190 ksys_write+0x45/0x90 sys_write+0x16/0x20 do_fast_syscall_32+0xaa/0x22c entry_SYSENTER_32+0x4c/0x7b EIP: 0xb7fc8c51 EFLAGS: 00000246 CPU: 0 EAX: ffffffda EBX: 00000003 ECX: 09cde000 EDX: 00001000 ESI: 00000003 EDI: 00001000 EBP: 00000000 ESP: bfbded38 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b Code: e8 f9 77 34 c9 8b 45 e0 8b 80 b8 00 00 00 39 45 d8 0f 84 bb 02 00 00 8b 45 e0 8b 80 b8 00 00 00 8d 50 d8 8b 08 89 55 f0 8b 50 04 <89> 51 04 89 0a c7 00 00 01 00 00 c7 40 04 00 02 00 00 8b 45 dc EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs] SS:ESP: 0068:e863db74 CR2: 0000000000000104 ---[ end trace 4cac79c0d1305ee6 ]--- allocate_data_block will submit all sequential pending IOs sorted by a FIFO list, If we failed to submit other user's IO due to unaligned write, we will retry to allocate new block address for current IO, then it will initialize fio.list again, if fio was in the list before, it can break FIFO list, result in above panic. Thread A Thread B - do_write_page - allocate_data_block - list_add_tail : fioA cached in FIFO list. - do_write_page - allocate_data_block - list_add_tail : fioB cached in FIFO list. - f2fs_submit_page_write : fail to submit IO - allocate_data_block - INIT_LIST_HEAD - f2fs_submit_page_write - list_del <-- NULL pointer dereference This patch adds fio.retry parameter to indicate failure status for each IO, and avoid bailing out if there is still pending IO in FIFO list for fixing. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:53 -07:00
Anatoly Pugachev	4071e67cff	disable loading f2fs module on PAGE_SIZE > 4KB The following patch disables loading of f2fs module on architectures which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on such architectures , log messages are: mount: /mnt: wrong fs type, bad option, bad superblock on /dev/vdiskb1, missing codepage or helper program, or other error. /dev/vdiskb1: F2FS filesystem, UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name "" May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS filesystem in 1th superblock May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS filesystem in 2th superblock May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid page_cache_size (8192), supports only 4KB which was introduced by git commit `5c9b469295` tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425 with patch applied: modprobe: ERROR: could not insert 'f2fs': Invalid argument May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096 Signed-off-by: Anatoly Pugachev <matorola@gmail.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Chao Yu	14a28559f4	f2fs: fix error path of move_data_page This patch fixes error path of move_data_page: - clear cold data flag if it fails to write page. - redirty page for non-ENOMEM case. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Chao Yu	1174abfd83	f2fs: don't drop dentry pages after fs shutdown As description in commit "f2fs: don't drop any page on f2fs_cp_error() case": "We still provide readdir() after shtudown, so we should keep pages to avoid additional IOs." In order to provider lastest directory structure, let's keep dentry pages in cache after fs shutdown. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Chao Yu	250dbf5151	f2fs: fix to avoid race during access gc_thread pointer Thread A Thread B - f2fs_remount - stop_gc_thread - f2fs_sbi_store sbi->gc_thread = NULL; access sbi->gc_thread->gc_* Previously, we allocate memory for sbi->gc_thread based on background gc thread mount option, the memory can be released if we turn off that mount option, but still there are several places access gc_thread pointer without considering race condition, result in NULL point dereference. In order to fix this issue, use sb->s_umount to exclude those operations. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Chao Yu	aec2f729fc	f2fs: clean up with clear_radix_tree_dirty_tag Introduce clear_radix_tree_dirty_tag to include common codes for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Chao Yu	64c74a7ab5	f2fs: fix to don't trigger writeback during recovery - f2fs_fill_super - recover_fsync_data - recover_data - del_fsync_inode - iput - iput_final - write_inode_now - f2fs_write_inode - f2fs_balance_fs - f2fs_balance_fs_bg - sync_dirty_inodes With data_flush mount option, during recovery, in order to avoid entering above writeback flow, let's detect recovery status and do skip in f2fs_balance_fs_bg. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Sheng Yong	35a9a766a1	f2fs: clear discard_wake earlier If SBI_NEED_FSCK is set, discard_wake will never be cleared. As a result, the condition of wait_event_interruptible_timeout() is always true, which gets discard thread run too frequently. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:52 -07:00
Yunlei He	f9d1dced75	f2fs: let discard thread wait a little longer if dev is busy This patch modify discard thread wait policy as below: issued io_interrupted wait time(ms) 1. 8 0 50 2. (0,8) 1 50 3. 0 1 500 (dev is busy) 4. 0 0 60000 (no candidates) Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Chao Yu	2ef79ecb5e	f2fs: avoid stucking GC due to atomic write f2fs doesn't allow abuse on atomic write class interface, so except limiting in-mem pages' total memory usage capacity, we need to limit atomic-write usage as well when filesystem is seriously fragmented, otherwise we may run into infinite loop during foreground GC because target blocks in victim segment are belong to atomic opened file for long time. Now, we will detect failure due to atomic write in foreground GC, if the count exceeds threshold, we will drop all atomic written data in cache, by this, I expect it can keep our system running safely to prevent Dos attack. In addition, his patch adds to show GC skip information in debugfs, now it just shows count of skipped caused by atomic write. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Jaegeuk Kim	5b0e95398e	f2fs: introduce sbi->gc_mode to determine the policy This is to avoid sbi->gc_thread pointer access. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Chao Yu	107a805de8	f2fs: keep migration IO order in LFS mode For non-migration IO, we will keep order of data/node blocks' submitting as allocation sequence by sorting IOs in per log io_list list, but for migration IO, it could be out-of-order. In LFS mode, we should keep all IOs including migration IO be ordered, so that this patch fixes to add an additional lock to keep submitting order. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Chao Yu	e5e5732d81	f2fs: fix to wait page writeback during revoking atomic write After revoking atomic write, related LBA can be reused by others, so we need to wait page writeback before reusing the LBA, in order to avoid interference between old atomic written in-flight IO and new IO. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Sahitya Tummala	60b2b4ee2b	f2fs: Fix deadlock in shutdown ioctl f2fs_ioc_shutdown() ioctl gets stuck in the below path when issued with F2FS_GOING_DOWN_FULLSYNC option. __switch_to+0x90/0xc4 percpu_down_write+0x8c/0xc0 freeze_super+0xec/0x1e4 freeze_bdev+0xc4/0xcc f2fs_ioctl+0xc0c/0x1ce0 f2fs_compat_ioctl+0x98/0x1f0 Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:51 -07:00
Chao Yu	f8de433122	f2fs: detect synchronous writeback more earlier This patch changes to detect synchronous writeback more earlier before, in order to avoid unnecessary page writeback before exiting asynchronous writeback. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Chao Yu	7b525dd013	f2fs: clean up with is_valid_blkaddr() - rename is_valid_blkaddr() to is_valid_meta_blkaddr() for readability. - introduce is_valid_blkaddr() for cleanup. No logic change in this patch. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Chao Yu	5ad25442b6	f2fs: fix to initialize min_mtime with ULLONG_MAX Since sit_i.min_mtime's type is unsigned long long, so we should initialize it with max value of the type ULLONG_MAX instead of LLONG_MAX. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Chao Yu	e7a4feb0ab	f2fs: fix to let checkpoint guarantee atomic page persistence 1. thread A: commit_inmem_pages submit data into block layer, but haven't waited it writeback. 2. thread A: commit_inmem_pages update related node. 3. thread B: do checkpoint, flush all nodes to disk. 4. SPOR Then, atomic file becomes corrupted since nodes is flushed before data. This patch fixes to treat atomic page as checkpoint guaranteed one, then in checkpoint, we can make sure all atomic page can be writebacked with metadata of atomic file. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Chao Yu	1c41e6808e	f2fs: fix to initialize i_current_depth according to inode type i_current_depth is used only for directory inode, but its space is shared with i_gc_failures field used for regular inode, in order to avoid affecting i_gc_failures' value, this patch fixes to initialize the union's fields according to inode type. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Chao Yu	299254d85d	Revert "f2fs: add ovp valid_blocks check for bg gc victim to fg_gc" For extreme case: 10 section, op = 10%, no_fggc_threshold = 90% All section usage: 85% 85% 85% 85% 90% 90% 95% 95% 95% 95% During foreground GC, if we skip select dirty section whose usage is larger than no_fggc_threshold, we can only recycle 80% invalid space from four 85% usage sections and two 90% usage sections, result in encountering out-of-space issue. This reverts commit `e93b986525` to fix this issue, besides, we keep the logic that we scan all dirty section when searching a victim, so that GC can select victim with least valid blocks. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:50 -07:00
Jaegeuk Kim	868de6135f	f2fs: don't drop any page on f2fs_cp_error() case We still provide readdir() after shtudown, so we should keep pages to avoid additional IOs. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Colin Ian King	4580038e2d	f2fs: fix spelling mistake: "extenstion" -> "extension" Trivial fix to spelling mistake in extension list text Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Jaegeuk Kim	0cfe75c5b0	f2fs: enhance sanity_check_raw_super() to avoid potential overflows In order to avoid the below overflow issue, we should have checked the boundaries in superblock before reaching out to allocation. As Linus suggested, the right place should be sanity_check_raw_super(). Dr Silvio Cesare of InfoSect reported: There are integer overflows with using the cp_payload superblock field in the f2fs filesystem potentially leading to memory corruption. include/linux/f2fs_fs.h struct f2fs_super_block { ... __le32 cp_payload; fs/f2fs/f2fs.h typedef u32 block_t; /* * should not change u32, since it is the on-disk block * address format, __le32. / ... static inline block_t __cp_payload(struct f2fs_sb_info sbi) { return le32_to_cpu(F2FS_RAW_SUPER(sbi)->cp_payload); } fs/f2fs/checkpoint.c block_t start_blk, orphan_blocks, i, j; ... start_blk = __start_cp_addr(sbi) + 1 + __cp_payload(sbi); orphan_blocks = __start_sum_addr(sbi) - 1 - __cp_payload(sbi); +++ integer overflows ... unsigned int cp_blks = 1 + __cp_payload(sbi); ... sbi->ckpt = kzalloc(cp_blks * blk_size, GFP_KERNEL); +++ integer overflow leading to incorrect heap allocation. int cp_payload_blks = __cp_payload(sbi); ... ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks + orphan_blocks); +++ sign bug and integer overflow ... for (i = 1; i < 1 + cp_payload_blks; i++) +++ integer overflow ... sbi->max_orphans = (sbi->blocks_per_seg - F2FS_CP_PACKS - NR_CURSEG_TYPE - __cp_payload(sbi)) * F2FS_ORPHANS_PER_BLOCK; +++ integer overflow Reported-by: Greg KH <greg@kroah.com> Reported-by: Silvio Cesare <silvio.cesare@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Chao Yu	b4c3ca8ba9	f2fs: treat volatile file's data as hot one Volatile file's data will be updated oftenly, so it'd better to place its data into hot data segment. In addition, for atomic file, we change to check FI_ATOMIC_FILE instead of FI_HOT_DATA to make code readability better. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Chao Yu	af8ff65bb8	f2fs: introduce release_discard_addr() for cleanup Introduce release_discard_addr() to include common codes for cleanup. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Fengguang Wu: declare static function, reported by kbuild test robot] Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Chao Yu	a9af3fdcc4	f2fs: fix potential overflow In build_sit_entries(), if valid_blocks in SIT block is smaller than valid_blocks in journal, for below calculation: sbi->discard_blks += old_valid_blocks - se->valid_blocks; There will be two times potential overflow: - old_valid_blocks - se->valid_blocks will overflow, and be a very large number. - sbi->discard_blks += result will overflow again, comes out a correct result accidently. Anyway, it should be fixed. Fixes: `d600af236d` ("f2fs: avoid unneeded loop in build_sit_entries") Fixes: `1f43e2ad7b` ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Chao Yu	b2532c6940	f2fs: rename dio_rwsem to i_gc_rwsem RW semphore dio_rwsem in struct f2fs_inode_info is introduced to avoid race between dio and data gc, but now, it is more wildly used to avoid foreground operation vs data gc. So rename it to i_gc_rwsem to improve its readability. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:49 -07:00
Yunlei He	b82f6e347b	f2fs: move mnt_want_write_file after range check This patch move mnt_want_write_file after range check, it's needless to check arguments with it. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Yunlei He	cba41be08c	f2fs: fix missing clear FI_NO_PREALLOC in some error case This patch fix missing clear FI_NO_PREALLOC in some error case Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Jaegeuk Kim	ade990f95e	f2fs: enforce fsync_mode=strict for renamed directory This is to give a option for user to be able to recover B/foo in the below case. mkdir A sync() rename(A, B) creat (B/foo) fsync (B/foo) ---crash--- Sugessted-by: Velayudhan Pillai <vijay@cs.utexas.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Jaegeuk Kim	8a29c1260e	f2fs: sanity check for total valid node blocks This patch enhances sanity check for SIT entries. syzbot hit the following crash on upstream commit `83beed7b2b` (Fri Apr 20 17:56:32 2018 +0000) Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=bf9253040425feb155ad syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5692130282438656 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5095924598571008 Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118 compiler: gcc (GCC) 8.0.1 20180413 (experimental) IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. F2FS-fs (loop0): invalid crc value F2FS-fs (loop0): Try to recover 1th superblock, ret: 0 F2FS-fs (loop0): Mounted with checkpoint version = d F2FS-fs (loop0): Bitmap was wrongly cleared, blk:9740 ------------[ cut here ]------------ kernel BUG at fs/f2fs/segment.c:1884! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 4508 Comm: syz-executor0 Not tainted 4.17.0-rc1+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: 0018:ffff8801af526708 EFLAGS: 00010282 RAX: ffffed0035ea4cc0 RBX: ffff8801ad454f90 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff82eeb87e RDI: ffffed0035ea4cb6 RBP: ffff8801af526760 R08: ffff8801ad4a2480 R09: ffffed003b5e4f90 R10: ffffed003b5e4f90 R11: ffff8801daf27c87 R12: ffff8801adb8d380 R13: 0000000000000001 R14: 0000000000000008 R15: 00000000ffffffff FS: 00000000014af940(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f06bc223000 CR3: 00000001adb02000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: allocate_data_block+0x66f/0x2050 fs/f2fs/segment.c:2663 do_write_page+0x105/0x1b0 fs/f2fs/segment.c:2727 write_node_page+0x129/0x350 fs/f2fs/segment.c:2770 __write_node_page+0x7da/0x1370 fs/f2fs/node.c:1398 sync_node_pages+0x18cf/0x1eb0 fs/f2fs/node.c:1652 block_operations+0x429/0xa60 fs/f2fs/checkpoint.c:1088 write_checkpoint+0x3ba/0x5380 fs/f2fs/checkpoint.c:1405 f2fs_sync_fs+0x2fb/0x6a0 fs/f2fs/super.c:1077 __sync_filesystem fs/sync.c:39 [inline] sync_filesystem+0x265/0x310 fs/sync.c:67 generic_shutdown_super+0xd7/0x520 fs/super.c:429 kill_block_super+0xa4/0x100 fs/super.c:1191 kill_f2fs_super+0x9f/0xd0 fs/f2fs/super.c:3030 deactivate_locked_super+0x97/0x100 fs/super.c:316 deactivate_super+0x188/0x1b0 fs/super.c:347 cleanup_mnt+0xbf/0x160 fs/namespace.c:1174 __cleanup_mnt+0x16/0x20 fs/namespace.c:1181 task_work_run+0x1e4/0x290 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457d97 RSP: 002b:00007ffd46f9c8e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000457d97 RDX: 00000000014b09a3 RSI: 0000000000000002 RDI: 00007ffd46f9da50 RBP: 00007ffd46f9da50 R08: 0000000000000000 R09: 0000000000000009 R10: 0000000000000005 R11: 0000000000000246 R12: 00000000014b0940 R13: 0000000000000000 R14: 0000000000000002 R15: 000000000000658e RIP: update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: ffff8801af526708 ---[ end trace f498328bb02610a2 ]--- Reported-and-tested-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+7d6d31d3bc702f566ce3@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+0a725420475916460f12@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Jaegeuk Kim	b2ca374f33	f2fs: sanity check on sit entry syzbot hit the following crash on upstream commit `87ef12027b` (Wed Apr 18 19:48:17 2018 +0000) Merge tag 'ceph-for-4.17-rc2' of git://github.com/ceph/ceph-client syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=83699adeb2d13579c31e C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5805208181407744 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=6005073343676416 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6555047731134464 Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118 compiler: gcc (GCC) 8.0.1 20180413 (experimental) IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0) F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock F2FS-fs (loop0): invalid crc value BUG: unable to handle kernel paging request at ffffed006b2a50c0 PGD 21ffee067 P4D 21ffee067 PUD 21fbeb067 PMD 0 Oops: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 4514 Comm: syzkaller989480 Not tainted 4.17.0-rc1+ #8 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:build_sit_entries fs/f2fs/segment.c:3653 [inline] RIP: 0010:build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852 RSP: 0018:ffff8801b102e5b0 EFLAGS: 00010a06 RAX: 1ffff1006b2a50c0 RBX: 0000000000000004 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801ac74243e RBP: ffff8801b102f410 R08: ffff8801acbd46c0 R09: fffffbfff14d9af8 R10: fffffbfff14d9af8 R11: ffff8801acbd46c0 R12: ffff8801ac742a80 R13: ffff8801d9519100 R14: dffffc0000000000 R15: ffff880359528600 FS: 0000000001e04880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffed006b2a50c0 CR3: 00000001ac6ac000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: f2fs_fill_super+0x4095/0x7bf0 fs/f2fs/super.c:2803 mount_bdev+0x30c/0x3e0 fs/super.c:1165 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020 mount_fs+0xae/0x328 fs/super.c:1268 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:1027 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0x564/0x3070 fs/namespace.c:2847 ksys_mount+0x12d/0x140 fs/namespace.c:3063 __do_sys_mount fs/namespace.c:3077 [inline] __se_sys_mount fs/namespace.c:3074 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x443d6a RSP: 002b:00007ffd312813c8 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443d6a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd312813d0 RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004 R13: 0000000000402c60 R14: 0000000000000000 R15: 0000000000000000 RIP: build_sit_entries fs/f2fs/segment.c:3653 [inline] RSP: ffff8801b102e5b0 RIP: build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852 RSP: ffff8801b102e5b0 CR2: ffffed006b2a50c0 ---[ end trace a2034989e196ff17 ]--- Reported-and-tested-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Jaegeuk Kim	5d64600d4f	f2fs: avoid bug_on on corrupted inode syzbot has tested the proposed patch but the reproducer still triggered crash: kernel BUG at fs/f2fs/inode.c:LINE! F2FS-fs (loop1): invalid crc value F2FS-fs (loop5): Magic Mismatch, valid(0xf2f52010) - read(0x0) F2FS-fs (loop5): Can't find valid F2FS filesystem in 1th superblock F2FS-fs (loop5): invalid crc value ------------[ cut here ]------------ kernel BUG at fs/f2fs/inode.c:238! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 4886 Comm: syz-executor1 Not tainted 4.17.0-rc1+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:do_read_inode fs/f2fs/inode.c:238 [inline] RIP: 0010:f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313 RSP: 0018:ffff8801c44a70e8 EFLAGS: 00010293 RAX: ffff8801ce208040 RBX: ffff8801b3621080 RCX: ffffffff82eace18 F2FS-fs (loop2): Magic Mismatch, valid(0xf2f52010) - read(0x0) RDX: 0000000000000000 RSI: ffffffff82eaf047 RDI: 0000000000000007 RBP: ffff8801c44a7410 R08: ffff8801ce208040 R09: ffffed0039ee4176 R10: ffffed0039ee4176 R11: ffff8801cf720bb7 R12: ffff8801c0efa000 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f753aa9d700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 ------------[ cut here ]------------ CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel BUG at fs/f2fs/inode.c:238! CR2: 0000000001b03018 CR3: 00000001c8b74000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: f2fs_fill_super+0x4377/0x7bf0 fs/f2fs/super.c:2842 mount_bdev+0x30c/0x3e0 fs/super.c:1165 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020 mount_fs+0xae/0x328 fs/super.c:1268 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:1027 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0x564/0x3070 fs/namespace.c:2847 ksys_mount+0x12d/0x140 fs/namespace.c:3063 __do_sys_mount fs/namespace.c:3077 [inline] __se_sys_mount fs/namespace.c:3074 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457daa RSP: 002b:00007f753aa9cba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 0000000000457daa RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f753aa9cbf0 RBP: 0000000000000064 R08: 0000000020016a00 R09: 0000000020000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 0000000000000064 R14: 00000000006fcb80 R15: 0000000000000000 RIP: do_read_inode fs/f2fs/inode.c:238 [inline] RSP: ffff8801c44a70e8 RIP: f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313 RSP: ffff8801c44a70e8 invalid opcode: 0000 [#2] SMP KASAN ---[ end trace 1cbcbec2156680bc ]--- Reported-and-tested-by: syzbot+41a1b341571f0952badb@syzkaller.appspotmail.com Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Jaegeuk Kim	a4f843bd00	f2fs: give message and set need_fsck given broken node id syzbot hit the following crash on upstream commit `83beed7b2b` (Fri Apr 20 17:56:32 2018 +0000) Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887 C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368 Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118 compiler: gcc (GCC) 8.0.1 20180413 (experimental) IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0) F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock F2FS-fs (loop0): invalid crc value ------------[ cut here ]------------ kernel BUG at fs/f2fs/node.c:1185! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: 0018:ffff8801d960e820 EFLAGS: 00010293 RAX: ffff8801d88205c0 RBX: 0000000000000003 RCX: ffffffff82f6cc06 RDX: 0000000000000000 RSI: ffffffff82f6d5e8 RDI: 0000000000000004 RBP: ffff8801d960ec30 R08: ffff8801d88205c0 R09: ffffed003b5e46c2 R10: 0000000000000003 R11: 0000000000000003 R12: ffff8801a86e00c0 R13: 0000000000000001 R14: ffff8801a86e0530 R15: ffff8801d9745240 FS: 000000000072c880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3d403209b8 CR3: 00000001d8f3f000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: get_node_page fs/f2fs/node.c:1237 [inline] truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014 remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039 f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547 evict+0x4a6/0x960 fs/inode.c:557 iput_final fs/inode.c:1519 [inline] iput+0x62d/0xa80 fs/inode.c:1545 f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849 mount_bdev+0x30c/0x3e0 fs/super.c:1164 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020 mount_fs+0xae/0x328 fs/super.c:1267 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:1027 [inline] do_new_mount fs/namespace.c:2518 [inline] do_mount+0x564/0x3070 fs/namespace.c:2848 ksys_mount+0x12d/0x140 fs/namespace.c:3064 __do_sys_mount fs/namespace.c:3078 [inline] __se_sys_mount fs/namespace.c:3075 [inline] __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x443dea RSP: 002b:00007ffcc7882368 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443dea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffcc7882370 RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004 R13: 0000000000402ce0 R14: 0000000000000000 R15: 0000000000000000 RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: ffff8801d960e820 ---[ end trace 4edbeb71f002bb76 ]--- Reported-and-tested-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:48 -07:00
Chao Yu	cf52b27a39	f2fs: clean up commit_inmem_pages() This patch moves error handling from commit_inmem_pages() into __commit_inmem_page() for cleanup, no logic change. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:47 -07:00
Sheng Yong	eff15c2aec	f2fs: do not check F2FS_INLINE_DOTS in recover Only dir may have F2FS_INLINE_DOTS flag, so there is no need to check the flag in recover flow. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:47 -07:00
Sheng Yong	a515d12f1b	f2fs: remove duplicated dquot_initialize and fix error handling This patch removes duplicated dquot_initialize in recover_orphan_inode(), and fix the error handling if dquot_initialize fails. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:47 -07:00
Chao Yu	c22aecd759	f2fs: fix to detect failure of dquot_initialize dquot_initialize() can fail due to any exception inside quota subsystem, f2fs needs to be aware of it, and return correct return value to caller. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:47 -07:00
Yunlei He	d618477473	f2fs: stop issue discard if something wrong with f2fs v4->v5: move data corruption check to __submit_discard_cmd, in order to control discard io submitted more accurately, besides, increase async thread wait time if data corruption detected. This patch stop async thread and umount process to issue discard if something wrong with f2fs, which is similar to fstrim. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:47 -07:00
Chao Yu	b169c3c560	f2fs: fix return value in f2fs_ioc_commit_atomic_write In f2fs_ioc_commit_atomic_write, if file is volatile, return -EINVAL to indicate that commit failure. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Yunlei He	054afda999	f2fs: allocate hot_data for atomic write more strictly If a file not set type as hot, has dirty pages more than threshold 64 before starting atomic write, may be lose hot flag. v1->v2: move set FI_ATOMIC_FILE flag behind flush dirty pages too, in case of dirty pages before starting atomic use atomic mode to write back. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Sheng Yong	d0891e84e1	f2fs: check if inmem_pages list is empty correctly `cur' will never be NULL, we should check inmem_pages list instead. Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Chao Yu	27319ba404	f2fs: fix race in between GC and atomic open Thread GC thread - f2fs_ioc_start_atomic_write - get_dirty_pages - filemap_write_and_wait_range - f2fs_gc - do_garbage_collect - gc_data_segment - move_data_page - f2fs_is_atomic_file - set_page_dirty - set_inode_flag(, FI_ATOMIC_FILE) Dirty data page can still be generated by GC in race condition as above call stack. This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write to avoid such race. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Souptick Joarder	ea4d479bb3	fs: f2fs: Adding new return type vm_fault_t Use new return type vm_fault_t for page_mkwrite and fault handler. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Zhikang Zhang	d6964949e4	f2fs: change le32 to le16 of f2fs_inode->i_extra_size In the structure of f2fs_inode, i_extra_size's type is __le16, so we should keep type consistent when using it. Fixes: `704956ecf5` ("f2fs: support inode checksum") Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com> Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Zhikang Zhang	56b07e7e65	f2fs: check cur_valid_map_mir & raw_sit block count when flush sit entries We should check valid_map_mir and block count to ensure the flushed raw_sit is correct. Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com> Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:46 -07:00
Chao Yu	3d165dc3ae	f2fs: correct return value of f2fs_trim_fs Correct return value in two cases: - return EINVAL if end boundary is out-of-range. - return EIO if fs needs off-line check. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Chao Yu	764811581d	f2fs: fix to show missing bits in FS_IOC_GETFLAGS This patch fixes to show missing encrypt/inline_data flag in FS_IOC_GETFLAGS like ext4 does. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Chao Yu	c807a7cb54	f2fs: remove unneeded F2FS_PROJINHERIT_FL Now F2FS_FL_USER_VISIBLE and F2FS_FL_USER_MODIFIABLE has included F2FS_PROJINHERIT_FL, so remove unneeded F2FS_PROJINHERIT_FL when using visible/modifiable flag macro. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Chao Yu	81114baa83	f2fs: don't use GFP_ZERO for page caches Related to https://lkml.org/lkml/2018/4/8/661 Sometimes, we need to write meta data to new allocated block address, then we will allocate a zeroed page in inner inode's address space, and fill partial data in it, and leave other place with zero value which means some fields are initial status. There are two inner inodes (meta inode and node inode) setting __GFP_ZERO, I have just checked them, for both of them, we can avoid using __GFP_ZERO, and do initialization by ourselves to avoid unneeded/redundant zeroing from mm. Cc: <stable@vger.kernel.org> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Yunlei He	241b493d8f	f2fs: issue all big range discards in umount process This patch modify max_requests to UINT_MAX, to issue all big range discards in umount. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Chao Yu	47cca3d62a	f2fs: remove redundant block plug For buffered IO, we don't need to use block plug to cache bio, for direct IO, generic f2fs_direct_IO has already added block plug, so let's remove redundant one in .write_iter. As Yunlei described in his patch: -f2fs_file_write_iter -blk_start_plug -__generic_file_write_iter ... -do_blockdev_direct_IO -blk_start_plug ... -blk_finish_plug ... -blk_finish_plug which may conduct performance decrease in our platform Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:45 -07:00
Yunlong Song	8045249829	f2fs: remove unmatched zero_user_segment when convert inline dentry Since the layout of regular dentry block is different from inline dentry block, zero_user_segment starting from MAX_INLINE_DATA(dir) is not correct for regular dentry block, besides, bitmap is already copied and used, so there is no necessary to zero page at all, so just remove the zero_user_segment is OK. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:44 -07:00
Chao Yu	59c844088b	f2fs: introduce private inode status mapping Previously, we use generic FS_*_FL defined by vfs to indicate inode status for each bit of i_flags, so f2fs's flag status definition is tied to vfs' one, it will be hard for f2fs to reuse bits f2fs never used to indicate new status.. In order to solve this issue, we introduce private inode status mapping, Note, for these bits have already been persisted into disk, we should never change their definition, for other ones, we can remap them for later new coming status. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 11:31:44 -07:00
Jaegeuk Kim	e555da9f31	f2fs: run fstrim asynchronously if runtime discard is on We don't need to wait for whole bunch of discard candidates in fstrim, since runtime discard will issue them in idle time. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-31 10:20:48 -07:00
Dave Jiang	80660f2025	dax: change bdev_dax_supported() to support boolean returns The function return values are confusing with the way the function is named. We expect a true or false return value but it actually returns 0/-errno. This makes the code very confusing. Changing the return values to return a bool where if DAX is supported then return true and no DAX support returns false. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-31 08:58:34 -07:00
Darrick J. Wong	ba23cba9b3	fs: allow per-device dax status checking for filesystems Change bdev_dax_supported so it takes a bdev parameter. This enables multi-device filesystems like xfs to check that a dax device can work for the particular filesystem. Once that's in place, actually fix all the parts of XFS where we need to be able to distinguish between datadev and rtdev. This patch fixes the problem where we screw up the dax support checking in xfs if the datadev and rtdev have different dax capabilities. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases] Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2018-05-31 08:58:33 -07:00
Adam Manzanares	087e566916	fs: iomap dio set bio prio from kiocb prio Now that kiocb has an ioprio field copy this over to the bio when it is created from the kiocb during direct IO. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-31 10:50:56 -04:00
Adam Manzanares	074111ca5f	fs: blkdev set bio prio from kiocb prio Now that kiocb has an ioprio field copy this over to the bio when it is created from the kiocb. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-31 10:50:55 -04:00
Adam Manzanares	d9a08a9e61	fs: Add aio iopriority support This is the per-I/O equivalent of the ioprio_set system call. When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. This patch depends on block: add ioprio_check_cap function. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-31 10:50:55 -04:00
Adam Manzanares	fc28724d67	fs: Convert kiocb rw_hint from enum to u16 In order to avoid kiocb bloat for per command iopriority support, rw_hint is converted from enum to a u16. Added a guard around ki_hint assignment. Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-31 10:50:54 -04:00
Tetsuo Handa	543b8f8662	fuse: don't keep dead fuse_conn at fuse_fill_super(). syzbot is reporting use-after-free at fuse_kill_sb_blk() [1]. Since sb->s_fs_info field is not cleared after fc was released by fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds already released fc and tries to hold the lock. Fix this by clearing sb->s_fs_info field after calling fuse_conn_put(). [1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com> Fixes: `3b463ae0c6` ("fuse: invalidation reverse calls") Cc: John Muir <john@jmuir.com> Cc: Csaba Henk <csaba@gluster.com> Cc: Anand Avati <avati@redhat.com> Cc: <stable@vger.kernel.org> # v2.6.31 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 12:26:11 +02:00
Miklos Szeredi	6becdb601b	fuse: fix control dir setup and teardown syzbot is reporting NULL pointer dereference at fuse_ctl_remove_conn() [1]. Since fc->ctl_ndents is incremented by fuse_ctl_add_conn() when new_inode() failed, fuse_ctl_remove_conn() reaches an inode-less dentry and tries to clear d_inode(dentry)->i_private field. Fix by only adding the dentry to the array after being fully set up. When tearing down the control directory, do d_invalidate() on it to get rid of any mounts that might have been added. [1] https://syzkaller.appspot.com/bug?id=f396d863067238959c91c0b7cfc10b163638cac6 Reported-by: syzbot <syzbot+32c236387d66c4516827@syzkaller.appspotmail.com> Fixes: `bafa96541b` ("[PATCH] fuse: add control filesystem") Cc: <stable@vger.kernel.org> # v2.6.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 12:26:10 +02:00
Tejun Heo	8a301eb16d	fuse: fix congested state leak on aborted connections If a connection gets aborted while congested, FUSE can leave nr_wb_congested[] stuck until reboot causing wait_iff_congested() to wait spuriously which can lead to severe performance degradation. The leak is caused by gating congestion state clearing with fc->connected test in request_end(). This was added way back in 2009 by `26c3679101` ("fuse: destroy bdi on umount"). While the commit description doesn't explain why the test was added, it most likely was to avoid dereferencing bdi after it got destroyed. Since then, bdi lifetime rules have changed many times and now we're always guaranteed to have access to the bdi while the superblock is alive (fc->sb). Drop fc->connected conditional to avoid leaking congestion states. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joshua Miller <joshmiller@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org # v2.6.29+ Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 12:26:10 +02:00
Eric W. Biederman	4ad769f3c3	fuse: Allow fully unprivileged mounts Now that the fuse and the vfs work is complete. Allow the fuse filesystem to be mounted by the root user in a user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 12:26:10 +02:00
Eric W. Biederman	e45b2546e2	fuse: Ensure posix acls are translated outside of init_user_ns Ensure the translation happens by failing to read or write posix acls when the filesystem has not indicated it supports posix acls. This ensures that modern cached posix acl support is available and used when dealing with posix acls. This is important because only that path has the code to convernt the uids and gids in posix acls into the user namespace of a fuse filesystem. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 12:26:10 +02:00
Tomohiro Misono	23d0b79dfa	btrfs: Add unprivileged version of ino_lookup ioctl Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to allow normal users to call "btrfs subvolume list/show" etc. in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF. This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. This is because it always searches the fs/file tree correspoinding to the fd with which this ioctl is called and also returns the name of bottom subvolume. The main differences from original ino_lookup ioctl are: 1. Read + Exec permission will be checked using inode_permission() during path construction. -EACCES will be returned in case of failure. 2. Path construction will be stopped at the inode number which corresponds to the fd with which this ioctl is called. If constructed path does not exist under fd's inode, -EACCES will be returned. 3. The name of bottom subvolume is also searched and filled. Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) bytes than ino_lookup ioctl because of space of subvolume's name. Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-31 11:35:24 +02:00
Tomohiro Misono	42e4b520c8	btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF information of the subvolume containing this inode except the subvolume name (this is because to prevent potential name leak). The subvolume name will be gained by user version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check. The min id of root ref's subvolume to be searched is specified by @min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller can just call this ioctl again without changing the argument to continue search. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes and struct item renames ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-31 11:35:23 +02:00
Tomohiro Misono	b64ec075bd	btrfs: Add unprivileged ioctl which returns subvolume information Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the information of subvolume containing this inode. (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.) Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ minor style fixes, update struct comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-31 11:35:23 +02:00
Amir Goldstein	01b39dcc95	ovl: use inode_insert5() to hash a newly created inode Currently, there is a small window where ovl_obtain_alias() can race with ovl_instantiate() and create two different overlay inodes with the same underlying real non-dir non-hardlink inode. The race requires an adversary to guess the file handle of the yet to be created upper inode and decode the guessed file handle after ovl_creat_real(), but before ovl_instantiate(). This race does not affect overlay directory inodes, because those are decoded via ovl_lookup_real() and not with ovl_obtain_alias(). This patch fixes the race, by using inode_insert5() to add a newly created inode to cache. If the newly created inode apears to already exist in cache (hashed by the same real upper inode), we instantiate the dentry with the old inode and drop the new inode, instead of silently not hashing the new inode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:12 +02:00
Vivek Goyal	ac6a52eb65	ovl: Pass argument to ovl_get_inode() in a structure ovl_get_inode() right now has 5 parameters. Soon this patch series will add 2 more and suddenly argument list starts looking too long. Hence pass arguments to ovl_get_inode() in a structure and it looks little cleaner. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:12 +02:00
Miklos Szeredi	80ea09a002	vfs: factor out inode_insert5() Split out common helper for race free insertion of an already allocated inode into the cache. Use this from iget5_locked() and insert_inode_locked4(). Make iget5_locked() use new_inode()/iput() instead of alloc_inode()/destroy_inode() directly. Also export to modules for use by filesystems which want to preallocate an inode before file/directory creation. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	b148cba403	ovl: clean up copy-up error paths Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	dd8ac699ed	ovl: return EIO on internal error EIO better represents an internal error than ENOENT. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Al Viro	f73cc77c3a	ovl: make ovl_create_real() cope with vfs_mkdir() safely vfs_mkdir() may succeed and leave the dentry passed to it unhashed and negative. ovl_create_real() is the last caller breaking when that happens. [amir: split re-factoring of ovl_create_temp() to prep patch add comment about unhashed dir after mkdir add pr_warn() if mkdir succeeds and lookup fails] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	137ec526a2	ovl: create helper ovl_create_temp() Also used ovl_create_temp() in ovl_create_index() instead of calling ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed through ovl_create_real(), which paves the way for Al's fix for non-hashed result from vfs_mkdir(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Miklos Szeredi	95a1c8153a	ovl: return dentry from ovl_create_real() Al Viro suggested to simplify callers of ovl_create_real() by returning the created dentry (or ERR_PTR) from ovl_create_real(). Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:11 +02:00
Amir Goldstein	471ec5dcf4	ovl: struct cattr cleanups * Rename to ovl_cattr * Fold ovl_create_real() hardlink argument into struct ovl_cattr * Create macro OVL_CATTR() to initialize struct ovl_cattr from mode Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	6cf00764b0	ovl: strip debug argument from ovl_do_ helpers It did not prove to be useful. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Amir Goldstein	a8b9e0ceed	ovl: remove WARN_ON() real inode attributes mismatch Overlayfs should cope with online changes to underlying layer without crashing the kernel, which is what xfstest overlay/019 checks. This test may sometimes trigger WARN_ON() in ovl_create_or_link() when linking an overlay inode that has been changed on underlying layer. Remove those WARN_ON() to prevent the stress test from failing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Miklos Szeredi	4280f74a57	ovl: Kconfig documentation fixes Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2018-05-31 11:06:10 +02:00
Darrick J. Wong	829bc787c1	fs: clear writeback errors in inode_init_always In inode_init_always(), we clear the inode mapping flags, which clears any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not also clear wb_err, which means that old mapping errors can leak through to new inodes. This is crucial for the XFS inode allocation path because we recycle old in-core inodes and we do not want error state from an old file to leak into the new file. This bug was discovered by running generic/036 and generic/047 in a loop and noticing that the EIOs generated by the collision of direct and buffered writes in generic/036 would survive the remount between 036 and 047, and get reported to the fsyncs (on different files!) in generic/047. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-05-30 19:43:53 -07:00
Steve French	28d59363ae	smb3: add tracepoints for smb2/smb3 open add two tracepoints for open completion. One for error one for completion (open_done). Sample output below TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION \| \| \| \|\|\|\| \| \| bash-15348 [007] .... 42441.027492: smb3_enter: cifs_lookup: xid=45 bash-15348 [007] .... 42441.028214: smb3_cmd_err: sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=105 status=0xc0000034 rc=-2 bash-15348 [007] .... 42441.028219: smb3_open_err: xid=45 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2 bash-15348 [007] .... 42441.028225: smb3_exit_done: cifs_lookup: xid=45 fop777-24560 [002] .... 42442.627617: smb3_enter: cifs_revalidate_dentry_attr: xid=46 fop777-24560 [003] .... 42442.628301: smb3_cmd_err: sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=106 status=0xc0000034 rc=-2 fop777-24560 [003] .... 42442.628319: smb3_open_err: xid=46 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2 fop777-24560 [003] .... 42442.628335: smb3_enter: cifs_atomic_open: xid=47 fop777-24560 [003] .... 42442.629587: smb3_cmd_done: sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=107 fop777-24560 [003] .... 42442.629592: smb3_open_done: xid=47 sid=0x6173e4ce tid=0xa05150e6 fid=0xb8a0984d cr_opts=0x40 des_access=0x40000080 Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 21:42:34 -05:00
Steve French	5c5a41be89	cifs: add debug output to show nocase mount option For smb1 nocase can be specified on mount. Allow displaying it in debug data. Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 17:59:49 -05:00
Steve French	fe048402e8	smb3: add define for id for posix create context and corresponding struct Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 17:59:46 -05:00
Ronnie Sahlberg	98170fb535	cifs: update smb2_check_message to handle PDUs without a 4 byte length header Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-05-30 17:24:14 -05:00
Kent Overstreet	e292d7bc63	xfs: convert to bioset_init()/mempool_init() Convert XFS to embedded bio sets. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-30 15:33:32 -06:00
Kent Overstreet	8ac9f7c1fd	btrfs: convert to bioset_init()/mempool_init() Convert btrfs to embedded bio sets. Acked-by: Chris Mason <clm@fb.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-30 15:33:32 -06:00
Kent Overstreet	52190f8abe	fs: convert block_dev.c to bioset_init() Convert block DIO code to embedded bio sets. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-30 15:33:32 -06:00
Steve French	b326614ea2	smb3: allow "posix" mount option to enable new SMB311 protocol extensions If "posix" (or synonym "unix" for backward compatibility) specified on mount, and server advertises support for SMB3.11 POSIX negotiate context, then enable the new posix extensions on the tcon. This can be viewed by looking for "posix" in the mount options displayed by /proc/mounts for that mount (ie if posix extensions allowed by server and the experimental POSIX extensions also requested on the mount by specifying "posix" at mount time). Also add check to warn user if conflicting unix/nounix or posix/noposix specified on mount. Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Steve French	fcef0db6d6	smb3: add support for posix negotiate context Unlike CIFS where UNIX/POSIX extensions had been negotiatable, SMB3 did not have POSIX extensions yet. Add the new SMB3.11 POSIX negotiate context to ask the server whether it can support POSIX (and thus whether we can send the new POSIX open context). Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Steve French	f92a720ee9	cifs: allow disabling less secure legacy dialects To improve security it may be helpful to have additional ways to restrict the ability to override the default dialects (SMB2.1, SMB3 and SMB3.02) on mount with old dialects (CIFS/SMB1 and SMB2) since vers=1.0 (CIFS/SMB1) and vers=2.0 are weaker and less secure. Add a module parameter "disable_legacy_dialects" (/sys/module/cifs/parameters/disable_legacy_dialects) which can be set to 1 (or equivalently Y) to forbid use of vers=1.0 or vers=2.0 on mount. Also cleans up a few build warnings about globals for various module parms. Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Steve French	11911b956f	cifs: make minor clarifications to module params for cifs.ko Note which ones of the module params are cifs dialect only (N/A for default dialect now that has moved to SMB2.1 or later) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-05-30 16:06:18 -05:00
Ronnie Sahlberg	6539e7f372	cifs: show the "w" bit for writeable /proc/fs/cifs/* files RHBZ: 1539612 Lets show the "w" bit for those files have a .write interface to set/enable/... the feature. Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-05-30 16:06:18 -05:00
Steve French	49218b4f57	smb3: add module alias for smb3 to cifs.ko We really don't want to be encouraging people to use the old (less secure) cifs dialect (SMB1) and it can be confusing for them with SMB3 (or later) being recommended but the module name is cifs. Add a module alias for "smb3" to cifs.ko to make this less confusing. Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Ronnie Sahlberg	25ad1cbd02	cifs: return error on invalid value written to cifsFYI RHBZ: 1539617 Check that, if it is not a boolean, the value the user tries to write to /proc/fs/cifs/cifsFYI is valid and return an error if not. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reported-by: Xiaoli Feng <xifeng@redhat.com>	2018-05-30 16:06:18 -05:00
Ronnie Sahlberg	57c55cd7c7	cifs: invalidate cache when we truncate a file RHBZ: 1566345 When truncating a file we always do this synchronously to the server. Thus we need to make sure that the cached inode metadata is marked as stale so that on next getattr we will refresh the metadata. In this particular bug we want to ensure that both ctime and mtime are updated and become visible to the application after a truncate. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reported-by: Xiaoli Feng <xifeng@redhat.com>	2018-05-30 16:06:18 -05:00
Steve French	e0386e449a	smb3: print tree id in debugdata in proc to be able to help logging When loooking at the logs for the new trace-cmd tracepoints for cifs, it would help to know which tid is for which share (UNC name) so update /proc/fs/cifs/DebugData to display the tid. Also display Maximal Access which was missing as well. Now the entry for typical entry for a tcon (in proc/fs/cifs/) looks like: 1) \\localhost\test Mounts: 1 DevInfo: 0x20 Attributes: 0x1006f PathComponentMax: 255 Status: 1 type: DISK Share Capabilities: None Aligned, Partition Aligned, Share Flags: 0x0 tid: 0xe0632a55 Optimal sector size: 0x200 Maximal Access: 0x1f01ff Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Steve French	d683bcd3e5	smb3: add additional ftrace entry points for entry/exit to cifs.ko Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-30 16:06:18 -05:00
Steve French	cfe8909164	smb3: fix various xid leaks Fix a few cases where we were not freeing the xid which led to active requests being non-zero at unmount time. Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-05-30 16:06:18 -05:00
Long Li	57a929a66f	CIFS: Introduce offset for the 1st page in data transfer structures When direct I/O is used, the data buffer may not always align to page boundaries. Introduce a page offset in transport data structures to describe the location of the buffer within the page. Also change the function to pass the page offset when sending data to transport. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2018-05-30 16:06:12 -05:00
Omar Sandoval	ad7e1a740d	Btrfs: clean up error handling in btrfs_truncate() btrfs_truncate() uses two variables for error handling, ret and err (if this sounds familiar, it's because btrfs_truncate_inode_items() did something similar). This is error prone, as was made evident by "Btrfs: fix error handling in btrfs_truncate()". We only have err because we don't want to mask an error if we call btrfs_update_inode() and btrfs_end_transaction(), so let's make that its own scoped return variable and use ret everywhere else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 21:27:32 +02:00
Nikolay Borisov	c5794e5178	btrfs: Factor out write portion of btrfs_get_blocks_direct Now that the read side is extracted into its own function, do the same to the write side. This leaves btrfs_get_blocks_direct_write with the sole purpose of handling common locking required. Also flip the condition in btrfs_get_blocks_direct_write so that the write case comes first and we check for if (Create) rather than if (!create). This is purely subjective but I believe makes reading a bit more "linear". No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 19:01:44 +02:00
Nikolay Borisov	1c8d0175df	btrfs: Factor out read portion of btrfs_get_blocks_direct Currently this function handles both the READ and WRITE dio cases. This is facilitated by a bunch of 'if' statements, a goto short-circuit statement and a very perverse aliasing of "!created"(READ) case by setting lockstart = lockend and checking for lockstart < lockend for detecting the write. Let's simplify this mess by extracting the READ-only code into a separate __btrfs_get_block_direct_read function. This is only the first step, the next one will be to factor out the write side as well. The end goal will be to have the common locking/ unlocking code in btrfs_get_blocks_direct and then it will call either the read\|write subvariants. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 19:01:43 +02:00
Chao Yu	cba608493d	f2fs: turn down IO priority of discard from background In order to avoid interfering normal r/w IO, let's turn down IO priority of discard issued from background. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-30 08:58:59 -07:00
Chao Yu	377224c471	f2fs: don't split checkpoint in fstrim Now, we issue discard asynchronously in separated thread instead of in checkpoint, after that, we won't encounter long latency in checkpoint due to huge number of synchronous discard command handling, so, we don't need to split checkpoint to do trim in batch, merge it and obsolete related sysfs entry. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-30 08:58:59 -07:00
Jaegeuk Kim	8bb4f2535c	f2fs: issue discard commands proactively in high fs utilization In the high utilization like over 80%, we don't expect huge # of large discard commands, but do many small pending discards which affects FTL GCs a lot. Let's issue them in that case. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-30 08:58:59 -07:00
Su Yue	9132c4ff6f	btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist The error code does not match the reason of failure and may confuse the callers. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 17:33:58 +02:00
Kees Cook	1389053e1b	btrfs: raid56: Remove VLA usage In the quest to remove all stack VLA usage from the kernel[1], this allocates the working buffers during regular init, instead of using stack space. This refactors the allocation code a bit to make it easier to review. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 17:15:43 +02:00
Darrick J. Wong	d25522f10c	xfs: repair superblocks If one of the backup superblocks is found to differ seriously from superblock 0, write out a fresh copy from the in-core sb. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:15 -07:00
Darrick J. Wong	7e85bc6c87	xfs: add helpers to attach quotas to inodes Add a helper routine to attach quota information to inodes that are about to undergo repair. If that fails, we need to schedule a quotacheck for the next mount but allow the corrupted metadata repair to continue. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:15 -07:00
Darrick J. Wong	04a2b7b254	xfs: recover AG btree roots from rmap data Add a helper function to help us recover btree roots from the rmap data. Callers pass in a list of rmap owner codes, buffer ops, and magic numbers. We iterate the rmap records looking for owner matches, and then read the matching blocks to see if the magic number & uuid match. If so, we then read-verify the block, and if that passes then we retain a pointer to the block with the highest level, assuming that by the end of the call we will have found the root. This will be used to reset the AGF/AGI btree root fields during their rebuild procedures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:14 -07:00
Darrick J. Wong	12c6510e2f	xfs: add helpers to dispose of old btree blocks after a repair Now that we've plumbed in the ability to construct a list of dead btree blocks following a repair, add more helpers to dispose of them. This is done by examining the rmapbt -- if the btree was the only owner we can free the block, otherwise it's crosslinked and we can only remove the rmapbt record. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:14 -07:00
Darrick J. Wong	64a39d876e	xfs: add helpers to collect and sift btree block pointers during repair Add some helpers to assemble a list of fs block extents. Generally, repair functions will iterate the rmapbt to make a list (1) of all extents owned by the nominal owner of the metadata structure; then they will iterate all other structures with the same rmap owner to make a list (2) of active blocks; and finally we have a subtraction function to subtract all the blocks in (2) from (1), with the result that (1) is now a list of blocks that were owned by the old btree and must be disposed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:14 -07:00
Darrick J. Wong	73d6b42aa4	xfs: add helpers to allocate and initialize fresh btree roots Add a pair of helper functions to allocate and initialize fresh btree roots. The repair functions will use these as part of recreating corrupted metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>	2018-05-30 08:03:14 -07:00
Darrick J. Wong	0a9633fa2f	xfs: add helpers to deal with transaction allocation and rolling For repairs, we need to reserve at least as many blocks as we think we're going to need to rebuild the data structure, and we're going to need some helpers to roll transactions while maintaining locks on the AG headers so that other threads cannot wander into the middle of a repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>	2018-05-30 08:03:14 -07:00
Darrick J. Wong	51863d7dd7	xfs: grab the per-ag structure whenever relevant Grab and hold the per-AG data across a scrub run whenever relevant. This helps us avoid repeated trips through rcu and the radix tree in the repair code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2018-05-30 08:03:14 -07:00
Su Yue	090a127afa	btrfs: return error value if create_io_em failed in cow_file_range In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: `6f9994dbab` ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:51:08 +02:00
Gu JinXiang	6b0cb1f901	btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot Since there is no more use of qgroup_reserved member in struct btrfs_pending_snapshot, remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:54 +02:00
Gu JinXiang	c4c129db5d	btrfs: drop unused parameter qgroup_reserved Since commit `7775c8184e` ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:53 +02:00
Ethan Lien	e73e81b6d0	btrfs: balance dirty metadata pages in btrfs_finish_ordered_io [Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:53 +02:00
Ethan Lien	78d4295b1e	btrfs: lift some btrfs_cross_ref_exist checks in nocow path In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows: [global] group_reporting time_based thread=1 ioengine=libaio bs=4k iodepth=32 size=64G runtime=180 numjobs=8 rw=randwrite [file1] filename=/mnt/nocow/testfile IOPS result: unpatched patched 1 fio round: 46670 46958 snapshot 2 fio round: 51826 54498 3 fio round: 59767 61289 After snapshot, the first fio get about 5% performance gain. As we continually write to the same file, all writes will resume to nocow mode and eventually we have no performance gain. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:53 +02:00
Lu Fengqi	d19577912d	btrfs: Remove fs_info argument from btrfs_uuid_tree_rem This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:53 +02:00
Lu Fengqi	cdb345a877	btrfs: Remove fs_info argument from btrfs_uuid_tree_add This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:52 +02:00
Liu Bo	f9ddfd0592	Btrfs: remove unused check of skip_locking The check is superfluous since all callers who set search_for_commit also have skip_locking set. ASSERT() is put in place to ensure skip_locking is set by new callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:52 +02:00
Liu Bo	d80bb3f905	Btrfs: remove always true check in unlock_up As unlock_up() is written as for () { if (!path->locks[i]) break; ... if (... && path->locks[i]) { } } Apparently, @path->locks[i] is always true at this 'if'. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:51 +02:00
Liu Bo	662c653bfd	Btrfs: grab write lock directly if write_lock_level is the max level Typically, when acquiring root node's lock, btrfs tries its best to get read lock and trade for write lock if @write_lock_level implies to do so. In case of (cow && (p->keep_locks \|\| p->lowest_level)), write_lock_level is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's write lock directly. In this particular case, the dance of acquiring read lock and then trading for write lock can be saved. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:51 +02:00
Liu Bo	1fc28d8e2e	Btrfs: move get root out of btrfs_search_slot to a helper It's good to have a helper instead of having all get-root details open-coded. The new helper locks (if necessary) and sets root node of the path. Also invert the checks to make the code flow easier to read. There is no functional change in this commit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:44 +02:00
Liu Bo	e6a1d6fd27	Btrfs: use more straightforward extent_buffer_uptodate check If parent_transid "0" is passed to btrfs_buffer_uptodate(), btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but extent_buffer_uptodate() is preferred since we don't have to look into verify_parent_transid(). Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:44 +02:00
Liu Bo	ca19b4a699	Btrfs: remove superfluous free_extent_buffer in read_block_for_search read_block_for_search() can be simplified as: tmp = find_extent_buffer(); if (tmp) return; ... free_extent_buffer(); read_tree_block(); Apparently, @tmp must be NULL at this point, free_extent_buffer() is not needed. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:44 +02:00
Lu Fengqi	4ca6168327	btrfs: drop unused space_info parameter from create_space_info Since commit `dc2d3005d2` ("btrfs: remove dead create_space_info calls"), there is only one caller btrfs_init_space_info. However, it doesn't need create_space_info to return space_info at all. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:43 +02:00
Liu Bo	ff76a864cc	Btrfs: add parent_transid parameter to veirfy_level_key As verify_level_key() is checked after verify_parent_transid(), i.e. if (verify_parent_transid()) ret = -EIO; else if (verify_level_key()) ret = -EUCLEAN; if parent_transid is 0, verify_parent_transid() skips verifying parent_transid and considers eb as valid, and if verify_level_key() reports something wrong, we're not going to know if it's caused by corrupted metadata or non-checkecd eb (e.g. stale eb). The stale eb can be from an outdated raid1 mirror after a degraded mount, see eg "btrfs: fix reading stale metadata blocks after degraded raid1 mounts" (`02a3307aa9`) for more details. @parent_transid is able to tell whether the eb's generation has been verified by the caller. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:43 +02:00
Qu Wenruo	9593bf4967	btrfs: qgroup: show more meaningful qgroup_rescan_init error message Error message from qgroup_rescan_init() mostly looks like: BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115 Which is far from meaningful, and sometimes confusing as for above -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes it can also indicate problem if the return value is -EINVAL. Change it to some more meaningful messages like: BTRFS info (device nvme0n1p1): qgroup rescan is already in progress And BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update the messages and level ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:43 +02:00
Omar Sandoval	fd4e994bd1	Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2() If we have invalid flags set, when we error out we must drop our writer counter and free the buffer we allocated for the arguments. This bug is trivially reproduced with the following program on 4.7+: #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/btrfs.h> #include <linux/btrfs_tree.h> int main(int argc, char **argv) { struct btrfs_ioctl_vol_args_v2 vol_args = { .flags = UINT64_MAX, }; int ret; int fd; if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } fd = open(argv[1], O_WRONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args); if (ret == -1) perror("ioctl"); close(fd); return EXIT_SUCCESS; } When unmounting the filesystem, we'll hit the WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the filesystem to be remounted read-only as the writer count will stay lifted. Fixes: `6b526ed70c` ("btrfs: introduce device delete by devid") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:43 +02:00
Qu Wenruo	de885e3ee2	btrfs: lzo: Harden inline lzo compressed extent decompression For inlined extent, we only have one segment, thus less things to check. And further more, inlined extent always has the csum in its leaf header, it's less probable to have corrupted data. Anyway, still check header and segment header. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:43 +02:00
Qu Wenruo	314bfa473b	btrfs: lzo: Add header length check to avoid potential out-of-bounds access James Harvey reported that some corrupted compressed extent data can lead to various kernel memory corruption. Such corrupted extent data belongs to inode with NODATASUM flags, thus data csum won't help us detecting such bug. If lucky enough, KASAN could catch it like: BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs] Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338 CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G O 4.17.0-rc5-custom+ #50 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Call Trace: dump_stack+0xc2/0x16b print_address_description+0x6a/0x270 kasan_report+0x260/0x380 memcpy+0x34/0x50 lzo_decompress_bio+0x384/0x7a0 [btrfs] end_compressed_bio_read+0x99f/0x10b0 [btrfs] bio_endio+0x32e/0x640 normal_work_helper+0x15a/0xea0 [btrfs] process_one_work+0x7e3/0x1470 worker_thread+0x1b0/0x1170 kthread+0x2db/0x390 ret_from_fork+0x22/0x40 ... The offending compressed data has the following info: Header: length 32768 (looks completely valid) Segment 0 Header: length 3472882419 (obviously out of bounds) Then when handling segment 0, since it's over the current page, we need the copy the compressed data to temporary buffer in workspace, then such large size would trigger out-of-bounds memory access, screwing up the whole kernel. Fix it by adding extra checks on header and segment headers to ensure we won't access out-of-bounds, and even checks the decompressed data won't be out-of-bounds. Reported-by: James Harvey <jamespharvey20@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated comments ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-30 16:46:38 +02:00
Al Viro	1da92779e2	aio: sanitize the limit checking in io_submit(2) as it is, the logics in native io_submit(2) is "if asked for more than LONG_MAX/sizeof(pointer) iocbs to submit, don't bother with more than LONG_MAX/sizeof(pointer)" (i.e. 512M requests on 32bit and 1E requests on 64bit) while compat io_submit(2) goes with "stop after the first PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so. Which is * inconsistent * way too much in native case * possibly too little in compat one and * wrong anyway, since the natural point where we ought to stop bothering is ctx->nr_events Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:20:17 -04:00
Al Viro	67ba049f94	aio: fold do_io_submit() into callers get rid of insane "copy array of 32bit pointers into an array of native ones" glue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:19:29 -04:00
Al Viro	95af8496ac	aio: shift copyin of iocb into io_submit_one() Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:18:31 -04:00
Al Viro	d2988bd412	aio_read_events_ring(): make a bit more readable The logics for 'avail' is * not past the tail of cyclic buffer * no more than asked * not past the end of buffer * not past the end of a page Unobfuscate the last part. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:18:17 -04:00
Al Viro	9061d14a8a	aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way ... so just make them return 0 when caller does not need to destroy iocb Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:17:40 -04:00
Al Viro	3c96c7f4ca	aio: take list removal to (some) callers of aio_complete() We really want iocb out of io_cancel(2) reach before we start tearing it down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 23:16:43 -04:00
Linus Torvalds	91fc957a61	Merge tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: - fix a BUG triggerable from faccessat() - fix the mounting of backup volumes * tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix mounting of backup volumes afs: Fix directory permissions check	2018-05-29 15:30:16 -05:00
Jaegeuk Kim	d6290814b0	f2fs: add fsync_mode=nobarrier for non-atomic files For non-atomic files, this patch adds an option to give nobarrier which doesn't issue flush commands to the device. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-29 12:04:08 -07:00
Jaegeuk Kim	9a997188ff	f2fs: let fstrim issue discard commands in lower priority The fstrim gathers huge number of large discard commands, and tries to issue without IO awareness, which results in long user-perceive IO latencies on READ, WRITE, and FLUSH in UFS. We've observed some of commands take several seconds due to long discard latency. This patch limits the maximum size to 2MB per candidate, and check IO congestion when issuing them to disk. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2018-05-29 12:04:08 -07:00
Souptick Joarder	05edd888d1	fs: xfs: Change return type to vm_fault_t Use new return type vm_fault_t for fault handlers. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-29 10:46:03 -07:00
Darrick J. Wong	2e050e648a	xfs: fix inobt magic number check In commit `a6a781a58b` ("xfs: have buffer verifier functions report failing address") the bad magic number return was ported incorrectly. Fixes: `a6a781a58b` Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2018-05-29 10:46:03 -07:00
Darrick J. Wong	aee9a4a555	fs: clear writeback errors in inode_init_always In inode_init_always(), we clear the inode mapping flags, which clears any retained error (AS_EIO, AS_ENOSPC) bits. Unfortunately, we do not also clear wb_err, which means that old mapping errors can leak through to new inodes. This is crucial for the XFS inode allocation path because we recycle old in-core inodes and we do not want error state from an old file to leak into the new file. This bug was discovered by running generic/036 and generic/047 in a loop and noticing that the EIOs generated by the collision of direct and buffered writes in generic/036 would survive the remount between 036 and 047, and get reported to the fsyncs (on different files!) in generic/047. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>	2018-05-29 10:46:03 -07:00
Chengguang Xu	eb91537575	vfs: delete unnecessary assignment in vfs_listxattr It seems the first error assignment in if branch is redundant. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-29 13:22:41 -04:00
Qu Wenruo	2a1f7c0cbd	btrfs: lzo: document the compressed data format Although it's not that complex, but such comment could still save several minutes for newer reader/reviewer instead of inferring that from the code. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor wording updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:13:00 +02:00
Qu Wenruo	d5c1d68fde	btrfs: compression: Add linux/sizes.h for compression.h Since compression.h is using the SZ_* macros, and if some file includes only compression.h without linux/sizes.h, it will cause compile error. One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED. Fix it by adding linux/sizes.h in compression.h Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:13:00 +02:00
Omar Sandoval	b5c40d598f	Btrfs: fix clone vs chattr NODATASUM race In btrfs_clone_files(), we must check the NODATASUM flag while the inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags() will change the flags after we check and we can end up with a party checksummed file. The race window is only a few instructions in size, between the if and the locks which is: 3834 if (S_ISDIR(src->i_mode) \|\| S_ISDIR(inode->i_mode)) 3835 return -EISDIR; where the setflags must be run and toggle the NODATASUM flag (provided the file size is 0). The clone will block on the inode lock, segflags takes the inode lock, changes flags, releases log and clone continues. Not impossible but still needs a lot of bad luck to hit unintentionally. Fixes: `0e7b824c4e` ("Btrfs: don't make a file partly checksummed through file clone") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:59 +02:00
Gu Jinxiang	b89311efe6	btrfs: propagate failures of __exclude_logged_extent to upper caller Function btrfs_exclude_logged_extents may call __exclude_logged_extent which may fail. Propagate the failures of __exclude_logged_extent to upper caller. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:58 +02:00
Nikolay Borisov	d4b20733d2	btrfs: Streamline shared ref check in alloc_reserved_tree_block Instead of setting "parent" to ref->parent only when dealing with a shared ref and subsequently performing another check to see if (parent > 0), check the "node->type" directly and act accordingly. This makes the code more streamline. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:57 +02:00
Nikolay Borisov	21ebfbe7e0	btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block Instead of taking only specific member of this structure, which results in 2 extra arguments, just take the delayed_extent_op struct and reference the arguments inside the functions. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:57 +02:00
Nikolay Borisov	4e6bd4e0aa	btrfs: Simplify alloc_reserved_tree_block interface This function currently takes 7 parameters, most of which are proxies for values from btrfs_delayed_ref_node struct which is not passed. This patch simplifies the interface of the function by simply passing said delayed ref node struct to the function. This enables us to: 1. Move locals variables and init code related to them from run_delayed_tree_ref which should only be used inside alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key, representing the extent being inserted. This removes the need for the "ins" argument. Instead, it's replaced by a local var with a more verbose name - extent_key. 2. Now that we have a reference to the node in alloc_reserved_tree_block the delayed_tree_ref struct can be referenced inside the function and this enable removing the "ref->level", "parent" and "ref_root" arguments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:53 +02:00
Nikolay Borisov	9dcdbe0144	btrfs: Remove fs_info argument from alloc_reserved_tree_block This function already takes a transaction handle which contains a reference to the fs_info. So use this and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:53 +02:00
David Sterba	315b76b462	btrfs: tests: drop newline from test_msg strings Now that test_err strings do not need the newline, remove them also from the test_msg. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:52 +02:00
David Sterba	3c7251f2f8	btrfs: tests: add helper for error messages and update them The test failures are not clearly visible in the system log as they're printed at INFO level. Add a new helper that is level ERROR. As this touches almost all strings, I took the opportunity to unify them: - decapitalize the first letter as there's a prefix and the text continues after ":" - glue strings split to more lines and un-indent so they fit to 80 columns - use %llu instead of %Lu - drop \n from the modified messages (test_msg is left untouched) Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-29 18:12:51 +02:00
Gang He	da3627c30d	dlm: remove O_NONBLOCK flag in sctp_connect_to_sock We should remove O_NONBLOCK flag when calling sock->ops->connect() in sctp_connect_to_sock() function. Why? 1. up to now, sctp socket connect() function ignores the flag argument, that means O_NONBLOCK flag does not take effect, then we should remove it to avoid the confusion (but is not urgent). 2. for the future, there will be a patch to fix this problem, then the flag argument will take effect, the patch has been queued at https://git.kernel.o rg/pub/scm/linux/kernel/git/davem/net.git/commit/net/sctp?id=644fbdeacf1d3ed d366e44b8ba214de9d1dd66a9. But, the O_NONBLOCK flag will make sock->ops->connect() directly return without any wait time, then the connection will not be established, DLM kernel module will call sock->ops->connect() again and again, the bad results are, CPU usage is almost 100%, even trigger soft_lockup problem if the related configurations are enabled, DLM kernel module also prints lots of messages like, [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 [Fri Apr 27 11:23:43 2018] dlm: connecting to 172167592 The upper application (e.g. ocfs2 mount command) is hanged at new_lockspace(), the whole backtrace is as below, tb0307-nd2:~ # cat /proc/2935/stack [<0>] new_lockspace+0x957/0xac0 [dlm] [<0>] dlm_new_lockspace+0xae/0x140 [dlm] [<0>] user_cluster_connect+0xc3/0x3a0 [ocfs2_stack_user] [<0>] ocfs2_cluster_connect+0x144/0x220 [ocfs2_stackglue] [<0>] ocfs2_dlm_init+0x215/0x440 [ocfs2] [<0>] ocfs2_fill_super+0xcb0/0x1290 [ocfs2] [<0>] mount_bdev+0x173/0x1b0 [<0>] mount_fs+0x35/0x150 [<0>] vfs_kern_mount.part.23+0x54/0x100 [<0>] do_mount+0x59a/0xc40 [<0>] SyS_mount+0x80/0xd0 [<0>] do_syscall_64+0x76/0x140 [<0>] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [<0>] 0xffffffffffffffff So, I think we should remove O_NONBLOCK flag here, since DLM kernel module can not handle non-block sockect in connect() properly. Signed-off-by: Gang He <ghe@suse.com> Signed-off-by: David Teigland <teigland@redhat.com>	2018-05-29 10:48:35 -05:00
Christoph Hellwig	5afb78356c	block: don't print a message when the device went away The information about a size change in this case just creates confusion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	4163a03984	block: unexport check_disk_size_change Only used in block_dev.c and the partitions code, and it should remain that way.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-05-29 08:59:21 -06:00
Christoph Hellwig	ac060cbaa8	aio: add missing break for the IOCB_CMD_FDSYNC case Looks like this got lost in a merge. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-28 13:40:50 -04:00
Trond Myklebust	9f6d44d418	NFS: Optimise away lookups for rename targets We can optimise away any lookup for a rename target, unless we're being asked to revalidate a dentry that might be in use. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-05-28 13:29:19 -04:00
Trond Myklebust	73dd684a4d	NFS: If the VFS sets LOOKUP_REVAL then force a lookup of the dentry If nfs_lookup_revalidate() is called with LOOKUP_REVAL because a previous path lookup failed, then we ought to force a full lookup of the component name. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-05-28 13:29:19 -04:00
Trond Myklebust	479219218f	NFS: Optimise away the close-to-open GETATTR when we have NFSv4 OPEN NFSv4 should not need to perform an extra close-to-open GETATTR as part of the process of looking up a regular file, since the OPEN call will do that for us. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2018-05-28 13:29:19 -04:00
Arnd Bergmann	533d1daea8	IB: Revert "remove redundant INFINIBAND kconfig dependencies" Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a link error when another driver using it is built-in. The INFINIBAND_ADDR_TRANS dependency is insufficient here as this is a 'bool' symbol that does not force anything to be a module in turn. fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work': smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect' net/9p/trans_rdma.o: In function `rdma_request': trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect' net/9p/trans_rdma.o: In function `rdma_destroy_trans': trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp' trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd' Fixes: `9533b292a7` ("IB: remove redundant INFINIBAND kconfig dependencies") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Greg Thelen <gthelen@google.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>	2018-05-28 10:40:16 -06:00
Misono Tomohiro	ad1e3d5672	btrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctl btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or ERR_PTR(-ENOMEM) and therefore search_ioctl() and btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT, which all other callers of btrfs_read_fs_root_no_name() do. Drop the error message as it would be confusing, the caller of ioctl will likely interpret the error code and not look into the syslog. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:14 +02:00
Omar Sandoval	37becec95a	Btrfs: allow empty subvol= again I got a report that after upgrading to 4.16, someone's filesystems weren't mounting: [ 23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol=' Before 4.16, this mounted the default subvolume. It turns out that this empty "subvol=" is actually an application bug, but it was causing the application to fail, so it's an ABI break if you squint. The generic parsing code we use for mount options (match_token()) doesn't match an empty string as "%s". Previously, setup_root_args() removed the "subvol=" string, but the mount path was cleaned up to not need that. Add a dummy Opt_subvol_empty to fix this. The simple workaround is to use / or . for the value of 'subvol=' . Fixes: `312c89fbca` ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") CC: stable@vger.kernel.org # 4.16+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:13 +02:00
Anand Jain	b78e2b78a8	btrfs: fix describe_relocation when printing unknown flags Looks like the original idea was to print the hex of the flags which is not coded with their flag name. So use the current buf pointer bp instead of buf. Reaching the uknown flags should never happen, it's there just in case. Fixes: `ebce0e01b9` ("btrfs: make block group flags in balance printks human-readable") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:11 +02:00
David Sterba	bf5091c8d6	btrfs: use kvzalloc for EXTENT_SAME temporary data The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the arrays can be 32KiB large. To avoid allocation failures due to fragmented memory, use the allocation with fallback to vmalloc. The arrays are allocated and freed only inside btrfs_extent_same and reused for all the ranges. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:09 +02:00
Timofey Titovets	67b07bd4be	Btrfs: reuse cmp workspace in EXTENT_SAME ioctl We support big dedup requests by splitting range to smaller parts, and call dedupe logic on each of them. Instead of repeated allocation and deallocation, allocate once at the beginning and reuse in the iteration. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:07 +02:00
Timofey Titovets	b672876826	Btrfs: dedupe_file_range ioctl: remove 16MiB restriction Currently btrfs_dedupe_file_range silently restricts the dedupe range to to 16MiB to limit locking and working memory size and is documented in manual page as implementation specific. Let's remove that restriction by iterating over the dedup range in 16MiB steps. This is backward compatible and will not change anything for requests smaller then 16MiB. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:04 +02:00
Timofey Titovets	3973909d92	Btrfs: split btrfs_extent_same Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME entry and a helper that can be repeatedly called on a range. This will be used in following patches. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:03 +02:00
Omar Sandoval	399b0bbf5f	Btrfs: reserve space for O_TMPFILE orphan item deletion btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but it doesn't reserve space to do so. Even before the removal of the orphan_block_rsv it wasn't using it. Fixes: `ef3b9af50b` ("Btrfs: implement inode_operations callback tmpfile") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:24:00 +02:00
Omar Sandoval	7efc3e349c	Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enums We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make them consecutive again. Signed-off-by: Omar Sandoval <osandov@fb.com> [ switch them enums so we don't have to do that again ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:59 +02:00
Omar Sandoval	a575ceeb13	Btrfs: get rid of unused orphan infrastructure Now that we don't keep long-standing reservations for orphan items, root->orphan_block_rsv isn't used. We can git rid of it, along with: - root->orphan_lock, which was used to protect root->orphan_block_rsv - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations in root->orphan_block_rsv - btrfs_orphan_commit_root(), which was the last user of any of these and does nothing else Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:57 +02:00
Omar Sandoval	27919067f1	Btrfs: fix ENOSPC caused by orphan items reservations Currently, we keep space reserved for all inode orphan items until the inode is evicted (i.e., all references to it are dropped). We hit an issue where an application would keep a bunch of deleted files open (by design) and thus keep a large amount of space reserved, causing ENOSPC errors when other operations tried to reserve space. This long-standing reservation isn't absolutely necessary for a couple of reasons: - We can almost always make the reservation we need or steal from the global reserve for the orphan item - If we can't, it's not the end of the world if we drop the orphan item on the floor and let the next mount clean it up So, get rid of persistent reservation and just reserve space in btrfs_evict_inode(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:54 +02:00
Omar Sandoval	4b9d7b59bf	Btrfs: refactor btrfs_evict_inode() reserve refill dance The truncate loop in btrfs_evict_inode() does two things at once: - It refills the temporary block reserve, potentially stealing from the global reserve or committing - It calls btrfs_truncate_inode_items() The tangle of continues hides the fact that these two steps are actually separate. Split the first step out into a separate function both for clarity and so that we can reuse it in a later patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:52 +02:00
Omar Sandoval	c08db7d8d2	Btrfs: don't return ino to ino cache if inode item removal fails In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode item will still be in the tree but we still return the ino to the ino cache. That will blow up later when someone tries to allocate that ino, so don't return it to the cache. Fixes: `581bb05094` ("Btrfs: Cache free inode numbers in memory") Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:51 +02:00
Omar Sandoval	05a5bd7c4d	Btrfs: delete dead code in btrfs_orphan_commit_root() btrfs_orphan_commit_root() tries to delete an orphan item for a subvolume in the tree root, but we don't actually insert that item in the first place. See commit `0a0d4415e3` ("Btrfs: delete dead code in btrfs_orphan_add()"). We can get rid of it. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:49 +02:00
Omar Sandoval	7b40b695b4	Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM Now that we don't add orphan items for truncate, there can't be races on adding or deleting an orphan item, so this bit is unnecessary. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:48 +02:00
Omar Sandoval	f7e9e8fc79	Btrfs: stop creating orphan items for truncate Currently, we insert an orphan item during a truncate so that if there's a crash, we don't leak extents past the on-disk i_size. However, since commit `7f4f6e0a3f` ("Btrfs: only update disk_i_size as we remove extents"), we keep disk_i_size in sync with the extent items as we truncate, so orphan cleanup will never have any extents to remove. Don't bother with the superfluous orphan item. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:46 +02:00
Omar Sandoval	0552210997	Btrfs: don't BUG_ON() in btrfs_truncate_inode_items() btrfs_free_extent() can fail because of ENOMEM. There's no reason to panic here, we can just abort the transaction. Fixes: `f4b9aa8d3b` ("btrfs_truncate") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:45 +02:00
Omar Sandoval	fd86a3a315	Btrfs: fix error handling in btrfs_truncate_inode_items() btrfs_truncate_inode_items() uses two variables for error handling, ret and err. These are not handled consistently, leading to a couple of bugs. - Errors from btrfs_del_items() are handled but not propagated to the caller - If btrfs_run_delayed_refs() fails and aborts the transaction, we continue running Just use ret everywhere and simplify things a bit, fixing both of these issues. Fixes: `79787eaab4` ("btrfs: replace many BUG_ONs with proper error handling") Fixes: `1262133b8d` ("Btrfs: account for crcs in delayed ref processing") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:44 +02:00
Omar Sandoval	d1342aadbd	Btrfs: update stale comments referencing vmtruncate() Commit `a41ad394a0` ("Btrfs: convert to the new truncate sequence") changed btrfs_setsize() to call truncate_setsize() instead of vmtruncate() but didn't update the comment above it. truncate_setsize() never fails (the IS_SWAPFILE() check happens elsewhere), so remove the comment. Additionally, the comment above btrfs_page_mkwrite() references vmtruncate(), but truncate_setsize() does the size write and page locking now. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:43 +02:00
Nikolay Borisov	c442793e67	btrfs: Remove stale comment about select_delayed_ref select_delayed_ref really just gets the next delayed ref which has to be processed - either an add ref or drop ref. We never go back for anything. So the comment is actually bogus, just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:42 +02:00
Misono Tomohiro	f902bd3a5e	btrfs: sysfs: Add entry which shows if rmdir can work on subvolumes Deletion of a subvolume by rmdir(2) has become allowed by the 'commit cd2decf640b1 ("btrfs: Allow rmdir(2) to delete an empty subvolume")'. It is a kind of new feature and this commits add a sysfs entry /sys/fs/btrfs/features/rmdir_subvol to indicate the availability of the feature so that a user program (e.g. fstests) can detect it. Prior to this commit, all entries in /sys/fs/btrfs/features are feature which depend on feature bits of superblock (i.e. each feature affects on-disk format) and managed by attribute_group "btrfs_feature_attr_group". For each fs, entries in /sys/fs/btrfs/UUID/features indicate which features are enabled (or can be changed online) for the fs. However, rmdir_subvol feature only depends on kernel module. Therefore new attribute_group "btrfs_static_feature_attr_group" is introduced and sysfs_merge_group() is used to share /sys/fs/btrfs/features directory. Features in "btrfs_static_feature_attr_group" won't be listed in each /sys/fs/btrfs/UUID/features. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:41 +02:00
Tomohiro Misono	6c52157fa9	btrfs: sysfs: Use enum/define value for feature array definitions Use existing named values instead of the raw numbers. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:39 +02:00
Anand Jain	6dac13f8e2	btrfs: add prefix "balance:" for log messages Kernel logs are very important for the forensic investigations of the issues in general make it easy to use it. This patch adds 'balance:' prefix so that it can be easily searched. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:38 +02:00
David Sterba	5c57b8b6a4	btrfs: unify naming of flags variables for SETFLAGS and XFLAGS * The simple 'flags' refer to the btrfs inode * ... that's in 'binode * the FS__FL variables are 'fsflags' the old copies of the variable are prefixed by 'old_' * Struct inode flags contain 'i_flags'. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:32 +02:00
David Sterba	025f212148	btrfs: add FS_IOC_FSSETXATTR ioctl The new ioctl is an extension to the FS_IOC_SETFLAGS and adds new flags and is extensible. Don't get fooled by the XATTR in the name, it does not have anything in common with the extended attributes, incidentally also abbreviated as XATTRs. This patch allows to set the xflags portion of the fsxattr structure, other items have no meaning and non-zero values will result in EOPNOTSUPP. Currently supported xflags: - APPEND - IMMUTABLE - NOATIME - NODUMP - SYNC The structure of btrfs_ioctl_fssetxattr copies btrfs_ioctl_setflags but is simpler on the flag setting side. The original patch was written by Chandan Jay Sharma but was incomplete and no further revision has been sent. Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:31 +02:00
David Sterba	e4202ac927	btrfs: add FS_IOC_FSGETXATTR ioctl The new ioctl is an extension to the FS_IOC_GETFLAGS and adds new flags and is extensible. This patch allows to return the xflags portion of the fsxattr structure, other items have no meaning for btrfs or can be added later. The original patch was written by Chandan Jay Sharma but was incomplete and no further revision has been sent. Several cleanups were necessary to avoid confusion with other ioctls, as we have another flavor of flags. Based-on-patches-by: Chandan Jay Sharma <chandansbg@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:29 +02:00
David Sterba	19f93b3cd8	btrfs: add helpers for FS_XFLAG_* conversion Preparatory work for the FS_IOC_FSGETXATTR ioctl, basic conversions and checking helpers. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:28 +02:00
David Sterba	a157d4fd81	btrfs: rename btrfs_flags_to_ioctl to reflect which flags it touches Converts btrfs_inode::flags to the FS_*_FL flags. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:27 +02:00
David Sterba	5ba76abfb2	btrfs: rename check_flags to reflect which flags it touches The FS_*_FL flags cannot be easily identified by a prefix but we still need to recognize them so the 'fsflags' should be closer to the naming scheme but again the 'fs' part sounds like it's a filesystem flag. I don't have a better idea for now. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:25 +02:00
David Sterba	1905a0f7c7	btrfs: rename btrfs_mask_flags to reflect which flags it touches The FS_*_FL flags cannot be easily identified by a variable name prefix but we still need to recognize them so the 'fsflags' should be closer to the naming scheme but again the 'fs' part sounds like it's a filesystem flag. I don't have a better idea for now. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:24 +02:00
David Sterba	7b6a221e5b	btrfs: rename btrfs_update_iflags to reflect which flags it touches The btrfs inode flag flavour is now simply called 'inode flags' and the vfs inode are i_flags. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:20 +02:00
Anand Jain	d9a071f008	btrfs: use common variable for fs_devices in btrfs_destroy_dev_replace_tgtdev Use a local btrfs_fs_devices variable to access the structure. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:18 +02:00
Anand Jain	ab5c2f65de	btrfs: drop uuid_mutex in btrfs_destroy_dev_replace_tgtdev Delete the uuid_mutex lock here as this thread accesses the btrfs_fs_devices::devices only (counters or called functions do a list traversal). And the device_list_mutex lock is already taken. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:17 +02:00
Anand Jain	b25e59e2b2	btrfs: drop uuid_mutex in btrfs_dev_replace_finishing btrfs_dev_replace_finishing updates devices (soruce and target) which are within the btrfs_fs_devices::devices or withint the cloned seed devices (btrfs_fs_devices::seed::devices), so we don't need the global uuid_mutex. The device replace context is also locked by its own locks. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:16 +02:00
Anand Jain	542c5908ab	btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices btrfs_open_devices() is using the uuid_mutex, but as btrfs_open_devices is just limited to openning all the devices under for given fsid, so we don't need uuid_mutex. Instead it should hold the device_list_mutex as it updates the members of the btrfs_fs_devices and btrfs_device and not the whole fs_devs list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:15 +02:00
Anand Jain	3dd0f7a364	btrfs: document uuid_mutex uasge in read_chunk_tree read_chunk_tree() calls read_one_dev(), but for seed device we have to search the fs_uuids list, so we need the uuid_mutex. Add a comment comment, so that we can improve this part. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:14 +02:00
Anand Jain	41a52a0f1b	btrfs: use existing cur_devices, cleanup btrfs_rm_device Instead of de-referencing the device->fs_devices use cur_devices which points to the same fs_devices and does not change. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:13 +02:00
Anand Jain	b6ed73bcb1	btrfs: reduce uuid_mutex critical section while scanning devices The generic block device lookup or cleanup does not need the uuid mutex, that's only for the device_list_add. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:12 +02:00
Nikolay Borisov	20a6800402	btrfs: Unexport and rename btrfs_invalidate_inodes This function is no longer used outside of inode.c so just make it static. At the same time give a more becoming name, since it's not really invalidating the inodes but just calling d_prune_alias. Last, but not least - move the function above the sole caller to avoid introducing yet-another-pointless forward declaration. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:10 +02:00
David Sterba	093258e6eb	btrfs: replace waitqueue_actvie with cond_wake_up Use the wrappers and reduce the amount of low-level details about the waitqueue management. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:09 +02:00
David Sterba	3d3a2e610e	btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups Currently the code assumes that there's an implied barrier by the sequence of code preceding the wakeup, namely the mutex unlock. As Nikolay pointed out: I think this is wrong (not your code) but the original assumption that the RELEASE semantics provided by mutex_unlock is sufficient. According to memory-barriers.txt: Section 'LOCK ACQUISITION FUNCTIONS' states: (2) RELEASE operation implication: Memory operations issued before the RELEASE will be completed before the RELEASE operation has completed. Memory operations issued after the RELEASE may be completed before the RELEASE operation has completed. (I've bolded the may portion) The example given there: As an example, consider the following: A = a; B = b; ACQUIRE C = c; D = d; RELEASE E = e; F = f; The following sequence of events is acceptable: ACQUIRE, {F,A}, E, {C,D}, B, RELEASE So if we assume that C is modifying the flag which the waitqueue is checking, and E is the actual wakeup, then those accesses can be re-ordered... IMHO this code should be considered broken... --- To be on the safe side, add the barriers. The synchronization logic around log using the mutexes and several other threads does not make it easy to reason for/against the barrier. CC: Nikolay Borisov <nborisov@suse.com> Link: https://lkml.kernel.org/r/6ee068d8-1a69-3728-00d1-d86293d43c9f@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:06 +02:00
David Sterba	110a21feed	btrfs: introduce conditional wakeup helpers Add convenience wrappers for the waitqueue management that involves memory barriers to prevent deadlocks. The helpers will let us remove barriers and the necessary comments in several places. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:04 +02:00
Qu Wenruo	ff3d27a048	btrfs: qgroup: Finish rescan when hit the last leaf of extent tree Under the following case, qgroup rescan can double account cowed tree blocks: In this case, extent tree only has one tree block. - \| transid=5 last committed=4 \| btrfs_qgroup_rescan_worker() \| \|- btrfs_start_transaction() \| \| transid = 5 \| \|- qgroup_rescan_leaf() \| \|- btrfs_search_slot_for_read() on extent tree \| Get the only extent tree block from commit root (transid = 4). \| Scan it, set qgroup_rescan_progress to the last \| EXTENT/META_ITEM + 1 \| now qgroup_rescan_progress = A + 1. \| \| fs tree get CoWed, new tree block is at A + 16K \| transid 5 get committed - \| transid=6 last committed=5 \| btrfs_qgroup_rescan_worker() \| btrfs_qgroup_rescan_worker() \| \|- btrfs_start_transaction() \| \| transid = 5 \| \|- qgroup_rescan_leaf() \| \|- btrfs_search_slot_for_read() on extent tree \| Get the only extent tree block from commit root (transid = 5). \| scan it using qgroup_rescan_progress (A + 1). \| found new tree block beyong A, and it's fs tree block, \| account it to increase qgroup numbers. - In above case, tree block A, and tree block A + 16K get accounted twice, while qgroup rescan should stop when it already reach the last leaf, other than continue using its qgroup_rescan_progress. Such case could happen by just looping btrfs/017 and with some possibility it can hit such double qgroup accounting problem. Fix it by checking the path to determine if we should finish qgroup rescan, other than relying on next loop to exit. Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:23:03 +02:00
Qu Wenruo	b6debf15d4	btrfs: qgroup: Search commit root for rescan to avoid missing extent When doing qgroup rescan using the following script (modified from btrfs/017 test case), we can sometimes hit qgroup corruption. ------ umount $dev &> /dev/null umount $mnt &> /dev/null mkfs.btrfs -f -n 64k $dev mount $dev $mnt extent_size=8192 xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null btrfs subvolume snapshot $mnt $mnt/snap xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll btrfs quota enable $mnt # -W is the new option to only wait rescan while not starting new one btrfs quota rescan -W $mnt btrfs qgroup show -prce $mnt umount $mnt # Need to patch btrfs-progs to report qgroup mismatch as error btrfs check $dev \|\| _fail ------ For fast machine, we can hit some corruption which missed accounting tree blocks: ------ qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 8.00KiB 0.00B none none --- --- 0/257 8.00KiB 0.00B none none --- --- ------ This is due to the fact that we're always searching commit root for btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is from current transaction, not commit root. And if our tree blocks get modified in current transaction, we won't find any owner in commit root, thus causing the corruption. Fix it by searching commit root for extent tree for qgroup_rescan_leaf(). Reported-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:21:07 +02:00
Al Viro	7a1b1e7028	btrfs: take the last remnants of ->d_fsdata use out [spotted while going through ->d_fsdata handling around d_splice_alias(); don't really care which tree that goes through] The only thing even looking at ->d_fsdata in there (since 2012) had been kfree(dentry->d_fsdata) in btrfs_dentry_delete(). Which, incidentally, is all btrfs_dentry_delete() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:37 +02:00
Qu Wenruo	75cb857d26	btrfs: Do super block verification before writing it to disk There are already 2 reports about strangely corrupted super blocks, where csum still matches but extra garbage gets slipped into super block. The corruption would looks like: ------ superblock: bytenr=65536, device=/dev/sdc1 --------------------------------------------------------- csum_type 41700 (INVALID) csum 0x3b252d3a [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] ... incompat_flags 0x5b22400000000169 ( MIXED_BACKREF \| COMPRESS_LZO \| BIG_METADATA \| EXTENDED_IREF \| SKINNY_METADATA \| unknown flag: 0x5b22400000000000 ) ... ------ Or ------ superblock: bytenr=65536, device=/dev/mapper/x --------------------------------------------------------- csum_type 35355 (INVALID) csum_size 32 csum 0xf0dbeddd [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] ... incompat_flags 0x176d200000000169 ( MIXED_BACKREF \| COMPRESS_LZO \| BIG_METADATA \| EXTENDED_IREF \| SKINNY_METADATA \| unknown flag: 0x176d200000000000 ) ------ Obviously, csum_type and incompat_flags get some garbage, but its csum still matches, which means kernel calculates the csum based on corrupted super block memory. And after manually fixing these values, the filesystem is completely healthy without any problem exposed by btrfs check. Although the cause is still unknown, at least detect it and prevent further corruption. Both reports have same symptoms, there's an overwrite on offset 192 of the superblock, by 4 bytes. The superblock structure is not allocated or freed and stays in the memory for the whole filesystem lifetime, so it's not a use-after-free kind of error on someone else's leaked page. As a vague point for the problable cause is mentioning of other system freezing related to graphic card drivers. Reported-by: Ken Swenson <flat@imo.uto.moe> Reported-by: Ben Parsons <9parsonsb@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add brief analysis of the reports ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:36 +02:00
Qu Wenruo	069ec957c3	btrfs: Refactor btrfs_check_super_valid Refactor btrfs_check_super_valid: 1) Rename it to btrfs_validate_mount_super() Now it's more obvious when the function should be called. 2) Extract core check routine into validate_super() Later write time check can reuse it, and if needed, we could also use validate_super() to check each super block. 3) Add more comments about btrfs_validate_mount_super() Mostly about what it doesn't check and when it should be called. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to validate_super ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:36 +02:00
Qu Wenruo	21a852b018	btrfs: Move btrfs_check_super_valid() to avoid forward declaration Move btrfs_check_super_valid() before its single caller to avoid forward declaration. Though such code motion is not recommended as it pollutes git history, in this case the following patches would need to add new forward declarations for static functions that we want to avoid. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:36 +02:00
Nikolay Borisov	ffa9a9ef2f	btrfs: Remove fs_info argument from populate_free_space_tree This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:36 +02:00
Nikolay Borisov	e7355e501d	btrfs: Remove fs_info argument from add_to_free_space_tree This function takes a transaction handle which already contains a reference to the fs_info. So use it and remove the extra function argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:36 +02:00
Nikolay Borisov	25a356d3f6	btrfs: Remove fs_info argument from remove_from_free_space_tree This function alreay takes a transaction handle which holds a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	c31683a6ef	btrfs: Remove fs_info argument from __remove_from_free_space_tree This function takes a transaction handle which holds a reference to fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	e581168d1f	btrfs: Remove fs_info argument from remove_free_space_extent This function takes a transaction handle which already has a reference to the fs_info. Use it and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	5cb1782213	btrfs: Remove fs_info argument from add_free_space_extent This function always takes a transaction handle which references the fs_info structure. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:35 +02:00
Nikolay Borisov	85a7ef130c	btrfs: Remove fs_info argument from modify_free_space_bitmap This function already takes a transaction which has a reference to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	690d76828a	btrfs: Remove fs_info argument from update_free_space_extent_count This function already takes a transaction handle which has a reference to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	5296c2bf51	btrfs: Remove fs_info parameter from convert_free_space_to_extents This function always takes a transaction handle which contains a reference to fs_info. So use that and kill the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	719fb4de55	btrfs: Remove fs_info argument from convert_free_space_to_bitmaps This function already takes a transaction handle which contains a reference to fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	f3f7277995	btrfs: Remove fs_info parameter from remove_block_group_free_space This function always takes a trans handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:34 +02:00
Nikolay Borisov	4457c1c702	btrfs: Remove fs_info argument from add_new_free_space This function also takes a btrfs_block_group_cache which contains a referene to the fs_info. So use that and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	66afee1848	btrfs: Remove fs_info parameter from add_new_free_space_info This function already takes trans handle from where fs_info can be referenced. Remove the redundant parameter. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	2d5cffa1b0	btrfs: Remove fs_info argument from __add_to_free_space_tree This function already takes a transaction handle which contains a reference to fs_info. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	9a7e0f9284	btrfs: Remove fs_info argument from __add_block_group_free_space This function already takes a transaction handle which has a reference to the fs_info. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	e4e0711cd9	btrfs: Remove fs_info argument from add_block_group_free_space We also pass in a transaction handle which has a reference to the fs_info. Just remove the extraneous argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:33 +02:00
Nikolay Borisov	483bce068e	btrfs: Make btrfs_init_dummy_trans initialize trans' fs_info field This will be necessary for future cleanups which remove the fs_info argument from some freespace tree functions. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	7c8a0d363a	btrfs: Add assert in __btrfs_del_delalloc_inode The invariant is that when nr_delalloc_inodes is 0 then the root mustn't have any inodes on its delalloc inodes list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Robbie Ko	0f96f517dc	btrfs: incremental send, improve rmdir performance for large directory Currently when checking if a directory can be deleted, we always check if all its children have been processed. Example: A directory with 2,000,000 files was deleted original: 1994m57.071s patch: 1m38.554s [FIX] Instead of checking all children on all calls to can_rmdir(), we keep track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls to can_rmdir(). Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Robbie Ko	35c8eda12f	btrfs: incremental send, move allocation until it's needed in orphan_dir_info Move the allocation after the search when it's clear that the new entry will be added. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	2335efafa6	btrfs: split delayed ref head initialization and addition add_delayed_ref_head really performed 2 independent operations - initialisting the ref head and adding it to a list. Now that the init part is in a separate function let's complete the separation between both operations. This results in a lot simpler interface for add_delayed_ref_head since the function now deals solely with either adding the newly initialised delayed ref head or merging it into an existing delayed ref head. This results in vastly simplified function signature since 5 arguments are dropped. The only other thing worth mentioning is that due to this split the WARN_ON catching reinit of existing. In this patch the condition is extended such that: qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved is added. This is done because the two qgroup_* prefixed member are set only if both ref_root and reserved are passed. So functionally it's equivalent to the old WARN_ON and allows to remove the two args from add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:32 +02:00
Nikolay Borisov	eb86ec73b9	btrfs: Use init_delayed_ref_head in add_delayed_ref_head Use the newly introduced function when initialising the head_ref in add_delayed_ref_head. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	a2e569b3f2	btrfs: Introduce init_delayed_ref_head add_delayed_ref_head implements the logic to both initialize a head_ref structure as well as perform the necessary operations to add it to the delayed ref machinery. This has resulted in a very cumebrsome interface with loads of parameters and code, which at first glance, looks very unwieldy. Begin untangling it by first extracting the initialization only code in its own function. It's more or less verbatim copy of the first part of add_delayed_ref_head. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	cd7f9699b1	btrfs: Open-code add_delayed_data_ref Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_data_ref. Do so in the following manner: 1. The common init function is put immediately after memory-to-be-initialized is allocated, followed by the specific data ref initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is moved outside of the critical section. No functional changes, just an overall shorter critical section. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	70d640004a	btrfs: Open-code add_delayed_tree_ref Now that the initialization part and the critical section code have been split it's a lot easier to open code add_delayed_tree_ref. Do so in the following manner: 1. The comming init code is put immediately after memory-to-be-initialized is allocated, followed by the ref-specific member initialization. 2. The only piece of code that remains in the critical section is insert_delayed_ref call. 3. Tracing and memory freeing code is put outside of the critical section as well. The only real change here is an overall shorter critical section when dealing with delayed tree refs. From functional point of view - the code is unchanged. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	c812c8a857	btrfs: Use init_delayed_ref_common in add_delayed_data_ref Use the newly introduced helper and remove the duplicate code. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:31 +02:00
Nikolay Borisov	646f4dd76f	btrfs: Use init_delayed_ref_common in add_delayed_tree_ref Use the newly introduced common helper. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Nikolay Borisov	cb49a87b2a	btrfs: Factor out common delayed refs init code THe majority of the init code for struct btrfs_delayed_ref_node is duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor out the common bits in init_delayed_ref_common. This function is going to be used in future patches to clean that up. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Chengguang Xu	891f41cb27	btrfs: return original error code when failing from option parsing It's not good to overwrite -ENOMEM using -EINVAL when failing from mount option parsing, so just return original error code. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
David Sterba	6fcf6e2bff	btrfs: remove redundant btrfs_balance_control::fs_info The fs_info is always available from the context so we don't need to store it in the structure. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Qu Wenruo	c9f6f3cd1c	btrfs: qgroup: Allow trace_btrfs_qgroup_account_extent() to record its transid When debugging quota rescan race, some times btrfs rescan could account some old (committed) leaf and then re-account newly committed leaf in next generation. This race needs extra transid to locate, so add @transid for trace_btrfs_qgroup_account_extent() for such debug. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:30 +02:00
Colin Ian King	f5686e3acd	btrfs: send: fix spelling mistake: "send_in_progres" -> "send_in_progress" Trivial fix to spelling mistake of function name in btrfs_err message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	63a9c7b9ce	btrfs: Remove devid parameter from btrfs_rmap_block This function is used in only one place and devid argument is always passed 0. So just remove it, similarly to how it was removed in the userspace code. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Qu Wenruo	8b317901da	btrfs: trace: Allow trace_qgroup_update_counters() to record old rfer/excl value Origin trace_qgroup_update_counters() only records qgroup id and its reference count change. It's good enough to debug qgroup accounting change, but when rescan race is involved, it's pretty hard to distinguish which modification belongs to which rescan. So add old_rfer and old_excl trace output to help distinguishing different rescan instance. (Different rescan instance should reset its qgroup->rfer to 0) For trace event parameter, it just changes from u64 qgroup_id to struct btrfs_qgroup *qgroup, so number of parameters is not changed at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	3a2f8c07e1	btrfs: Unexport btrfs_alloc_delalloc_work It's used only in inode.c so makes no sense to have it exported. Also move the definition of btrfs_delalloc_work to inode.c since it's used only this file. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	076da91cd9	btrfs: Remove delayed_iput member from btrfs_delalloc_work When allocating a delalloc work we are always setting the delayed_iput to 0. So remove the delay_iput member of btrfs_delalloc_work, as a result also remove it as a parameter from btrfs_alloc_delalloc_work since it's not used anymore. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:29 +02:00
Nikolay Borisov	4fbb514785	btrfs: Remove delay_iput parameter from __start_delalloc_inodes It's always set to 0 so remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ rename to start_delalloc_inodes ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Nikolay Borisov	76f32e240e	btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes It's always set to 0, so just remove it and collapse the constant value to the only function we are passing it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Nikolay Borisov	82b3e53b8d	btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots This parameter was introduced alongside the function in `eb73c1b7ce` ("Btrfs: introduce per-subvolume delalloc inode list") to avoid deadlocks since this function was used in the transaction commit path. However, commit `8d875f95da` ("btrfs: disable strict file flushes for renames and truncates") removed that usage, rendering the parameter obsolete. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Gu Jinxiang	0338dff6e0	btrfs: do reverse path readahead in btrfs_shrink_device In btrfs_shrink_device, before btrfs_search_slot, path->reada is set to READA_FORWARD. But I think READA_BACK is correct. Since: 1. key.offset is set to (u64)-1 2. after btrfs_search_slot, btrfs_previous_item is called So, for readahead previous items, READA_BACK is the correct one. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Qu Wenruo	4ed0a7a3b7	btrfs: trace: Add trace points for unused block groups This patch will add the following trace events: 1) btrfs_remove_block_group For btrfs_remove_block_group() function. Triggered when a block group is really removed. 2) btrfs_add_unused_block_group Triggered which block group is added to unused_bgs list. 3) btrfs_skip_unused_block_group Triggered which unused block group is not deleted. These trace events is pretty handy to debug case related to block group auto remove. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:28 +02:00
Qu Wenruo	3dca5c942d	btrfs: trace: Remove unnecessary fs_info parameter for btrfs__reserve_extent event class fs_info can be extracted from btrfs_block_group_cache, and all btrfs_block_group_cache is created by btrfs_create_block_group_cache() with fs_info initialized, no need to worry about NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Gu Jinxiang	9113493e3a	btrfs: remove unused fs_info parameter Since the commit `c6100a4b4e` ("Btrfs: replace tree->mapping with tree->private_data"), parameter fs_info in alloc_reloc_control is not used. So remove it. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	f9fbcaa2a3	btrfs: move btrfs_raid_mindev_errorvalues to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::mindev_error so that btrfs_raid_array can maintain the error code to return if the minimum number of devices condition is not met while trying to delete a device in the given raid. And so we can drop btrfs_raid_mindev_error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	41a6e8913c	btrfs: move btrfs_raid_group values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Anand Jain	ed23467b18	btrfs: move btrfs_raid_type_names values to btrfs_raid_attr table Add a new member struct btrfs_raid_attr::raid_name so that btrfs_raid_array can maintain the name of the raid type, and so we can drop btrfs_raid_type_names. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:27 +02:00
Qu Wenruo	b545993694	btrfs: print-tree: Add eb locking status output for debug build It's pretty handy if we can get the debug output for locking status of an extent buffer, specially for race condition related debugging. So add the following output for btrfs_print_tree() and btrfs_print_leaf(): - refs - write_locks (as w:%d) - read_locks (as r:%d) - blocking_writers (as bw:%d) - blocking_readers (as br:%d) - spinning_writers (as sw:%d) - spinning_readers (as sr:%d) - lock_owner - current->pid Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	833aae18fc	btrfs: open code set_balance_control The helper is quite simple and I'd like to see the locking in the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	1354e1a13e	btrfs: use mutex in btrfs_resume_balance_async While the spinlock does not cause problems, using the mutex is more correct and consistent with others. The global status of balance is eg. checked from btrfs_pause_balance or btrfs_cancel_balance with mutex. Resuming balance happens during mount or ro->rw remount. In the former case, no other user of the balance_ctl exists, in the latter, balance cannot run until the ro/rw transition is finished. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	008ef0969d	btrfs: drop lock parameter from update_ioctl_balance_args and rename The parameter controls locking of the stats part but we can lock it unconditionally, as this only happens once when balance starts. This is not performance critical. Add the prefix for an exported function. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	cf7d20f447	btrfs: move and comment read-only check in btrfs_cancel_balance Balance cannot be started on a read-only filesystem and will have to finish/exit before eg. going to read-only via remount. In case the filesystem is forcibly set to read-only after an error, balance will finish anyway and if the cancel call is too fast it will just wait for that to happen. The last case is when the balance is paused after mount but it's read-only and cancelling would want to delete the item. The test is moved after the check if balance is running at all, as it looks more logical to report "no balance running" instead of "read-only filesystem". Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:26 +02:00
David Sterba	3009a62f3b	btrfs: track running balance in a simpler way Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	dccdb07bc9	btrfs: kill btrfs_fs_info::volume_mutex Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit `5ac00addc7` ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	a0fecc2371	btrfs: remove wrong use of volume_mutex from btrfs_dev_replace_start The volume mutex does not protect against anything in this case, the comment about scrub is right but not related to locking and looks confusing. The comment in btrfs_find_device_missing_or_by_path is wrong and confusing too. The device_list_mutex is not held here to protect device lookup, but in this case device replace cannot run in parallel with device removal (due to exclusive op protection), so we don't need further locking here. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	149196a2ae	btrfs: cleanup helpers that reset balance state The function __cancel_balance name is confusing with the cancel operation of balance and it really resets the state of balance back to zero. The unset_balance_control helper is called only from one place and simple enough to be inlined. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	eee95e3fb0	btrfs: add sanity check when resuming balance after mount Replace a WARN_ON with a proper check and message in case something goes really wrong and resumed balance cannot set up its exclusive status. The check is a user friendly assertion, I don't expect to ever happen under normal circumstances. Also document that the paused balance starts here and owns the exclusive op status. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:25 +02:00
David Sterba	010a47bde9	btrfs: add proper safety check before resuming dev-replace The device replace is paused by unmount or read only remount, and resumed on next mount or write remount. The exclusive status should be checked properly as it's a global invariant and we must not allow 2 operations run. In this case, the balance can be also paused and resumed under same conditions. It's always checked first so dev-replace could see the EXCL_OP already taken, BUT, the ioctl would never let start both at the same time. Replace the WARN_ON with message and return 0, indicating no error as this is purely theoretical and the user will be informed. Resolving that manually should be possible by waiting for the other operation to finish or cancel the paused state. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	a17c95df4c	btrfs: move clearing of EXCL_OP out of __cancel_balance Make the clearning visible in the callers so we can pair it with the test_and_set part. Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	72b81abf95	btrfs: move volume_mutex to callers of btrfs_rm_device Move locking and unlocking next to the BTRFS_FS_EXCL_OP bit manipulation so it's obvious that the two happen at the same time. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	d48f39d5a5	btrfs: move btrfs_init_dev_replace_tgtdev to dev-replace.c and make static The function logically belongs there and there's only a single caller, no need to export it. No code changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:24 +02:00
David Sterba	a425f9d475	btrfs: export and rename free_device The function will be used outside of volumes.c, the allocation btrfs_alloc_device is also exported. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
David Sterba	6fc4749d25	btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear This is a preparatory cleanup that will make clear that the only successful way out of btrfs_init_dev_replace_tgtdev will also set the device_out to a valid pointer. With this guarantee, the callers can be simplified. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:23 +02:00
David Sterba	00251a527a	btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller The function is called once and is fairly small, we can merge it with the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	b518519713	btrfs: cleanup btrfs_rm_device() promote fs_devices pointer This function uses fs_info::fs_devices number of time, however we declare and use it only at the end, instead do it in the beginning of the function and use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	636d2c9d63	btrfs: cleanup find_device() drop list_head pointer find_device() declares struct list_head *head pointer and used only once, instead just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:22 +02:00
Anand Jain	897fb5734a	btrfs: rename __btrfs_open_devices to open_fs_devices __btrfs_open_devices() is un-exported drop __ prefix and rename it to open_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	0226e0eb65	btrfs: rename __btrfs_close_devices to close_fs_devices __btrfs_close_devices() is un-exported, drop the __ prefix and rename it to close_fs_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	f117e290e8	btrfs: cleanup __btrfs_open_devices() drop head pointer __btrfs_open_devices() declares struct list_head *head, however head is used only once, instead use btrfs_fs_devices::devices directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Anand Jain	c4babc5e38	btrfs: rename struct btrfs_fs_devices::list btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic name 'list' makes it's search very difficult, rename it to fs_list. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:21 +02:00
Nikolay Borisov	be97f133b3	btrfs: Drop fs_info parameter from btrfs_merge_delayed_refs It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	f033798d12	btrfs: Drop fs_info parameter from add_delayed_data_ref It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	1acda0c289	btrfs: Drop add_delayed_ref_head fs_info parameter It's provided by the transaction handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	40012f96b6	btrfs: Remove btrfs_wait_and_free_delalloc_work This function is called from only 1 place and is effectively a wrapper over wait_completion/kfree. It doesn't really bring any value having those two calls in a separate function. Just open code it and remove it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	8ae225a8a4	btrfs: Remove tree argument from extent_writepages It can be directly referenced from the passed address_space so do that. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:20 +02:00
Nikolay Borisov	81f1d39035	btrfs: Use list_empty instead of list_empty_careful list_empty_careful usually is a signal of something tricky going on. Its usage in btrfs is actually not needed since both lists it's used on are local to a function and cannot be modified concurrently. So switch to plain list_empty. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	2a3ff0adc9	btrfs: Remove redundant tree argument from extent_readpages This function is called only from btrfs_readpage and is already passed the mapping. Simplify its signature by moving the code obtaining reference to the extent tree in the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	29c68b2de9	btrfs: Remove map argument from try_release_extent_state It's not used in the function so just remove it. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Nikolay Borisov	477a30ba5f	btrfs: Sink extent_tree arguments in try_release_extent_mapping This function already gets the page from which the two extent trees are referenced. Simplify its signature by moving the code getting the trees inside the function. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:19 +02:00
Misono Tomohiro	a79a464d56	btrfs: Allow rmdir(2) to delete an empty subvolume Change the behavior of rmdir(2) and allow it to delete an empty subvolume by using btrfs_delete_subvolume() which is used by btrfs_ioctl_snap_destroy(). This is a change in behaviour and has been requested by users. Deleting the subvolume by ioctl requires root permissions while the rmdir way does works with standard tools and syscalls for all users that can access the subvolume. The main usecase is to allow 'rm -rf /path/with/subvols' to simply work. We were not able to find any nasty usability surprises, the intention is to do the destructive rm. Without allowing rmdir, this would have to be followed by the ioctl subvolume deletion, which is more of an annoyance. Implementation details: The required lock for @dir and inode of @dentry is already acquired in vfs layer. We need some check before deleting a subvolume. Permission check is done in vfs layer, emptiness check is in btrfs_rmdir() and additional check (i.e. neither the subvolume is a default subvolume nor send is in progress) is in btrfs_delete_subvolume(). Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs layer later. Tested-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Misono Tomohiro	f60a2364a4	btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() Factor out the second half of btrfs_ioctl_snap_destroy() as btrfs_delete_subvolume(), which performs some subvolume specific checks before deletion: 1. send is not in progress 2. the subvolume is not the default subvolume 3. the subvolume does not contain other subvolumes and actual deletion process. btrfs_delete_subvolume() requires inode_lock for both @dir and inode of @dentry. The remaining part of btrfs_ioctl_snap_destroy() is mainly permission checks. Note that call of d_delete() is not included in btrfs_delete_subvolume() as this function will also be used by btrfs_rmdir() to delete an empty subvolume and in that case d_delete() is called in VFS layer. As a result, btrfs_unlink_subvol() and may_destroy_subvol() become static functions. No functional changes. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Misono Tomohiro	ec42f16734	btrfs: Move may_destroy_subvol() from ioctl.c to inode.c This is a preparation work to refactor btrfs_ioctl_snap_destroy() and to allow rmdir(2) to delete an empty subvolume. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of the function comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	3b079a919a	btrfs: remove unused le_test_bit() With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap conversion"), there are no more callers to le_test_bit(). This patch removes le_test_bit(). Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	a565971ff3	btrfs: optimize free space tree bitmap conversion Presently, convert_free_space_to_extents() does a linear scan of the bitmap. We can speed this up with find_next_{bit,zero_bit}_le(). This patch replaces the linear scan with find_next_{bit,zero_bit}_le(). Testing shows a 20-33% decrease in execution time for convert_free_space_to_extents(). Since we change bitmap to be unsigned long, we have to do some casting for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as we are doing bit operations. Everywhere else, we're just using it for pointer arithmetic and not directly accessing it, so char seems more appropriate. Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
Howard McLauchlan	6faa8f475e	btrfs: clean up le_bitmap_{set, clear}() le_bitmap_set() is only used by free-space-tree, so move it there and make it static. le_bitmap_clear() is not used, so remove it. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:18 +02:00
David Sterba	f46b24c945	btrfs: use fs_info for btrfs_handle_em_exist tracepoint We really want to know to which filesystem the extent map events belong, but as it cannot be reached from the extent_map pointers, we need to pass it down the callchain. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:17 +02:00
David Sterba	0e08eb9b1c	btrfs: tests: pass fs_info to extent_map tests Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can get a better tracepoint message. Extent maps do not need fs_info for anything so we only add a dummy one without any other initialization. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:17 +02:00
Nikolay Borisov	57f1642ec3	btrfs: Consolidate error checking for btrfs_alloc_chunk The second if is really a subcase of ret being less than 0. So introduce a generic if (ret < 0) check, and inside have another if which explicitly handles the -ENOSPC and any other errors. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:16 +02:00
Nikolay Borisov	1e7a14211b	btrfs: Fix lock release order Locks should generally be released in the oppposite order they are acquired. Generally lock acquisiton ordering is used to ensure deadlocks don't happen. However, as becomes more complicated it's best to also maintain proper unlock order so as to avoid possible dead locks. This was found by code inspection and doesn't necessarily lead to a deadlock scenario. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:16 +02:00
Nikolay Borisov	b25f0d0012	btrfs: Use while loop instead of labels in __endio_write_update_ordered Currently __endio_write_update_ordered uses labels to implement what is essentially a simple while loop. This makes the code more cumbersome to follow than it actually has to be. No functional changes. No xfstest regressions were found during testing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:15 +02:00
Anand Jain	89595e80de	btrfs: add comment about BTRFS_FS_EXCL_OP Adds comments about BTRFS_FS_EXCL_OP to existing comments about the device locks. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates ] Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 18:07:15 +02:00
Nikolay Borisov	41d0bd3b5e	btrfs: Drop delayed_refs argument from btrfs_check_delayed_seq It's used to print its pointer in a debug statement but doesn't really bring any useful information to the error message. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Su Yue	c065f5b1cf	btrfs: rename btrfs_get_block_group_info and make it static The function btrfs_get_block_group_info() was introduced by the commit `5af3e8cce8` ("Btrfs: make filesystem read-only when submitting barrier fails") which used it in disk-io.c. However, the function is only called in ioctl.c now. Its parameter type btrfs_ioctl_space_info* is only for ioctl. So, make it static and rename it to be original name get_block_group_info. No functional change. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Nikolay Borisov	29d2b84cf9	btrfs: Replace owner argument in add_pinned_bytes with a boolean add_pinned_bytes really cares whether the bytes being pinned are either data or metadata. To that effect it checks whether the 'owner' argument is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because owner can really have 2 types of values: a) For metadata extents it holds the level at which the parent is in the btree. This amounts to owner having the values 0-7 b) In case of modifying data extents, owner is the inode number to which those extents belongs. Let's make this more explicit byt converting the owner parameter to a boolean value and either pass it directly when we know the type of extents we are working with (i.e. in btrfs_free_tree_block). In cases when the parent function can be called on both metadata/data extents perform the check in the caller. This hopefully makes the interface of add_pinned_bytes more intuitive. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 13:12:11 +02:00
Chengguang Xu	84ae6f829f	affs: fix potential memory leak when parsing option 'prefix' When specifying option 'prefix' multiple times, current option parsing will cause memory leak. Hence, call kfree for previous one in this case. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-28 12:36:41 +02:00
Steve French	eccb4422cf	smb3: Add ftrace tracepoints for improved SMB3 debugging Although dmesg logs and wireshark network traces can be helpful, being able to dynamically enable/disable tracepoints (in this case via the kernel ftrace mechanism) can also be helpful in more quickly debugging problems, and more selectively tracing the events related to the bug report. This patch adds 12 ftrace tracepoints to cifs.ko for SMB3 events in some obvious locations. Subsequent patches will add more as needed. Example use: trace-cmd record -e cifs <run test case> trace-cmd show Various trace events can be filtered. See: trace-cmd list \| grep cifs for the current list of cifs tracepoints. Sample output (from mount and writing to a file): root@smf:/sys/kernel/debug/tracing/events/cifs# trace-cmd show <snip> mount.cifs-6633 [006] .... 7246.936461: smb3_cmd_done: pid=6633 tid=0x0 sid=0x0 cmd=0 mid=0 mount.cifs-6633 [006] .... 7246.936701: smb3_cmd_err: pid=6633 tid=0x0 sid=0x3d9cf8e5 cmd=1 mid=1 status=0xc0000016 rc=-5 mount.cifs-6633 [006] .... 7246.943055: smb3_cmd_done: pid=6633 tid=0x0 sid=0x3d9cf8e5 cmd=1 mid=2 mount.cifs-6633 [006] .... 7246.943298: smb3_cmd_done: pid=6633 tid=0xf9447636 sid=0x3d9cf8e5 cmd=3 mid=3 mount.cifs-6633 [006] .... 7246.943446: smb3_cmd_done: pid=6633 tid=0xf9447636 sid=0x3d9cf8e5 cmd=11 mid=4 mount.cifs-6633 [006] .... 7246.943659: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=3 mid=5 mount.cifs-6633 [006] .... 7246.943766: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=11 mid=6 mount.cifs-6633 [006] .... 7246.943937: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=5 mid=7 mount.cifs-6633 [006] .... 7246.944020: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=16 mid=8 mount.cifs-6633 [006] .... 7246.944091: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=16 mid=9 mount.cifs-6633 [006] .... 7246.944163: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=16 mid=10 mount.cifs-6633 [006] .... 7246.944218: smb3_cmd_err: pid=6633 tid=0xf9447636 sid=0x3d9cf8e5 cmd=11 mid=11 status=0xc0000225 rc=-2 mount.cifs-6633 [006] .... 7246.944219: smb3_fsctl_err: xid=0 fid=0xffffffffffffffff tid=0xf9447636 sid=0x3d9cf8e5 class=0 type=393620 rc=-2 mount.cifs-6633 [007] .... 7246.944353: smb3_cmd_done: pid=6633 tid=0xe1b781a sid=0x3d9cf8e5 cmd=16 mid=12 bash-2071 [000] .... 7256.903844: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=5 mid=13 bash-2071 [000] .... 7256.904172: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=16 mid=14 bash-2071 [000] .... 7256.904471: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=17 mid=15 bash-2071 [000] .... 7256.904950: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=5 mid=16 bash-2071 [000] .... 7256.905305: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=17 mid=17 bash-2071 [000] .... 7256.905688: smb3_cmd_done: pid=2071 tid=0xe1b781a sid=0x3d9cf8e5 cmd=6 mid=18 bash-2071 [000] .... 7256.905809: smb3_write_done: xid=0 fid=0xd628f511 tid=0xe1b781a sid=0x3d9cf8e5 offset=0x0 len=0x1b Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-05-27 17:56:35 -05:00
Steve French	5a77e75fed	smb3: rename encryption_required to smb3_encryption_required Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Steve French	3fa9a54061	cifs: update internal module version number for cifs.ko to 2.12 Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Ronnie Sahlberg	97ca176224	cifs: add a new SMB2_close_flags function And make SMB2_close just a wrapper for SMB2_close_flags. We need this as we will start to send SMB2_CLOSE pdus using special flags. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Ronnie Sahlberg	96164ab2d8	cifs: store the leaseKey in the fid on SMB2_open In SMB2_open(), if we got a lease we need to store this in the fid structure or else we will never be able to map a lease break back to which file/fid it applies to. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Steve French	71992e62b8	cifs: fix build break when CONFIG_CIFS_DEBUG2 enabled Previous patches "cifs: update calc_size to take a server argument" and "cifs: add server argument to the dump_detail method" were broken if CONFIG_CIFS_DEBUG2 enabled Signed-off-by: Steve French <smfrench@gmail.com> CC: Ronnie Sahlberg <lsahlber@redhat.com>	2018-05-27 17:56:35 -05:00
Ronnie Sahlberg	9ec672bd17	cifs: update calc_size to take a server argument and change the smb2 version to take heder_preamble_size into account instead of hardcoding it as 4 bytes. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Ronnie Sahlberg	14547f7d74	cifs: add server argument to the dump_detail method We need a struct TCP_Server_Info *server to this method as it calls calc_size. The calc_size method will soon be changed to also take a server argument. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>	2018-05-27 17:56:35 -05:00
Steve French	3d4ef9a153	smb3: fix redundant opens on root In SMB2/SMB3 unlike in cifs we unnecessarily open the root of the share over and over again in various places during mount and path revalidation and also in statfs. This patch cuts redundant traffic (opens and closes) by simply keeping the directory handle for the root around (and reopening it as needed on reconnect), so query calls don't require three round trips to copmlete - just one, and eases load on network, client and server (on mount alone, cuts network traffic by more than a third). Also add a new cifs mount parm "nohandlecache" to allow users whose servers might have resource constraints (eg in case they have a server with so many users connecting to it that this extra handle per mount could possibly be a resource concern). Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>	2018-05-27 17:56:35 -05:00
Al Viro	8767712f26	rmdir(),rename(): do shrink_dcache_parent() only on success Once upon a time ->rmdir() instances used to check if victim inode had more than one (in-core) reference and failed with -EBUSY if it had. The reason was race avoidance - emptiness check is worthless if somebody could just go and create new objects in the victim directory afterwards. With introduction of dcache the checks had been replaced with checking the refcount of dentry. However, since a cached negative lookup leaves a negative child dentry, such check had lead to false positives - with empty foo/ doing stat foo/bar before rmdir foo ended up with -EBUSY unless the negative dentry of foo/bar happened to be evicted by the time of rmdir(2). That had been fixed by doing shrink_dcache_parent() just before the refcount check. At the same time, ext2_rmdir() has grown a private solution that eliminated those -EBUSY - it did something (setting ->i_size to 0) which made any subsequent ext2_add_entry() fail. Unfortunately, even with shrink_dcache_parent() the check had been racy - after all, the victim itself could be found by dcache lookup just after we'd checked its refcount. That got fixed by a new helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed the sucker if its refcount ended up equal to 1. That got called before ->rmdir(), turning the checks in ->rmdir() instances into "if not unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but had an unpleasant side effect - now shrink_dcache_parent() had been done before the emptiness checks, leading to easily triggerable calls of shrink_dcache_parent() on arbitrary large subtrees, quite possibly nested into each other. Several years later the ext2-private trick had been generalized - (in-core) inodes of dead directories are flagged and calls of lookup, readdir and all directory-modifying methods were prevented in so marked directories. Remaining boilerplate in ->rmdir() instances became redundant and some instances got rid of it. In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances and then killed off in all of them. That has lead to another problem, though - in case of successful rmdir we want any (negative) child dentries dropped and the victim itself made negative. There's no point keeping cached negative lookups in foo when we can get the negative lookup of foo itself cached. So shrink_dcache_parent() call had been restored; unfortunately, it went into the place where dentry_unhash() used to be, i.e. before the ->rmdir() call. Note that we don't unhash anymore, so any "is it busy" checks would be racy; fortunately, all of them are gone. We should've done that call right after successful ->rmdir(). That reduces contention caused by tree-walking in shrink_dcache_parent() and, especially, contention caused by evictions in two nested subtrees going on in parallel. The same goes for directory-overwriting rename() - the story there had been parallel to that of rmdir(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-27 16:23:51 -04:00
David S. Miller	5b79c2af66	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-26 19:46:15 -04:00
Al Viro	888e2b03ef	switch the rest of procfs lookups to d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-26 14:20:50 -04:00
Al Viro	0168b9e38c	procfs: switch instantiate_t to d_splice_alias() ... and get rid of pointless struct inode *dir argument of those, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-26 14:20:50 -04:00
Al Viro	9883638641	don't bother with tid_fd_revalidate() in lookups what we want it for is actually updating inode metadata; take _that_ into a separate helper and use it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-26 14:20:28 -04:00
Christoph Hellwig	652fe8e876	timerfd: convert to ->poll_mask Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	9e42f195f5	eventfd: switch to ->poll_mask Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	dd67081b36	pipe: convert to ->poll_mask Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	1962da0d21	aio: try to complete poll iocbs without context switch If we can acquire ctx_lock without spinning we can just remove our iocb from the active_reqs list, and thus complete the iocbs from the wakeup context. Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	2c14fa838c	aio: implement IOCB_CMD_POLL Simple one-shot poll through the io_submit() interface. To poll for a file descriptor the application should submit an iocb of type IOCB_CMD_POLL. It will poll the fd for the events specified in the the first 32 bits of the aio_buf field of the iocb. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the iocb is completed, it will have to be resubmitted. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	888933f8fd	aio: simplify cancellation With the current aio code there is no need for the magic KIOCB_CANCELLED value, as a cancelation just kicks the driver to queue the completion ASAP, with all actual completion handling done in another thread. Given that both the completion path and cancelation take the context lock there is no need for magic cmpxchg loops either. If we remove iocbs from the active list after calling ->ki_cancel (but with ctx_lock still held), we can also rely on the invariant thay anything found on the list has a ->ki_cancel callback and can be cancelled, further simplifing the code. Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	f3a2752a43	aio: simplify KIOCB_KEY handling No need to pass the key field to lookup_iocb to compare it with KIOCB_KEY, as we can do that right after retrieving it from userspace. Also move the KIOCB_KEY definition to aio.c as it is an internal value not used by any other place in the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	3deb642f0d	fs: introduce new ->get_poll_head and ->poll_mask methods ->get_poll_head returns the waitqueue that the poll operation is going to sleep on. Note that this means we can only use a single waitqueue for the poll, unlike some current drivers that use two waitqueues for different events. But now that we have keyed wakeups and heavily use those for poll there aren't that many good reason left to keep the multiple waitqueues, and if there are any ->poll is still around, the driver just won't support aio poll. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	9965ed174e	fs: add new vfs_poll and file_can_poll helpers These abstract out calls to the poll method in preparation for changes in how we poll. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	a0f8dcfc60	fs: cleanup do_pollfd Use straightline code with failure handling gotos instead of a lot of nested conditionals. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	8f546ae1fc	fs: unexport poll_schedule_timeout No users outside of select.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>	2018-05-26 09:16:44 +02:00
Christoph Hellwig	ed0d523adb	Merge branch 'fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into aio-base	2018-05-26 09:16:25 +02:00
Hugh Dickins	6c04ab0edd	proc: fix smaps and meminfo alignment The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit numbers (commonly 0) are misaligned. Remove seq_put_decimal_ull_width()'s leftover optimization for single digits: it's wrong now that num_to_str() takes care of the width. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils Fixes: `d1be35cb6f` ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-05-25 18:12:11 -07:00
Changwei Ge	3373de209c	ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" This reverts commit `ba16ddfbeb` ("ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"). In my testing, this patch introduces a problem that mkfs can't have slots more than 16 with 4k block size. And the original logic is safe actually with the situation it mentions so revert this commit. Attach test log: (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0 (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192 (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5 (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5 (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5 Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-05-25 18:12:10 -07:00
Arnd Bergmann	5ef03dbd91	xfs, proc: hide unused xfs procfs helpers These two functions now trigger a warning when CONFIG_PROC_FS is disabled: fs/xfs/xfs_stats.c:128:12: error: 'xqmstat_proc_show' defined but not used [-Werror=unused-function] static int xqmstat_proc_show(struct seq_file m, void v) ^~~~~~~~~~~~~~~~~ fs/xfs/xfs_stats.c:118:12: error: 'xqm_proc_show' defined but not used [-Werror=unused-function] static int xqm_proc_show(struct seq_file m, void v) ^~~~~~~~~~~~~ Previously, they were referenced from an unused 'static const' structure, which is silently dropped by gcc. We can address the warning by adding the same #ifdef around them that hides the reference. Fixes: `3f3942aca6` ("proc: introduce proc_create_single{,_data}") Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-25 20:43:08 -04:00
Deepa Dinamani	0220eddac6	udf: Simplify calls to udf_disk_stamp_to_time Subsequent patches in the series convert inode timestamps to use struct timespec64 instead of struct timespec as part of solving the y2038 problem. commit `fd3cfad374` ("udf: Convert udf_disk_stamp_to_time() to use mktime64()") eliminated the NULL return condition from udf_disk_stamp_to_time(). udf_time_to_disk_time() is always called with a valid dest pointer and the return value is ignored. Further, caller can as well check the dest pointer being passed in rather than return argument. Make both the functions return void. This will make the inode timestamp conversion simpler. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: jack@suse.com ---- Changes from v1: * fixed the pointer error pointed by Jan	2018-05-25 15:31:14 -07:00
Deepa Dinamani	0a2dfbecb3	fs: nfs: get rid of memcpys for inode times Subsequent patches in the series convert inode timestamps to use struct timespec64 instead of struct timespec as part of solving the y2038 problem. This will lead to type mismatch for memcpys. Use regular assignments instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: trond.myklebust@primarydata.com	2018-05-25 15:31:13 -07:00
Deepa Dinamani	13442b036a	ceph: make inode time prints to be long long Subsequent patches in the series convert inode timestamps to use struct timespec64 instead of struct timespec as part of solving the y2038 problem. Convert these print formats to use long long types to avoid warnings and errors on conversion. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: zyan@redhat.com Cc: ceph-devel@vger.kernel.org	2018-05-25 15:31:12 -07:00
Deepa Dinamani	8efd6894ff	fs: add timespec64_truncate() As vfs moves to using struct timespec64 to represent times, update the argument to timespec_truncate() to use struct timespec64. Also change the name of the function. The rest of the implementation logic is the same. Move this to fs/inode.c instead of kernel/time/time.c as all the users of this api are filesystems. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: <viro@zeniv.linux.org.uk>	2018-05-25 15:31:09 -07:00
Jan Kara	4f2f76f751	ext4: fix fencepost error in check for inode count overflow during resize ext4_resize_fs() has an off-by-one bug when checking whether growing of a filesystem will not overflow inode count. As a result it allows a filesystem with 8192 inodes per group to grow to 64TB which overflows inode count to 0 and makes filesystem unusable. Fix it. Cc: stable@vger.kernel.org Fixes: `3f8a6411fb` Reported-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>	2018-05-25 12:51:25 -04:00
Linus Torvalds	34b48b8789	Merge candidates for 4.17-rc - Remove bouncing addresses from the MAINTAINERS file - Kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers - Various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers - Two fixes for patches already sent during the merge window - A long standing bug related to not decreasing the pinned pages count in the right MM was found and fixed -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCgAGBQJbByPQAAoJEDht9xV+IJsa164P/AihB/vbn9MBdK3pe1OSUGTm tKZJ/Y6nY/Q/XTJSeM2wNECk8fOrZbKuLBz2XlPRsB2djp4ugC5WWfK9YbwWMGXG I5B/lB8VTorQr8E5i9lqqMDQc8aF8VcGJtdqVE3nD4JsVTrQSGiSnw45/BARDUm3 OycJJMDOWhDj2wnNSa+JfjPemIMDM1jse7DnsJfDsGfTMS/G+6nyzjKIlEnnFZ8/ PBxhq0q7C5viNDwwn2GsAVUrATTlW48SY0WYhkgMdSl20d2th9wMZqNMqtniz8NP lg87SrhzsAPOTlbSWlYYkAnzE7nEhfJyIfYUp2piNJeYuOohYPtO6w99Tqjl/GmU uLIYIXtZCxAK1Zb/znc49HkRVL5YFDsQGXdtYy7tvRZPwwR32kowUtpKIWaZFz8O BA/x+Zgqu9AlwqSWwQwxmMbUX42RRwhNJDVyTYlXQSSzhfgFaLIZARqb4K6HxeNN vZN0BK+x6pX6FI7hpdsqNRtH1oo4SNUBxiuUsrZ7cy7GqYNdUJ6piygDgmERaJxU svIUJof/+OoU1QyErQ0JgUEK/3jOHbjxSPb/rjQeqxAnCqhaGOuNGMtdfsGqgvBU x/u3eDcbfi/LBErXR46gYtxnOQ8I2BB+m8erUc/GVvCzWrX+R7ELZYpBrP5Pcu/6 mr2D7hDqgZHbeU8aB8+D =uFZh -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma fixes from Jason Gunthorpe: "This is pretty much just the usual array of smallish driver bugs. - remove bouncing addresses from the MAINTAINERS file - kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and hns drivers - various small LOC behavioral/operational bugs in mlx5, hns, qedr and i40iw drivers - two fixes for patches already sent during the merge window - a long-standing bug related to not decreasing the pinned pages count in the right MM was found and fixed" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits) RDMA/hns: Move the location for initializing tmp_len RDMA/hns: Bugfix for cq record db for kernel IB/uverbs: Fix uverbs_attr_get_obj RDMA/qedr: Fix doorbell bar mapping for dpi > 1 IB/umem: Use the correct mm during ib_umem_release iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()' RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint RDMA/i40iw: Avoid reference leaks when processing the AEQ RDMA/i40iw: Avoid panic when objects are being created and destroyed RDMA/hns: Fix the bug with NULL pointer RDMA/hns: Set NULL for __internal_mr RDMA/hns: Enable inner_pa_vld filed of mpt RDMA/hns: Set desc_dma_addr for zero when free cmq desc RDMA/hns: Fix the bug with rq sge RDMA/hns: Not support qp transition from reset to reset for hip06 RDMA/hns: Add return operation when configured global param fail RDMA/hns: Update convert function of endian format RDMA/hns: Load the RoCE dirver automatically RDMA/hns: Bugfix for rq record db for kernel RDMA/hns: Add rq inline flags judgement ...	2018-05-24 14:12:05 -07:00
Linus Torvalds	d7b66b4ab0	for-4.17-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsG/ecACgkQxWXV+ddt WDvr3w/8D12pwR9sPcEwxD4pvoLv7LP1VRQy2u+ivSifdBD7MueKh3y0igUMyARR LERsK0zUsTQGkkC6c7ZYd4cT9PikPpXtO1P9iATFAKqR/YMDIV/haSqT8DwbI/qb 7F+ZMeTy1LzL01YlYBrGVDxP8AWVO2Dml6JolYxzplILSLvdPH6G8xOSjei/p9sm RK5ERHJENEI0l/cThpiLoAEWjzciPtR39T5Hq45onHyCs3bjJCcx51/QE8sBsl8x +BKvCmL40UKd30YKudJZYDM6NgMgWENhfTtIZQIInv99sMNCxIgTEUdX8ExdyjRZ 24rst/BuQz4d8r/8zqE/hdFsHRGWwnEiYmGWylanPY5KdQ41ULfXC06xuoNOLoW8 KQwD8SWv+W5vEJW0UQz5cb3vUgv5RnUzPvcmMfSztLeo2K4zj6zCK5L6XJwIJNbM 1AJR7R4TRkQdf5QEeziFl738Yv1AgsPQuKSiiFa9YwXMLU8dYXlx14ioUzBL8MLe 1wZPJ03x/N7eKJ0g6OIAAVfUTFFejv4Z2B2IDoObuLLsPwTdK6tS+9tJ5mos7ngG Vf1ZVmhmeJdw1qwK8ROzAJHkK807KgGO7LWmA7tIVLwWuZX14F7xLQIg3Ux3MhIh NhoBTFy2AGmdE0hFYv/4FA5dnUOU4VTVYVw3QUV4DMc0XIodZrE= =iYyx -----END PGP SIGNATURE----- Merge tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A one-liner that prevents leaking an internal error value 1 out of the ftruncate syscall. This has been observed in practice. The steps to reproduce make a common pattern (open/write/fync/ftruncate) but also need the application to not check only for negative values and happens only for compressed inlined files. The conditions are narrow but as this could break userspace I think it's better to merge it now and not wait for the merge window" * tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix error handling in btrfs_truncate()	2018-05-24 11:47:43 -07:00
Seth Forshee	f3f1a18330	fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems The user in control of a super block should be allowed to freeze and thaw it. Relax the restrictions on the FIFREEZE and FITHAW ioctls to require CAP_SYS_ADMIN in s_user_ns. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2018-05-24 12:04:28 -05:00
Eric W. Biederman	bc6155d132	fs: Allow superblock owner to access do_remount_sb() Superblock level remounts are currently restricted to global CAP_SYS_ADMIN, as is the path for changing the root mount to read only on umount. Loosen both of these permission checks to also allow CAP_SYS_ADMIN in any namespace which is privileged towards the userns which originally mounted the filesystem. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2018-05-24 12:02:25 -05:00
Eric W. Biederman	0031181c49	fs: Allow superblock owner to replace invalid owners of inodes Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to chown files when inode owner is invalid. Ordinarily the capable_wrt_inode_uidgid check is sufficient to allow access to files but when the underlying filesystem has uids or gids that don't map to the current user namespace it is not enough, so the chown permission checks need to be extended to allow this case. Calling chown on filesystem nodes whose uid or gid don't map is necessary if those nodes are going to be modified as writing back inodes which contain uids or gids that don't map is likely to cause filesystem corruption of the uid or gid fields. Once chown has been called the existing capable_wrt_inode_uidgid checks are sufficient to allow the owner of a superblock to do anything the global root user can do with an appropriate set of capabilities. An ordinary filesystem mountable by a userns root will limit all uids and gids in s_user_ns or the INVALID_UID and INVALID_GID to flag all others. So having this added permission limited to just INVALID_UID and INVALID_GID is sufficient to handle every case on an ordinary filesystem. Of the virtual filesystems at least proc is known to set s_user_ns to something other than &init_user_ns, while at the same time presenting some files owned by GLOBAL_ROOT_UID. Those files the mounter of proc in a user namespace should not be able to chown to get access to. Limiting the relaxation in permission to just the minimum of allowing changing INVALID_UID and INVALID_GID prevents problems with cases like that. The original version of this patch was written by: Seth Forshee. I have rewritten and rethought this patch enough so it's really not the same thing (certainly it needs a different description), but he deserves credit for getting out there and getting the conversation started, and finding the potential gotcha's and putting up with my semi-paranoid feedback. Inspired-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2018-05-24 11:57:18 -05:00
Eric W. Biederman	55956b59df	vfs: Allow userns root to call mknod on owned filesystems. These filesystems already always set SB_I_NODEV so mknod will not be useful for gaining control of any devices no matter their permissions. This will allow overlayfs and applications like to fakeroot to use device nodes to represent things on disk. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-05-24 11:56:43 -05:00
Eric W. Biederman	593d1ce854	vfs: Don't allow changing the link count of an inode with an invalid uid or gid Changing the link count of an inode via unlink or link will cause a write back of that inode. If the uids or gids are invalid (aka not known to the kernel) writing the inode back may change the uid or gid in the filesystem. To prevent possible filesystem and to avoid the need for filesystem maintainers to worry about it don't allow operations on inodes with an invalid uid or gid. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-05-24 11:56:18 -05:00
Omar Sandoval	d50147381a	Btrfs: fix error handling in btrfs_truncate() Jun Wu at Facebook reported that an internal service was seeing a return value of 1 from ftruncate() on Btrfs in some cases. This is coming from the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items(). btrfs_truncate() uses two variables for error handling, ret and err. When btrfs_truncate_inode_items() returns non-zero, we set err to the return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we only set err if ret is an error (i.e., negative). To reproduce the issue: mount a filesystem with -o compress-force=zstd and the following program will encounter return value of 1 from ftruncate: int main(void) { char buf[256] = { 0 }; int ret; int fd; fd = open("test", O_CREAT \| O_WRONLY \| O_TRUNC, 0666); if (fd == -1) { perror("open"); return EXIT_FAILURE; } if (write(fd, buf, sizeof(buf)) != sizeof(buf)) { perror("write"); close(fd); return EXIT_FAILURE; } if (fsync(fd) == -1) { perror("fsync"); close(fd); return EXIT_FAILURE; } ret = ftruncate(fd, 128); if (ret) { printf("ftruncate() returned %d\n", ret); close(fd); return EXIT_FAILURE; } close(fd); return EXIT_SUCCESS; } Fixes: `ddfae63cc8` ("btrfs: move btrfs_truncate_block out of trans handle") CC: stable@vger.kernel.org # 4.15+ Reported-by: Jun Wu <quark@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2018-05-24 11:56:57 +02:00
Al Viro	4faa99965e	fix io_destroy()/aio_complete() race If io_destroy() gets to cancelling everything that can be cancelled and gets to kiocb_cancel() calling the function driver has left in ->ki_cancel, it becomes vulnerable to a race with IO completion. At that point req is already taken off the list and aio_complete() does NOT spin until we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds to kiocb_free(), freing req just it gets passed to ->ki_cancel(). Fix is simple - remove from the list after the call of kiocb_cancel(). All instances of ->ki_cancel() already have to cope with the being called with iocb still on list - that's what happens in io_cancel(2). Cc: stable@kernel.org Fixes: `0460fef2a9` "aio: use cancellation list lazily" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-23 22:53:22 -04:00
Alexei Starovoitov	449325b52b	umh: introduce fork_usermode_blob() helper Introduce helper: int fork_usermode_blob(void data, size_t len, struct umh_info info); struct umh_info { struct file pipe_to_umh; struct file pipe_from_umh; pid_t pid; }; that GPLed kernel modules (signed or unsigned) can use it to execute part of its own data as swappable user mode process. The kernel will do: - allocate a unique file in tmpfs - populate that file with [data, data + len] bytes - user-mode-helper code will do_execve that file and, before the process starts, the kernel will create two unix pipes for bidirectional communication between kernel module and umh - close tmpfs file, effectively deleting it - the fork_usermode_blob will return zero on success and populate 'struct umh_info' with two unix pipes and the pid of the user process As the first step in the development of the bpfilter project the fork_usermode_blob() helper is introduced to allow user mode code to be invoked from a kernel module. The idea is that user mode code plus normal kernel module code are built as part of the kernel build and installed as traditional kernel module into distro specified location, such that from a distribution point of view, there is no difference between regular kernel modules and kernel modules + umh code. Such modules can be signed, modprobed, rmmod, etc. The use of this new helper by a kernel module doesn't make it any special from kernel and user space tooling point of view. Such approach enables kernel to delegate functionality traditionally done by the kernel modules into the user space processes (either root or !root) and reduces security attack surface of the new code. The buggy umh code would crash the user process, but not the kernel. Another advantage is that umh code of the kernel module can be debugged and tested out of user space (e.g. opening the possibility to run clang sanitizers, fuzzers or user space test suites on the umh code). In case of the bpfilter project such architecture allows complex control plane to be done in the user space while bpf based data plane stays in the kernel. Since umh can crash, can be oom-ed by the kernel, killed by the admin, the kernel module that uses them (like bpfilter) needs to manage life time of umh on its own via two unix pipes and the pid of umh. The exit code of such kernel module should kill the umh it started, so that rmmod of the kernel module will cleanup the corresponding umh. Just like if the kernel module does kmalloc() it should kfree() it in the exit code. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-23 13:23:39 -04:00
Theodore Ts'o	8a2b307c21	ext4: correctly handle a zero-length xattr with a non-zero e_value_offs Ext4 will always create ext4 extended attributes which do not have a value (where e_value_size is zero) with e_value_offs set to zero. In most places e_value_offs will not be used in a substantive way if e_value_size is zero. There was one exception to this, which is in ext4_xattr_set_entry(), where if there is a maliciously crafted file system where there is an extended attribute with e_value_offs is non-zero and e_value_size is 0, the attempt to remove this xattr will result in a negative value getting passed to memmove, leading to the following sadness: [ 41.225365] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) [ 44.538641] BUG: unable to handle kernel paging request at ffff9ec9a3000000 [ 44.538733] IP: __memmove+0x81/0x1a0 [ 44.538755] PGD 1249bd067 P4D 1249bd067 PUD 1249c1067 PMD 80000001230000e1 [ 44.538793] Oops: 0003 [#1] SMP PTI [ 44.539074] CPU: 0 PID: 1470 Comm: poc Not tainted 4.16.0-rc1+ #1 ... [ 44.539475] Call Trace: [ 44.539832] ext4_xattr_set_entry+0x9e7/0xf80 ... [ 44.539972] ext4_xattr_block_set+0x212/0xea0 ... [ 44.540041] ext4_xattr_set_handle+0x514/0x610 [ 44.540065] ext4_xattr_set+0x7f/0x120 [ 44.540090] __vfs_removexattr+0x4d/0x60 [ 44.540112] vfs_removexattr+0x75/0xe0 [ 44.540132] removexattr+0x4d/0x80 ... [ 44.540279] path_removexattr+0x91/0xb0 [ 44.540300] SyS_removexattr+0xf/0x20 [ 44.540322] do_syscall_64+0x71/0x120 [ 44.540344] entry_SYSCALL_64_after_hwframe+0x21/0x86 https://bugzilla.kernel.org/show_bug.cgi?id=199347 This addresses CVE-2018-10840. Reported-by: "Xu, Wen" <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Fixes: `dec214d00e` ("ext4: xattr inode deduplication")	2018-05-23 11:31:03 -04:00
David Howells	5b86d4ff5d	afs: Implement network namespacing Implement network namespacing within AFS, but don't yet let mounts occur outside the init namespace. An additional patch will be required propagate the network namespace across automounts. Signed-off-by: David Howells <dhowells@redhat.com>	2018-05-23 12:01:15 +01:00
David Howells	1588def91d	afs: Mark afs_net::ws_cell as __rcu and set using rcu functions The afs_net::ws_cell member is sometimes used under RCU conditions from within an seq-readlock. It isn't, however, marked __rcu and it isn't set using the proper RCU barrier-imposing functions. Fix this by annotating it with __rcu and using appropriate barriers to make sure accesses are correctly ordered. Without this, the code can produce the following warning: >> fs/afs/proc.c:151:24: sparse: incompatible types in comparison expression (different address spaces) Fixes: `f044c8847b` ("afs: Lay the groundwork for supporting network namespaces") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>	2018-05-23 11:51:29 +01:00
David Howells	c875c76a06	afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus() Sparse doesn't appear able to handle the conditionally-taken locks in xdr_decode_AFSFetchStatus(), even though the lock and unlock are both contingent on the same unvarying function argument. Deal with this by interpolating a wrapper function that takes the lock if needed and calls xdr_decode_AFSFetchStatus() on two separate branches, one with the lock held and one without. This allows Sparse to work out the locking. Signed-off-by: David Howells <dhowells@redhat.com>	2018-05-23 11:32:06 +01:00
Dan Williams	a77d478642	dax: Report bytes remaining in dax_iomap_actor() In preparation for protecting the dax read(2) path from media errors with copy_to_iter_mcsafe() (via dax_copy_to_iter()), convert the implementation to report the bytes successfully transferred. Cc: <x86@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 23:18:31 -07:00
Dan Williams	b3a9a0c36e	dax: Introduce a ->copy_to_iter dax operation Similar to the ->copy_from_iter() operation, a platform may want to deploy an architecture or device specific routine for handling reads from a dax_device like /dev/pmemX. On x86 this routine will point to a machine check safe version of copy_to_iter(). For now, add the plumbing to device-mapper and the dax core. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 23:18:31 -07:00
Theodore Ts'o	eb9b5f01c3	ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget() If ext4_find_inline_data_nolock() returns an error it needs to get reflected up to ext4_iget(). In order to fix this, ext4_iget_extra_inode() needs to return an error (and not return void). This is related to "ext4: do not allow external inodes for inline data" (which fixes CVE-2018-11412) in that in the errors=continue case, it would be useful to for userspace to receive an error indicating that file system is corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org	2018-05-22 17:14:07 -04:00
Theodore Ts'o	117166efb1	ext4: do not allow external inodes for inline data The inline data feature was implemented before we added support for external inodes for xattrs. It makes no sense to support that combination, but the problem is that there are a number of extended attribute checks that are skipped if e_value_inum is non-zero. Unfortunately, the inline data code is completely e_value_inum unaware, and attempts to interpret the xattr fields as if it were an inline xattr --- at which point, Hilarty Ensues. This addresses CVE-2018-11412. https://bugzilla.kernel.org/show_bug.cgi?id=199803 Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: `e50e5129f3` ("ext4: xattr-in-inode support") Cc: stable@kernel.org	2018-05-22 16:15:24 -04:00
Al Viro	1ae9bd8b7e	proc_lookupfd_common(): don't bother with instantiate unless the file is open ... and take the "check if file is open, pick ->f_mode" into a helper; tid_fd_revalidate() can use it. The next patch will get rid of tid_fd_revalidate() calls in instantiate callbacks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:04 -04:00
Al Viro	1bbc55131e	procfs: get rid of ancient BS in pid_revalidate() uses First of all, calling pid_revalidate() in the end of <pid>/* lookups is not about closing any kind of races; that used to be true once upon a time, but these days those comments are actively misleading. Especially since pid_revalidate() doesn't even do d_drop() on failure anymore. It doesn't matter, anyway, since once pid_revalidate() starts returning false, ->d_delete() of those dentries starts saying "don't keep"; they won't get stuck in dcache any longer than they are pinned. These calls cannot be just removed, though - the side effect of pid_revalidate() (updating i_uid/i_gid/etc.) is what we are calling it for here. Let's separate the "update ownership" into a new helper (pid_update_inode()) and use it, both in lookups and in pid_revalidate() itself. The comments in pid_revalidate() are also out of date - they refer to the time when pid_revalidate() used to call d_drop() directly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:03 -04:00
Al Viro	11f17c9bd7	cifs_lookup(): switch to d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:03 -04:00
Al Viro	a8b75f663e	cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL not since 2004... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:02 -04:00
Al Viro	500e2ab6c3	9p: unify paths in v9fs_vfs_lookup() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:02 -04:00
Al Viro	293542d8e5	hfsplus: switch to d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:00 -04:00
Al Viro	0e5c56fd07	hfs: don't allow mounting over .../rsrc That's one case when unlink() destroys a subtree, thanks to "resource fork" idiocy. We might forcibly evict that shit on unlink(2), but for now let's just disallow overmounting; as it is, anything that plays games with those would leak mounts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:28:00 -04:00
Al Viro	6b9cceead0	hfs: use d_splice_alias() code is simpler that way Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:59 -04:00
Al Viro	18fbbfc2bf	omfs_lookup(): report IO errors, use d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:58 -04:00
Al Viro	04bb1ba141	orangefs_lookup: simplify d_splice_alias() can handle NULL and ERR_PTR() for inode just fine... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:58 -04:00
Al Viro	0ed883fd80	openpromfs: switch to d_splice_alias() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:57 -04:00
Al Viro	b113a6d3cf	xfs_vn_lookup: simplify a bit have all post-xfs_lookup() branches converge on d_splice_alias() Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:57 -04:00
Al Viro	9a7dddcaff	adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias() Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:56 -04:00
Al Viro	686bb96d1b	adfs_lookup_byname: .. is taken care of in fs/namei.c Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:55 -04:00
Al Viro	8130c15176	romfs_lookup: switch to d_splice_alias() ... and hash negative lookups Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:55 -04:00
Al Viro	c1481700f4	qnx6_lookup: switch to d_splice_alias() ... and hash negative lookups Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:54 -04:00
Al Viro	191ac107f9	ubifs_lookup: use d_splice_alias() code is simpler that way Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:54 -04:00
Al Viro	5bf3544970	sysv_lookup: use d_splice_alias() code is simpler that way Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:53 -04:00
Al Viro	b135dcea37	qnx4_lookup: use d_splice_alias() code is simpler that way Acked-by: Anders Larsen <al@alarsen.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:52 -04:00
Al Viro	b014951692	minix_lookup: use d_splice_alias() code is simpler that way Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:52 -04:00
Al Viro	72ff0b038d	freevxfs_lookup(): use d_splice_alias() code is simpler that way Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:51 -04:00
Al Viro	d023b3a19f	cramfs_lookup(): use d_splice_alias() simpler code that way, actually Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:51 -04:00
Al Viro	b455ecd4bb	bfs_add_entry: pass name/len as qstr pointer same story as with bfs_find_entry() Cc: "Tigran A. Aivazian" <aivazian.tigran@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:50 -04:00
Al Viro	33ebdebece	bfs_find_entry: pass name/len as qstr pointer all callers feed something->name/something->len anyway Cc: "Tigran A. Aivazian" <aivazian.tigran@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:49 -04:00
Al Viro	a596a23b9a	bfs_lookup(): use d_splice_alias() code is actually simpler that way. Acked-by: "Tigran A. Aivazian" <aivazian.tigran@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-22 14:27:38 -04:00
Dan Williams	d6dc57e251	xfs, dax: introduce xfs_break_dax_layouts() xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans for busy / pinned dax pages and waits for those pages to go idle before any potential extent unmap operation. dax_layout_busy_page() handles synchronizing against new page-busy events (get_user_pages). It invalidates all mappings to trigger the get_user_pages slow path which will eventually block on the xfs inode lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a busy page it returns it for xfs to wait for the page-idle event that will fire when the page reference count reaches 1 (recall ZONE_DEVICE pages are idle at count 1, see generic_dax_pagefree()). While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not deadlock the process that might be trying to elevate the page count of more pages before arranging for any of them to go idle. I.e. the typical case of submitting I/O is that iov_iter_get_pages() elevates the reference count of all pages in the I/O before starting I/O on the first page. The process of elevating the reference count of all pages involved in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL. Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is held while sleeping. We need this to prevent starvation of the truncate path as continuous submission of direct-I/O could starve the truncate path indefinitely if the lock is dropped. Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 07:19:08 -07:00
Dan Williams	69eb5fa10e	xfs: prepare xfs_break_layouts() for another layout type When xfs is operating as the back-end of a pNFS block server, it prevents collisions between local and remote operations by requiring a lease to be held for remotely accessed blocks. Local filesystem operations break those leases before writing or mutating the extent map of the file. A similar mechanism is needed to prevent operations on pinned dax mappings, like device-DMA, from colliding with extent unmap operations. BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of layout breaking. Layouts are broken in the BREAK_WRITE case to ensure that layout-holders do not collide with local writes. Additionally, layouts are broken in the BREAK_UNMAP case to make sure the layout-holder has a consistent view of the file's extent map. While BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL. After this refactoring xfs_break_layouts() becomes the entry point for coordinating both types of breaks. Finally, xfs_break_leased_layouts() becomes just the BREAK_WRITE handler. Note that the unlock tracking is needed in a follow on change. That will coordinate retrying either break handler until both successfully test for a lease break while maintaining the lock state. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 07:19:08 -07:00
Dan Williams	c63a8eae63	xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL In preparation for adding coordination between extent unmap operations and busy dax-pages, update xfs_break_layouts() to permit it to be called with the mmap lock held. This lock scheme will be required for coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE) pages mapped into the file's address space). Breaking dax layouts will be added to xfs_break_layouts() in a future patch, for now this preps the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to xfs_break_layouts(). Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 07:19:08 -07:00
Dan Williams	5fac7408d8	mm, fs, dax: handle layout changes to pinned dax mappings Background: get_user_pages() in the filesystem pins file backed memory pages for access by devices performing dma. However, it only pins the memory pages not the page-to-file offset association. If a file is truncated the pages are mapped out of the file and dma may continue indefinitely into a page that is owned by a device driver. This breaks coherency of the file vs dma, but the assumption is that if userspace wants the file-space truncated it does not matter what data is inbound from the device, it is not relevant anymore. The only expectation is that dma can safely continue while the filesystem reallocates the block(s). Problem: This expectation that dma can safely continue while the filesystem changes the block map is broken by dax. With dax the target dma page is the filesystem block. The model of leaving the page pinned for dma, but truncating the file block out of the file, means that the filesytem is free to reallocate a block under active dma to another file and now the expected data-incoherency situation has turned into active data-corruption. Solution: Defer all filesystem operations (fallocate(), truncate()) on a dax mode file while any page/block in the file is under active dma. This solution assumes that dma is transient. Cases where dma operations are known to not be transient, like RDMA, have been explicitly disabled via commits like `5f1d43de54` "IB/core: disable memory registration of filesystem-dax vmas". The dax_layout_busy_page() routine is called by filesystems with a lock held against mm faults (i_mmap_lock) to find pinned / busy dax pages. The process of looking up a busy page invalidates all mappings to trigger any subsequent get_user_pages() to block on i_mmap_lock. The filesystem continues to call dax_layout_busy_page() until it finally returns no more active pages. This approach assumes that the page pinning is transient, if that assumption is violated the system would have likely hung from the uncompleted I/O. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 07:19:08 -07:00
Dan Williams	e763848843	mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS In preparation for fixing dax-dma-vs-unmap issues, filesystems need to be able to rely on the fact that they will get wakeups on dev_pagemap page-idle events. Introduce MEMORY_DEVICE_FS_DAX and generic_dax_page_free() as common indicator / infrastructure for dax filesytems to require. With this change there are no users of the MEMORY_DEVICE_HOST designation, so remove it. The HMM sub-system extended dev_pagemap to arrange a callback when a dev_pagemap managed page is freed. Since a dev_pagemap page is free / idle when its reference count is 1 it requires an additional branch to check the page-type at put_page() time. Given put_page() is a hot-path we do not want to incur that check if HMM is not in use, so a static branch is used to avoid that overhead when not necessary. Now, the FS_DAX implementation wants to reuse this mechanism for receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific static-key into a generic mechanism that either HMM or FS_DAX code paths can enable. For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support, care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure. However, we still need to support FS_DAX in the FS_DAX_LIMITED case implemented by the s390/dcssblk driver. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Reported-by: Dave Jiang <dave.jiang@intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2018-05-22 06:59:39 -07:00
Al Viro	837f3ec692	Merge branch 'work.misc' into work.lookup	2018-05-21 17:43:32 -04:00
David S. Miller	6f6e434aa2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net S390 bpf_jit.S is removed in net-next and had changes in 'net', since that code isn't used any more take the removal. TLS data structures split the TX and RX components in 'net-next', put the new struct members from the bug fix in 'net' into the RX part. The 'net-next' tree had some reworking of how the ERSPAN code works in the GRE tunneling code, overlapping with a one-line headroom calculation fix in 'net'. Overlapping changes in __sock_map_ctx_update_elem(), keep the bits that read the prog members via READ_ONCE() into local variables before using them. Signed-off-by: David S. Miller <davem@davemloft.net>	2018-05-21 16:01:54 -04:00
Linus Torvalds	5997aab0a1	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes all over the place" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: aio: fix io_destroy(2) vs. lookup_ioctx() race ext2: fix a block leak nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed unfuck sysfs_mount() kernfs: deal with kernfs_fill_super() failures cramfs: Fix IS_ENABLED typo befs_lookup(): use d_splice_alias() affs_lookup: switch to d_splice_alias() affs_lookup(): close a race with affs_remove_link() fix breakage caused by d_find_alias() semantics change fs: don't scan the inode cache before SB_BORN is set do d_instantiate/unlock_new_inode combinations safely iov_iter: fix memory leak in pipe_get_pages_alloc() iov_iter: fix return type of __pipe_get_pages()	2018-05-21 11:54:57 -07:00
Al Viro	baf10564fb	aio: fix io_destroy(2) vs. lookup_ioctx() race kill_ioctx() used to have an explicit RCU delay between removing the reference from ->ioctx_table and percpu_ref_kill() dropping the refcount. At some point that delay had been removed, on the theory that percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was the wrong kind of RCU delay and it didn't care about rcu_read_lock() used by lookup_ioctx(). As the result, we could get ctx freed right under lookup_ioctx(). Tejun has fixed that in `a6d7cff472` ("fs/aio: Add explicit RCU grace period when freeing kioctx"); however, that fix is not enough. Suppose io_destroy() from one thread races with e.g. io_setup() from another; CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2 has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the refcount, getting it to 0 and triggering a call of free_ioctx_users(), which proceeds to drop the secondary refcount and once that reaches zero calls free_ioctx_reqs(). That does INIT_RCU_WORK(&ctx->free_rwork, free_ioctx); queue_rcu_work(system_wq, &ctx->free_rwork); and schedules freeing the whole thing after RCU delay. In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the refcount from 0 to 1 and returned the reference to io_setup(). Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get freed until after percpu_ref_get(). Sure, we'd increment the counter before ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it has grabbed the reference, ctx is NOT going away until it gets around to dropping that reference. The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss. It's not costlier than what we currently do in normal case, it's safe to call since freeing is delayed and it closes the race window - either lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx() fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see the object in question at all. Cc: stable@kernel.org Fixes: `a6d7cff472` "fs/aio: Add explicit RCU grace period when freeing kioctx" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-05-21 14:30:11 -04:00

... 6 7 8 9 10 ...

54492 Commits