mirror of https://gitee.com/openkylin/linux.git
565 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Linus Torvalds | 31466f3ed7 |
for-4.16-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct 8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1 DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE= =Irul -----END PGP SIGNATURE----- Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features or user visible changes: - fallocate: implement zero range mode - avoid losing data raid profile when deleting a device - tree item checker: more checks for directory items and xattrs Notable fixes: - raid56 recovery: don't use cached stripes, that could be potentially changed and a later RMW or recovery would lead to corruptions or failures - let raid56 try harder to rebuild damaged data, reading from all stripes if necessary - fix scrub to repair raid56 in a similar way as in the case above Other: - cleanups: device freeing, removed some call indirections, redundant bio_put/_get, unused parameters, refactorings and renames - RCU list traversal fixups - simplify mount callchain, remove recursing back when mounting a subvolume - plug for fsync, may improve bio merging on multiple devices - compression heurisic: replace heap sort with radix sort, gains some performance - add extent map selftests, buffered write vs dio" * tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits) btrfs: drop devid as device_list_add() arg btrfs: get device pointer from device_list_add() btrfs: set the total_devices in device_list_add() btrfs: move pr_info into device_list_add btrfs: make btrfs_free_stale_devices() to match the path btrfs: rename btrfs_free_stale_devices() arg to skip_dev btrfs: make btrfs_free_stale_devices() argument optional btrfs: make btrfs_free_stale_device() to iterate all stales btrfs: no need to check for btrfs_fs_devices::seeding btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it Btrfs: noinline merge_extent_mapping Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping Btrfs: extent map selftest: dio write vs dio read Btrfs: extent map selftest: buffered write vs dio read Btrfs: add extent map selftests Btrfs: move extent map specific code to extent_map.c Btrfs: add helper for em merge logic Btrfs: fix unexpected EEXIST from btrfs_get_extent Btrfs: fix incorrect block_len in merge_extent_mapping btrfs: Remove unused readahead spinlock ... |
|
Jeff Layton | ae5e165d85 |
fs: new API for handling inode->i_version
Add a documentation blob that explains what the i_version field is, how it is expected to work, and how it is currently implemented by various filesystems. We already have inode_inc_iversion. Add several other functions for manipulating and accessing the i_version counter. For now, the implementation is trivial and basically works the way that all of the open-coded i_version accesses work today. Future patches will convert existing users of i_version to use the new API, and then convert the backend implementation to do things more efficiently. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> |
|
Filipe Manana | 81fdf6382b |
Btrfs: fix space leak after fallocate and zero range operations
If we do a buffered write after a zero range operation that has an unaligned (with the filesystem's sector size) end which also falls within an unwritten (prealloc) extent that is currently beyond the inode's i_size, and the zero range operation has the flag FALLOC_FL_KEEP_SIZE, we end up leaking data and metadata space. This happens because when zeroing a range we call btrfs_truncate_block(), which does delalloc (loads the page and partially zeroes its content), and in the buffered write path we only clear existing delalloc space reservation for the range we are writing into if that range starts at an offset smaller then the inode's i_size, which makes sense since we can not have delalloc extents beyond the i_size, only unwritten extents are allowed. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "falloc -k 428K 4K" /mnt/foobar $ xfs_io -c "fzero -k 0 430K" /mnt/foobar $ xfs_io -c "pwrite -S 0xaa 428K 4K" /mnt/foobar $ umount /mnt After the unmount we get the metadata and data space leaks reported in dmesg/syslog: [95794.602253] ------------[ cut here ]------------ [95794.603322] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9561 btrfs_destroy_inode+0x4e/0x206 [btrfs] [95794.605167] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.613000] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.614448] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.615972] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.617114] RIP: 0010:btrfs_destroy_inode+0x4e/0x206 [btrfs] [95794.618001] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202 [95794.618721] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.619645] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.620711] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.621932] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.623124] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.624188] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.625578] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.626522] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.627647] Call Trace: [95794.628128] destroy_inode+0x3d/0x55 [95794.628573] evict+0x177/0x17e [95794.629010] dispose_list+0x50/0x71 [95794.629478] evict_inodes+0x132/0x141 [95794.630289] generic_shutdown_super+0x3f/0x10b [95794.630864] kill_anon_super+0x12/0x1c [95794.631383] btrfs_kill_super+0x16/0x21 [btrfs] [95794.631930] deactivate_locked_super+0x30/0x68 [95794.632539] deactivate_super+0x36/0x39 [95794.633200] cleanup_mnt+0x49/0x67 [95794.633818] __cleanup_mnt+0x12/0x14 [95794.634416] task_work_run+0x82/0xa6 [95794.634902] prepare_exit_to_usermode+0xe1/0x10c [95794.635525] syscall_return_slowpath+0x18c/0x1af [95794.636122] entry_SYSCALL_64_fastpath+0xab/0xad [95794.636834] RIP: 0033:0x7fa678cb99a7 [95794.637370] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.638672] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.639596] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.640703] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.641773] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.643150] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.644249] Code: ff 4c 8b a8 80 06 00 00 48 8b 87 c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 <0f> ff 83 bb 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 [95794.646929] ---[ end trace e95877675c6ec007 ]--- [95794.647751] ------------[ cut here ]------------ [95794.648509] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9562 btrfs_destroy_inode+0x59/0x206 [btrfs] [95794.649842] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.654659] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.655894] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.657546] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.658433] RIP: 0010:btrfs_destroy_inode+0x59/0x206 [btrfs] [95794.659279] RSP: 0018:ffffc90001737d00 EFLAGS: 00010202 [95794.660054] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.660753] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.661513] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.662289] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.663393] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.664342] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.665673] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.666593] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.667629] Call Trace: [95794.668065] destroy_inode+0x3d/0x55 [95794.668637] evict+0x177/0x17e [95794.669179] dispose_list+0x50/0x71 [95794.669830] evict_inodes+0x132/0x141 [95794.670416] generic_shutdown_super+0x3f/0x10b [95794.671103] kill_anon_super+0x12/0x1c [95794.671786] btrfs_kill_super+0x16/0x21 [btrfs] [95794.672552] deactivate_locked_super+0x30/0x68 [95794.673393] deactivate_super+0x36/0x39 [95794.674107] cleanup_mnt+0x49/0x67 [95794.674706] __cleanup_mnt+0x12/0x14 [95794.675279] task_work_run+0x82/0xa6 [95794.675795] prepare_exit_to_usermode+0xe1/0x10c [95794.676507] syscall_return_slowpath+0x18c/0x1af [95794.677275] entry_SYSCALL_64_fastpath+0xab/0xad [95794.678006] RIP: 0033:0x7fa678cb99a7 [95794.678600] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.679739] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.680779] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.681837] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.682867] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.683891] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.684843] Code: c0 01 00 00 48 85 c0 74 02 0f ff 48 83 bb e0 02 00 00 00 74 02 0f ff 83 bb 3c ff ff ff 00 74 02 0f ff 83 bb 40 ff ff ff 00 74 02 <0f> ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff [95794.687156] ---[ end trace e95877675c6ec008 ]--- [95794.687876] ------------[ cut here ]------------ [95794.688579] WARNING: CPU: 0 PID: 31496 at fs/btrfs/inode.c:9565 btrfs_destroy_inode+0x7d/0x206 [btrfs] [95794.689735] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.695015] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.696396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.697956] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.698925] RIP: 0010:btrfs_destroy_inode+0x7d/0x206 [btrfs] [95794.699763] RSP: 0018:ffffc90001737d00 EFLAGS: 00010206 [95794.700434] RAX: 0000000000000000 RBX: ffff880070fa1418 RCX: ffffc90001737c7c [95794.701445] RDX: 0000000175aa0240 RSI: 0000000000000001 RDI: ffff880070fa1418 [95794.702448] RBP: ffffc90001737d38 R08: 0000000000000000 R09: 0000000000000000 [95794.703557] R10: ffffc90001737c48 R11: ffff88007123e158 R12: ffff880075b6a000 [95794.704441] R13: ffff88006145c000 R14: ffff880070fa1418 R15: ffff880070c3b4a0 [95794.705270] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.706341] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.707001] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.708030] Call Trace: [95794.708466] destroy_inode+0x3d/0x55 [95794.709071] evict+0x177/0x17e [95794.709497] dispose_list+0x50/0x71 [95794.709973] evict_inodes+0x132/0x141 [95794.710564] generic_shutdown_super+0x3f/0x10b [95794.711200] kill_anon_super+0x12/0x1c [95794.711633] btrfs_kill_super+0x16/0x21 [btrfs] [95794.712139] deactivate_locked_super+0x30/0x68 [95794.712608] deactivate_super+0x36/0x39 [95794.713093] cleanup_mnt+0x49/0x67 [95794.713514] __cleanup_mnt+0x12/0x14 [95794.713933] task_work_run+0x82/0xa6 [95794.714543] prepare_exit_to_usermode+0xe1/0x10c [95794.715247] syscall_return_slowpath+0x18c/0x1af [95794.715952] entry_SYSCALL_64_fastpath+0xab/0xad [95794.716653] RIP: 0033:0x7fa678cb99a7 [95794.721100] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.722052] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.722856] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.723698] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.724736] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.725928] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.726728] Code: 40 ff ff ff 00 74 02 0f ff 48 83 bb f8 fe ff ff 00 74 02 0f ff 48 83 bb 00 ff ff ff 00 74 02 0f ff 48 83 bb 30 ff ff ff 00 74 02 <0f> ff 48 83 bb 08 ff ff ff 00 74 02 0f ff 4d 85 e4 0f 84 52 01 [95794.729203] ---[ end trace e95877675c6ec009 ]--- [95794.841054] ------------[ cut here ]------------ [95794.841829] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5831 btrfs_free_block_groups+0x235/0x36a [btrfs] [95794.843425] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.850658] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.852590] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.854752] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.855812] RIP: 0010:btrfs_free_block_groups+0x235/0x36a [btrfs] [95794.856811] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.857805] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.859014] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.860270] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.861525] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000 [95794.862700] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100 [95794.863810] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.865149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.866099] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.867198] Call Trace: [95794.867626] close_ctree+0x1db/0x2b8 [btrfs] [95794.868188] ? evict_inodes+0x132/0x141 [95794.869037] btrfs_put_super+0x15/0x17 [btrfs] [95794.870400] generic_shutdown_super+0x6a/0x10b [95794.871262] kill_anon_super+0x12/0x1c [95794.872046] btrfs_kill_super+0x16/0x21 [btrfs] [95794.872746] deactivate_locked_super+0x30/0x68 [95794.873687] deactivate_super+0x36/0x39 [95794.874639] cleanup_mnt+0x49/0x67 [95794.875504] __cleanup_mnt+0x12/0x14 [95794.876126] task_work_run+0x82/0xa6 [95794.876788] prepare_exit_to_usermode+0xe1/0x10c [95794.877777] syscall_return_slowpath+0x18c/0x1af [95794.878381] entry_SYSCALL_64_fastpath+0xab/0xad [95794.878888] RIP: 0033:0x7fa678cb99a7 [95794.879307] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.880204] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.881640] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.882690] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.883538] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.884562] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.885664] Code: 89 ef e8 07 ec 32 e1 e8 9d c0 ea e0 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 <0f> ff 48 83 bb 88 02 00 00 00 74 02 0f ff 48 83 bb d8 02 00 00 [95794.887980] ---[ end trace e95877675c6ec00a ]--- [95794.888739] ------------[ cut here ]------------ [95794.889405] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:5832 btrfs_free_block_groups+0x241/0x36a [btrfs] [95794.891020] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.897551] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.898509] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.899685] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.900592] RIP: 0010:btrfs_free_block_groups+0x241/0x36a [btrfs] [95794.901387] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.902300] RAX: 0000000080000000 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.903260] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.904332] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.905300] R10: ffffc90001737c80 R11: 00000000000337fd R12: 0000000000000000 [95794.906439] R13: ffff88006145c0c0 R14: ffff88021b61a800 R15: ffff88006145c100 [95794.907459] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.908625] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.909511] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.910630] Call Trace: [95794.911153] close_ctree+0x1db/0x2b8 [btrfs] [95794.911837] ? evict_inodes+0x132/0x141 [95794.912344] btrfs_put_super+0x15/0x17 [btrfs] [95794.912975] generic_shutdown_super+0x6a/0x10b [95794.913788] kill_anon_super+0x12/0x1c [95794.914424] btrfs_kill_super+0x16/0x21 [btrfs] [95794.915142] deactivate_locked_super+0x30/0x68 [95794.915831] deactivate_super+0x36/0x39 [95794.916433] cleanup_mnt+0x49/0x67 [95794.917045] __cleanup_mnt+0x12/0x14 [95794.917665] task_work_run+0x82/0xa6 [95794.918309] prepare_exit_to_usermode+0xe1/0x10c [95794.919021] syscall_return_slowpath+0x18c/0x1af [95794.919722] entry_SYSCALL_64_fastpath+0xab/0xad [95794.920426] RIP: 0033:0x7fa678cb99a7 [95794.921039] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.922303] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.923335] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.924364] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.925435] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.926533] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.927557] Code: 48 8d b3 28 02 00 00 48 83 c9 ff 31 d2 48 89 df e8 29 c5 ff ff 48 83 bb 80 02 00 00 00 74 02 0f ff 48 83 bb 88 02 00 00 00 74 02 <0f> ff 48 83 bb d8 02 00 00 00 74 02 0f ff 48 83 bb e0 02 00 00 [95794.930166] ---[ end trace e95877675c6ec00b ]--- [95794.930961] ------------[ cut here ]------------ [95794.931727] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.932729] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.938394] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.939842] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.941455] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.942336] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.943268] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.944127] RAX: ffff8802004fd0e8 RBX: ffff88006145c000 RCX: 0000000000000001 [95794.945211] RDX: 00000001810af668 RSI: 0000000000000002 RDI: 00000000ffffffff [95794.946316] RBP: ffffc90001737d98 R08: 0000000000000000 R09: ffffffff817e22b9 [95794.947271] R10: ffffc90001737c80 R11: 00000000000337fd R12: ffff8802004fd0e8 [95794.948219] R13: ffff88006145c0c0 R14: ffff88006145e598 R15: ffff88006145c100 [95794.949193] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.950495] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.951338] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95794.952361] Call Trace: [95794.952811] close_ctree+0x1db/0x2b8 [btrfs] [95794.953522] ? evict_inodes+0x132/0x141 [95794.954543] btrfs_put_super+0x15/0x17 [btrfs] [95794.955231] generic_shutdown_super+0x6a/0x10b [95794.955916] kill_anon_super+0x12/0x1c [95794.956414] btrfs_kill_super+0x16/0x21 [btrfs] [95794.956953] deactivate_locked_super+0x30/0x68 [95794.957635] deactivate_super+0x36/0x39 [95794.958256] cleanup_mnt+0x49/0x67 [95794.958701] __cleanup_mnt+0x12/0x14 [95794.959181] task_work_run+0x82/0xa6 [95794.959635] prepare_exit_to_usermode+0xe1/0x10c [95794.960182] syscall_return_slowpath+0x18c/0x1af [95794.960731] entry_SYSCALL_64_fastpath+0xab/0xad [95794.961438] RIP: 0033:0x7fa678cb99a7 [95794.961990] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95794.963111] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95794.963975] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95794.964680] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95794.965763] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95794.966868] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95794.967800] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff [95794.970629] ---[ end trace e95877675c6ec00c ]--- [95794.971451] BTRFS info (device sdi): space_info 1 has 7680000 free, is not full [95794.972351] BTRFS info (device sdi): space_info total=8388608, used=704512, pinned=0, reserved=0, may_use=4096, readonly=0 [95794.973595] ------------[ cut here ]------------ [95794.974353] WARNING: CPU: 0 PID: 31496 at fs/btrfs/extent-tree.c:9953 btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.980163] Modules linked in: btrfs xfs ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper parport_pc psmouse sg i2c_piix4 parport i2c_core evdev pcspkr button serio_raw sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod virtio_scsi ata_generic crc32c_intel ata_piix floppy virtio_pci virtio_ring virtio libata scsi_mod e1000 [last unloaded: btrfs] [95794.986461] CPU: 0 PID: 31496 Comm: umount Tainted: G W 4.14.0-rc6-btrfs-next-54+ #1 [95794.987591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [95794.988929] task: ffff880075aa0240 task.stack: ffffc90001734000 [95794.989922] RIP: 0010:btrfs_free_block_groups+0x2bc/0x36a [btrfs] [95794.990715] RSP: 0018:ffffc90001737d70 EFLAGS: 00010206 [95794.991431] RAX: ffff88020f6e70e8 RBX: ffff88006145c000 RCX: ffffffff8115a906 [95794.992455] RDX: ffffffff8115a902 RSI: ffff880075aa0b40 RDI: ffff880075aa0b40 [95794.993535] RBP: ffffc90001737d98 R08: 0000000000000020 R09: fffffffffffffff7 [95794.994573] R10: 00000000ffffffc4 R11: ffff8800633b1bc0 R12: ffff88020f6e70e8 [95794.996250] R13: 0000000000000038 R14: ffff88006145e598 R15: 0000000000000000 [95794.997233] FS: 00007fa6793c92c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [95794.998592] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [95794.999484] CR2: 000056338670d048 CR3: 00000000610dc005 CR4: 00000000001606f0 [95795.000542] Call Trace: [95795.001138] close_ctree+0x1db/0x2b8 [btrfs] [95795.001885] ? evict_inodes+0x132/0x141 [95795.002407] btrfs_put_super+0x15/0x17 [btrfs] [95795.003093] generic_shutdown_super+0x6a/0x10b [95795.003720] kill_anon_super+0x12/0x1c [95795.004353] btrfs_kill_super+0x16/0x21 [btrfs] [95795.005095] deactivate_locked_super+0x30/0x68 [95795.005716] deactivate_super+0x36/0x39 [95795.006388] cleanup_mnt+0x49/0x67 [95795.006939] __cleanup_mnt+0x12/0x14 [95795.007512] task_work_run+0x82/0xa6 [95795.008124] prepare_exit_to_usermode+0xe1/0x10c [95795.008994] syscall_return_slowpath+0x18c/0x1af [95795.009831] entry_SYSCALL_64_fastpath+0xab/0xad [95795.010610] RIP: 0033:0x7fa678cb99a7 [95795.011193] RSP: 002b:00007ffccf0aaed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [95795.012327] RAX: 0000000000000000 RBX: 0000563386706030 RCX: 00007fa678cb99a7 [95795.013432] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000056338670ca90 [95795.014558] RBP: 000056338670ca90 R08: 000056338670c740 R09: 0000000000000015 [95795.015577] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fa6791bae64 [95795.016569] R13: 0000000000000000 R14: 0000563386706210 R15: 00007ffccf0ab160 [95795.017662] Code: 00 00 00 4c 8b a3 98 25 00 00 49 83 bc 24 60 ff ff ff 00 75 16 49 83 bc 24 68 ff ff ff 00 75 0b 49 83 bc 24 70 ff ff ff 00 74 16 <0f> ff 49 8d b4 24 18 ff ff ff 31 c9 31 d2 48 89 df e8 93 7a ff [95795.020538] ---[ end trace e95877675c6ec00d ]--- [95795.021259] BTRFS info (device sdi): space_info 4 has 1072775168 free, is not full [95795.022390] BTRFS info (device sdi): space_info total=1073741824, used=114688, pinned=0, reserved=0, may_use=786432, readonly=65536 Fix this by ensuring the zero range operation does not call btrfs_truncate_block() if the corresponding extent is an unwritten one (it's pointless anyway, since reading from an unwritten extent yields zeroes). Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | 9f13ce743b |
Btrfs: fix missing inode i_size update after zero range operation
For a fallocate's zero range operation that targets a range with an end that is not aligned to the sector size, we can end up not updating the inode's i_size. This happens when the last page of the range maps to an unwritten (prealloc) extent and before that last page we have either a hole or a written extent. This is because in this scenario we relied on a call to btrfs_prealloc_file_range() to update the inode's i_size, however it can only update the i_size to the "down aligned" end of the range. Example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar $ xfs_io -c "falloc -k 428K 4K" /mnt/foobar $ xfs_io -c "fzero 0 430K" /mnt/foobar $ du --bytes /mnt/foobar 438272 /mnt/foobar The inode's i_size was left as 428Kb (438272 bytes) when it should have been updated to 430Kb (440320 bytes). Fix this by always updating the inode's i_size explicitly after zeroing the range. Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | 94f450712a |
Btrfs: use cached state when dirtying pages during buffered write
During a buffered IO write, we can have an extent state that we got when we locked the range (if the range starts at an offset lower than eof), so always pass it to btrfs_dirty_pages() so that setting the delalloc bit in the range does not need to do a full search in the inode's io tree, saving time and reducing the amount of time we hold the io tree's lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | f27451f229 |
Btrfs: add support for fallocate's zero range operation
This implements support the zero range operation of fallocate. For now at least it's as simple as possible while reusing most of the existing fallocate and hole punching infrastructure. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | e43bbe5e16 |
btrfs: sink unlock_extent parameter gfp_flags
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com> |
|
Liu Bo | 343e4fc1c6 |
Btrfs: set plug for fsync
Setting plug can merge adjacent IOs before dispatching IOs to the disk driver. Without plug, it'd not be a problem for single disk usecases, but for multiple disks using raid profile, a large IO can be split to several IOs of stripe length, and plug can be helpful to bring them together for each disk so that we can save several disk access. Moreover, fsync issues synchronous writes, so plug can really take effect. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | ae0f162534 |
btrfs: sink gfp parameter to clear_extent_bit
All callers use GFP_NOFS, we don't have to pass it as an argument. The built-in tests pass GFP_KERNEL, but they run only at module load time and NOFS works there as well. Signed-off-by: David Sterba <dsterba@suse.com> |
|
Liu Bo | f5c29bd9db |
Btrfs: add __init macro to btrfs init functions
Adding __init macro gives kernel a hint that this function is only used during the initialization phase and its memory resources can be freed up after. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 96b09dde92 |
btrfs: Use locked_end rather than open coding it
Right before we go into this loop locked_end is set to alloc_end - 1 and is being used in nearby functions, no need to have exceptions. This just makes the code consistent, no functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 6b7d6e9334 |
btrfs: Move loop termination condition in while()
Fallocating a file in btrfs goes through several stages. The one before actually inserting the fallocated extents is to create a qgroup reservation, covering the desired range. To this end there is a loop in btrfs_fallocate which checks to see if there are holes in the fallocated range or !PREALLOC extents past EOF and if so create qgroup reservations for them. Unfortunately, the main condition of the loop is burried right at the end of its body rather than in the actual while statement which makes it non-obvious. Fix this by moving the condition in the while statement where it belongs. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Liu Bo | ebb70442cd |
Btrfs: fix list_add corruption and soft lockups in fsync
Xfstests btrfs/146 revealed this corruption,
[ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[ 58.153518] ------------[ cut here ]------------
[ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[ 58.161956] Call Trace:
[ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[ 58.164393] vfs_fsync_range+0x5f/0xd0
[ 58.164898] do_fsync+0x5a/0x90
[ 58.165170] SyS_fsync+0x10/0x20
[ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe
...
It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.
This returns the correct error in the above case.
Jeff also reported this while testing against his fsync error
patch set[1].
[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"
Fixes:
|
|
Filipe Manana | e3b8a48585 |
Btrfs: fix reported number of inode blocks after buffered append writes
The patch from commit |
|
Filipe Manana | f48bf66b66 |
Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Move the definition of the function btrfs_find_new_delalloc_bytes() closer to the function btrfs_dirty_pages(), because in a future commit it will be used exclusively by btrfs_dirty_pages(). This just moves the function's definition, with no functional changes at all. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Josef Bacik | 8b62f87bad |
Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Goldwyn Rodrigues | 79f015f216 |
btrfs: cleanup extent locking sequence
Code cleanup for better understanding: Variable needs_unlock to be called extent_locked to show state as opposed to action. Changed the type to int, to reduce code in the critical path. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Josef Bacik | 84f7d8e624 |
btrfs: pass root to various extent ref mod functions
We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Thomas Meyer | 897ca8194c |
btrfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Linus Torvalds | e253d98f5b |
Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nowait read support from Al Viro: "Support IOCB_NOWAIT for buffered reads and block devices" * 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: block_dev: support RFW_NOWAIT on block device nodes fs: support RWF_NOWAIT for buffered reads fs: support IOCB_NOWAIT in generic_file_buffered_read fs: pass iocb to do_generic_file_read |
|
Christoph Hellwig | 91f9943e1c |
fs: support RWF_NOWAIT for buffered reads
This is based on the old idea and code from Milosz Tanski. With the aio nowait code it becomes mostly trivial now. Buffered writes continue to return -EOPNOTSUPP if RWF_NOWAIT is passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Josef Bacik | 23b5ec7494 |
btrfs: fix readdir deadlock with pagefault
Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | ea14b57fd1 |
btrfs: fix spelling of snapshotting
Signed-off-by: David Sterba <dsterba@suse.com> |
|
Linus Torvalds | 6618a24ab2 |
Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba: "This fixes a user-visible bug introduced by the nowait-aio patches merged in this cycle" * 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: nowait aio: Correct assignment of pos |
|
Goldwyn Rodrigues | ff0fa73247 |
btrfs: nowait aio: Correct assignment of pos
Assigning pos for usage early messes up in append mode, where the pos is
re-assigned in generic_write_checks(). Assign pos later to get the
correct position to write from iocb->ki_pos.
Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
are present in generic_write_checks(), so checking for IOCB_NOWAIT is
enough.
Also, put locking sequence in the fast path.
This fixes a user visible bug, as reported:
"apparently breaks several shell related features on my system.
In zsh history stopped working, because no new entries are added
anymore.
I fist noticed the issue when I tried to build mplayer. It uses a shell
script to generate a help_mp.h file:
[...]
Here is a simple testcase:
% echo "foo" >> test
% echo "foo" >> test
% cat test
foo
%
"
Fixes:
|
|
Linus Torvalds | 088737f44b |
Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0 yKWjj9wfYRQ0vSUqhsI5 =3Z93 -----END PGP SIGNATURE----- Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty |
|
Jeff Layton | 333427a505 |
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
Just check and advance the errseq_t in the file before returning, and use an errseq_t based check for writeback errors. Other internal callers of filemap_* functions are left as-is. Signed-off-by: Jeff Layton <jlayton@redhat.com> |
|
Linus Torvalds | 8c27cb3566 |
Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ... |
|
Qu Wenruo | bc42bda223 |
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
[BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Qu Wenruo | 364ecf3651 |
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | 609805d809 |
Btrfs: fix invalid extent maps due to hole punching
While punching a hole in a range that is not aligned with the sector size (currently the same as the page size) we can end up leaving an extent map in memory with a length that is smaller then the sector size or with a start offset that is not aligned to the sector size. Both cases are not expected and can lead to problems. This issue is easily detected after the patch from commit |
|
Goldwyn Rodrigues | edf064e7c6 |
btrfs: nowait aio support
Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Chris Mason | bce19f9d23 | Merge branch 'for-chris-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12 | |
Filipe Manana | a7e3b975a0 |
Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | be2d253cc9 |
Btrfs: fix extent map leak during fallocate error path
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an extent map structure. The failure can happen either due to an -ENOMEM condition or, when quotas are enabled, due to -EDQUOT for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> |
|
Dan Carpenter | 9986277e0e |
Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs_get_extent() never returns NULL pointers, so this code introduces a static checker warning. The btrfs_get_extent() is a bit complex, but trust me that it doesn't return NULLs and also if it did we would trigger the BUG_ON(!em) before the last return statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [ updated subject ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
Linus Torvalds | bbe08c0a43 |
Merge branch 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason: "Btrfs round two. These are mostly a continuation of Dave Sterba's collection of cleanups, but Filipe also has some bug fixes and performance improvements" * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits) btrfs: add dummy callback for readpage_io_failed and drop checks btrfs: drop checks for mandatory extent_io_ops callbacks btrfs: document existence of extent_io ops callbacks btrfs: let writepage_end_io_hook return void btrfs: do proper error handling in btrfs_insert_xattr_item btrfs: handle allocation error in update_dev_stat_item btrfs: remove BUG_ON from __tree_mod_log_insert btrfs: derive maximum output size in the compression implementation btrfs: use predefined limits for calculating maximum number of pages for compression btrfs: export compression buffer limits in a header btrfs: merge nr_pages input and output parameter in compress_pages btrfs: merge length input and output parameter in compress_pages btrfs: constify name of subvolume in creation helpers btrfs: constify buffers used by compression helpers btrfs: constify input buffer of btrfs_csum_data btrfs: constify device path passed to relevant helpers btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode btrfs: Make btrfs_add_nondir take btrfs_inode btrfs: Make btrfs_add_link take btrfs_inode ... |
|
Nikolay Borisov | fc4f21b1d8 |
btrfs: Make get_extent_t take btrfs_inode
In addition to changing the signature, this patch also switches all the functions which are used as an argument to also take btrfs_inode. Namely those are: btrfs_get_extent and btrfs_get_extent_filemap. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 2cff578cfc |
btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 85b7ab6705 |
btrfs: Make check_can_nocow take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | a776c6fa1f |
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 7a6d706795 |
btrfs: Make btrfs_mark_extent_written take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | a012a74e78 |
btrfs: Make fill_holes take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 35339c245b |
btrfs: Make hole_mergeable take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | dcdbc059f0 |
btrfs: Make btrfs_drop_extent_cache take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 46e5979183 |
btrfs: Make btrfs_requeue_inode_defrag take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 6158e1ce1c |
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 691fa05967 |
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 9f3db423f9 |
btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Nikolay Borisov | 04f4f91653 |
btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |