linux/fs
Filipe Manana 885f46d87f btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).

That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.

This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.

Scenario 1 - logging overlapping extents

The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:

1) Task A starts an fsync on inode X, which has the full sync runtime flag
   set. First it starts by flushing all delalloc for the inode;

2) Task A then locks the inode and flushes any other delalloc that might
   have been created after the previous flush and waits for all ordered
   extents to complete;

3) In the inode's root we have the following leaf:

   Leaf N, generation == current transaction id:

   ---------------------------------------------------------
   | (...)  [ file extent item, offset 640K, length 128K ] |
   ---------------------------------------------------------

   The last file extent item in leaf N covers the file range from 640K to
   768K;

4) Task B does a memory mapped write for the page corresponding to the
   file range from 764K to 768K;

5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
   btrfs_search_forward() to search for leafs modified in the current
   transaction that contain items for the inode. It finds leaf N and copies
   all the inode items from that leaf into the log tree.

   Now the log tree has a copy of the last file extent item from leaf N.

   At the end of the while loop at copy_inode_items_to_log(), we have the
   minimum key set to:

   min_key.objectid = <inode X number>
   min_key.type = BTRFS_EXTENT_DATA_KEY
   min_key.offset = 640K

   Then we increment the key's offset by 1 so that the next call to
   btrfs_search_forward() leaves us at the first key greater than the key
   we just processed.

   But before btrfs_search_forward() is called again...

6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
   It can be started for several reasons:

     - The async reclaim task is attempting to satisfy metadata or data
       reservation requests, and it has reached a point where it decided
       to flush delalloc;
     - Due to memory pressure the VMM triggers writeback of dirty pages;
     - The system call sync_file_range(2) is called from user space.

7) When the respective ordered extent completes, it trims the length of
   the existing file extent item for file offset 640K from 128K to 124K,
   and a new file extent item is added with a key offset of 764K and a
   length of 4K;

8) Task A calls btrfs_search_forward(), which returns us a path pointing
   to the leaf (can be leaf N or some other) containing the new file extent
   item for file offset 764K.

   We end up copying this item to the log tree, which overlaps with the
   last copied file extent item, which covers the file range from 640K to
   768K.

   When writeback is triggered for log tree's extent buffers, the issue
   will be detected by the tree checker which will dump a trace and an
   error message on dmesg/syslog. If the writeback is triggered when
   syncing the log, which typically is, then we also end up aborting the
   current transaction.

This is the same type of problem fixed in 0c713cbab6 ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").

Scenario 2 - logging a version of the file that never existed

This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:

1) We have an inode I with a size of 1M and two file extent items, one
   covering an extent with disk_bytenr == X for the file range [0, 512K)
   and another one covering another extent with disk_bytenr == Y for the
   file range [512K, 1M);

2) A hole is punched for the file range [512K, 1M);

3) Task A starts an fsync of inode I, which has the full sync runtime flag
   set. It starts by flushing all existing delalloc, locks the inode (VFS
   lock), starts any new delalloc that might have been created before
   taking the lock and waits for all ordered extents to complete;

4) Some other task does a memory mapped write for the page corresponding to
   the file range [640K, 644K) for example;

5) Task A then logs all items of the inode with the call to
   copy_inode_items_to_log();

6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
   be started for several reasons:

     - The async reclaim task is attempting to satisfy metadata or data
       reservation requests, and it has reached a point where it decided
       to flush delalloc;
     - Due to memory pressure the VMM triggers writeback of dirty pages;
     - The system call sync_file_range(2) is called from user space.

7) The ordered extent for the range [640K, 644K) completes and a file
   extent item for that range is added to the subvolume tree, pointing
   to a 4K extent with a disk_bytenr == Z;

8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
   the subvolume tree. It finds two implicit holes:

   - one for the file range [512K, 640K)
   - one for the file range [644K, 1M)

   As a result we end up neither logging a hole for the range [640K, 644K)
   nor logging the file extent item with a disk_bytenr == Z.
   This means that if we have a power failure and replay the log tree we
   end up getting the following file extent layout:

   [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Y ]    [  hole  ]
   0             512K  512K      640K  640K           644K  644K     1M

   Which does not corresponding to any layout the file ever had before
   the power failure. The only two valid layouts would be:

   [ disk_bytenr X ]    [   hole   ]
   0             512K  512K        1M

   and

   [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Z ]    [  hole  ]
   0             512K  512K      640K  640K           644K  644K     1M

This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:

1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
   in the inode's io tree. This prevents the race but also blocks any reads
   during the duration of the fsync, which has a negative impact for many
   common workloads;

2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
   semaphore was recently added by Josef's patch set:

   btrfs: add a i_mmap_lock to our inode
   btrfs: cleanup inode_lock/inode_unlock uses
   btrfs: exclude mmaps while doing remap
   btrfs: exclude mmap from happening during all fallocate operations

   and is used to solve races between memory mapped writes and
   clone/dedupe/fallocate. This also makes us have the same behaviour we
   have regarding other writes (buffered and direct IO) and fsync - block
   them while the inode logging is in progress.

This change uses the second approach due to the performance impact of the
first one.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
..
9p Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
adfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
affs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
afs afs: Use wait_on_page_writeback_killable 2021-03-23 20:54:37 +00:00
autofs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
befs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
bfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
btrfs btrfs: fix race between memory mapped writes and fsync 2021-04-19 17:25:15 +02:00
cachefiles cachefiles, afs: mm wait fixes 2021-03-24 10:22:00 -07:00
ceph idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
cifs cifs: escape spaces in share names 2021-04-07 21:30:27 -05:00
coda fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
configfs configfs: fix a use-after-free in __configfs_open_file 2021-03-11 12:13:48 +01:00
cramfs cramfs: use %pD instead of messing with file_dentry()->d_name 2021-01-05 23:02:47 -05:00
crypto block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
debugfs Driver core / debugfs update for 5.12-rc1 2021-02-24 10:13:55 -08:00
devpts
dlm fs: dlm: check on existing node address 2020-11-10 12:14:20 -06:00
ecryptfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
efivarfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
efs [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
erofs Change since last update: 2021-03-13 12:26:22 -08:00
exfat idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
exportfs exportfs: Add a function to return the raw output from fh_to_dentry() 2020-12-09 09:39:38 -05:00
ext2 fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
ext4 Miscellaneous ext4 bug fixes for v5.12. 2021-03-21 14:06:10 -07:00
f2fs block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
fat idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
freevxfs
fscache
fuse fuse: 32-bit user space ioctl compat for fuse device 2021-03-16 15:20:16 +01:00
gfs2 Two more gfs2 fixes 2021-04-03 12:15:01 -07:00
hfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
hfsplus idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
hostfs hostfs: fix memory handling in follow_link() 2021-03-25 18:57:42 -04:00
hpfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
hugetlbfs hugetlbfs: remove unneeded return value of hugetlb_vmtruncate() 2021-02-24 13:38:35 -08:00
iomap Merge branch 'iomap-5.12-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux 2021-03-18 10:37:30 -07:00
isofs isofs: handle large user and group ID 2021-02-03 19:05:52 +01:00
jbd2 block: use an on-stack bio in blkdev_issue_flush 2021-01-27 09:51:48 -07:00
jffs2 idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
jfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
kernfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
lockd SUNRPC: Make trace_svc_process() display the RPC procedure symbolically 2021-01-25 09:36:23 -05:00
minix fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
nfs nfs: we don't support removing system.nfs4_acl 2021-03-11 13:17:42 -05:00
nfs_common NFSv4_2: SSC helper should use its own config. 2021-01-28 10:55:37 -05:00
nfsd NFSD: fix error handling in NFSv4.0 callbacks 2021-03-11 10:58:49 -05:00
nilfs2 block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
nls
notify idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
ntfs ntfs: check for valid standard information attribute 2021-02-24 13:38:26 -08:00
ocfs2 ocfs2: fix deadlock between setattr and dio_end_io_write 2021-04-09 14:54:23 -07:00
omfs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
openpromfs
orangefs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
overlayfs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
proc mm: use is_cow_mapping() across tree where proper 2021-03-13 11:27:30 -08:00
pstore pstore fixes for v5.12-rc2 2021-03-05 17:21:25 -08:00
qnx4 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
qnx6 [PATCH] reduce boilerplate in fsid handling 2020-09-18 16:45:50 -04:00
quota quota: Fix memory leak when handling corrupted quota file 2021-01-05 14:42:18 +01:00
ramfs ramfs: support O_TMPFILE 2021-02-24 13:38:26 -08:00
reiserfs reiserfs: update reiserfs_xattrs_initialized() condition 2021-03-30 14:27:32 -07:00
romfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-10-24 12:26:05 -07:00
squashfs squashfs: fix xattr id and id lookup sanity checks 2021-03-25 09:22:55 -07:00
sysfs sysfs: Support zapping of binary attr mmaps 2021-01-12 14:26:31 +01:00
sysv fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
tracefs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
ubifs idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
udf idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
ufs fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
unicode unicode: Add utf8_casefold_hash 2020-09-10 14:03:31 -07:00
vboxsf fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
verity idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
xfs xfs: also reject BULKSTAT_SINGLE in a mount user namespace 2021-03-15 08:50:41 -07:00
zonefs zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() 2021-03-17 08:56:50 +09:00
Kconfig s390,alpha: make TMPFS_INODE64 available again 2021-03-08 10:46:30 +01:00
Kconfig.binfmt Merge branch 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-21 09:29:23 -08:00
Makefile fs: Remove dcookies support 2021-01-29 10:06:46 +05:30
aio.c Merge branch 'akpm' (patches from Andrew) 2020-12-15 12:53:37 -08:00
anon_inodes.c fs: anon_inodes: rephrase to appropriate kernel-doc 2021-01-15 12:17:25 -05:00
attr.c ima: handle idmapped mounts 2021-01-24 14:27:20 +01:00
bad_inode.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
binfmt_aout.c
binfmt_elf.c Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2021-02-21 13:20:41 -08:00
binfmt_elf_fdpic.c Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2021-02-21 13:20:41 -08:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: revert "binfmt_flat: don't offset the data start" 2020-08-24 08:49:13 +10:00
binfmt_misc.c binfmt_misc: fix possible deadlock in bm_register_write 2021-03-13 11:27:30 -08:00
binfmt_script.c
block_dev.c block: don't ignore REQ_NOWAIT for direct IO 2021-04-02 08:34:30 -06:00
buffer.c fs: buffer: use raw page_memcg() on locked page 2021-02-24 13:38:30 -08:00
char_dev.c
compat_binfmt_elf.c get rid of COMPAT_ELF_EXEC_PAGESIZE 2021-01-06 08:42:51 -05:00
coredump.c fs/coredump: use kmap_local_page() 2021-02-26 09:41:05 -08:00
d_path.c fs: fix NULL dereference due to data race in prepend_path() 2020-10-14 14:54:45 -07:00
dax.c mm: provide a saner PTE walking API for modules 2021-02-09 07:05:44 -05:00
dcache.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c
eventfd.c eventfd: Export eventfd_ctx_do_read() 2020-11-15 09:49:10 -05:00
eventpoll.c kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE 2021-02-16 09:59:41 +01:00
exec.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
fcntl.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
fhandle.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
file.c file: fix close_range() for unshare+cloexec 2021-04-02 14:11:10 +02:00
file_table.c epoll: take epitem list out of struct file 2020-10-25 20:02:08 -04:00
filesystems.c
fs-writeback.c fs: improve comments for writeback_single_inode() 2021-01-13 17:26:50 +01:00
fs_context.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
fs_parser.c fs_parse: mark fs_param_bad_value() as static 2020-10-13 18:38:27 -07:00
fs_pin.c
fs_struct.c vfs: Use sequence counter with associated spinlock 2020-07-29 16:14:27 +02:00
fs_types.c
fsopen.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
init.c init: handle idmapped mounts 2021-01-24 14:27:19 +01:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
internal.h idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
io-wq.c io-wq: cancel unbounded works on io-wq destroy 2021-04-08 13:33:17 -06:00
io-wq.h io_uring: remove structures from include/linux/io_uring.h 2021-03-18 09:44:35 -06:00
io_uring.c io_uring: fix early sqd_list removal sqpoll hangs 2021-04-14 13:07:27 -06:00
ioctl.c fs: remove ksys_ioctl 2020-07-31 08:16:01 +02:00
kernel_read_file.c fs/kernel_file_read: Add "offset" arg for partial reads 2020-10-05 13:37:04 +02:00
libfs.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
locks.c Revert "nfsd4: a client's own opens needn't prevent delegations" 2021-03-09 10:37:34 -05:00
mbcache.c
mount.h mount: make {lock,unlock}_mount_hash() static 2021-01-24 14:29:34 +01:00
mpage.c block: rename BIO_MAX_PAGES to BIO_MAX_VECS 2021-03-11 07:47:48 -07:00
namei.c LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late 2021-04-06 20:33:00 -04:00
namespace.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-02-27 08:07:12 -08:00
no-block.c
nsfs.c
open.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
pipe.c fs: delete repeated words in comments 2021-02-24 13:38:26 -08:00
pnode.c
pnode.h mount: fix mounting of detached mounts onto targets that reside on shared mounts 2021-03-08 15:18:43 +01:00
posix_acl.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
proc_namespace.c fs: introduce MOUNT_ATTR_IDMAP 2021-01-24 14:43:45 +01:00
read_write.c teach sendfile(2) to handle send-to-pipe directly 2021-01-25 23:29:36 -05:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c ioctl: handle idmapped mounts 2021-01-24 14:27:19 +01:00
select.c kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() 2021-03-16 22:13:10 +01:00
seq_file.c fs: fix kernel-doc markups 2021-01-21 14:06:00 -07:00
signalfd.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
splice.c for-5.12/block-2021-02-17 2021-02-21 11:02:48 -08:00
stack.c
stat.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
statfs.c s390,alpha: switch to 64-bit ino_t 2021-02-13 17:17:53 +01:00
super.c It has been a relatively quiet cycle in docsland. 2021-02-22 10:57:46 -08:00
sync.c
timerfd.c
userfaultfd.c userfaultfd: use secure anon inodes for userfaultfd 2021-01-14 17:40:57 -05:00
utimes.c utimes: handle idmapped mounts 2021-01-24 14:27:18 +01:00
xattr.c namei: handle idmapped mounts in may_*() helpers 2021-01-24 14:27:17 +01:00