mirror of https://gitee.com/openkylin/linux.git
186 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Dan Carpenter | 457ae7268b |
Btrfs: fix an integer overflow check
This isn't super serious because you need CAP_ADMIN to run this code.
I added this integer overflow check last year but apparently I am
rubbish at writing integer overflow checks... There are two issues.
First, access_ok() works on unsigned long type and not u64 so on 32 bit
systems the access_ok() could be checking a truncated size. The other
issue is that we should be using a stricter limit so we don't overflow
the kzalloc() setting ctx->clone_roots later in the function after the
access_ok():
alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
Fixes:
|
|
Filipe Manana | 82bfb2e7b6 |
Btrfs: incremental send, fix unnecessary hole writes for sparse files
When using the NO_HOLES feature, during an incremental send we often issue write operations for holes when we should not, because that range is already a hole in the destination snapshot. While that does not change the contents of the file at the receiver, it avoids preservation of file holes, leading to wasted disk space and extra IO during send/receive. A couple examples where the holes are not preserved follows. $ mkfs.btrfs -O no-holes -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 # Now add one new extent to our first test file, increasing its size and # leaving a 1Mb hole between the first extent and this new extent. $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo # Now overwrite the last extent of our second test file. $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25088..25095 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24576..24583 8 0x2001 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25096..25103 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24584..24591 8 0x2001 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt # It's not relevant to enable no-holes in the new filesystem. $ mkfs.btrfs -O no-holes -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive /mnt -f /tmp/2.snap $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24576..24583 8 0x2000 1: [8..2063]: 25624..27679 2056 0x1 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24584..24591 8 0x2000 1: [8..2063]: 27680..29735 2056 0x1 The holes do not exist in the second filesystem and they were replaced with extents filled with the byte 0x00, making each file take 1032Kb of space instead of 8Kb. So fix this by not issuing the write operations consisting of buffers filled with the byte 0x00 when the destination snapshot already has a hole for the respective range. A test case for fstests will follow soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robbie Ko | 0191410158 |
Btrfs: incremental send, do not issue invalid rmdir operations
When both the parent and send snapshots have a directory inode with the same number but different generations (therefore they are different inodes) and both have an entry with the same name, an incremental send stream will contain an invalid rmdir operation that refers to the orphanized name of the inode from the parent snapshot. The following example scenario shows how this happens. Parent snapshot: . |---- d259_old/ (ino 259, gen 9) | |---- d1/ (ino 258, gen 9) | |---- f (ino 257, gen 9) Send snapshot: . |---- d258/ (ino 258, gen 7) |---- d259/ (ino 259, gen 7) |---- d1/ (ino 257, gen 7) When the kernel is processing inode 258 it notices that in both snapshots there is an inode numbered 259 that is a parent of an inode 258. However it ignores the fact that the inodes numbered 259 have different generations in both snapshots, which means they are effectively different inodes. Then it checks that both inodes 259 have a dentry named "d1" and because of that it issues a rmdir operation with orphanized name of the inode 258 from the parent snapshot. This happens at send.c:process_record_refs(), which calls send.c:did_overwrite_first_ref() that returns true and because of that later on at process_recorded_refs() such rmdir operation is issued because the inode being currently processed (258) is a directory and it was deleted in the send snapshot (and replaced with another inode that has the same number and is a directory too). Fix this issue by comparing the generations of parent directory inodes that have the same number and make send.c:did_overwrite_first_ref() when the generations are different. The following steps reproduce the problem. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/f $ mkdir /mnt/d1 $ mkdir /mnt/d259_old $ mv /mnt/d1 /mnt/d259_old/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/d1 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/d1 /mnt/dir259/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt/ -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inodes from the second snapshot all have # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. This means all the inodes # in the first snapshot (snap1) have a generation value of 9. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive -vv /mnt -f /tmp/2.snap receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9 utimes unlink f mkdir o257-7-0 mkdir o259-7-0 rename o257-7-0 -> o259-7-0/d1 chown o259-7-0/d1 - uid=0, gid=0 chmod o259-7-0/d1 - mode=0755 utimes o259-7-0/d1 rmdir o258-9-0 ERROR: rmdir o258-9-0 failed: No such file or directory Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | fe9c798dbf |
Btrfs: incremental send, do not delay rename when parent inode is new
When we are checking if we need to delay the rename operation for an inode we not checking if a parent inode that exists in the send and parent snapshots is really the same inode or not, that is, we are not comparing the generation number of the parent inode in the send and parent snapshots. Not only this results in unnecessarily delaying a rename operation but also can later on make us generate an incorrect name for a new inode in the send snapshot that has the same number as another inode in the parent snapshot but a different generation. Here follows an example where this happens. Parent snapshot: . (ino 256, gen 3) |--- dir258/ (ino 258, gen 7) | |--- dir257/ (ino 257, gen 7) | |--- dir259/ (ino 259, gen 7) Send snapshot: . (ino 256, gen 3) |--- file258 (ino 258, gen 10) | |--- new_dir259/ (ino 259, gen 10) |--- dir257/ (ino 257, gen 7) The following steps happen when computing the incremental send stream: 1) When processing inode 257, its new parent is created using its orphan name (o257-21-0), and the rename operation for inode 257 is delayed because its new parent (inode 259) was not yet processed - this decision to delay the rename operation does not make much sense because the inode 259 in the send snapshot is a new inode, it's not the same as inode 259 in the parent snapshot. 2) When processing inode 258 we end up delaying its rmdir operation, because inode 257 was not yet renamed (moved away from the directory inode 258 represents). We also create the new inode 258 using its orphan name "o258-10-0", then rename it to its final name of "file258" and then issue a truncate operation for it. However this truncate operation contains an incorrect name, which corresponds to the orphan name and not to the final name, which makes the receiver fail. This happens because when we attempt to compute the inode's current name we verify that there's another inode with the same number (258) that has its rmdir operation pending and because of that we generate an orphan name for the new inode 258 (we do this in the function get_cur_path()). Fix this by not delayed the rename operation of an inode if it has parents with the same number but different generations in both snapshots. The following steps reproduce this example scenario. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir257 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/dir257 /mnt/dir258/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/dir258/dir257 /mnt/dir257 $ rmdir /mnt/dir258 $ rmdir /mnt/dir259 # Remount the filesystem so that the next created inodes will have the # numbers 258 and 259. This is because when a filesystem is mounted, # btrfs sets the subvolume's inode counter to a value corresponding to # the highest inode number in the subvolume plus 1. This inode counter # is used to assign a unique number to each new inode and it's # incremented by 1 after very inode creation. # Note: we unmount and then mount instead of doing a mount with # "-o remount" because otherwise the inode counter remains at value 260. $ umount /mnt $ mount /dev/sdb /mnt $ touch /mnt/file258 $ mkdir /mnt/new_dir259 $ mv /mnt/dir257 /mnt/new_dir259/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmo/1.snap $ btrfs receive /mnt -f /tmo/2.snap -vv receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7 utimes mkdir o259-10-0 rename dir258 -> o258-7-0 utimes mkfile o258-10-0 rename o258-10-0 -> file258 utimes truncate o258-10-0 size=0 ERROR: truncate o258-10-0 failed: No such file or directory Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robbie Ko | 4dd9920d99 |
Btrfs: send, fix failure to rename top level inode due to name collision
Under certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) |---- a1/ (ino 257, gen 9) |---- a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) |---- a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Jeff Mahoney | 3a45bb207e |
btrfs: remove root parameter from transaction commit/end routines
Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Jeff Mahoney | 2ff7e61e0d |
btrfs: take an fs_info directly when the root is not used otherwise
There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Jeff Mahoney | 0b246afa62 |
btrfs: root->fs_info cleanup, add fs_info convenience variables
In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Jeff Mahoney | da17066c40 |
btrfs: pull node/sector/stripe sizes out of root and into fs_info
We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Linus Torvalds | f6167514c8 |
Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason: "My patch fixes the btrfs list_head abuse that we tracked down during Dave Jones' memory corruption investigation. With both Jens and my patches in place, I'm no longer able to trigger problems. Filipe is fixing a difficult old bug between snapshots, balance and send. Dave is cooking a few more for the next rc, but these are tested and ready" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix races on root_log_ctx lists btrfs: fix incremental send failure caused by balance |
|
Filipe Manana | d5e84fd8d0 |
Btrfs: fix incremental send failure caused by balance
Commit
|
|
Linus Torvalds | f29135b54b |
Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason: "This is a big variety of fixes and cleanups. Liu Bo continues to fixup fuzzer related problems, and some of Josef's cleanups are prep for his bigger extent buffer changes (slated for v4.10)" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits) Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf Btrfs: don't BUG() during drop snapshot btrfs: fix btrfs_no_printk stub helper Btrfs: memset to avoid stale content in btree leaf btrfs: parent_start initialization cleanup btrfs: Remove already completed TODO comment btrfs: Do not reassign count in btrfs_run_delayed_refs btrfs: fix a possible umount deadlock Btrfs: fix memory leak in do_walk_down btrfs: btrfs_debug should consume fs_info when DEBUG is not defined btrfs: convert send's verbose_printk to btrfs_debug btrfs: convert pr_* to btrfs_* where possible btrfs: convert printk(KERN_* to use pr_* calls btrfs: unsplit printed strings btrfs: clean the old superblocks before freeing the device Btrfs: kill BUG_ON in run_delayed_tree_ref Btrfs: don't leak reloc root nodes on error btrfs: squash lines for simple wrapper functions Btrfs: improve check_node to avoid reading corrupted nodes ... |
|
Andreas Gruenbacher | 2211d5ba5c |
posix_acl: xattr representation cleanups
Remove the unnecessary typedefs and the zero-length a_entries array in struct posix_acl_xattr_header. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Jeff Mahoney | 04ab956ee6 |
btrfs: convert send's verbose_printk to btrfs_debug
This was basically an open-coded, less flexible dynamic printk. We can just use btrfs_debug instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Jeff Mahoney | 5d163e0e68 |
btrfs: unsplit printed strings
CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Masahiro Yamada | e2c8990734 |
btrfs: squash lines for simple wrapper functions
Remove unneeded variables and assignments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Josef Bacik | 3dc09ec895 |
Btrfs: kill invalid ASSERT() in process_all_refs()
Suppose you have the following tree in snap1 on a file system mounted with -o inode_cache so that inode numbers are recycled └── [ 258] a └── [ 257] b and then you remove b, rename a to c, and then re-create b in c so you have the following tree └── [ 258] c └── [ 257] b and then you try to do an incremental send you will hit ASSERT(pending_move == 0); in process_all_refs(). This is because we assume that any recycling of inodes will not have a pending change in our path, which isn't the case. This is the case for the DELETE side, since we want to remove the old file using the old path, but on the create side we could have a pending move and need to do the normal pending rename dance. So remove this ASSERT() and put a comment about why we ignore pending_move. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | 951555856b |
Btrfs: send, don't bug on inconsistent snapshots
When doing an incremental send, if we find a new/modified/deleted extent, reference or xattr without having previously processed the corresponding inode item we end up exexuting a BUG_ON(). This is because whenever an extent, xattr or reference is added, modified or deleted, we always expect to have the corresponding inode item updated. However there are situations where this will not happen due to transient -ENOMEM or -ENOSPC errors when doing delayed inode updates. For example, when punching holes we can succeed in deleting and modifying (shrinking) extents but later fail to do the delayed inode update. So after such failure we close our transaction handle and right after a snapshot of the fs/subvol tree can be made and used later for a send operation. The same thing can happen during truncate, link, unlink, and xattr related operations. So instead of executing a BUG_ON, make send return an -EIO error and print an informative error message do dmesg/syslog. Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | 15b253eace |
Btrfs: send, avoid incorrect leaf accesses when sending utimes operations
The caller of send_utimes() is supposed to be sure that the inode number it passes to this function does actually exists in the send snapshot. However due to logic/algorithm bugs (such as the one fixed by the patch titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations"), this might not be the case and when that happens it makes send_utimes() access use an unrelated leaf item as the target inode item or access beyond a leaf's boundaries (when the leaf is full and path->slots[0] matches the number of items in the leaf). So if the call to btrfs_search_slot() done by send_utimes() does not find the inode item, just make sure send_utimes() returns -ENOENT and does not silently accesses unrelated leaf items or does invalid leaf accesses, also allowing us to easialy and deterministically catch such algorithmic/logic bugs. Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robbie Ko | 764433a12e |
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations
During an incremental send, if we have delayed rename operations for inodes that were children of directories which were removed in the send snapshot, we can end up accessing incorrect items in a leaf or accessing beyond the last item of the leaf due to issuing utimes operations for the removed inodes. Consider the following example: Parent snapshot: . (ino 256) |--- a/ (ino 257) | |--- c/ (ino 262) | |--- b/ (ino 258) | |--- d/ (ino 263) | |--- del/ (ino 261) |--- x/ (ino 259) |--- y/ (ino 260) Send snapshot: . (ino 256) |--- a/ (ino 257) | |--- b/ (ino 258) | |--- c/ (ino 262) | |--- y/ (ino 260) | |--- d/ (ino 263) |--- x/ (ino 259) 1) When processing inodes 259 and 260, we end up delaying their rename operations because their parents, inodes 263 and 262 respectively, were not yet processed and therefore not yet renamed; 2) When processing inode 262, its rename operation is issued and right after the rename operation for inode 260 is issued. However right after issuing the rename operation for inode 260, at send.c:apply_dir_move(), we issue utimes operations for all current and past parents of inode 260. This means we try to send a utimes operation for its old parent, inode 261 (deleted in the send snapshot), which does not cause any immediate and deterministic failure, because when the target inode is not found in the send snapshot, the send.c:send_utimes() function ignores it and uses the leaf region pointed to by path->slots[0], which can be any unrelated item (belonging to other inode) or it can be a region outside the leaf boundaries, if the leaf is full and path->slots[0] matches the number of items in the leaf. So we end up either successfully sending a utimes operation, which is fine and irrelevant because the old parent (inode 261) will end up being deleted later, or we end up doing an invalid memory access tha crashes the kernel. So fix this by making apply_dir_move() issue utimes operations only for parents that still exist in the send snapshot. In a separate patch we will make send_utimes() return an error (-ENOENT) if the given inode does not exists in the send snapshot. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Rewrote change log to be more detailed and better organized] Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robbie Ko | 443f9d266c |
Btrfs: send, fix warning due to late freeing of orphan_dir_info structures
Under certain situations, when doing an incremental send, we can end up not freeing orphan_dir_info structures as soon as they are no longer needed. Instead we end up freeing them only after finishing the send stream, which causes a warning to be emitted: [282735.229200] ------------[ cut here ]------------ [282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs] [282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1 [282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [282735.240160] 0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000 [282735.240160] 0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8 [282735.240160] ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000 [282735.240160] Call Trace: [282735.240160] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [282735.240160] [<ffffffff81052b14>] __warn+0xc2/0xdd [282735.240160] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f [282735.240160] [<ffffffffa03c99d5>] btrfs_ioctl_send+0xe2f/0xe51 [btrfs] [282735.240160] [<ffffffffa0398358>] btrfs_ioctl+0x14f/0x1f81 [btrfs] [282735.240160] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [282735.240160] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34 [282735.240160] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be [282735.240160] [<ffffffff81196f0c>] ? __fget+0x6b/0x77 [282735.240160] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71 [282735.240160] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79 [282735.240160] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [282735.240160] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14 [282735.240160] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa [282735.256343] ---[ end trace a4539270c8056f93 ]--- Consider the following example: Parent snapshot: . (ino 256) |--- a/ (ino 257) | |--- c/ (ino 260) | |--- del/ (ino 259) |--- tmp/ (ino 258) |--- x/ (ino 261) |--- y/ (ino 262) Send snapshot: . (ino 256) |--- a/ (ino 257) | |--- x/ (ino 261) | |--- y/ (ino 262) | |--- c/ (ino 260) |--- tmp/ (ino 258) 1) When processing inode 258, we end up delaying its rename operation because it has an ancestor (in the send snapshot) that has a higher inode number (inode 260) which was also renamed in the send snapshot, therefore we delay the rename of inode 258 so that it happens after inode 260 is renamed; 2) When processing inode 259, we end up delaying its deletion (rmdir operation) because it has a child inode (258) that has its rename operation delayed. At this point we allocate an orphan_dir_info structure and tag inode 258 so that we later attempt to see if we can delete (rmdir) inode 259 once inode 258 is renamed; 3) When we process inode 260, after renaming it we finally do the rename operation for inode 258. Once we issue the rename operation for inode 258 we notice that this inode was tagged so that we attempt to see if at this point we can delete (rmdir) inode 259. But at this point we can not still delete inode 259 because it has 2 children, inodes 261 and 262, that were not yet processed and therefore not yet moved (renamed) away from inode 259. We end up not freeing the orphan_dir_info structure allocated in step 2; 4) We process inodes 261 and 262, and once we move/rename inode 262 we issue the rmdir operation for inode 260; 5) We finish the send stream and notice that red black tree that contains orphan_dir_info structures is not empty, so we emit a warning and then free any orphan_dir_structures left. So fix this by freeing an orphan_dir_info structure once we try to apply a pending rename operation if we can not delete yet the tagged directory. A test case for fstests follows soon. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog to be more detailed and easier to understand] |
|
Robbie Ko | 99ea42ddb1 |
Btrfs: incremental send, fix premature rmdir operations
Under certain situations, an incremental send operation can contain a rmdir operation that will make the receiving end fail when attempting to execute it, because the target directory is not yet empty. Consider the following example: Parent snapshot: . (ino 256) |--- a/ (ino 257) | |--- c/ (ino 260) | |--- del/ (ino 259) |--- tmp/ (ino 258) |--- x/ (ino 261) Send snapshot: . (ino 256) |--- a/ (ino 257) | |--- x/ (ino 261) | |--- c/ (ino 260) |--- tmp/ (ino 258) 1) When processing inode 258, we delay its rename operation because inode 260 is its new parent in the send snapshot and it was not yet renamed (since 260 > 258, that is, beyond the current progress); 2) When processing inode 259, we realize we can not yet send an rmdir operation (against inode 259) because inode 258 was still not yet renamed/moved away from inode 259. Therefore we update data structures so that after inode 258 is renamed, we try again to see if we can finally send an rmdir operation for inode 259; 3) When we process inode 260, we send a rename operation for it followed by a rename operation for inode 258. Once we send the rename operation for inode 258 we then check if we can finally issue an rmdir for its previous parent, inode 259, by calling the can_rmdir() function with a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress" argument. This makes can_rmdir() return true (value 1) because even though there's still a child inode of inode 259 that was not yet renamed/moved, which is inode 261, the given value of progress (261) is not lower then 261 (that is, not lower than the inode number of some child of inode 259). So we end up sending a rmdir operation for inode 259 before its child inode 261 is processed and renamed. So fix this by passing the correct progress value to the call to can_rmdir() from within apply_dir_move() (where we issue delayed rename operations), which should match stcx->cur_ino (the number of the inode currently being processed) and not sctx->cur_ino + 1. A test case for fstests follows soon. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Rewrote change log to be more detailed, clear and well formatted] Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | 4122ea64f8 |
Btrfs: incremental send, fix invalid paths for rename operations
Example scenario: Parent snapshot: . (ino 277) |---- tmp/ (ino 278) |---- pre/ (ino 280) | |---- wait_dir/ (ino 281) | |---- desc/ (ino 282) |---- ance/ (ino 283) | |---- below_ance/ (ino 279) | |---- other_dir/ (ino 284) Send snapshot: . (ino 277) |---- tmp/ (ino 278) |---- other_dir/ (ino 284) |---- below_ance/ (ino 279) | |---- pre/ (ino 280) | |---- wait_dir/ (ino 281) |---- desc/ (ino 282) |---- ance/ (ino 283) While computing the send stream the following steps happen: 1) While processing inode 279 we end up delaying its rename operation because its new parent in the send snapshot, inode 284, was not yet processed and therefore not yet renamed; 2) Later when processing inode 280 we end up renaming it immediately to "ance/below_once/pre" and not delay its rename operation because its new parent (inode 279 in the send snapshot) has its rename operation delayed and inode 280 is not an encestor of inode 279 (its parent in the send snapshot) in the parent snapshot; 3) When processing inode 281 we end up delaying its rename operation because its new parent in the send snapshot, inode 284, was not yet processed and therefore not yet renamed; 4) When processing inode 282 we do not delay its rename operation because its parent in the send snapshot, inode 281, already has its own rename operation delayed and our current inode (282) is not an ancestor of inode 281 in the parent snapshot. Therefore inode 282 is renamed to "ance/below_ance/pre/wait_dir"; 5) When processing inode 283 we realize that we can rename it because one of its ancestors in the send snapshot, inode 281, has its rename operation delayed and inode 283 is not an ancestor of inode 281 in the parent snapshot. So a rename operation to rename inode 283 to "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is invalid due to a missing path building loop that was undetected by the incremental send implementation, as inode 283 ends up getting included twice in the path (once with its path in the parent snapshot). Therefore its rename operation must wait before the ancestor inode 284 is renamed. Fix this by not terminating the rename dependency checks when we find an ancestor, in the send snapshot, that has its rename operation delayed. So that we continue doing the same checks if the current inode is not an ancestor, in the parent snapshot, of an ancestor in the send snapshot we are processing in the loop. The problem and reproducer were reported by Robbie Ko, as part of a patch titled "Btrfs: incremental send, avoid ancestor rename to descendant". However the fix was unnecessarily complicated and can be addressed with much less code and effort. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | 7969e77a73 |
Btrfs: send, add missing error check for calls to path_loop()
The function path_loop() can return a negative integer, signaling an error, 0 if there's no path loop and 1 if there's a path loop. We were treating any non zero values as meaning that a path loop exists. Fix this by explicitly checking for errors and gracefully return them to user space. Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robbie Ko | 801bec365e |
Btrfs: send, fix failure to move directories with the same name around
When doing an incremental send we can end up not moving directories that have the same name. This happens when the same parent directory has different child directories with the same name in the parent and send snapshots. For example, consider the following scenario: Parent snapshot: . (ino 256) |---- d/ (ino 257) | |--- p1/ (ino 258) | |---- p1/ (ino 259) Send snapshot: . (ino 256) |--- d/ (ino 257) |--- p1/ (ino 259) |--- p1/ (ino 258) The directory named "d" (inode 257) has in both snapshots an entry with the name "p1" but it refers to different inodes in both snapshots (inode 258 in the parent snapshot and inode 259 in the send snapshot). When attempting to move inode 258, the operation is delayed because its new parent, inode 259, was not yet moved/renamed (as the stream is currently processing inode 258). Then when processing inode 259, we also end up delaying its move/rename operation so that it happens after inode 258 is moved/renamed. This decision to delay the move/rename rename operation of inode 259 is due to the fact that the new parent inode (257) still has inode 258 as its child, which has the same name has inode 259. So we end up with inode 258 move/rename operation waiting for inode's 259 move/rename operation, which in turn it waiting for inode's 258 move/rename. This results in ending the send stream without issuing move/rename operations for inodes 258 and 259 and generating the following warnings in syslog/dmesg: [148402.979747] ------------[ cut here ]------------ [148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs] [148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1 [148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [148402.988136] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000 [148402.988136] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8 [148402.988136] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000 [148402.988136] Call Trace: [148402.988136] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [148402.988136] [<ffffffff81052b14>] __warn+0xc2/0xdd [148402.988136] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f [148402.988136] [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs] [148402.988136] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs] [148402.988136] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [148402.988136] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57 [148402.988136] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34 [148402.988136] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be [148402.988136] [<ffffffff81196f0c>] ? __fget+0x6b/0x77 [148402.988136] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71 [148402.988136] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79 [148402.988136] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [148402.988136] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa [148403.011373] ---[ end trace a4539270c8056f8b ]--- [148403.012296] ------------[ cut here ]------------ [148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs] [148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G W 4.6.0-rc7-btrfs-next-31+ #1 [148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [148403.020104] 0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000 [148403.020104] 0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8 [148403.020104] ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000 [148403.020104] Call Trace: [148403.020104] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [148403.020104] [<ffffffff81052b14>] __warn+0xc2/0xdd [148403.020104] [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f [148403.020104] [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs] [148403.020104] [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs] [148403.020104] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [148403.020104] [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57 [148403.020104] [<ffffffff8118da05>] vfs_ioctl+0x18/0x34 [148403.020104] [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be [148403.020104] [<ffffffff81196f0c>] ? __fget+0x6b/0x77 [148403.020104] [<ffffffff81196fa1>] ? __fget_light+0x62/0x71 [148403.020104] [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79 [148403.020104] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [148403.020104] [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa [148403.038981] ---[ end trace a4539270c8056f8c ]--- There's another issue caused by similar (but more complex) changes in the directory hierarchy that makes move/rename operations fail, described with the following example: Parent snapshot: . |---- a/ (ino 262) | |---- c/ (ino 268) | |---- d/ (ino 263) |---- ance/ (ino 267) |---- e/ (ino 264) |---- f/ (ino 265) |---- ance/ (ino 266) Send snapshot: . |---- a/ (ino 262) |---- c/ (ino 268) | |---- ance/ (ino 267) | |---- d/ (ino 263) | |---- ance/ (ino 266) | |---- f/ (ino 265) |---- e/ (ino 264) When the inode 265 is processed, the path for inode 267 is computed, which at that time corresponds to "d/ance", and it's stored in the names cache. Later on when processing inode 266, we end up orphanizing (renaming to a name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has the same name as inode 266 and it's currently a child of the new parent directory (inode 263) for inode 266. After the orphanization and while we are still processing inode 266, a rename operation for inode 266 is generated. However the source path for that rename operation is incorrect because it ends up using the old, pre-orphanization, name of inode 267. The no longer valid name for inode 267 was previously cached when processing inode 265 and it remains usable and considered valid until the inode currently being processed has a number greater than 267. This resulted in the receiving side failing with the following error: ERROR: rename d/ance/ance -> d/ance failed: No such file or directory So fix these issues by detecting such circular dependencies for rename operations and by clearing the cached name of an inode once the inode is orphanized. A test case for fstests will follow soon. Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Rewrote change log to be more detailed and organized, and improved comments] Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
David Sterba | 42f31734eb | Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 | |
Nicholas D Steeves | 0132761017 |
btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | 2f91306a37 |
btrfs: send: use vmalloc only as fallback for clone_sources_tmp
Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | c03d01f340 |
btrfs: send: use vmalloc only as fallback for clone_roots
Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | e55d1153db |
btrfs: send: use temporary variable to store allocation size
We're going to use the argument multiple times later. Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | eb5b75fe2e |
btrfs: send: use vmalloc only as fallback for read_buf
Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | 6ff48ce06b |
btrfs: send: use vmalloc only as fallback for send_buf
Signed-off-by: David Sterba <dsterba@suse.com> |
|
Dan Carpenter | f5ecec3ce2 |
btrfs: send: silence an integer overflow warning
The "sizeof(*arg->clone_sources) * arg->clone_sources_count" expression can overflow. It causes several static checker warnings. It's all under CAP_SYS_ADMIN so it's not that serious but lets silence the warnings. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
Kirill A. Shutemov | 09cbfeaf1a |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Anand Jain | ebb8765b2d |
btrfs: move btrfs_compression_type to compression.h
So that its better organized. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
David Sterba | e780b0d1c1 |
btrfs: send: use GFP_KERNEL everywhere
The send operation is not on the critical writeback path we don't need to use GFP_NOFS for allocations. All error paths are handled and the whole operation is restartable. Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | a879719b8c |
Btrfs: send, don't BUG_ON() when an empty symlink is found
When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Robin Ruede | b96b1db039 |
btrfs: fix resending received snapshot with parent
This fixes a regression introduced by |
|
Filipe Manana | d906d49fc5 |
Btrfs: send, fix file corruption due to incorrect cloning operations
If we have a file that shares an extent with other files, when processing the extent item relative to a shared extent, we blindly issue a clone operation that will target a length matching the length in the extent item and uses as a source some other file the receiver already has and points to the same extent. However that range in the other file might not exclusively point only to the shared extent, and so using that length will result in the receiver getting a file with different data from the one in the send snapshot. This issue happened both for incremental and full send operations. So fix this by issuing clone operations with lengths that don't cover regions of the source file that point to different extents (or have holes). The following test case for fstests reproduces the problem. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root _require_cp_reflink _require_xfs_io_command "fpunch" send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a single 100K extent. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Clone our file into a new file named bar. cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now overwrite parts of our foo file. $XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \ -c "pwrite -S 0xcc 90K 10K" \ -c "fpunch 70K 10k" \ $SCRATCH_MNT/foo | _filter_xfs_io _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/snap echo "File digests in the original filesystem:" md5sum $SCRATCH_MNT/snap/foo | _filter_scratch md5sum $SCRATCH_MNT/snap/bar | _filter_scratch _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap # Now recreate the filesystem by receiving the send stream and verify # we get the same file contents that the original filesystem had. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap # We expect the destination filesystem to have exactly the same file # data as the original filesystem. # The btrfs send implementation had a bug where it sent a clone # operation from file foo into file bar covering the whole [0, 100K[ # range after creating and writing the file foo. This was incorrect # because the file bar now included the updates done to file foo after # we cloned foo to bar, breaking the COW nature of reflink copies # (cloned extents). echo "File digests in the new filesystem:" md5sum $SCRATCH_MNT/snap/foo | _filter_scratch md5sum $SCRATCH_MNT/snap/bar | _filter_scratch status=0 exit Another test case that reproduces the problem when we have compressed extents: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $send_files_dir rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root _require_cp_reflink send_files_dir=$TEST_DIR/btrfs-test-$seq rm -f $seqres.full rm -fr $send_files_dir mkdir $send_files_dir _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our file with an extent of 100K starting at file offset 0K. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io # Rewrite part of the previous extent (its first 40K) and write a new # 100K extent starting at file offset 100K. $XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K" \ -c "pwrite -S 0xcc 100K 100K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Our file foo now has 3 file extent items in its metadata: # # 1) One covering the file range 0 to 40K; # 2) One covering the file range 40K to 100K, which points to the first # extent we wrote to the file and has a data offset field with value # 40K (our file no longer uses the first 40K of data from that # extent); # 3) One covering the file range 100K to 200K. # Now clone our file foo into file bar. cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Create our snapshot for the send operation. _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \ $SCRATCH_MNT/snap echo "File digests in the original filesystem:" md5sum $SCRATCH_MNT/snap/foo | _filter_scratch md5sum $SCRATCH_MNT/snap/bar | _filter_scratch _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap # Now recreate the filesystem by receiving the send stream and verify we # get the same file contents that the original filesystem had. # Btrfs send used to issue a clone operation from foo's range # [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed # to by foo's second file extent item, this was incorrect because of bad # accounting of the file extent item's data offset field. The correct # range to clone from should have been [40K, 100K[. _scratch_unmount _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap echo "File digests in the new filesystem:" # Must match the digests we got in the original filesystem. md5sum $SCRATCH_MNT/snap/foo | _filter_scratch md5sum $SCRATCH_MNT/snap/bar | _filter_scratch status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Chris Mason | 640926ffdd | Merge branch 'cleanup/messages' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 | |
David Sterba | f14d104dbd |
btrfs: switch more printks to our helpers
Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com> |
|
Filipe Manana | b786f16ac3 |
Btrfs: send, fix corner case for reference overwrite detection
When the inode given to did_overwrite_ref() matches the current progress
and has a reference that collides with the reference of other inode that
has the same number as the current progress, we were always telling our
caller that the inode's reference was overwritten, which is incorrect
because the other inode might be a new inode (different generation number)
in which case we must return false from did_overwrite_ref() so that its
callers don't use an orphanized path for the inode (as it will never be
orphanized, instead it will be unlinked and the new inode created later).
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single extent of 64K.
mkdir -p $SCRATCH_MNT/foo
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
| _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/mysnap1
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
$SCRATCH_MNT/mysnap2
echo "File digest before being replaced:"
md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch
# Remove the file and then create a new one in the same location with
# the same name but with different content. This new file ends up
# getting the same inode number as the previous one, because that inode
# number was the highest inode number used by the snapshot's root and
# therefore when attempting to find the a new inode number for the new
# file, we end up reusing the same inode number. This happens because
# currently btrfs uses the highest inode number summed by 1 for the
# first inode created once a snapshot's root is loaded (done at
# fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
# tree).
# Having these two different files in the snapshots with the same inode
# number (but different generation numbers) caused the btrfs send code
# to emit an incorrect path for the file when issuing an unlink
# operation because it failed to realize they were different files.
rm -f $SCRATCH_MNT/mysnap2/foo/bar
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
$SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
$SCRATCH_MNT/mysnap2_ro
_run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
_run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
$SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap
echo "File digest in the original filesystem after being replaced:"
md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
# Now recreate the filesystem by receiving both send streams and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
_run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap
echo "File digest in the new filesystem:"
# Must match the digest from the new file.
md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
status=0
exit
Reported-by: Martin Raiber <martin@urbackup.org>
Fixes:
|
|
Josef Bacik | 37b8d27de5 |
Btrfs: use received_uuid of parent during send
Neil Horman pointed out a problem where if he did something like this receive A snap A B change B send -p A B and then on another box do recieve A receive B the receive B would fail because we use the UUID of A for the clone sources for B. This makes sense most of the time because normally you are sending from the original sources, not a received source. However when you use a recieved subvol its UUID is going to be something completely different, so if you then try to receive the diff on a different volume it won't find the UUID because the new A will be something else. The only constant is the received uuid. So instead check to see if we have received_uuid set on the root, and if so use that as the clone source, as btrfs receive looks for matches either in received_uuid or uuid. Thanks, Reported-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Chris Mason <clm@fb.com> |
|
Chris Mason | 1ab818b137 | Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2 | |
Filipe Manana | 619d8c4ef7 |
Btrfs: incremental send, fix clone operations for compressed extents
Marc reported a problem where the receiving end of an incremental send was performing clone operations that failed with -EINVAL. This happened because, unlike for uncompressed extents, we were not checking if the source clone offset and length, after summing the data offset, falls within the source file's boundaries. So make sure we do such checks when attempting to issue clone operations for compressed extents. Problem reproducible with the following steps: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt2 # Create the file with a single extent of 128K. This creates a metadata file # extent item with a data start offset of 0 and a logical length of 128K. $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo # Now rewrite the range 64K to 112K of our file. This will make the inode's # metadata continue to point to the 128K extent we created before, but now # with an extent item that points to the extent with a data start offset of # 112K and a logical length of 16K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 192K. $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo # Now rewrite the range 180K to 12K. This will make the inode's metadata # continue to point the the 128K extent we created earlier, with a single # extent item that points to it with a start offset of 112K and a logical # length of 4K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 180K. $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ touch /mnt/bar # Calls the btrfs clone ioctl. $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \ -l $((4 * 1024)) /mnt/foo /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 At subvol /mnt/snap1 At subvol snap1 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 At subvol /mnt/snap2 At snapshot snap2 ERROR: failed to clone extents to bar Invalid argument A test case for fstests follows soon. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> |
|
Filipe Manana | 8b191a6849 |
Btrfs: incremental send, check if orphanized dir inode needs delayed rename
If a directory inode is orphanized, because some inode previously processed has a new name that collides with the old name of the current inode, we need to check if it needs its rename operation delayed too, as its ancestor-descendent relationship with some other inode might have been reversed between the parent and send snapshots and therefore its rename operation needs to happen after that other inode is renamed. For example, for the following reproducer where this is needed (provided by Robbie Ko): $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir -p /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/t6/t7 $ mkdir /mnt/data/t5 $ mkdir /mnt/data/t7 $ mkdir /mnt/data/n4/t2 $ mkdir /mnt/data/t4 $ mkdir /mnt/data/t3 $ mv /mnt/data/t7 /mnt/data/n4/t2 $ mv /mnt/data/t4 /mnt/data/n4/t2/t7 $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4 $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5 $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2 $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4 $ mv /mnt/data/n4/t2 /mnt/data/n4/n1 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6 $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3 $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory Where the parent snapshot directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- t2/ (ino 265) |-- t7/ (ino 264) |-- t4/ (ino 266) |-- t5/ (ino 263) |-- t6/ (ino 261) |-- n1/ (ino 258) |-- n2/ (ino 259) |-- t7/ (ino 262) |-- t3/ (ino 267) And the send snapshot's directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- n1/ (ino 258) |-- t2/ (ino 265) |-- n2/ (ino 259) |-- t3/ (ino 267) | |-- t7 (ino 264) | |-- t6/ (ino 261) | |-- t4/ (ino 266) | |-- t5/ (ino 263) | |-- t7/ (ino 262) While processing inode 262 we orphanize inode 264 and later attempt to rename inode 264 to its new name/location, which resulted in building an incorrect destination path string for the rename operation with the value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must have been done only after inode 267 is processed and renamed, as the ancestor-descendent relationship between inodes 264 and 267 was reversed between both snapshots, because otherwise it results in an infinite loop when building the path string for inode 264 when we are processing an inode with a number larger than 264. That loop is the following: start inode 264, send progress of 265 for example parent of 264 -> 267 parent of 267 -> 262 parent of 262 -> 259 parent of 259 -> 261 parent of 261 -> 263 parent of 263 -> 266 parent of 266 -> 264 |--> back to first iteration while current path string length is <= PATH_MAX, and fail with -ENOMEM otherwise So fix this by making the check if we need to delay a directory rename regardless of the current inode having been orphanized or not. A test case for fstests follows soon. Thanks to Robbie Ko for providing a reproducer for this problem. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | 80aa602756 |
Btrfs: incremental send, don't delay directory renames unnecessarily
Even though we delay the rename of directories when they become descendents of other directories that were also renamed in the send root to prevent infinite path build loops, we were doing it in cases where this was not needed and was actually harmful resulting in infinite path build loops as we ended up with a circular dependency of delayed directory renames. Consider the following reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir /mnt/data $ mkdir /mnt/data/n1 $ mkdir /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir /mnt/data/n1/n2/p1 $ mkdir /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/t6 $ mkdir /mnt/data/t7 $ mkdir -p /mnt/data/t5/t7 $ mkdir /mnt/data/t2 $ mkdir /mnt/data/t4 $ mkdir -p /mnt/data/t1/t3 $ mkdir /mnt/data/p1 $ mv /mnt/data/t1 /mnt/data/p1 $ mkdir -p /mnt/data/p1/p2 $ mv /mnt/data/t4 /mnt/data/p1/p2/t1 $ mv /mnt/data/t5 /mnt/data/n4/t5 $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2 $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7 $ mv /mnt/data/t2 /mnt/data/n4/t1 $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1 $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2 $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1 $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7 $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1 $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5 $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1 $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/ $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2 $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1 $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1 $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2 $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7 $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1 $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5 $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory This reproducer resulted in an infinite path build loop when building the path for inode 266 because the following circular dependency of delayed directory renames was created: ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261 Where the notation "X <- Y" means the rename of inode X is delayed by the rename of inode Y (X will be renamed after Y is renamed). This resulted in an infinite path build loop of inode 266 because that inode has inode 261 as an ancestor in the send root and inode 261 is in the circular dependency of delayed renames listed above. Fix this by not delaying the rename of a directory inode if an ancestor of the inode in the send root, which has a delayed rename operation, is not also a descendent of the inode in the parent root. Thanks to Robbie Ko for sending the reproducer example. A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> |
|
Filipe Manana | 5f806c3ae2 |
Btrfs: incremental send, remove dead code
The logic to detect path loops when attempting to apply a pending
directory rename, introduced in commit
|
|
Filipe Manana | 8996a48c0a |
Btrfs: incremental send, clear name from cache after orphanization
If a directory's reference ends up being orphanized, because the inode currently being processed has a new path that matches that directory's path, make sure we evict the name of the directory from the name cache. This is because there might be descendent inodes (either directories or regular files) that will be orphanized later too, and therefore the orphan name of the ancestor must be used, otherwise we send issue rename operations with a wrong path in the send stream. Reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/p1/p2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/p1/p2 /mnt/data $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1 $ mv /mnt/data/p2 /mnt/data/n1/n2/p1 $ mv /mnt/data/n1/n2 /mnt/data/p1 $ mv /mnt/data/p1 /mnt/data/n4 $ mv /mnt/data/n4/p1/n2/p1 /mnt/data $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.send $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ btrfs receive /mnt2 -f /tmp/1.send $ btrfs receive /mnt2 -f /tmp/2.send ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent snapshot are both orphanized during the incremental send, and as soon as data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2 we use a source path of o263-6-o/p2 for the rename operation instead of the old path data/p1/p2 (the one before the orphanization of inode 263). A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> |
|
Filipe Manana | 2f1f465ae6 |
Btrfs: send, don't leave without decrementing clone root's send_progress
If the clone root was not readonly or the dead flag was set on it, we were leaving without decrementing the root's send_progress counter (and before we just incremented it). If a concurrent snapshot deletion was in progress and ended up being aborted, it would be impossible to later attempt to delete again the snapshot, since the root's send_in_progress counter could never go back to 0. We were also setting clone_sources_to_rollback to i + 1 too early - if we bailed out because the clone root we got is not readonly or flagged as dead we ended up later derreferencing a null pointer because we didn't assign the clone root to sctx->clone_roots[i].root: for (i = 0; sctx && i < clone_sources_to_rollback; i++) btrfs_root_dec_send_in_progress( sctx->clone_roots[i].root); So just don't increment the send_in_progress counter if the root is readonly or flagged as dead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> |