Commit Graph

4691 Commits

Author SHA1 Message Date
Al Viro 0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Omar Sandoval 22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval 6f67376318 direct_IO: use iov_iter_rw() instead of rw everywhere
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval 17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Al Viro 5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Chris Mason cdfb080e18 Btrfs: fix use after free when close_ctree frees the orphan_rsv
Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv.  The problem is this call updates the
space_info, which has already been freed.

This adds a new __ function that directly calls kfree instead of trying
to update the space infos.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:29 -07:00
Chris Mason 1bbc621ef2 Btrfs: allow block group cache writeout outside critical section in commit
We loop through all of the dirty block groups during commit and write
the free space cache.  In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit.  We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:22 -07:00
Chris Mason 2b10826800 Btrfs: don't use highmem for free space cache pages
In order to create the free space cache concurrently with FS modifications,
we need to take a few block group locks.

The cache code also does kmap, which would schedule with the locks held.
Instead of going through kmap_atomic, lets just use lowmem for the cache
pages.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:18 -07:00
Chris Mason c9dc4c6578 Btrfs: two stage dirty block group writeout
Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one.  This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:11 -07:00
Chris Mason 4c6d1d85ad btrfs: move struct io_ctl into ctree.h and rename it
We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:04 -07:00
Josef Bacik 3bce876fd5 Btrfs: don't steal from the global reserve if we don't have the space
btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv.  We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:59 -07:00
Josef Bacik 365c531377 Btrfs: don't commit the transaction in the async space flushing
We're triggering a huge number of commits from
btrfs_async_reclaim_metadata_space.  These aren't really requried,
because everyone calling the async reclaim code is going to end up
triggering a commit on their own.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:54 -07:00
Josef Bacik cb723e4919 Btrfs: reserve space for block groups
This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:48 -07:00
Chris Mason 28f75a0e6c Btrfs: refill block reserves during truncate
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents.  This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:34 -07:00
Josef Bacik 1262133b8d Btrfs: account for crcs in delayed ref processing
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs.  This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.

This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:04:47 -07:00
Chris Mason 28ed1345a5 btrfs: actively run the delayed refs while deleting large files
When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing.  Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.

The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:00:14 -07:00
Guenter Roeck 4a3d1caf8a fs: btrfs: Add missing include file
Building alpha:allmodconfig fails with

fs/btrfs/inode.c: In function 'check_direct_IO':
fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'

due to a missing include file.

Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-01 12:32:07 -07:00
Chris Mason dd82525956 Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()
The error handling path for alloc_reserved_tree_block is calling
btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
might sleep as we allocate extent_state objects:

 BUG: sleeping function called from invalid context at mm/slub.c:1268
 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
 5 locks held by kworker/u4:7/11093:
  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
  #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
  #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
 CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
  00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
  ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
  0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
 Call Trace:
  [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
  [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
  [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
  [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
  [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
  [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
  [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
  [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
  [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
  [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
  [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
  [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
  [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
  [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
  [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
  [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
  [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
  [<ffffffff81091d51>] ? process_one_work+0x151/0x520
  [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
  [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
  [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
  [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
  [<ffffffff81098686>] kthread+0xf6/0x120
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
  [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
 ------------[ cut here ]------------

This changes things to free the path first, which will also unlock the
extent buffer.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Sterba <dsterba@suse.cz>
Tested-by: Dave Sterba <dsterba@suse.cz>
2015-04-01 08:36:05 -07:00
Liu Bo e56a951e01 Btrfs: Remove the check for old-style mkfs
This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
but it also allows us to mount a buggy btrfs if this btrfs has the right
superblock head part but has something wrong with chunk tree part[1], and
after that we can hit BUG_ON()s set in the code to prevent something
impossible.

Since David has released "Btrfs progs v3.19-rc2", just remove the check,
if anyone who wants to make a fresh btrfs, please use the latest one.

[1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:25 -07:00
Jeff Mahoney 727b9784b6 btrfs: cleanup orphans while looking up default subvolume
Orphans in the fs tree are cleaned up via open_ctree and subvolume
orphans are cleaned via btrfs_lookup_dentry -- except when a default
subvolume is in use.  The name for the default subvolume uses a manual
lookup that doesn't trigger orphan cleanup and needs to trigger it
manually as well. This doesn't apply to the remount case since the
subvolumes are cleaned up by walking the root radix tree.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:25 -07:00
Tom Van Braeckel d862095829 btrfs: explicitly set control file's private_data
The private_data member of the Btrfs control device file
(/dev/btrfs-control) is used to hold the current transaction and needs
to be initialized to NULL to signify that no transaction is in progress.

We explicitly set the control file's private_data to NULL to be
independent of whatever value the misc subsystem initializes it to.

Backstory:
----------

The misc subsystem (which is used by /dev/btrfs-control) initializes
a file's private_data to point to the misc device when a driver has
registered a custom open file operation and initializes it to NULL
when a custom open file operation has *not* been provided.

This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc
device structure.

And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.

To simplify things in the misc subsystem, a patch [1] has been proposed
to *always* set private_data to point to the misc device instead of
only doing this when a custom open file operation has been registered.

But before we can fix this in the misc subsystem itself, we need to
modify the (few) drivers that rely on this very subtle behavior.

[1] https://lkml.org/lkml/2014/12/4/939

Signed-off-by: Martin Kepplinger <martink@posteo.de>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
Chengyu Song 26e726afe0 btrfs: incorrect handling for fiemap_fill_next_extent return
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
David Sterba 3c3b04d10f btrfs: don't accept bare namespace as a valid xattr
Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
works:

 $ touch file
 $ setfattr -n user. -v 1 file
 $ getfattr -d file
user.="1"

ie. the missing attribute name after the namespace.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
Reported-by: William Douglas <william.douglas@intel.com>
CC: <stable@vger.kernel.org> # 2.6.29+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 18:10:24 -07:00
Filipe Manana dcc82f4783 Btrfs: fix log tree corruption when fs mounted with -o discard
While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f (Btrfs: fix extent pinning bugs in the tree log)
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:56:24 -07:00
Filipe Manana 2f2ff0ee5e Btrfs: fix metadata inconsistencies after directory fsync
We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.

Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.

This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file and directory.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
  mkdir $SCRATCH_MNT/mydir

  # Make sure all metadata and data are durably persisted.
  sync

  # Add a hard link to 'foo' inside our test directory and fsync only the
  # directory. The btrfs fsync implementation had a bug that caused the new
  # directory entry to be visible after the fsync log replay but, the inode
  # of our file remained with a link count of 1.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

  # Add a few more links and new files.
  # This is just to verify nothing breaks or gives incorrect results after the
  # fsync log is replayed.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
  $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
  ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

  # Add some subdirectories and new files and links to them. This is to verify
  # that after fsyncing our top level directory 'mydir', all the subdirectories
  # and their files/links are registered in the fsync log and exist after the
  # fsync log is replayed.
  mkdir -p $SCRATCH_MNT/mydir/x/y/z
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
  touch $SCRATCH_MNT/mydir/x/y/z/qwerty

  # Now fsync only our top directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

  # And fsync now our new file named 'hello', just to verify later that it has
  # the expected content and that the previous fsync on the directory 'mydir' had
  # no bad influence on this fsync.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
  # all with the value 0xaa.
  echo "File 'foo' content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # Remove the first name of our inode. Because of the directory fsync bug, the
  # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
  # deleting the inode and the other names became stale directory entries (still
  # visible to applications). Attempting to remove or access the remaining
  # dentries pointing to that inode resulted in stale file handle errors and
  # made it impossible to remove the parent directories since it was impossible
  # for them to become empty.
  echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
  rm -f $SCRATCH_MNT/foo

  # Now verify that all files, links and directories created before fsyncing our
  # directory exist after the fsync log was replayed.
  [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
  [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
  [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
  [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
      echo "Link mydir/x/y/foo_y_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
      echo "Link mydir/x/y/z/foo_z_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
      echo "File mydir/x/y/z/qwerty is missing"

  # We expect our file here to have a size of 64Kb and all the bytes having the
  # value 0xff.
  echo "file 'hello' content after log replay:"
  od -t x1 $SCRATCH_MNT/hello

  # Now remove all files/links, under our test directory 'mydir', and verify we
  # can remove all the directories.
  rm -f $SCRATCH_MNT/mydir/x/y/z/*
  rmdir $SCRATCH_MNT/mydir/x/y/z
  rm -f $SCRATCH_MNT/mydir/x/y/*
  rmdir $SCRATCH_MNT/mydir/x/y
  rmdir $SCRATCH_MNT/mydir/x
  rm -f $SCRATCH_MNT/mydir/*
  rmdir $SCRATCH_MNT/mydir

  # An fsck, run by the fstests framework everytime a test finishes, also detected
  # the inconsistency and printed the following error message:
  #
  # root 5 inode 257 errors 2001, no inode item, link count wrong
  #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
  #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

  status=0
  exit

The expected golden output for the test is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 5
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000

Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 1
  Link mydir/foo_2 is missing
  Link mydir/foo_3 is missing
  Link mydir/x/y/foo_y_link is missing
  Link mydir/x/y/z/foo_z_link is missing
  File mydir/x/y/z/qwerty is missing
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

Fsck, without this fix, also complains about the wrong link count:

  root 5 inode 257 errors 2001, no inode item, link count wrong
      unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
      unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

So fix this by logging the inodes that the dentries point to when
fsyncing a directory.

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:56:23 -07:00
Filipe Manana bf69196045 Btrfs: change the insertion criteria for the qgroup operations rbtree
After looking at Liu Bo's recent patch (titled
"Btrfs: fix comp_oper to get right order") I realized the search made by
qgroup_oper_exists() was buggy because its rbtree navigation comparison
function, comp_oper_exist(), only looks at the fields bytenr and ref_root
of a tree node, ignoring the seq field completely. This was wrong because
when we insert a node into the rbtree we use comp_oper(), which takes a
decision based first on bytenr, then on seq and then on the ref_root field.
That means qgroup_oper_exists() could miss the fact that at least one
operation with given bytenr and ref_root exists.

Consider the following simple example of a 3 nodes qgroup operations
rbtree (created using comp_oper before this patch), where each node's key
is a tuple with the shape (bytenr, seq, ref_root, op):

                          [ (4096, 2, 20, op X) ]
                         /                       \
                        /                         \
   [ (4096, 1, 5, op Y) ]                         [ (4096, 3, 10, op Z) ]

qgroup_oper_exists() when called to search for an existing operation for
bytenr 4096 and ref root 10 wouldn't find anything because it would go to
the left subtree instead of the right subtree, since comp_oper_exits()
ignores the seq field completely.

Fix this by changing the insertion navigation function to use the ref_root
field right after using the bytenr field and before using the seq field,
so that qgroup_oper_exists() / comp_oper_exist() work as expected.

This patch applies on top of the patch mentioned above from Liu.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana 3d850dd448 Btrfs: add missing inode item update in fallocate()
If we fallocate(), without the keep size flag, into an area already covered
by an extent previously fallocated, we were updating the inode's i_size but
we weren't updating the inode item in the fs/subvol tree. A following umount
+ mount would result in a loss of the inode's size (and an fsync would miss
too the fact that the inode changed).

Reproducer:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ fallocate -n -l 1M /mnt/foobar
  $ fallocate -l 512K /mnt/foobar
  $ umount /mnt
  $ mount /dev/sdd /mnt
  $ od -t x1 /mnt/foobar
  0000000

The expected result is:

  $ od -t x1 /mnt/foobar
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  2000000

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana 5f806c3ae2 Btrfs: incremental send, remove dead code
The logic to detect path loops when attempting to apply a pending
directory rename, introduced in commit
f959492fc1 (Btrfs: send, fix more issues related to directory renames)
is no longer needed, and the respective fstests test case for that commit,
btrfs/045, now passes without this code (as well as all the other test
cases for send/receive).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:52 -07:00
Filipe Manana 8996a48c0a Btrfs: incremental send, clear name from cache after orphanization
If a directory's reference ends up being orphanized, because the inode
currently being processed has a new path that matches that directory's
path, make sure we evict the name of the directory from the name cache.
This is because there might be descendent inodes (either directories or
regular files) that will be orphanized later too, and therefore the
orphan name of the ancestor must be used, otherwise we send issue rename
operations with a wrong path in the send stream.

Reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir -p /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/n4
  $ mkdir -p /mnt/data/p1/p2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/p1/p2 /mnt/data
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1
  $ mv /mnt/data/p2 /mnt/data/n1/n2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/p1
  $ mv /mnt/data/p1 /mnt/data/n4
  $ mv /mnt/data/n4/p1/n2/p1 /mnt/data

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory

Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent
snapshot are both orphanized during the incremental send, and as soon as
data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2
we use a source path of o263-6-o/p2 for the rename operation instead of
the old path data/p1/p2 (the one before the orphanization of inode 263).

A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana 2f1f465ae6 Btrfs: send, don't leave without decrementing clone root's send_progress
If the clone root was not readonly or the dead flag was set on it, we were
leaving without decrementing the root's send_progress counter (and before
we just incremented it). If a concurrent snapshot deletion was in progress
and ended up being aborted, it would be impossible to later attempt to
delete again the snapshot, since the root's send_in_progress counter could
never go back to 0.

We were also setting clone_sources_to_rollback to i + 1 too early - if we
bailed out because the clone root we got is not readonly or flagged as dead
we ended up later derreferencing a null pointer because we didn't assign
the clone root to sctx->clone_roots[i].root:

		for (i = 0; sctx && i < clone_sources_to_rollback; i++)
			btrfs_root_dec_send_in_progress(
					sctx->clone_roots[i].root);

So just don't increment the send_in_progress counter if the root is readonly
or flagged as dead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana 5cc2b17e80 Btrfs: send, add missing check for dead clone root
After we locked the root's root item, a concurrent snapshot deletion
call might have set the dead flag on it. So check if the dead flag
is set and abort if it is, just like we do for the parent root.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Filipe Manana 4f764e5153 Btrfs: remove deleted xattrs on fsync log replay
If we deleted xattrs from a file and fsynced the file, after a log replay
the xattrs would remain associated to the file. This was an unexpected
behaviour and differs from what other filesystems do, such as for example
xfs and ext3/4.

Fix this by, on fsync log replay, check if every xattr in the fs/subvol
tree (that belongs to a logged inode) has a matching xattr in the log,
and if it does not, delete it from the fs/subvol tree. This is a similar
approach to what we do for dentries when we replay a directory from the
fsync log.

This issue is trivial to reproduce, and the following excerpt from my
test for xfstests triggers the issue:

  _crash_and_mount()
  {
       # Simulate a crash/power loss.
       _load_flakey_table $FLAKEY_DROP_WRITES
       _unmount_flakey
       _load_flakey_table $FLAKEY_ALLOW_WRITES
       _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create out test file and add 3 xattrs to it.
  touch $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar

  # Make sure everything is durably persisted.
  sync

  # Now delete the second xattr and fsync the inode.
  $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  _crash_and_mount

  # After the fsync log is replayed, the file should have only 2 xattrs, the ones
  # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file
  # with the 3 xattrs that we had before deleting the second one and fsyncing the
  # file.
  echo "xattr names and values after first fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  # Now write some data to our file, fsync it, remove the first xattr, add a new
  # hard link to our file and commit the fsync log by fsyncing some other new
  # file. This is to verify that after log replay our first xattr does not exist
  # anymore.
  echo "hello world!" >> $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
  $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
  ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
  touch $SCRATCH_MNT/qwerty
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty

  _crash_and_mount

  # Now only the xattr with name user.attr3 should be set in our file.
  echo "xattr names and values after second fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  status=0
  exit

The expected golden output, which is produced with this patch applied or
when testing against xfs or ext3/4, is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr3="val3"

Without this patch applied, the output is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

A patch with a test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:55:51 -07:00
Christoph Hellwig e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Chris Mason fc4c3c872f Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/disk-io.c
2015-03-25 10:52:48 -07:00
Chris Mason 9deed229fa Merge branch 'cleanups-for-4.1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:43:16 -07:00
Linus Torvalds 521d474631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Most of these are fixing extent reservation accounting, or corners
  with tree writeback during commit.

  Josef's set does add a test, which isn't strictly a fix, but it'll
  keep us from making this same mistake again"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix outstanding_extents accounting in DIO
  Btrfs: add sanity test for outstanding_extents accounting
  Btrfs: just free dummy extent buffers
  Btrfs: account merges/splits properly
  Btrfs: prepare block group cache before writing
  Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
  Btrfs: account for the correct number of extents for delalloc reservations
  Btrfs: fix merge delalloc logic
  Btrfs: fix comp_oper to get right order
  Btrfs: catch transaction abortion after waiting for it
  btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-21 10:53:37 -07:00
Josef Bacik e1cbbfa5f5 Btrfs: fix outstanding_extents accounting in DIO
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested.  This isn't quite right since we will
be limited to our max extent size.  So instead lets do something horrible!  Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent.  If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly.  Thanks,

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:35 -04:00
Josef Bacik 6a3891c551 Btrfs: add sanity test for outstanding_extents accounting
I introduced a regression wrt outstanding_extents accounting.  These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time.  So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:31 -04:00
Josef Bacik bcb7e449ec Btrfs: just free dummy extent buffers
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU.  So check for
this case and just free the things directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:30:18 -04:00
Josef Bacik ba11721355 Btrfs: account merges/splits properly
My fix

Btrfs: fix merge delalloc logic

only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent.  We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent.  If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation.  This shows up with a case like this

[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]

Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation.  But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations.  The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:28:21 -04:00
Josef Bacik dcdf7f6ddb Btrfs: prepare block group cache before writing
Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff.  To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 10:56:55 -04:00
Josef Bacik ea526d1899 Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Dave could hit this assert consistently running btrfs/078.  This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root.  For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case.  This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty.  This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:47:04 -07:00
Josef Bacik 6a41dd0922 Btrfs: account for the correct number of extents for delalloc reservations
Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Josef Bacik 8461a3de77 Btrfs: fix merge delalloc logic
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas.  We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic.  This fixes the regression that was reported by a user on
list.  Thanks,

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Liu Bo 48da5f0a4c Btrfs: fix comp_oper to get right order
Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Liu Bo b4924a0fa1 Btrfs: catch transaction abortion after waiting for it
This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.

Fsync() can report success when it actually doesn't.  When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.

It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return.  Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.

This add the necessary check and it now passes the test.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:38:23 -07:00
Fabian Frederick d22071293f btrfs: fix sizeof format specifier in btrfs_check_super_valid()
This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:38:22 -07:00
Linus Torvalds 84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
Quentin Casasnovas dd9ef135e3 Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:33 -08:00
Filipe Manana 3a8b36f378 Btrfs: fix data loss in the fast fsync path
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:32 -08:00
Josef Bacik f5c0a12280 Btrfs: remove extra run_delayed_refs in update_cowonly_root
This got added with my dirty_bgs patch, it's not needed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:30 -08:00
David Sterba 258ece0212 btrfs: remove shadowing variables in __btrfs_map_block
1) We can safely use the function's 'i'. Fixes warning

fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local
fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here

2) A local variable duplicates name of an argument, we can use the value
directly. Fixes warning

fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter
fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:35 +01:00
David Sterba 093adbcedf btrfs: switch helper macros to static inlines in sysfs.h
The conversion macros use nested container_of that leads to a warning

fs/btrfs/sysfs.c: In function 'btrfs_feature_visible':
fs/btrfs/sysfs.c:183:8: warning: declaration of '__mptr' shadows a previous local
fs/btrfs/sysfs.c:183:8: warning: shadowed declaration is here

Use of functions will add proper type checking.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:02 +01:00
David Sterba 9d644a623e btrfs: cleanup, use correct type in div_u64_rem
div_u64_rem expects u32 for divisior and reminder.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba 47c5713f47 btrfs: replace remaining do_div calls with div_u64 variants
Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba b8b93addde btrfs: cleanup 64bit/32bit divs, provably bounded values
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:00 +01:00
David Sterba 3284da7b7b btrfs: use explicit initializer for seq_elem
Using {} as initializer for struct seq_elem does not properly initialize
the list_head member, but it currently works because it gets set through
btrfs_get_tree_mod_seq if 'seq' is 0.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:59 +01:00
David Sterba f64c7b12f8 btrfs: remove shadowing variables in __btrfs_buffered_write
There are lockstart and lockend defined in the function and not used
after their duplicate definition scope ends, it's safe to reuse them.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba 31e818fe73 btrfs: cleanup, use kmalloc_array/kcalloc array helpers
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba f8c269d722 btrfs: cleanup 64bit/32bit divs, compile time constants
Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.

Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:57 +01:00
David Sterba 351810c1d2 btrfs: use cond_resched_lock where possible
Clean the opencoded variant, cond_resched_lock also checks the lock for
contention so it might help in some cases that were not covered by
simple need_resched().

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:56 +01:00
David Sterba 853d8ec4b2 btrfs: need_resched not needed with cond_resched
Cleanup, no special reason to do

if (need_resched())
        cond_resched();

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:55 +01:00
Filipe Manana 84471e2429 Btrfs: incremental send, don't rename a directory too soon
There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/file

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/c /mnt/x
  $ mv /mnt/a /mnt/x/y
  $ mv /mnt/b /mnt/a

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.

Reported-by: Ames Cornish <ames@cornishes.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
David Sterba 1932b7be97 btrfs: fix lost return value due to variable shadowing
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: <stable@vger.kernel.org>	# 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana 5cdf83edb8 Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
The return value from btrfs_lookup_xattr() can be a pointer encoding an
error, therefore deal with it. This fixes commit 5f5bc6b1e2
("Btrfs: make xattr replace operations atomic").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana 5dfe2be7ea Btrfs: fix off-by-one logic error in btrfs_realloc_node
The end_slot variable actually matches the number of pointers in the
node and not the last slot (which is 'nritems - 1'). Therefore in order
to check that the current slot in the for loop doesn't match the last
one, the correct logic is to check if 'i' is less than 'end_slot - 1'
and not 'end_slot - 2'.

Fix this and set end_slot to be 'nritems - 1', as it's less confusing
since the variable name implies it's inclusive rather then exclusive.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana e8c1c76e80 Btrfs: add missing inode update when punching hole
When punching a file hole if we endup only zeroing parts of a page,
because the start offset isn't a multiple of the sector size or the
start offset and length fall within the same page, we were not updating
the inode item. This prevented an fsync from doing anything, if no other
file changes happened in the current transaction, because the fields
in btrfs_inode used to check if the inode needs to be fsync'ed weren't
updated.

This issue is easy to reproduce and the following excerpt from the
xfstest case I made shows how to trigger it:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file.
  $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, this makes btrfs update some btrfs inode specific fields
  # that are used to track if the inode needs to be written/updated to the fsync
  # log or not. After this fsync, the new values for those fields indicate that
  # a subsequent fsync does not need to touch the fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Force a commit of the current transaction. After this point, any operation
  # that modifies the data or metadata of our file, should update those fields in
  # the btrfs inode with values that make the next fsync operation write to the
  # fsync log.
  sync

  # Punch a hole in our file. This small range affects only 1 page.
  # This made the btrfs hole punching implementation write only some zeroes in
  # one page, but it did not update the btrfs inode fields used to determine if
  # the next fsync needs to write to the fsync log.
  $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo

  # Another variation of the previously mentioned case.
  $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo

  # Now fsync the file. This was a no-operation because the previous hole punch
  # operation didn't update the inode's fields mentioned before, so they remained
  # with the values they had after the first fsync - that is, they indicate that
  # it is not needed to write to fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Enable writes and mount the fs. This makes the fsync log replay code run.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Because the last fsync didn't do anything, here the file content matched what
  # it was after the first fsync, before the holes were punched, and not what it
  # was after the holes were punched.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This issue has been around since 2012, when the punch hole implementation
was added, commit 2aaa665581 ("Btrfs: add hole punching").

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Josef Bacik 0c0ef4bc84 Btrfs: abort the transaction if we fail to update the free space cache inode
Our gluster boxes were hitting a problem where they'd run out of space when
updating the block group cache and therefore wouldn't be able to update the free
space inode.  This is a problem because this is how we invalidate the cache and
protect ourselves from errors further down the stack, so if this fails we have
to abort the transaction so we make sure we don't end up with stale free space
cache.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Filipe Manana 4d884fceaa Btrfs: fix fsync race leading to ordered extent memory leaks
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

         CPU 1                                         CPU 2

btrfs_sync_file()

                                                 btrfs_sync_file()

  mutex_lock(inode->i_mutex)
  btrfs_log_inode()
    btrfs_get_logged_extents()
      --> collects ordered extent X
      --> increments ordered
          extent X's refcount
    btrfs_submit_logged_extents()
  mutex_unlock(inode->i_mutex)

                                                   mutex_lock(inode->i_mutex)
  btrfs_sync_log()
     btrfs_wait_logged_extents()
       --> list_del_init(&ordered->log_list)
                                                     btrfs_log_inode()
                                                       btrfs_get_logged_extents()
                                                         --> Adds ordered extent X
                                                             to logged_list because
                                                             at this point:
                                                             list_empty(&ordered->log_list)
                                                             && test_bit(BTRFS_ORDERED_LOGGED,
                                                                         &ordered->flags) == 0
                                                         --> Increments ordered extent
                                                             X's refcount
       --> check if ordered extent's io is
           finished or not, start it if
           necessary and wait for it to finish
       --> sets bit BTRFS_ORDERED_LOGGED
           on ordered extent X's flags
           and adds it to trans->ordered
  btrfs_sync_log() finishes

                                                       btrfs_submit_logged_extents()
                                                     btrfs_log_inode() finishes
                                                   mutex_unlock(inode->i_mutex)

btrfs_sync_file() finishes

                                                   btrfs_sync_log()
                                                      btrfs_wait_logged_extents()
                                                        --> Sees ordered extent X has the
                                                            bit BTRFS_ORDERED_LOGGED set in
                                                            its flags
                                                        --> X's refcount is untouched
                                                   btrfs_sync_log() finishes

                                                 btrfs_sync_file() finishes

btrfs_commit_transaction()
  --> called by transaction kthread for e.g.
  btrfs_wait_pending_ordered()
    --> waits for ordered extent X to
        complete
    --> decrements ordered extent X's
        refcount by 1 only, corresponding
        to the increment done by the fsync
        task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

  Btrfs: make sure logged extents complete in the current transaction V3
  commit 50d9aa99bd

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
[ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
[ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
[ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
[ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
[ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
[ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
[ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
[ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
[ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
[ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
[ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
[ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...

Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):

$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
  comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
  hex dump (first 32 bytes):
    01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
    00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
  comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
  hex dump (first 32 bytes):
    02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
  comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
  hex dump (first 32 bytes):
    00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
    00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8114eb34>] create_object+0x172/0x29a
    [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
    [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
    [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
    [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
    [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
    [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
    [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
    [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
    [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
    [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
    [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
    [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
    [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17

CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Linus Torvalds 7dac5cb1bc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "I'm still testing more fixes, but I wanted to get out the fix for the
  btrfs raid5/6 memory corruption I mentioned in my merge window pull"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix allocation size calculations in alloc_btrfs_bio
2015-02-26 10:34:24 -08:00
Linus Torvalds be5e6616dd Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted stuff from this cycle.  The big ones here are multilayer
  overlayfs from Miklos and beginning of sorting ->d_inode accesses out
  from David"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  Documentation/filesystems/Locking: ->get_sb() is long gone
  trylock_super(): replacement for grab_super_passive()
  fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
  SELinux: Use d_is_positive() rather than testing dentry->d_inode
  Smack: Use d_is_positive() rather than testing dentry->d_inode
  TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
  Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
  Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
  VFS: Split DCACHE_FILE_TYPE into regular and special types
  VFS: Add a fallthrough flag for marking virtual dentries
  VFS: Add a whiteout dentry type
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  Infiniband: Fix potential NULL d_inode dereference
  posix_acl: fix reference leaks in posix_acl_create
  autofs4: Wrong format for printing dentry
  ...
2015-02-22 17:42:14 -08:00
David Howells e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
Chris Mason e57cf21e97 Btrfs: fix allocation size calculations in alloc_btrfs_bio
Since commit 8e5cfb55d3 (Btrfs: Make raid_map array be inlined in
btrfs_bio structure), the raid map array is allocated along with the
btrfs bio in alloc_btrfs_bio.  The calculation used to decide how much
we need to allocate was using the wrong parameter passed into the
allocation function.

The passed in real_stripes will be zero if a target replace operation
is not currently running.  We want to use total_stripes instead.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: David Sterba <dsterba@suse.cz>
Tested-by: David Sterba <dsterba@suse.cz>
2015-02-20 06:55:15 -08:00
Linus Torvalds 2b9fb532d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is mostly cleanups and fixes:

   - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
     in the code and add improvements on top of the scrubbing support
     from 3.19.

   - Josef has round one of our ENOSPC fixes coming from large btrfs
     clusters here at FB.

   - Dave Sterba continues a long series of cleanups (thanks Dave), and
     Filipe continues hammering on corner cases in fsync and others

  This all was held up a little trying to track down a use-after-free in
  btrfs raid5/6.  It's not clear yet if this is just made easier to
  trigger with this pull or if its a new bug from the raid5/6 cleanups.
  Dave Sterba is the only one to trigger it so far, but he has a
  consistent way to reproduce, so we'll get it nailed shortly"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
  Btrfs: don't remove extents and xattrs when logging new names
  Btrfs: fix fsync data loss after adding hard link to inode
  Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
  Btrfs: account for large extents with enospc
  Btrfs: don't set and clear delalloc for O_DIRECT writes
  Btrfs: only adjust outstanding_extents when we do a short write
  btrfs: Fix out-of-space bug
  Btrfs: scrub, fix sleep in atomic context
  Btrfs: fix scheduler warning when syncing log
  Btrfs: Remove unnecessary placeholder in btrfs_err_code
  btrfs: cleanup init for list in free-space-cache
  btrfs: delete chunk allocation attemp when setting block group ro
  btrfs: clear bio reference after submit_one_bio()
  Btrfs: fix scrub race leading to use-after-free
  Btrfs: add missing cleanup on sysfs init failure
  Btrfs: fix race between transaction commit and empty block group removal
  btrfs: add more checks to btrfs_read_sys_array
  btrfs: cleanup, rename a few variables in btrfs_read_sys_array
  btrfs: add checks for sys_chunk_array sizes
  btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
  ...
2015-02-19 14:36:00 -08:00
David Sterba a4f3d2c4ef btrfs: cleanup, reduce temporary variables in btrfs_read_roots
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:47 +01:00
David Sterba 6f01105819 btrfs: use correct type for workqueue flags
Through all the local wrappers to alloc_workqueue, __alloc_workqueue_key
takes an unsigned int.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:47 +01:00
Eric Sandeen 4bbcaa648d btrfs: factor btrfs_read_roots() out of open_ctree()
Also, remove the two local variables create_uuid_tree
and check_uuid_tree; we can use the existence of
the uuid root and/or the RESCAN_UUID_TREE flag to
determine what action to take.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen 63443bf54a btrfs: factor btrfs_replay_log() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen 2a4581983f btrfs: factor btrfs_init_workqueues() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_workqueues]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen f9e92e40b5 btrfs: factor btrfs_init_qgroup() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_qgroup]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen ad6183680c btrfs: factor btrfs_init_dev_replace_locks() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_dev_replace_locks]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:46 +01:00
Eric Sandeen f37938e0e2 btrfs: factor btrfs_init_btree_inode() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_btree_inode]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen 779a65a495 btrfs: factor btrfs_init_balance() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_balance]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen 638aa7ed46 btrfs: factor btrfs_init_scrub() out of open_ctree()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[renamed to btrfs_init_scrub]
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen 0489234077 btrfs: consistently use fs_info in close_ctree()
close_ctree() has a local fs_info var for convienience;
use it consistently.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Eric Sandeen 9eaed21ef9 btrfs: remove unused fs_info arg from btrfs_close_extra_devices()
The commit:
8dabb74 Btrfs: change core code of btrfs to support the
        device replace operations
added the fs_info argument, but never used it -
just remove it again.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Fabian Frederick 41d6b13ea7 btrfs: fix sizeof format specifier in btrfs_check_super_valid()
This patch fixes mips compilation warning:

fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei 08da757d3f btrfs: cleanup: use for() loop in btrfs_map_bio()
for() is obviously better in these code block, and remove noused
init-value to reduce about 6 bytes binary size.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei a688a04aab btrfs: remove unused chunk_tree argument in several functions
There functions include unused chunk_tree argument from the begining,
it is time to remove them and clean up relative code to prepare value
of this argument in caller.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei b9fd47cde5 btrfs: cleanup: remove no-used alloc_chunk in btrfs_check_data_free_space()
int alloc_chunk is never used in this function, remove it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
David Sterba e8c9f18603 btrfs: constify structs with op functions or static definitions
There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Wang Shilong f749303bda Btrfs: switch to kvfree() helper
A new helper kvfree() in mm/utils.c will do this.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler 01d58472a8 Btrfs: disk-io: replace root args iff only fs_info used
This is the 3rd independent patch of a larger project to cleanup btrfs's
internal usage of btrfs_root. Many functions take btrfs_root only to
grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) csum_tree_block
  2) csum_dirty_buffer
  3) check_tree_block_fsid
  4) btrfs_find_tree_block
  5) clean_tree_block

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler a585e94895 Btrfs: delayed-inode: replace root args iff only fs_info used
This is the second independent patch of a larger project to cleanup
btrfs's internal usage of btrfs_root. Many functions take btrfs_root
only to grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) btrfs_wq_run_delayed_node

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Daniel Dressler b7a0365ec7 Btrfs: ctree: reduce args where only fs_info used
This patch is part of a larger project to cleanup btrfs's internal usage
of struct btrfs_root. Many functions take btrfs_root only to grab a
pointer to fs_info.

This causes programmers to ponder which root can be passed. Since only
the fs_info is read affected functions can accept any root, except this
is only obvious upon inspection.

This patch reduces the specificty of such functions to accept the
fs_info directly.

This patch does not address the two functions in ctree.c (insert_ptr,
and split_item) which only use root for BUG_ONs in ctree.c

This patch affects the following functions:
  1) fixup_low_keys
  2) btrfs_set_item_key_safe

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Filipe Manana a742994aa2 Btrfs: don't remove extents and xattrs when logging new names
If we are recording in the tree log that an inode has new names (new hard
links were added), we would drop items, belonging to the inode, that we
shouldn't:

1) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the extent and xattr items that were
   previously logged. This was done only in memory, since logging a new
   name doesn't imply syncing the log;

2) When the flag BTRFS_INODE_COPY_EVERYTHING is set in the inode's runtime
   flags, we ended up dropping all the xattr items that were previously
   logged. Like the case before, this was done only in memory because
   logging a new name doesn't imply syncing the log.

This led to some surprises in scenarios such as the following:

1) write some extents to an inode;
2) fsync the inode;
3) truncate the inode or delete/modify some of its xattrs
4) add a new hard link for that inode
5) fsync some other file, to force the log tree to be durably persisted
6) power failure happens

The next time the fs is mounted, the fsync log replay code is executed,
and the resulting file doesn't have the content it had when the last fsync
against it was performed, instead if has a content matching what it had
when the last transaction commit happened.

So change the behaviour such that when a new name is logged, only the inode
item and reference items are processed.

This is easy to reproduce with the test I just made for xfstests, whose
main body is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Now shrink our file to 5000 bytes.
  $XFS_IO_PROG -c "truncate 5000" $SCRATCH_MNT/foo

  # Now do an expanding truncate to a size larger than what we had when we last
  # fsync'ed our file. This is just to verify that after power failure and
  # replaying the fsync log, our file matches what it was when we last fsync'ed
  # it - 12Kb size, first 8Kb of data had a value of 0xaa and the last 4Kb of
  # data had a value of 0xcc.
  $XFS_IO_PROG -c "truncate 32K" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs drop all of our file's
  # metadata from the fsync log, including the metadata relative to the
  # extent we just wrote and fsync'ed. This change was made only to the fsync
  # log in memory, so adding the hard link alone doesn't change the persisted
  # fsync log. This happened because the previous truncates set the runtime
  # flag BTRFS_INODE_NEEDS_FULL_SYNC in the btrfs inode structure.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  # After this our persisted fsync log will no longer have metadata for our file
  # foo that points to the extent we wrote and fsync'ed before.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to see a file
  # with a size of 32Kb, with its first 5000 bytes having the value 0xaa and all
  # the remaining bytes with value 0x00.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The expected result is to see a file with a size of 12Kb, with its first 8Kb
  # of data having the value 0xaa and its last 4Kb of data having a value of 0xcc.
  # The btrfs bug used to leave the file as it used te be as of the last
  # transaction commit - that is, with a size of 8Kb with all bytes having a
  # value of 0xaa.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

The test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Filipe Manana 1a4bcf470c Btrfs: fix fsync data loss after adding hard link to inode
We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7)

Test cases for xfstests follow soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Forrest Liu 3d84be7991 Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Removing large amount of block group in a transaction may encounters
BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
will grab metadata reservation from transaction handle, and
btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
delete unused block group.

The problem can be reproduce by following script

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 1000g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    for j in `seq 1 1 1000`; do
        fallocate -l 1g $mntpath/$j
    done
    # wait cleaner thread remove unused block group
    sleep 300

The call trace that results from the BUG_ON() is:

[  613.093084] ------------[ cut here ]------------
[  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
[  613.105855] invalid opcode: 0000 [#1] SMP
[  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
[  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
[  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
[  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
[  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
[  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
[  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
[  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
[  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
[  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
[  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
[  613.177554] Stack:
[  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
[  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
[  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
[  613.186171] Call Trace:
[  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
[  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
[  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
[  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
[  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
[  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
[  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
[  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
[  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.218828]  RSP <ffff880039cf7c48>
[  613.220382] ---[ end trace 71073106deb8a457 ]---

This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()

Signed-off-by: Forrest Liu <forrestl@synology.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Josef Bacik dcab6a3b2a Btrfs: account for large extents with enospc
On our gluster boxes we stream large tar balls of backups onto our fses.  With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent.  The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need.  To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:48 -08:00