Commit Graph

287 Commits

Author SHA1 Message Date
Chris Mason d95603b262 Btrfs: fix uninit variable in repair_eb_io_failure
We'd have to be passing bogus extent buffers for this uninit variable to
actually be used, but set it to zero just in case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-12 15:55:15 -04:00
Linus Torvalds 9613bebb22 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes and features from Chris Mason:
 "We've merged in the error handling patches from SuSE.  These are
  already shipping in the sles kernel, and they give btrfs the ability
  to abort transactions and go readonly on errors.  It involves a lot of
  churn as they clarify BUG_ONs, and remove the ones we now properly
  deal with.

  Josef reworked the way our metadata interacts with the page cache.
  page->private now points to the btrfs extent_buffer object, which
  makes everything faster.  He changed it so we write an whole extent
  buffer at a time instead of allowing individual pages to go down,,
  which will be important for the raid5/6 code (for the 3.5 merge
  window ;)

  Josef also made us more aggressive about dropping pages for metadata
  blocks that were freed due to COW.  Overall, our metadata caching is
  much faster now.

  We've integrated my patch for metadata bigger than the page size.
  This allows metadata blocks up to 64KB in size.  In practice 16K and
  32K seem to work best.  For workloads with lots of metadata, this cuts
  down the size of the extent allocation tree dramatically and fragments
  much less.

  Scrub was updated to support the larger block sizes, which ended up
  being a fairly large change (thanks Stefan Behrens).

  We also have an assortment of fixes and updates, especially to the
  balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
  the defragging code (Liu Bo)."

Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
removal of the second argument to k[un]map_atomic() in commit
7ac687d9e0.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
  Btrfs: update the checks for mixed block groups with big metadata blocks
  Btrfs: update to the right index of defragment
  Btrfs: do not bother to defrag an extent if it is a big real extent
  Btrfs: add a check to decide if we should defrag the range
  Btrfs: fix recursive defragment with autodefrag option
  Btrfs: fix the mismatch of page->mapping
  Btrfs: fix race between direct io and autodefrag
  Btrfs: fix deadlock during allocating chunks
  Btrfs: show useful info in space reservation tracepoint
  Btrfs: don't use crc items bigger than 4KB
  Btrfs: flush out and clean up any block device pages during mount
  btrfs: disallow unequal data/metadata blocksize for mixed block groups
  Btrfs: enhance superblock sanity checks
  Btrfs: change scrub to support big blocks
  Btrfs: minor cleanup in scrub
  Btrfs: introduce common define for max number of mirrors
  Btrfs: fix infinite loop in btrfs_shrink_device()
  Btrfs: fix memory leak in resolver code
  Btrfs: allow dup for data chunks in mixed mode
  Btrfs: validate target profiles only if we are going to use them
  ...
2012-03-30 12:44:29 -07:00
Chris Mason 1d4284bd6e Merge branch 'error-handling' into for-linus
Conflicts:
	fs/btrfs/ctree.c
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/inode.c
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:31:37 -04:00
Josef Bacik ea46679408 Btrfs: deal with read errors on extent buffers differently
Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis.  So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 21:57:36 -04:00
Chris Mason a098d8e8ee Btrfs: loop waiting on writeback
lock_extent_buffer_for_io needs to loop around and make sure the
writeback bits are not set.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 17:04:23 -04:00
Josef Bacik 0b32f4bbb4 Btrfs: ensure an entire eb is written at once
This patch simplifies how we track our extent buffers.  Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 17:04:23 -04:00
Josef Bacik 5df4235ea1 Btrfs: introduce mark_extent_buffer_accessed
Because an eb can have multiple pages we need to make sure that all pages within
the eb are markes as accessed, since releasepage can be called against any page
in the eb.  This will keep us from possibly evicting hot eb's when we're doing
larger than pagesize eb's.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 16:51:09 -04:00
Josef Bacik 3083ee2e18 Btrfs: introduce free_extent_buffer_stale
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory.  So instead of evicting these pages, we
could end up evicting things we actually care about.  Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks.  This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 16:51:08 -04:00
Josef Bacik 115391d231 Btrfs: only use the existing eb if it's count isn't 0
We can run into a problem where we find an eb for our existing page already on
the radix tree but it has a ref count of 0.  It hasn't yet been removed by RCU
yet so this can cause issues where we will use the EB after free.  So do
atomic_inc_not_zero on the exists->refs and if it is zero just do
synchronize_rcu() and try again.  We won't have to worry about new allocators
coming in since they will block on the page lock at this point.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 16:51:08 -04:00
Josef Bacik 4f2de97ace Btrfs: set page->private to the eb
We spend a lot of time looking up extent buffers from pages when we could just
store the pointer to the eb the page is associated with in page->private.  This
patch does just that, and it makes things a little simpler and reduces a bit of
CPU overhead involved with doing metadata IO.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26 16:51:07 -04:00
Chris Mason 727011e07c Btrfs: allow metadata blocks larger than the page size
A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling.  This fixes the code to properly support
large metadata blocks again.

Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.

This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 16:50:37 -04:00
Jeff Mahoney 79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Jeff Mahoney 3fbe5c02ae btrfs: split extent_state ops
set_extent_bit can do exclusive locking but only when called by lock_extent*,

 Drop the exclusive bits argument except when called by lock_extent.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Jeff Mahoney d0082371cf btrfs: drop gfp_t from lock_extent
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
 argument and use GFP_NOFS consistently.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:35 +01:00
Jeff Mahoney 143bede527 btrfs: return void in functions without error conditions
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Jeff Mahoney 355808c296 btrfs: ->submit_bio_hook error push-up
This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.

It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Jeff Mahoney 3444a97255 btrfs: Factor out tree->ops->merge_bio_hook call
In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.

This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:33 +01:00
Jeff Mahoney 6763af84a6 btrfs: Remove set bits return from clear_extent_bit
There is only one caller of clear_extent_bit that checks the return value
and it only checks if it's negative. Since there are no users of the
returned bits functionality of clear_extent_bit, stop returning it
and avoid complicating error handling.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:32 +01:00
Jeff Mahoney c2d904e086 btrfs: Catch locking failures in {set,clear,convert}_extent_bit
The *_state functions can only return 0 or -EEXIST. This patch addresses
the cases where those functions returning -EEXIST represent a locking
failure. It handles them by panicking with an appropriate error message.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:30 +01:00
Cong Wang 7ac687d9e0 btrfs: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:21 +08:00
Chris Mason 5065319052 Btrfs: clear the extent uptodate bits during parent transid failures
If btrfs reads a block and finds a parent transid mismatch, it clears
the uptodate flags on the extent buffer, and the pages inside it.  But
we only clear the uptodate bits in the state tree if the block straddles
more than one page.

This is from an old optimization from to reduce contention on the extent
state tree.  But it is buggy because the code that retries a read from
a different copy of the block is going to find the uptodate state bits
set and skip the IO.

The end result of the bug is that we'll never actually read the good
copy (if there is one).

The fix here is to always clear the uptodate state bits, which is safe
because this code is only called when the parent transid fails.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Liu Bo 692e5759a4 Btrfs: be less strict on finding next node in clear_extent_bit
In clear_extent_bit, it is enough that next node is adjacent in tree level.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-21 16:02:10 +01:00
Liu Bo 9d47c7671d Btrfs: kick out redundant stuff in convert_extent_bit
clear_state_bit will do merge_state for us, so kick out the redundant one.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16 17:23:17 +01:00
Liu Bo 0449314a9c Btrfs: skip states when they does not contain bits to clear
Clearing a range's bits is different with setting them, since we don't
need to touch them when states do not contain bits we want.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16 17:23:17 +01:00
Tsutomu Itoh 285190d99f Btrfs: check return value of lookup_extent_mapping() correctly
This patch corrects error checking of lookup_extent_mapping().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16 17:23:17 +01:00
Tsutomu Itoh 013bd4c336 Btrfs: fix return value check of extent_io_ops
This patch adds the check on the return value of extent_io_ops.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16 17:23:16 +01:00
Jeff Mahoney 87826df0ec btrfs: delalloc for page dirtied out-of-band in fixup worker
We encountered an issue that was easily observable on s/390 systems but
 could really happen anywhere. The timing just seemed to hit reliably
 on s/390 with limited memory.

 The gist is that when an unexpected set_page_dirty() happened, we'd
 run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
 properly set up for delalloc.

 This patch does the following:
 - Performs the missing delalloc in the fixup worker
 - Allow the start hook to return -EBUSY which informs __extent_writepage
   that it should mark the page skipped and not to redirty it. This is
   required since the fixup worker can fail with -ENOSPC and the page
   will have already been redirtied. That causes an Oops in
   drop_outstanding_extents later. Retrying the fixup worker could
   lead to an infinite loop. Deferring the page redirty also saves us
   some cycles since the page would be stuck in a resubmit-redirty loop
   until the fixup worker completes. It's not harmful, just wasteful.
 - If the fixup worker fails, we mark the page and mapping as errored,
   and end the writeback, similar to what we would do had the page
   actually been submitted to writeback.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-02-15 16:40:25 +01:00
Mitch Harder 8bedd51b61 Btrfs: Check for NULL page in extent_range_uptodate
A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors.  The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).

After testing this patch, the user reported that the error with
the NULL pointer oops was solved.  However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c

       for (i = start_i; i < num_pages; i++) {
               page = extent_buffer_page(eb, i);
               wait_on_page_locked(page);
               if (!PageUptodate(page))
                       ret = -EIO;
       }

This patch leaves the issue with the locked page yet to be resolved.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Chris Mason c126dea771 Merge branch 'integrity-check-patch-v2' of git://btrfs.giantdisaster.de/git/btrfs into integration
Conflicts:
	fs/btrfs/ctree.h
	fs/btrfs/super.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:27:58 -05:00
Chris Mason 9785dbdf26 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration 2012-01-16 15:26:31 -05:00
Arne Jansen 5b25f70f42 Btrfs: add nested locking mode for paths
This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-01-04 16:12:29 +01:00
Stefan Behrens 21adbd5cbb Btrfs: integrate integrity check module into btrfs
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-12-21 19:14:17 +01:00
Liu Bo 1cf4ffdb32 Btrfs: drop spin lock when memory alloc fails
Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-08 08:55:47 -05:00
Jan Schmidt f4a8e6563e Btrfs: fix meta data raid-repair merge problem
Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.

The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.

This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-01 09:30:36 -05:00
Josef Bacik 4d479cf010 Btrfs: sectorsize align offsets in fiemap
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop.  This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned.  With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:17 -05:00
Jan Schmidt 32240a913d btrfs: mirror_num should be int, not u64
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:14 -05:00
Jeff Mahoney 745c4d8e16 btrfs: Fix up 32/64-bit compatibility for new ioctls
This patch casts to unsigned long before casting to a pointer and fixes
 the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:13 -05:00
Chris Mason 806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
Chris Mason 531f4b1ae5 Merge branch 'for-chris' of git://github.com/sensille/linux into integration
Conflicts:
	fs/btrfs/ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:05:08 -05:00
Chris Mason bf0da8c183 Btrfs: ClearPageError during writepage and clean_tree_block
Failure testing was tripping up over stale PageError bits in
metadata pages.  If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.

During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.

This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit.  In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:20 -05:00
Chris Mason 01d658f2ca Btrfs: make sure to flush queued bios if write_cache_pages waits
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.

Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:48 -05:00
Liu Bo fee187d9d9 Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2011-10-20 18:10:45 +02:00
Josef Bacik 462d6fac89 Btrfs: introduce convert_extent_bit
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches.  This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:47 -04:00
Josef Bacik 9e4871070b Btrfs: skip looking for delalloc if we don't have ->fill_delalloc
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check.  So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:30 -04:00
Arne Jansen 4bb31e928d btrfs: hooks for readahead
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.

Changes for v2:
 - eliminate race condition

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:44 +02:00
Arne Jansen bb82ab88df btrfs: add an extra wait mode to read_extent_buffer_pages
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.

Changes v5:
 - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
   WAIT_PAGE_LOCK

Change v6:
 - fix bug introduced in v5

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:47:55 +02:00
Jan Schmidt 4a54c8c165 btrfs: Moved repair code from inode.c to extent_io.c
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.

Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt 8ddc7d9cd0 btrfs: add mirror_num to extent_read_full_page
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Linus Torvalds ed8f37370d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
  Btrfs: don't call writepages from within write_full_page
  Btrfs: Remove unused variable 'last_index' in file.c
  Btrfs: clean up for find_first_extent_bit()
  Btrfs: clean up for wait_extent_bit()
  Btrfs: clean up for insert_state()
  Btrfs: remove unused members from struct extent_state
  Btrfs: clean up code for merging extent maps
  Btrfs: clean up code for extent_map lookup
  Btrfs: clean up search_extent_mapping()
  Btrfs: remove redundant code for dir item lookup
  Btrfs: make acl functions really no-op if acl is not enabled
  Btrfs: remove remaining ref-cache code
  Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
  Btrfs: use wait_event()
  Btrfs: check the nodatasum flag when writing compressed files
  Btrfs: copy string correctly in INO_LOOKUP ioctl
  Btrfs: don't print the leaf if we had an error
  btrfs: make btrfs_set_root_node void
  Btrfs: fix oops while writing data to SSD partitions
  Btrfs: Protect the readonly flag of block group
  ...

Fix up trivial conflicts (due to acl and writeback cleanups) in
 - fs/btrfs/acl.c
 - fs/btrfs/ctree.h
 - fs/btrfs/extent_io.c
2011-08-02 21:14:05 -10:00
Josef Bacik 0d10ee2e6d Btrfs: don't call writepages from within write_full_page
When doing a writepage we call writepages to try and write out any other dirty
pages in the area.  This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:37:36 -04:00
Xiao Guangrong 69261c4b6a Btrfs: clean up for find_first_extent_bit()
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:32:39 -04:00
Xiao Guangrong ded91f0814 Btrfs: clean up for wait_extent_bit()
We can just use cond_resched_lock().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:32:38 -04:00
Xiao Guangrong 3150b69969 Btrfs: clean up for insert_state()
Don't duplicate set_state_bits().

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:32:30 -04:00
Jeff Mahoney 1bf85046e4 btrfs: Make extent-io callbacks that never fail return void
The set/clear bit and the extent split/merge hooks only ever return 0.

 Changing them to return void simplifies the error handling cases later.

 This patch changes the hook prototypes, the single implementation of each,
 and the functions that call them to return void instead.

 Since all four of these hooks execute under a spinlock, they're necessarily
 simple.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-01 14:30:43 -04:00
Linus Torvalds 22712200e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
  Btrfs: use the commit_root for reading free_space_inode crcs
  Btrfs: reduce extent_state lock contention for metadata
  Btrfs: remove lockdep magic from btrfs_next_leaf
  Btrfs: make a lockdep class for each root
  Btrfs: switch the btrfs tree locks to reader/writer
  Btrfs: fix deadlock when throttling transactions
  Btrfs: stop using highmem for extent_buffers
  Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
  Btrfs: tag pages for writeback in sync
  Btrfs: fix enospc problems with delalloc
  Btrfs: don't flush delalloc arbitrarily
  Btrfs: use find_or_create_page instead of grab_cache_page
  Btrfs: use a worker thread to do caching
  Btrfs: fix how we merge extent states and deal with cached states
  Btrfs: use the normal checksumming infrastructure for free space cache
  Btrfs: serialize flushers in reserve_metadata_bytes
  Btrfs: do transaction space reservation before joining the transaction
  Btrfs: try to only do one btrfs_search_slot in do_setxattr
2011-07-27 16:43:52 -07:00
Chris Mason ff95acb673 Merge branch 'integration' into for-linus 2011-07-27 16:18:13 -04:00
Chris Mason 19b6caf4ac Btrfs: reduce extent_state lock contention for metadata
For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.

This greatly reduces contention on the state tree lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:47 -04:00
Chris Mason bd681513fa Btrfs: switch the btrfs tree locks to reader/writer
The btrfs metadata btree is the source of significant
lock contention, especially in the root node.   This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking.  Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster.  In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:46 -04:00
Chris Mason a65917156e Btrfs: stop using highmem for extent_buffers
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:45 -04:00
Josef Bacik f7aaa06bff Btrfs: tag pages for writeback in sync
Everybody else does this, we need to do it too.  If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:44 -04:00
Josef Bacik df98b6e2c5 Btrfs: fix how we merge extent states and deal with cached states
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed.  This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree.  So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state.  This way we are sure to cache the correct state.  Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-11 10:00:48 -04:00
Wu Fengguang d46db3d582 writeback: make writeback_control.nr_to_write straight
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09 22:09:01 -07:00
Linus Torvalds e6ece70732 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  btrfs: fix uninitialized variable warning
  btrfs: add helper for fs_info->closing
  Btrfs: add mount -o inode_cache
  btrfs: scrub: add explicit plugging
  btrfs: use btrfs_ino to access inode number
  Btrfs: don't save the inode cache if we are deleting this root
  btrfs: false BUG_ON when degraded
  Btrfs: don't save the inode cache in non-FS roots
  Btrfs: make sure we don't overflow the free space cache crc page
  Btrfs: fix uninit variable in the delayed inode code
  btrfs: scrub: don't reuse bios and pages
  Btrfs: leave spinning on lookup and map the leaf
  Btrfs: check for duplicate entries in the free space cache
  Btrfs: don't try to allocate from a block group that doesn't have enough space
  Btrfs: don't always do readahead
  Btrfs: try not to sleep as much when doing slow caching
  Btrfs: kill BTRFS_I(inode)->block_group
  Btrfs: don't look at the extent buffer level 3 times in a row
  Btrfs: map the node block when looking for readahead targets
  Btrfs: set range_start to the right start in count_range_bits
  ...
2011-06-05 06:17:23 +09:00
Chris Mason ff5714cca9 Merge branch 'for-chris' of
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-28 07:00:39 -04:00
Linus Torvalds a0c3061093 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
  Btrfs: use the device_list_mutex during write_dev_supers
  Btrfs: setup free ino caching in a more asynchronous way
  btrfs scrub: don't coalesce pages that are logically discontiguous
  Btrfs: return -ENOMEM in clear_extent_bit
  Btrfs: add mount -o auto_defrag
  Btrfs: using rcu lock in the reader side of devices list
  Btrfs: drop unnecessary device lock
  Btrfs: fix the race between remove dev and alloc chunk
  Btrfs: fix the race between reading and updating devices
  Btrfs: fix bh leak on __btrfs_open_devices path
  Btrfs: fix unsafe usage of merge_state
  Btrfs: allocate extent state and check the result properly
  fs/btrfs: Add missing btrfs_free_path
  Btrfs: check return value of btrfs_inc_extent_ref()
  Btrfs: return error to caller if read_one_inode() fails
  Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
  Btrfs: return error code to caller when btrfs_del_item fails
  Btrfs: return error code to caller when btrfs_previous_item fails
  btrfs: fix typo 'testeing' -> 'testing'
  btrfs: typo: 'btrfS' -> 'btrfs'
  ...
2011-05-27 13:57:12 -07:00
Chris Mason c309df0786 Btrfs: return -ENOMEM in clear_extent_bit
The btrfs releasepage function depends on ENOMEM coming
back when it is called atomic.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-26 17:52:44 -04:00
Linus Torvalds f8d613e2a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
  xen: cleancache shim to Xen Transcendent Memory
  ocfs2: add cleancache support
  ext4: add cleancache support
  btrfs: add cleancache support
  ext3: add cleancache support
  mm/fs: add hooks to support cleancache
  mm: cleancache core ops functions and config
  fs: add field to superblock to support cleancache
  mm/fs: cleancache documentation

Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26 10:50:56 -07:00
Dan Magenheimer 90a887c9a2 btrfs: add cleancache support
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs.  Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted.  Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:01:56 -06:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Xiao Guangrong c7f895a2b2 Btrfs: fix unsafe usage of merge_state
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Xiao Guangrong 8233767a22 Btrfs: allocate extent state and check the result properly
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
  allowed wait

- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
  we trigger BUG_ON() if it's fail

- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
  the return value of clear_extent_bit() is ignored by many callers

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Josef Bacik af60bed24e Btrfs: set range_start to the right start in count_range_bits
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results.  For example, if the range

[0-8192]

is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096].  So instead set range_start = max(cur_start, state->start).
This makes everything come out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:09 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason 0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Linus Torvalds 268bb0ce3e sanitize <linux/prefetch.h> usage
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

   grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
   grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 12:50:29 -07:00
David Sterba f2a97a9dbd btrfs: remove all unused functions
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

  text    data     bss     dec     hex filename
402081    7464     200  409745   64091 btrfs.ko.base
398620    7144     200  405964   631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:03 +02:00
David Sterba ba14419264 btrfs: drop gfp parameter from alloc_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba f09d1f60e6 btrfs: drop gfp parameter from find_extent_buffer
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba f993c883ad btrfs: drop unused argument from extent_io_tree_init
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba c704005d88 btrfs: unify checking of IS_ERR and null
use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba 306e16ce13 btrfs: rename variables clashing with global function names
reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Linus Torvalds 019793b755 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: cleanup error handling in inode.c
  Btrfs: put the right bio if we have an error
  Btrfs: free bitmaps properly when evicting the cache
  Btrfs: Free free_space item properly in btrfs_trim_block_group()
  btrfs: add missing spin_unlock to a rare exit path
  Btrfs: check return value of kmalloc()
  btrfs: fix wrong allocating flag when reading page
  Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
2011-04-26 08:26:58 -07:00
Itaru Kitayama 43e817a1fd btrfs: fix wrong allocating flag when reading page
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Li Zefan 33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Linus Torvalds adff377bb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...
2011-04-18 12:24:05 -07:00
Chris Mason 0d399205ed Btrfs end_bio_extent_readpage should look for locked bits
A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents.  This
fixes things to make it more effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-16 06:55:39 -04:00
Chris Mason 109b36a2bb Btrfs: make uncache_state unconditional
The extent_io code can take cached pointers into the extent state trees,
and these can make lookups much faster in common operations.  The
caching only happens when specific bits are set that prevent merging
and splitting of the extent state.

A help function was added to uncache the state, and it was testing
the same set of conditionals.  This can leak in very strange corner
cases where the lock bit goes away unexpectedly.

The uncaching should be unconditional.  Once we have a ref on the
extent we should always give it up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-12 20:51:26 -04:00
Arne Jansen 507903b818 btrfs: using cached extent_state in set/unlock combinations
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:45:36 -04:00
Sergei Trofimovich 3387206f26 btrfs: properly handle overlapping areas in memmove_extent_buffer
Fix data corruption caused by memcpy() usage on overlapping data.
I've observed it first when found out usermode linux crash on btrfs.

?all chain is the following:
------------[ cut here ]------------
WARNING: at /home/slyfox/linux-2.6/fs/btrfs/extent_io.c:3900 memcpy_extent_buffer+0x1a5/0x219()
Call Trace:
6fa39a58:  [<601b495e>] _raw_spin_unlock_irqrestore+0x18/0x1c
6fa39a68:  [<60029ad9>] warn_slowpath_common+0x59/0x70
6fa39aa8:  [<60029b05>] warn_slowpath_null+0x15/0x17
6fa39ab8:  [<600efc97>] memcpy_extent_buffer+0x1a5/0x219
6fa39b48:  [<600efd9f>] memmove_extent_buffer+0x94/0x208
6fa39bc8:  [<600becbf>] btrfs_del_items+0x214/0x473
6fa39c78:  [<600ce1b0>] btrfs_delete_one_dir_name+0x7c/0xda
6fa39cc8:  [<600dad6b>] __btrfs_unlink_inode+0xad/0x25d
6fa39d08:  [<600d7864>] btrfs_start_transaction+0xe/0x10
6fa39d48:  [<600dc9ff>] btrfs_unlink_inode+0x1b/0x3b
6fa39d78:  [<600e04bc>] btrfs_unlink+0x70/0xef
6fa39dc8:  [<6007f0d0>] vfs_unlink+0x58/0xa3
6fa39df8:  [<60080278>] do_unlinkat+0xd4/0x162
6fa39e48:  [<600517db>] call_rcu_sched+0xe/0x10
6fa39e58:  [<600452a8>] __put_cred+0x58/0x5a
6fa39e78:  [<6007446c>] sys_faccessat+0x154/0x166
6fa39ed8:  [<60080317>] sys_unlink+0x11/0x13
6fa39ee8:  [<60016b80>] handle_syscall+0x58/0x70
6fa39f08:  [<60021377>] userspace+0x2d4/0x381
6fa39fc8:  [<60014507>] fork_handler+0x62/0x69
---[ end trace 70b0ca2ef0266b93 ]---

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg09302.html

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Linus Torvalds 212a17ab87 Merge branch 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
  Btrfs: fix __btrfs_map_block on 32 bit machines
  btrfs: fix possible deadlock by clearing __GFP_FS flag
  btrfs: check link counter overflow in link(2)
  btrfs: don't mess with i_nlink of unlocked inode in rename()
  Btrfs: check return value of btrfs_alloc_path()
  Btrfs: fix OOPS of empty filesystem after balance
  Btrfs: fix memory leak of empty filesystem after balance
  Btrfs: fix return value of setflags ioctl
  Btrfs: fix uncheck memory allocations
  btrfs: make inode ref log recovery faster
  Btrfs: add btrfs_trim_fs() to handle FITRIM
  Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
  Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
  Btrfs: make update_reserved_bytes() public
  btrfs: return EXDEV when linking from different subvolumes
  Btrfs: Per file/directory controls for COW and compression
  Btrfs: add datacow flag in inode flag
  btrfs: use GFP_NOFS instead of GFP_KERNEL
  Btrfs: check return value of read_tree_block()
  btrfs: properly access unaligned checksum buffer
  ...

Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
2011-03-28 15:31:05 -07:00
liubo 1abe9b8a13 Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:33 -04:00
Linus Torvalds 6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Josef Bacik 850265335f Btrfs: return error if the range we want to map is bogus
Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit.  Fix this.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:35 -04:00
Linus Torvalds 0e5b88cd99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: break out of shrink_delalloc earlier
  btrfs: fix not enough reserved space
  btrfs: fix dip leak
  Btrfs: make sure not to return overlapping extents to fiemap
  Btrfs: deal with short returns from copy_from_user
  Btrfs: fix regressions in copy_from_user handling
2011-03-13 16:00:49 -07:00
Jens Axboe 4c63f5646e Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core
Conflicts:
	block/blk-core.c
	block/blk-flush.c
	drivers/md/raid1.c
	drivers/md/raid10.c
	drivers/md/raid5.c
	fs/nilfs2/btnode.c
	fs/nilfs2/mdt.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:58:35 +01:00
Jens Axboe 721a9602e6 block: kill off REQ_UNPLUG
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10 08:52:27 +01:00
Chris Mason ea8efc74bd Btrfs: make sure not to return overlapping extents to fiemap
The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases.  cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.

The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-08 11:58:09 -05:00
Linus Torvalds 4660ba63f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix fiemap bugs with delalloc
  Btrfs: set FMODE_EXCL in btrfs_device->mode
  Btrfs: make btrfs_rm_device() fail gracefully
  Btrfs: Avoid accessing unmapped kernel address
  Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
  Btrfs: allow balance to explicitly allocate chunks as it relocates
  Btrfs: put ENOSPC debugging under a mount option
2011-02-25 14:03:39 -08:00
Chris Mason ec29ed5b40 Btrfs: fix fiemap bugs with delalloc
The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.

This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-23 16:23:20 -05:00
Linus Torvalds 007a14af26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check return value of alloc_extent_map()
  Btrfs - Fix memory leak in btrfs_init_new_device()
  btrfs: prevent heap corruption in btrfs_ioctl_space_info()
  Btrfs: Fix balance panic
  Btrfs: don't release pages when we can't clear the uptodate bits
  Btrfs: fix page->private races
2011-02-15 08:00:35 -08:00