Commit Graph

31122 Commits

Author SHA1 Message Date
Tsutomu Itoh ecc7ada77b Btrfs: fix error handling in btrfs_ioctl_send()
fget() returns NULL if error. So, we should check NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:13 -04:00
Tsutomu Itoh ba1eeaac99 Btrfs: remove unused variable in __process_changed_new_xattr()
Variable 'p' is not used any more. So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:12 -04:00
Josef Bacik 54067ae95e Btrfs: various abort cleanups
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort.  This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.

The second half of this are related to the cleanup parts.  First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong.  The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes.  With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:11 -04:00
Josef Bacik fd8b2b6115 Btrfs: cleanup destroy_marked_extents
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:11 -04:00
Josef Bacik abefa55ac1 Btrfs: check return value of commit when recovering log
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out.  We need to do this before
we free up the log_tree_root since the caller will handle all of that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:10 -04:00
Josef Bacik 32b0253803 Btrfs: don't panic if we're trying to drop too many refs
This is just obnoxious.  Just print a message, abort the transaction, and return
an error.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:09 -04:00
Josef Bacik 171f6537ab Btrfs: cleanup fs roots if we fail to mount
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime.  So we need to clean this up
so we don't leave extent buffers hanging around on the cache.  With this patch
we no longer leak extent buffers on failure to mount.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:08 -04:00
Josef Bacik eb384b55ae Btrfs: fix extent logging with O_DIRECT into prealloc
This is the same as the fix from commit

Btrfs: fix bad extent logging

but for O_DIRECT.  I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent.  We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Josef Bacik 416bc6580b Btrfs: fix all callers of read_tree_block
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly.  You need to check
and make sure the extent_buffer is uptodate before you use it.  This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not.  With this we no longer
leak EB's when things go horribly wrong.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Josef Bacik 51bf5f0bc4 Btrfs: only exclude supers in the range of our block group
If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree.  This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it.  This fixes the problem by only adding excluded
extents if it falls in the block group range we care about.  With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:06 -04:00
Josef Bacik 1c24c3ce6a Btrfs: add tree block level sanity check
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level.  So add this
to the sanity checks in the endio handler for tree blocks.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:05 -04:00
Josef Bacik 5ec8dca761 Btrfs: don't try and free ebs twice in log replay
This work is done by btrfs_free_path() anyway so there's no need for this
duplicate work.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:04 -04:00
Josef Bacik fb7669b5a0 Btrfs: don't BUG_ON() in btrfs_num_copies
A user sent me a btrfs-image that was panicing because of some corruption.  This
is because we pass in a bogus value to btrfs_num_copies, and it panics.  Instead
just return 1.  We only call btrfs_num_copies to see if there are other copies
to try and read for things, so if we just return 1 it will make the callers exit
out with an appropriate error value.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:04 -04:00
Josef Bacik 79fb65a1f6 Btrfs: don't call readahead hook until we have read the entire eb
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map.  Turns out he is using > PAGESIZE leafsizes.  The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus.  Fix this
by only calling the readahead hook once all pages have been read in.  Thanks,

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:03 -04:00
Josef Bacik 9bb91873e3 Btrfs: deal with bad mappings in btrfs_map_block
Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find
a chunk for a particular block we were trying to map.  This happened because the
block was bogus.  We shouldn't be BUG_ON()'ing in this case, just print a
message and return an error.  This came from reada_add_block and it appears to
deal with an error fine so we should be good there.  Thanks,

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:02 -04:00
Josef Bacik d4c7ca86b5 Btrfs: use REQ_META for all metadata IO
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:01 -04:00
Josef Bacik 0a3896d0f5 Btrfs: fix possible infinite loop in slow caching
So I noticed there is an infinite loop in the slow caching code.  If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again.  Fix this by only doing btrfs_next_leaf() if we
don't need_resched().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:01 -04:00
Josef Bacik 62dbd7176e Btrfs: fix lockdep warning
The locking order for stuff is

__sb_start_write
ordered_mutex

but with sync() we don't do __sb_start_write for some strange reason, which
means that our iput in wait_ordered_extents could start a transaction which does
the __sb_start_write while we're holding the ordered_mutex.  Fix this by using
delayed iput in sync.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:00 -04:00
Wang Shilong 534e6623b7 Btrfs: add all ioctl checks before user change for quota operations
Since all the quota configurations are loaded in memory, and we can
have ioctl checks before operating in the disk. It is safe to do such
things because qgroup_ioctl_lock is held outside.

Without these extra checks firstly, it should be ok to do user change
for quota operations. For example:

if we want to add an existed qgroup, we will do:
	->add_qgroup_item()
		->add_qgroup_rb()

add_qgroup_item() will return -EEXIST to us, however, qgroups are all
in memory, why not check them in memory firstly.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:59 -04:00
Wang Shilong 3c97185c65 Btrfs: fix missing check about ulist_add() in qgroup.c
ulist_add() may return -ENOMEM, fix missing check about
return value.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:58 -04:00
Stefan Behrens 70023da276 Btrfs: clear received_uuid field for new writable snapshots
For created snapshots, the full root_item is copied from the source
root and afterwards selectively modified. The current code forgets
to clear the field received_uuid. The only problem is that it is
confusing when you look at it with 'btrfs subv list', since for
writable snapshots, the contents of the snapshot can be completely
unrelated to the previously received snapshot.
The receiver ignores such snapshots anyway because he also checks
the field stransid in the root_item and that value used to be reset
to zero for all created snapshots.

This commit changes two things:
- clear the received_uuid field for new writable snapshots.
- don't clear the send/receive related information like the stransid
  for read-only snapshots (which makes them useable as a parent for
  the automatic selection of parents in the receive code).

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:58 -04:00
Josef Bacik b8d7f3ac10 Btrfs: don't force pages under writeback to finish when aborting
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right.  However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong.  We need to
lock the page and wait on page writeback to complete if it is.  There is nothing
unsafe about this since we are discarding the transaction anyway.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:57 -04:00
Wang Shilong ccf7f29d1a Btrfs: remove unused variable in the iterate_extent_inodes()
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:56 -04:00
Liu Bo 0abd5b1724 Btrfs: return error when we specify wrong start to defrag
We need such a sanity check for wrong start when we defrag a file, otherwise,
even with a wrong start that's larger than file size, we can end up changing
not only inode's force compress flag but also FS's incompat flags.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:55 -04:00
Vincent 3c59ccd32a Btrfs: fix reada debug code compilation
This fixes the following errors:

  fs/btrfs/reada.c: In function ‘btrfs_reada_wait’:
  fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
  fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)

Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:55 -04:00
Tsutomu Itoh fd279faefa Btrfs: cleanup of function where btrfs_extend_item() is called
Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:54 -04:00
Tsutomu Itoh 4b90c68015 Btrfs: remove unused argument of btrfs_extend_item()
Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:53 -04:00
Tsutomu Itoh afe5fea72b Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Tsutomu Itoh d6a0a12684 Btrfs: remove unused argument of fixup_low_keys()
Argument 'trans' is not used in fixup_low_keys(). So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Wang Shilong b4fcd6be6b Btrfs: fix confusing edquot happening case
Step to reproduce:
	mkfs.btrfs <disk>
	mount <disk> <mnt>
	dd if=/dev/zero of=/<mnt>/data bs=1M count=10
	sync
	btrfs quota enable <mnt>
	btrfs qgroup create 0/5 <mnt>
	btrfs qgroup limit 5M 0/5 <mnt>
	rm -f /<mnt>/data
	sync
	btrfs qgroup show <mnt>
	dd if=/dev/zero of=data bs=1M count=1

>From the perspective of users, qgroup's referenced or exclusive
is negative,but user can not continue to write data! a workaround
way is to cast u64 to s64 when doing qgroup reservation.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:51 -04:00
Wang Shilong e36902d4cc Btrfs: do not continue if out of memory happens
If out of memory happens, we should return -ENOMEM directly to the caller
rather than continue the work.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:50 -04:00
Nathaniel Yazdani 9c931c5ab2 btrfs: fix minor typo in comment
In the comment describing the sync_writers field of the btrfs_inode
struct, "fsyncing" was misspelled "fsycing."

Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:49 -04:00
Wang Shilong 98ad43be0a Btrfs: cleanup to remove reduplicate code in transaction.c
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:49 -04:00
Jan Schmidt 47fb091fb7 Btrfs: fix unlock after free on rewinded tree blocks
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:48 -04:00
Jan Schmidt 30b0463a93 Btrfs: fix accessing the root pointer in tree mod log functions
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00
Jan Schmidt 90f8d62ebb Btrfs: fix tree mod log regression on root split operations
Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.

This commit removes the erroneous logging of removal of all elements.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00
Miao Xie ceda086424 Btrfs: use a lock to protect incompat/compat flag of the super block
The following case will make the incompat/compat flag of the super block
be recovered.
 Task1					|Task2
 flags = btrfs_super_incompat_flags();	|
					|flags = btrfs_super_incompat_flags();
 flags |= new_flag1;			|
					|flags |= new_flag2;
 btrfs_set_super_incompat_flags(flags);	|
					|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.

In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:46 -04:00
Miao Xie f42a34b2f1 Btrfs: fix unblocked autodefraggers when remount
The new mount option is set after parsing the remount arguments,
so it is wrong that checking the autodefrag is close or not at
btrfs_remount_prepare(). Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:45 -04:00
Wang Shilong f7f82b81d2 Btrfs: add a rb_tree to improve performance of ulist search
Walking backref tree and btrfs quota rely on ulist very much.
This patch tries to use rb_tree to speed up search time.

The original code always checks whether an element
exists before adding a new element, however it costs O(n).

I try to add a rb_tree in the ulist,this is only used to speed up
search. I also do some measurements with quota enabled.

fsstress -p 4 -n 10000

Without this path:
real    0m51.058s       2m4.745s        1m28.222s       1m5.137s
user    0m0.035s        0m0.041s        0m0.105s        0m0.100s
sys     0m12.009s       0m11.246s       0m10.901s       0m10.999s       0m11.287s

With this path:
real    0m55.295s       0m50.960s       1m2.214s        0m48.273s
user    0m0.053s        0m0.095s        0m0.135s        0m0.107s
sys     0m7.766s        0m6.013s        0m6.319s        0m6.030s        0m6.532s

After applying the patch,the execute time is down by ~42%.(11.287s->6.532s)

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:44 -04:00
Stefan Behrens c2c71324ec Btrfs: allow omitting stream header and end-cmd for btrfs send
Two new flags are added to allow omitting the stream header and the
end command for btrfs send streams. This is used in cases where you
send multiple snapshots back-to-back in one stream.

This used to be encoded like this (with 2 snapshots in this example):
<stream header> + <sequence of commands> + <end cmd> +
<stream header> + <sequence of commands> + <end cmd> + EOF

The new format (if the two new flags are used) is this one:
<stream header> + <sequence of commands> +
                  <sequence of commands> + <end cmd>

Note that the currently existing receivers treat <end cmd> only as
an indication that a new <stream header> is following. This means,
you can just skip the sequence <end cmd> <stream header> without
loosing compatibility. As long as an EOF is following, the currently
existing receivers handle the new format (if the two new flags are
used) exactly as the old one.

So what is the benefit of this change? The goal is to be able to use
a single stream (one TCP connection) to multiplex a request/response
handshake plus Btrfs send streams, all in the same stream. In this
case you cannot evaluate an EOF condition as an end of the Btrfs send
stream. You need something else, and the <end cmd> is just perfect
for this purpose.

The summary is:
The format change is driven by the need to send several Btrfs send
streams over a single TCP connections, with the ability for a repeated
request/response handshake in the middle. And this format change does
not break any existing tool, it is completely compatible.

You could compare the old behaviour of the Btrfs send stream to the
one of ftp where you need a seperate request/response channel and
newly opened data transfer channels for each file, while the new
behaviour is more like http using a single stream for everything.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:44 -04:00
Wang Shilong 692206b153 Btrfs: make __merge_refs() return type be void
__merge_refs() always return 0, it is unnecessary
for the caller to check the return value.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:43 -04:00
Wang Shilong 1149ab6bd4 Btrfs: remove some BUG_ONs() when walking backref tree
The only error return value of __add_prelim_ref() is -ENOMEM,
just return errors rather than trigger BUG_ON().

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:42 -04:00
Wang Shilong 92f183aa5b Btrfs: use tree_root to avoid edquot when disabling quota
Steps to reproduce:
	mkfs.btrfs <disk>
	mount <disk> <mnt>
	btrfs quota enable <mnt>
	btrfs sub create <mnt>/subv
	btrfs qgroup limit 10K <mnt>/subv
	btrfs quota disable <mnt>/subv

It is wrong for qgroup to reserve when disabling quota,
so just use tree_root to avoid edquot when disabling quota.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:41 -04:00
Wang Shilong ddb47afa50 Btrfs: fix a warning when updating qgroup limit
Step to reproduce:
	mkfs.btrfs <disk>
	mount <disk> <mnt>
	btrfs quota enable <mnt>
	btrfs qgroup limit 0/1 <mnt>
	dmesg

If the relative qgroup dosen't exist, flag 'BTRFS_QGROUP_STATUS_
FLAG_INCONSISTENT' will be set, and print the noise message.
This is wrong, we can just move find_qgroup_rb() before
update_qgroup_limit_item().this dosen't change the logic of the
function. But it can avoid unnecessary noise message and wrong set of flag.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:41 -04:00
Wang Shilong 3f5e2d3b38 Btrfs: fix missing check in the btrfs_qgroup_inherit()
The original code forgot to check 'inherit', we should
gurantee that all the qgroups in the struct 'inherit' exist.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:40 -04:00
Wang Shilong b7fef4f593 Btrfs: fix missing check before creating a qgroup relation
Step to reproduce:
		mkfs.btrfs <disk>
		mount <disk> <mnt>
		btrfs quota enable <mnt>
		btrfs qgroup assign 0/1 1/1 <mnt>
		umount <mnt>
		btrfs-debug-tree <disk> | grep QGROUP
If we want to add a qgroup relation, we should gurantee that
'src' and 'dst' exist, otherwise, such qgroup relation should
not be allowed to create.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:39 -04:00
Wang Shilong 58400fce5a Btrfs: remove some unnecessary spin_lock usages
We use mutex lock to protect all the user change operations.
So when we are calling find_qgroup_rb() to check whether qgroup
exists, we don't have to hold spin_lock.

Besides, when enabling/disabling quota, it must be single thread
when operations come here. spin lock must be firstly used to
clear quota_root when disabling quota, while enabling quota, spin
lock must be used to complete the last assign work.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:39 -04:00
Wang Shilong f2f6ed3d54 Btrfs: introduce a mutex lock for btrfs quota operations
The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
	->add_qgroup_info_item()
		->add_qgroup_rb()

For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().

What's worse, when we want to add a qgroup relation:
	->add_qgroup_relation_item()
		->add_qgroup_relations()

We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.

To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:38 -04:00
Wang Shilong 7708f029dc Btrfs: creating the subvolume qgroup automatically when enabling quota
Creating the subvolume/snapshots(including root subvolume) qgroup
auotomatically when enabling quota.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:37 -04:00
Zach Brown d4e3991b99 btrfs: abort unlink trans in missed error case
__btrfs_unlink_inode() aborts its transaction when it sees errors after
it removes the directory item.  But it missed the case where
btrfs_del_dir_entries_in_log() returns an error.  If this happens then
the unlink appears to fail but the items have been removed without
updating the directory size.  The directory then has leaked bytes in
i_size and can never be removed.

Adding the missing transaction abort at least makes this failure
consistent with the other failure cases.

I noticed this while reading the code after someone on irc reported
having a directory with i_size but no entries.  I tested it by forcing
btrfs_del_dir_entries_in_log() to return -ENOMEM.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:36 -04:00
Eric Sandeen f63e0cca91 btrfs: ignore device open failures in __btrfs_open_devices
This:

   # mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test

would lead to a blkdev open/close mismatch when the mount fails, and
a permanently busy (opened O_EXCL) sdb2:

   # wipefs -a /dev/sdb2
   wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy

It's because btrfs_open_devices() may open some devices, fail on
the last one, and return that failure stored in "ret."   The mount
then fails, but the caller then does not clean up the open devices.

Chris assures me that:

"btrfs_open_devices just means: go off and open every bdev you can from
this uuid.  It should return success if we opened any of them at all."

So change the logic to ignore any open failures; just skip processing
of that device.  Later on it's decided whether we have enough devices
to continue.

Reported-by: Jan Safranek <jsafrane@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:36 -04:00
Miao Xie e4100d987b Btrfs: improve the performance of the csums lookup
It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.

According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).

 # dd if=<mnt>/file of=/dev/null bs=1M count=1024

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:35 -04:00
Josef Bacik 09a2a8f96e Btrfs: fix bad extent logging
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery.  I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem.  The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents.  So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space.  If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later.  The data and such will still work,
but everything else is broken.  This patch fixes this by not allowing extents
that are on the modified list to be merged.  This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree.  So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents.  With this patch the testcase I've created no
longer creates a bogus file system after replaying the log.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:34 -04:00
Josef Bacik cc95bef635 Btrfs: log ram bytes properly
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be.  This was
still working out right with compression for some reason but I think we were
getting lucky.  It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it.  With this patch btrfsck no longer complains
after a log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:33 -04:00
Josef Bacik 98ad69cfd2 Btrfs: don't wait on ordered extents if we have a trans open
Dave was hitting a lockdep warning because we're now properly taking the ordered
operations mutex in the ordered wait stuff.  This is because some cases we will
have a trans handle when we are flushing delalloc space, but we can't wait on
ordered extents because we could potentially deadlock, so fix this by not doing
the wait if we have a trans handle.  Thanks

Reported-and-tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:32 -04:00
Josef Bacik 8c579fe745 Btrfs: fix error handling in make/read block group
I noticed that we will add a block group to the space info before we add it to
the block group cache rb tree, so we could potentially allocate from the block
group before it's able to be searched for.  I don't think this is too much of
a problem, the race window is microscopic, but just in case move the tree
insertion to above the space info linking.  This makes it easier to adjust the
error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
error handling setup for these scenarios.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:32 -04:00
Wang Shilong 5c2d867fdc Btrfs: fix double free in the iterate_extent_inodes()
If btrfs_find_all_roots() fails, 'roots' has been freed or 'roots'
fails to allocate. We don't need to free it outside btrfs_find_all_roots()
again.Fix it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:31 -04:00
Wang Shilong f172393952 Btrfs: kill some BUG_ONs() in the find_parent_nodes()
The reason that BUG_ON() happens in these places is just
because of ENOMEM.

We try ro return ENOMEM rather than trigger BUG_ON(), the
caller will abort the transaction thus avoiding the kernel panic.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:30 -04:00
Josef Bacik 41b0fc4280 Btrfs: compare relevant parts of delayed tree refs
A user reported a panic while running a balance.  What was happening was he was
relocating a block, which added the reference to the relocation tree.  Then
relocation would walk through the relocation tree and drop that reference and
free that block, and then it would walk down a snapshot which referenced the
same block and add another ref to the block.  The problem is this was all
happening in the same transaction, so the parent block was free'ed up when we
drop our reference which was immediately available for allocation, and then it
was used _again_ to add a reference for the same block from a different
snapshot.  This resulted in something like this in the delayed ref tree

add ref to 90234880, parent=2067398656, ref_root 1766, level 1
del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
add ref to 90234880, parent=2067398656, ref_root 1767, level 1

as you can see the ref_root's don't match, because when we inc the ref we use
the header owner, which is the original tree the block belonged to, instead of
the data reloc tree.  Then when we remove the extent we use the reloc tree
objectid.  But none of this matters, since it is a shared reference which means
only the parent matters.  When the delayed ref stuff runs it adds all the
increments first, and then does all the drops, to make sure that we don't delete
the ref if we net a positive ref count.  But tree blocks aren't allowed to have
multiple refs from the same block, so this panics when it tries to add the
second ref.  We need the add and the drop to cancel each other out in memory so
we only do the final add.

So to fix this we need to adjust how the delayed refs are added to the tree.
Only the ref_root matters when it is a normal backref, and only the parent
matters when it is a shared backref.  So make our decision based on what ref
type we have.  This allows us to keep the ref_root in memory in case anybody
wants to use it for something else, and it allows the delayed refs to be merged
properly so we don't end up with this panic.

With this patch the users image no longer panics on mount, and it has a clean
fsck after a normal mount/umount cycle.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:29 -04:00
Josef Bacik cf79ffb5b7 Btrfs: fix infinite loop when we abort on mount
Testing my enospc log code I managed to abort a transaction during mount, which
put me into an infinite loop.  This is because of two things, first we don't
reset trans_no_join if we abort during transaction commit, which will force
anybody trying to start a transaction to just loop endlessly waiting for it to
be set to 0.  But this is still just a symptom, the second issue is we don't set
the fs state to error during errors on mount.  This is because we don't want to
do the flip read only thing during mount, but we still really want to set the fs
state to an error to keep us from even getting to the trans_no_join check.  So
fix both of these things, make sure to reset trans_no_join if we abort during a
commit, and make sure we set the fs state to error no matter if we're mounting
or not.  This should keep us from getting into this infinite loop again.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:29 -04:00
Wang Shilong c9a9dbf2cb Btrfs: fix a warning when disabling quota
Steps to reproduce:
	mkfs.btrfs <disk>
	mount <disk> <mnt>
	btrfs quota enable <mnt>
	btrfs sub create <mnt>/subv

	i=1
	while [ $i -le 10000 ]
	do
		dd if=/dev/zero of=<mnt>/subv/data_$i bs=1K count=1
		i=$(($i+1))
		if [ $i -eq 500 ]
		then
			btrfs quota disable $mnt
		fi
	done
	dmesg
Obviously, this warn_on() is unnecessary, and it will be easily triggered.
Just remove it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:28 -04:00
Liu Bo 6b67a32000 Btrfs: pass NULL instead of 0
set_extent_bit()'s (u64 *failed_start) expects NULL not 0.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:27 -04:00
David Sterba 5c50c9b89f btrfs: make subvol creation/deletion killable in the early stages
The subvolume ioctls block on the parent directory mutex that can be
held by other concurrent snapshot activity for a long time. Give the
user at least some chance to get out of this situation by allowing
to send a kill signal.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:26 -04:00
David Sterba 94ef7280e8 btrfs: cover more error codes in btrfs_decode_error
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:25 -04:00
David Sterba 4884b476d7 btrfs: make orphan cleanup less verbose
The messages

  btrfs: unlinked 123 orphans
  btrfs: truncated 456 orphans

are not useful to regular users and raise questions whether there are
problems with the filesystem.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:24 -04:00
David Sterba 5e2a4b25da btrfs: deprecate subvolrootid mount option
This mount option was a workaround when subvol= assumed path relative
to the default subvolume, not the toplevel one. This was fixed long time
ago and subvolrootid has no effect.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:23 -04:00
Simon Kirby c2cf52eb71 Btrfs: Include the device in most error printk()s
With more than one btrfs volume mounted, it can be very difficult to find
out which volume is hitting an error. btrfs_error() will print this, but
it is currently rigged as more of a fatal error handler, while many of
the printk()s are currently for debugging and yet-unhandled cases.

This patch just changes the functions where the device information is
already available. Some cases remain where the root or fs_info is not
passed to the function emitting the error.

This may introduce some confusion with volumes backed by multiple devices
emitting errors referring to the primary device in the set instead of the
one on which the error occurred.

Use btrfs_printk(fs_info, format, ...) rather than writing the device
string every time, and introduce macro wrappers ala XFS for brevity.
Since the function already cannot be used for continuations, print a
newline as part of the btrfs_printk() message rather than at each caller.

Signed-off-by: Simon Kirby <sim@hostway.ca>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:23 -04:00
David Sterba aa8259145e btrfs: update kconfig title
The Kconfig title does not make much sense after the cleanup of
CONFIG_EXPERIMENTAL option, align the wording with other filesystems.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:22 -04:00
David Sterba 9d1a2a3ad5 btrfs: clean snapshots one by one
Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.

A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.

The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:21 -04:00
Zhi Yong Wu 6841ebee6b btrfs: Cleanup some redundant codes in btrfs_log_inode()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:20 -04:00
Zhi Yong Wu 628c8282be btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:20 -04:00
Liu Bo 7abadb6431 Btrfs: share stop worker code
Share the exactly same code of stopping workers.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:19 -04:00
Josef Bacik 3173a18f70 Btrfs: add a incompatible format change for smaller metadata extent refs
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree.  This takes up quite a bit of space.  Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster.  This is not an automatic format change, you
must enable it at mkfs time or with btrfstune.  This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:18 -04:00
Liu Bo be283b2e67 Btrfs: use helper to cleanup tree roots
free_root_pointers() has been introduced to cleanup all of tree roots,
so just use it instead.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:17 -04:00
Liu Bo b0496686ba Btrfs: cleanup unused arguments of btrfs_csum_data
Argument 'root' is no more used in btrfs_csum_data().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:14 -04:00
David Sterba 087488109a btrfs: clean up transaction abort messages
The transaction abort stacktrace is printed only once per module
lifetime, but we'd like to see it each time it happens per mounted
filesystem.  Introduce a fs_state flag that records it.

Tweak the messages around abort:
* add error number to the first abort
* print the exact negative errno from btrfs_decode_error
* clean up btrfs_decode_error and callers
* no dots at the end of the messages

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:52:56 -04:00
David Sterba bbece8a3f0 btrfs: merge save_error_info helpers into one
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:52:55 -04:00
Josef Bacik 74255aa07d Btrfs: add some free space cache tests
We keep hitting bugs in the tree log replay because btrfs_remove_free_space
doesn't account for some corner case.  So add a bunch of tests to try and fully
test btrfs_remove_free_space since the only time it is called is during tree log
replay.  These tests all finish successfully, so as we find more of these bugs
we need to add to these tests to make sure we don't regress in fixing things.
I've hidden the tests behind a Kconfig option, but they take no time to run so
all btrfs developers should have this turned on all the time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:52:54 -04:00
Liu Bo e75206cfdc Btrfs: cleanup unused function
btrfs_abort_devices() is no more used.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-04-29 14:58:34 -04:00
Zhao Hongjiang 91d80a84bb aio: fix possible invalid memory access when DEBUG is enabled
dprintk() shouldn't access @ring after it's unmapped.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-26 07:56:18 -07:00
Linus Torvalds 0a82a8d132 Revert "block: add missing block_bio_complete() tracepoint"
This reverts commit 3a366e614d.

Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.

Jens says:
 "It's not quite clear why that is yet, so I think we should just revert
  the commit for 3.9 final (which I'm assuming is pretty close).

  The wifi is crap at the LSF hotel, so sending this email instead of
  queueing up a revert and pull request."

Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18 09:00:26 -07:00
Vyacheslav Dubeyko 12f267a20a hfsplus: fix potential overflow in hfsplus_file_truncate()
Change a u32 to loff_t hfsplus_file_truncate().

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:45 -07:00
Naoya Horiguchi 23d9e48213 fs/binfmt_elf.c: fix hugetlb memory check in vma_dump_size()
Documentation/filesystems/proc.txt says about coredump_filter bitmask,

  Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
  effected by bit 5-6.

However current code can go into the subsequent flag checks of bit 0-4
for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
as written in the document.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:44 -07:00
Naoya Horiguchi a2fce91430 hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
Currently we fail to include any data on hugepages into coredump,
because VM_DONTDUMP is set on hugetlbfs's vma.  This behavior was
recently introduced by commit 314e51b985 ("mm: kill vma flag
VM_RESERVED and mm->reserved_vm counter").

This looks to me a serious regression, so let's fix it.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[3.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17 16:10:44 -07:00
Linus Torvalds bb33db7a07 Merge branches 'timers-urgent-for-linus', 'irq-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull {timer,irq,core} fixes from Thomas Gleixner:

 - timer: bug fix for a cpu hotplug race.

 - irq: single bugfix for a wrong return value, which prevents the
   calling function to invoke the software fallback.

 - core: bugfix which plugs two race confitions which can cause hotplug
   per cpu threads to end up on the wrong cpu.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Don't reinitialize a cpu_base lock on CPU_UP

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: gic: fix irq_trigger return

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kthread: Prevent unpark race which puts threads on the wrong cpu
2013-04-15 07:03:01 -07:00
Linus Torvalds 3792a64fde Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull one more btrfs fix from Chris Mason:
 "This has a recent fix from Josef for our tree log replay code.  It
  fixes problems where the inode counter for the number of bytes in the
  file wasn't getting updated properly during fsync replay.

  The commit did get rebased this morning, but it was only to clean up
  the subject line.  The code hasn't changed."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: make sure nbytes are right after log replay
2013-04-14 10:52:54 -07:00
Suleiman Souhlal 5b55d70833 vfs: Revert spurious fix to spinning prevention in prune_icache_sb
Revert commit 62a3ddef61 ("vfs: fix spinning prevention in prune_icache_sb").

This commit doesn't look right: since we are looking at the tail of the
list (sb->s_inode_lru.prev) if we want to skip an inode, we should put
it back at the head of the list instead of the tail, otherwise we will
keep spinning on it.

Discovered when investigating why prune_icache_sb came top in perf
reports of a swapping load.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v3.2+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-13 16:13:55 -07:00
Josef Bacik 4bc4bee459 Btrfs: make sure nbytes are right after log replay
While trying to track down a tree log replay bug I noticed that fsck was always
complaining about nbytes not being right for our fsynced file.  That is because
the new fsync stuff doesn't wait for ordered extents to complete, so the inodes
nbytes are not necessarily updated properly when we log it.  So to fix this we
need to set nbytes to whatever it is on the inode that is on disk, so when we
replay the extents we can just add the bytes that are being added as we replay
the extent.  This makes it work for the case that we have the wrong nbytes or
the case that we logged everything and nbytes is actually correct.  With this
I'm no longer getting nbytes errors out of btrfsck.

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-04-13 07:35:06 -04:00
Linus Torvalds 0b1fd266bf Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fix from Steve French:
 "Fixes a regression in cifs in which a password which begins with a
  comma is parsed incorrectly as a blank password"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Allow passwords which begin with a delimitor
2013-04-12 15:18:20 -07:00
Thomas Gleixner f2530dc71c kthread: Prevent unpark race which puts threads on the wrong cpu
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0	      	     	 CPU1
Bring up CPU2
create_thread(T)
park(T)
 wait_for_completion()
			 parkme()
			 complete()
sched_set_stop_task()
			 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-12 14:18:43 +02:00
Sachin Prabhu c369c9a4a7 cifs: Allow passwords which begin with a delimitor
Fixes a regression in cifs_parse_mount_options where a password
which begins with a delimitor is parsed incorrectly as being a blank
password.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-04-10 15:54:14 -05:00
Linus Torvalds 51de017007 NFS client bugfixes for Linux 3.9
- Fix a brain fart in nfs41_walk_client_list
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRZZq5AAoJEGcL54qWCgDypU8P/0daWpe+a8TNpXDA0KdYZKYN
 KNXvZkNNk/TtSiQo5gPzRnD4CgZIZ4n+EX9U94gmdNr/UQz7xiL+bHZY4zFtQ574
 i+QMiLbf687anY7vLBL1eKOhKHeBMoIrk2G3iineEUhfzF97cqtgqIou1pSS/BCa
 2kk/w/LRWPOaMpr802y2p9R/mejRtDbTIwaPURTKA3Pw+odwiVib3FXMIoXDI5Iq
 QzH2fl+Q0me/Z2c5Y+KRs5X3gY1MWdhpZUbEpKy3iLAxlgl3gfp7Mxpb61dw5gBz
 Jl2F1lDOzYmU1Uqe88G7w38RnBD0Q7RWtlQzZFMeIQsk1TqPsx9ymFRxaZu1Q6HZ
 +hdpfVsFDhGNTvLZF4YSP4c7AS9s1yEj8erT8Ro90Ar/PuZi15N6HpDzHHAiIQWK
 HsqSLQBrW24cFk2Ybed7YVcFdNxHdR3DDYVVstodnhIw9VwDSvQfPBlhlPqF+Q/9
 onnAMsc6SqHnLhFV7yCF6tB0Of4ZPO0oIeW8C0Hrxo+sPly03BvasAvaSWr3uheh
 wqEtawNm9QQVMdWSA1hA0LV6P887yTRXruT83uC14doPlz5g0hxlvAZQfDC3Ld3J
 ae4HARv3LLFj7Dk9/9yyM6FELyTIe8YvqvH8u9QenPQEmW0VlaPVp73vPEhL5yPA
 TxWSJtquxq5ajpH5lBeI
 =G1ZG
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.9-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull another nfs fixlet from Trond Myklebust:
 "I suddenly noticed that a one-line issue that I _thought_ I had fixed
  with the nfs41_walk_client_list patch was apparently still there in
  the pull request I sent earlier today.  I'm very sorry for not
  catching that in time.

   - Fix a brain fart in nfs41_walk_client_list"

* tag 'nfs-for-3.9-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Doh! Typo in the fix to nfs41_walk_client_list
2013-04-10 10:26:49 -07:00
Trond Myklebust eb04e0ac19 NFSv4: Doh! Typo in the fix to nfs41_walk_client_list
Make sure that we set the status to 0 on success. Missed in testing
because it never appears when doing multiple mounts to _different_
servers.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: <stable@vger.kernel.org> # 3.7.x: 7b1f1fd: NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list
2013-04-10 12:57:29 -04:00
Linus Torvalds f94eeb423b NFS client bugfixes for Linux 3.9
- Stable fix for memory corruption issues in nfs4[01]_walk_client_list
 - Stable fix for an Oopsable bug in rpc_clone_client
 - Another state manager deadlock in the NFSv4 open code
 - Memory leaks in nfs4_discover_server_trunking and rpc_new_client
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRZYu9AAoJEGcL54qWCgDySfwP/R2IdO2nfRzmDCPtvD6pPg8T
 l8Gf97Z/8A3g6WwfvmKNt48D1fKnhAcOaKTZQIZuZePAjI/Yy74DFMof6paiDmsO
 8hMcZgvunZotPwmBmhIwmLOxDYgbpdizDBlITsimnUQLrv78bMw2F/cNCcThYgTI
 Q4sNpZsl4kk1nmOYK/tGBCCkq6mIQhc95QeQPgnl2B/NozpZiIqgzrpWpSWMofn2
 cuSLiuEdmpCdJbgQaPEjSWf+doo/nBn720+Xj2RjmLhTTnWUtAsouElAdMs96Jjz
 cEhSll3nLIygr1xdFF7CD8qFjpbtg/YNhKw3HBCFAgHjrAjr+a3N+eHQOz9QQ6W4
 5OL3Mj0VEkvMrK1Sy76smynQJMJhrsn852Zo2wK2mCp+mHNZlBlML529Y4PJy2Ba
 Up4MteIaOTpKGSnBdzWmqPqro9glqlhrUk/o3XipCzIziWC8yDYjl2J9Ez8B7Ren
 uzvBeevYRX9AmQlmZUAPvx8+xVqA6cr0X2q8/6PqPnrNXP6Ff8+rm6gvH4VozyzJ
 qd/r7Bf1ozFXxoKQOztSiGjI5YiBp4DRXycR5td6eF3nZJipmbxY+WKllhaAakn6
 UY2NsGX2zfxkJMltqd2/xRmHtN+Eif1Uoo35pvzNxzBtPsRxBMIiPhGLglQu98Yj
 2NuwfT4//UNfS6JlBe6E
 =kBf2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.9-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - fix for memory corruption issues in nfs4[01]_walk_client_list (stable)
 - fix for an Oopsable bug in rpc_clone_client (stable)
 - another state manager deadlock in the NFSv4 open code
 - memory leaks in nfs4_discover_server_trunking and rpc_new_client

* tag 'nfs-for-3.9-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix another potential state manager deadlock
  SUNRPC: Fix a potential memory leak in rpc_new_client
  NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list
  NFSv4: Fix a memory leak in nfs4_discover_server_trunking
  SUNRPC: Remove extra xprt_put()
2013-04-10 09:00:51 -07:00
Linus Torvalds e8f2b548de Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A nasty bug in fs/namespace.c caught by Andrey + a couple of less
  serious unpleasantness - ecryptfs misc device playing hopeless games
  with try_module_get() and palinfo procfs support being...  not quite
  correctly done, to be polite."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mnt: release locks on error path in do_loopback
  palinfo fixes
  procfs: add proc_remove_subtree()
  ecryptfs: close rmmod race
2013-04-09 12:22:49 -07:00
Andrey Vagin e9c5d8a562 mnt: release locks on error path in do_loopback
do_loopback calls lock_mount(path) and forget to unlock_mount
if clone_mnt or copy_mnt fails.

[   77.661566] ================================================
[   77.662939] [ BUG: lock held when returning to user space! ]
[   77.664104] 3.9.0-rc5+ #17 Not tainted
[   77.664982] ------------------------------------------------
[   77.666488] mount/514 is leaving the kernel with locks still held!
[   77.668027] 2 locks held by mount/514:
[   77.668817]  #0:  (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff811cca22>] lock_mount+0x32/0xe0
[   77.671755]  #1:  (&namespace_sem){+++++.}, at: [<ffffffff811cca3a>] lock_mount+0x4a/0xe0

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:09:50 -04:00
Al Viro 8ce584c741 procfs: add proc_remove_subtree()
just what it sounds like; do that only to procfs subtrees you've
created - doing that to something shared with another driver is
not only antisocial, but might cause interesting races with
proc_create() and its ilk.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:09:17 -04:00
Al Viro 52f21999c7 ecryptfs: close rmmod race
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:08:16 -04:00
Trond Myklebust fa332941c0 NFSv4: Fix another potential state manager deadlock
Don't hold the NFSv4 sequence id while we check for open permission.
The call to ACCESS may block due to reboot recovery.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-04-09 13:19:35 -04:00
Trond Myklebust 7b1f1fd184 NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list
It is unsafe to use list_for_each_entry_safe() here, because
when we drop the nn->nfs_client_lock, we pin the _current_ list
entry and ensure that it stays in the list, but we don't do the
same for the _next_ list entry. Use of list_for_each_entry() is
therefore the correct thing to do.

Also fix the refcounting in nfs41_walk_client_list().

Finally, ensure that the nfs_client has finished being initialised
and, in the case of NFSv4.1, that the session is set up.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-04-05 16:59:19 -04:00