This happens because writeback_inodes_sb_nr_if_idle does down_read. This
doesn't work for us and it has not been fixed upstream yet, so do it
ourselves and use that instead so we can stop having this stupid long
standing lockup. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to release the reserved space in the error path of delalloc
reservatiom, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When committing a transaction, we may bail out of running delayed refs
due to ENOSPC, and then abort the current transaction to flip into readonly.
But we'll hit a deadlock on ref head's lock since we forget to release
its lock and other cleanup stuff.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
"Whether" is misspelled in various comments across the tree; this
fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
I don't think we have the same problem that this was supposed to fix
originally since we can allocate chunks in the enospc path now. This code
is causing us to constantly commit the transaction as we get close to using
all of our available space in our currently allocated chunks, instead of
allocating another chunk and carrying on with life, which is not nice for
performance. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Running delayed refs is faster than running delalloc, so lets do that first
to try and reclaim space. This makes my fs_mark test about 20% faster.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.
Signed-off-by: David Sterba <dsterba@suse.cz>
Everybody is just making stuff up, and it's just used to see if we really do
need to alloc a chunk, and since we do this when we already know we really
do it's just a waste of space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So we have lots of places where we try to preallocate chunks in order to
make sure we have enough space as we make our allocations. This has
historically meant that we're constantly tweaking when we should allocate a
new chunk, and historically we have gotten this horribly wrong so we way
over allocate either metadata or data. To try and keep this from happening
we are going to make it so that the block group item insertion is done out
of band at the end of a transaction. This will allow us to create chunks
even if we are trying to make an allocation for the extent tree. With this
patch my enospc tests run faster (didn't expect this) and more efficiently
use the disk space (this is what I wanted). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed I was seeing large lags when running my torrent test in a vm on my
laptop. While trying to make it lag less I noticed that our overcommit math
was taking into account the number of bytes we wanted to reclaim, not the
number of bytes we actually wanted to allocate, which means we wouldn't
overcommit as often. This patch fixes the overcommit math and makes
shrink_delalloc() use that logic so that it will stop looping faster. We
still have pretty high spikes of latency, but the test now takes 3 minutes
less time (about 5% faster). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Mitch reported a problem where you could get an ENOSPC error when untarring
a kernel git tree onto a 16gb file system with compress-force=zlib. This is
because compression is a huge pain, it will return from ->writepages()
without having actually created any ordered extents. To get around this we
check to see if the async submit counter is up, and if it is wait until it
drops to 0 before doing our normal ordered wait dance. With this patch I
can now untar a kernel git tree onto a 16gb file system without getting
ENOSPC errors. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We should insert/update 6 items(root ref, root backref, dir item, dir index,
root item and parent inode) when creating a snapshot, not 5 items, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Swinging this pendulum back the other way. We've been allocating chunks up
to 2% of the disk no matter how much we actually have allocated. So instead
fix this calculation to only allocate chunks if we have more than 80% of the
space available allocated. Please test this as it will likely cause all
sorts of ENOSPC problems to pop up suddenly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
Basically what happens is the balance relocates a tree block which will drop
the implicit refs for all of its children and adds a full backref. Once the
block is relocated we have to add the implicit refs back, so when we cow the
block again we add the implicit refs for its children back. The problem
comes when the original drop ref doesn't get run before we add the implicit
refs back. The delayed ref stuff will specifically prefer ADD operations
over DROP to keep us from freeing up an extent that will have references to
it, so we try to add the implicit ref before it is actually removed and we
panic. This worked fine before because the add would have just canceled the
drop out and we would have been fine. But the backref walking work needs to
be able to freeze the delayed ref stuff in time so we have this ever
increasing sequence number that gets attached to all new delayed ref updates
which makes us not merge refs and we run into this issue.
So to fix this we need to merge delayed refs. So everytime we run a
clustered ref we need to try and merge all of its delayed refs. The backref
walking stuff locks the delayed ref head before processing, so if we have it
locked we are safe to merge any refs inside of the sequence number. If
there is no sequence number we can merge all refs. Doing this not only
fixes our bug but keeps the delayed ref code from adding and removing
useless refs and batching together multiple refs into one search instead of
one search per delayed ref, which will really help our commit times. I ran
this with Daniels test and 276 and I haven't seen any problems. Thanks,
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With commit
commit d1270cd91f
Author: Arne Jansen <sensille@gmx.net>
Date: Tue Sep 13 15:16:43 2011 +0200
Btrfs: put back delayed refs that are too new
I added a window where the delayed_ref's head->ref_mod code can diverge
from the sum of the remaining refs, because we release the head->mutex
in the middle. This leads to btrfs_lookup_extent_info returning wrong
numbers. This patch fixes this by adjusting the head's ref_mod with each
delayed ref we run.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Arne was complaining about the space cache having mismatching generation
numbers when debugging a deadlock. This is because we can run out of space
in our preallocated range for our space cache if you have a pretty
fragmented amount of space in your pinned space. So just increase the
amount of space we preallocate for space cache so we can be sure to have
enough space. This will only really affect data ranges since their the only
chunks that end up larger than 256MB. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For backref walking, we've introduce delayed ref's sequence. However,
it changes our preallocation behavior.
The story is that when we preallocate an extent and then mark it written
piece by piece, the ideal case should be that we don't need to COW the
extent, which is why we use 'preallocate'.
But we may not make use of preallocation, since when we check for cross refs on
the extent, we may have two ref entries which have the same content except
the sequence value, and we recognize them as cross refs and do COW to allocate
another extent.
So we end up with several pieces of space instead of an whole extent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.
This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
Signed-off-by: Li Zefan <lizefan@huawei.com>
Block group has ro attributes, make dump_space_info show it.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Here is the whole story:
1)
A free space cache consists of two parts:
o free space cache inode, which is special becase it's stored in root tree.
o free space info, which is stored as the above inode's file data.
But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.
2)
We can set a block group readonly when we relocate the block group.
However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.
3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.
4)
However, it's not over, there is still another case, nospace_cache.
With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.
We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
During disk balance, we prealloc new file extent for file data relocation,
but we may fail in 'no available space' case, and it leads to flipping btrfs
into readonly.
It is not necessary to bail out and abort transaction since we do have several
ways to rescue ourselves from ENOSPC case.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc. What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time. So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation. This makes xfstests 224 go much faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv. Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work. The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit. By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often. This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.
This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes. This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes. Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed. So add the
ability to expclitly flush delayed inodes to free up space. Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Like block reserves, reserve a small piece of space on each
transaction start and for delalloc. These are the hooks that
can actually return EDQUOT to the user.
The amount of space reserved is tracked in the transaction
handle.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Normally delayed refs get processed in ascending bytenr order. This
correlates in most cases to the order added. To expose dependencies
on this order, we start to process the tree in the middle instead of
the beginning.
This code is only effective when SCRAMBLE_DELAYED_REFS is defined.
Signed-off-by: Arne Jansen <sensille@gmx.net>
We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.
Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
We track two conditions to decide if we should sleep while waiting for more
delayed refs, the number of delayed refs (num_refs) and the first entry in
the list of blockers (first_seq).
When we suspect staleness, we save num_refs and do one more cycle. If
nothing changes, we then save first_seq for later comparison and do
wait_event. We ought to save first_seq the very same moment we're saving
num_refs. Otherwise we cannot be sure that nothing has changed and we might
start waiting when we shouldn't, which could lead to starvation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right. Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting. So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.
Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
It confuses Smatch that we use two names for the same lock. Plus the
shorter name is nicer. This doesn't change how the code works, it's
just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.
But, a few callers are using it with spinlocks held. Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case. This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
may_commit_transaction() calls
spin_lock(&space_info->lock);
spin_lock(&delayed_rsv->lock);
and update_global_block_rsv() calls
spin_lock(&block_rsv->lock);
spin_lock(&sinfo->lock);
Lockdep complains about this at run time.
Everywhere except in update_global_block_rsv(), the space_info lock is
the outer lock, therefore the locking order in update_global_block_rsv()
is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is basically a good thing if we are interruptible when waiting for
free space, but the generality in which it is implemented currently
leads to system calls being interruptible that are not documented this
way. For example git can't handle interrupted unlink(), leading to
corrupt repos under space pressure.
Instead we raise the bar to only be interruptible by SIGKILL.
Thanks to David Sterba for suggesting this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
The caller expects this function to return with the lock held and
releases it immediately on error.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
A user reported that booting his box up with btrfs root on 3.4 was way
slower than on 3.3 because I removed the ideal caching code. It turns out
that we don't load the free space cache if we're in a commit for deadlock
reasons, but since we're reading the cache and it hasn't changed yet we are
safe reading the inode and free space item from the commit root, so do that
and remove all of the deadlock checks so we don't unnecessarily skip loading
the free space cache. The user reported this fixed the slowness. Thanks,
Tested-by: Calvin Walton <calvin.walton@kepstin.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a regression introduced by fc67c450. spin_is_locked() always
returns 0 on UP kernels, which caused assert in get_restripe_target() to
be fired on every call from btrfs_reduce_alloc_profile() on UP systems.
Remove it completely for now, it's not clear if it's going to be needed
in future.
Reported-by: Bobby Powers <bobbypowers@gmail.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reverts commit 5500cdbe14.
We've had a number of complaints of early enospc that bisect down
to this patch. We'll hae to fix the reservations differently.
CC: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This deadlock comes from xfstests 251.
We'll hold the chunk_mutex throughout the whole of a chunk allocation.
But if we find that we've used up system chunk space, we need to allocate a
new system chunk, but this will lead to a recursion of chunk allocation and end
up with a deadlock on chunk_mutex.
So instead we need to allocate the system chunk first if we find we're in ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
o For space info, the type of space info is useful for debug.
o For transaction handle, its transid is useful.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently if we don't have enough space allocated we go ahead and loop
though devices in the hopes of finding enough space for a chunk of the
*same* type as the one we are trying to relocate. The problem with that
is that if we are trying to restripe the chunk its target type can be
more relaxed than the current one (eg require less devices or less
space). So, when restriping, run checks against the target profile
instead of the current one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add __get_block_group_index() helper to be able to derive block group
index from an arbitary set of flags. Implement get_block_group_index()
in terms of it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Header file is not a good place to define functions. This also moves a
call to alloc_profile_is_valid() down the stack and removes a redundant
check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it
into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
"0" is a valid value for an on-disk chunk profile, but it is not a valid
extended profile. (We have a separate bit for single chunks in extended
case)
Also rename it to alloc_profile_is_valid() for clarity.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add functions to abstract the conversion between chunk and extended
allocation profile formats and switch everybody to use them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This has been causing a lot of confusion for quite a while now and a lot
of users were surprised by this (some of them were even stuck in a
ENOSPC situation which they couldn't easily get out of). The addition
of restriper gives users a clear choice between raid0 and drive concat
setup so there's absolutely no excuse for us to keep doing this.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory. So instead of evicting these pages, we
could end up evicting things we actually care about. Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks. This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been passing nothing but (u64)-1 to find_free_extent for search_end in
all of the callers, so it's completely useless, and we've always been passing 0
in as search_start, so just remove them as function arguments and move
search_start into find_free_extent. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a relic from before we had the disk space cache and it was to make
bootup times when you had btrfs as root not be so damned slow. Now that we have
the disk space cache this isn't a problem anymore and really having this code
casues uneeded fragmentation and complexity, so just remove it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.
This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Commit cb1b69f4 (Btrfs: forced readonly when btrfs_drop_snapshot() fails)
made btrfs_drop_snapshot return void because there were no callers checking
the return value. That is the wrong order to handle error propogation since
the caller will have no idea that an error has occured and continue on
as if nothing went wrong.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
When doing IO with large amounts of data fragmentation, the global block
reserve calulations are too low. This increases them to avoid
ENOSPC crashes.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a bug of btrfs's trim, that is we will trim 0 bytes
after a device delete.
The reproducer:
$ mkfs.btrfs disk1
$ mkfs.btrfs disk2
$ mount disk1 /mnt
$ fstrim -v /mnt
$ btrfs device add disk2 /mnt
$ btrfs device del disk1 /mnt
$ fstrim -v /mnt
This is because after we delete the device, the block group may start from
a non-zero place, which will confuse trim to discard nothing.
Reported-by: Lutz Euler <lutz.euler@freenet.de>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
system chunks by default are very small. This makes them slightly
larger and also fixes the conditional checks to make sure we don't
allocate a billion of them at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead. This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This in addition to a script in my btrfs-tracing tree will help track down space
leaks when we're getting space left over in block groups on umount. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I used these tracepoints when figuring out what the cluster stuff was doing, so
add them to mainline in case we need to profile this stuff again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized. Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce. If target profile is not yet available it goes
back to a plain reduce.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time. Instead check the
validity of the passed profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently when new chunks are created respective avail_alloc_bits field
is updated to reflect profiles of all chunks present in the system.
However when chunks are removed profile bits are never cleared.
This patch clears profile bit of respective avail_alloc_bits field when
the last chunk with that profile is removed. Restriper needs this to
properly operate when "downgrading".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.
This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Chunk's type and profile are encoded in u64 flags field. Introduce
masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A bug was triggered while using seed device:
# mkfs.btrfs /dev/loop1
# btrfstune -S 1 /dev/loop1
# mount -o /dev/loop1 /mnt
# btrfs dev add /dev/loop2 /mnt
btrfs: block rsv returned -28
------------[ cut here ]------------
WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]()
...
Call Trace:
...
[<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs]
[<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs]
[<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs]
[<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs]
[<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs]
[<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs]
[<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs]
[<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs]
[<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs]
[<c04f55f7>] do_vfs_ioctl+0x496/0x4cc
[<c04f5660>] sys_ioctl+0x33/0x4f
[<c07b9edf>] sysenter_do_call+0x12/0x38
---[ end trace 906adac595facc7d ]---
Since seed device is readonly, there's no usable space in the filesystem.
Afterwards we add a sprout device to it, and the kernel creates a METADATA
block group and a SYSTEM block group where comes free space we can reserve,
but we still get revervation failure because the global block_rsv hasn't
been updated accordingly.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler. Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>