2019-06-19 04:09:19 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
2019-08-22 00:54:28 +08:00
|
|
|
#include "misc.h"
|
2019-06-19 04:09:19 +08:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "space-info.h"
|
|
|
|
#include "sysfs.h"
|
|
|
|
#include "volumes.h"
|
2019-06-19 04:09:24 +08:00
|
|
|
#include "free-space-cache.h"
|
2019-06-19 04:09:25 +08:00
|
|
|
#include "ordered-data.h"
|
|
|
|
#include "transaction.h"
|
2019-06-21 03:37:44 +08:00
|
|
|
#include "block-group.h"
|
2019-06-19 04:09:19 +08:00
|
|
|
|
2020-02-05 02:18:56 +08:00
|
|
|
/*
|
|
|
|
* HOW DOES SPACE RESERVATION WORK
|
|
|
|
*
|
|
|
|
* If you want to know about delalloc specifically, there is a separate comment
|
|
|
|
* for that with the delalloc code. This comment is about how the whole system
|
|
|
|
* works generally.
|
|
|
|
*
|
|
|
|
* BASIC CONCEPTS
|
|
|
|
*
|
|
|
|
* 1) space_info. This is the ultimate arbiter of how much space we can use.
|
|
|
|
* There's a description of the bytes_ fields with the struct declaration,
|
|
|
|
* refer to that for specifics on each field. Suffice it to say that for
|
|
|
|
* reservations we care about total_bytes - SUM(space_info->bytes_) when
|
|
|
|
* determining if there is space to make an allocation. There is a space_info
|
|
|
|
* for METADATA, SYSTEM, and DATA areas.
|
|
|
|
*
|
|
|
|
* 2) block_rsv's. These are basically buckets for every different type of
|
|
|
|
* metadata reservation we have. You can see the comment in the block_rsv
|
|
|
|
* code on the rules for each type, but generally block_rsv->reserved is how
|
|
|
|
* much space is accounted for in space_info->bytes_may_use.
|
|
|
|
*
|
|
|
|
* 3) btrfs_calc*_size. These are the worst case calculations we used based
|
|
|
|
* on the number of items we will want to modify. We have one for changing
|
|
|
|
* items, and one for inserting new items. Generally we use these helpers to
|
|
|
|
* determine the size of the block reserves, and then use the actual bytes
|
|
|
|
* values to adjust the space_info counters.
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, THE NORMAL CASE
|
|
|
|
*
|
|
|
|
* We call into either btrfs_reserve_data_bytes() or
|
|
|
|
* btrfs_reserve_metadata_bytes(), depending on which we're looking for, with
|
|
|
|
* num_bytes we want to reserve.
|
|
|
|
*
|
|
|
|
* ->reserve
|
|
|
|
* space_info->bytes_may_reserve += num_bytes
|
|
|
|
*
|
|
|
|
* ->extent allocation
|
|
|
|
* Call btrfs_add_reserved_bytes() which does
|
|
|
|
* space_info->bytes_may_reserve -= num_bytes
|
|
|
|
* space_info->bytes_reserved += extent_bytes
|
|
|
|
*
|
|
|
|
* ->insert reference
|
|
|
|
* Call btrfs_update_block_group() which does
|
|
|
|
* space_info->bytes_reserved -= extent_bytes
|
|
|
|
* space_info->bytes_used += extent_bytes
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, FLUSHING NORMALLY (non-priority)
|
|
|
|
*
|
|
|
|
* Assume we are unable to simply make the reservation because we do not have
|
|
|
|
* enough space
|
|
|
|
*
|
|
|
|
* -> __reserve_bytes
|
|
|
|
* create a reserve_ticket with ->bytes set to our reservation, add it to
|
|
|
|
* the tail of space_info->tickets, kick async flush thread
|
|
|
|
*
|
|
|
|
* ->handle_reserve_ticket
|
|
|
|
* wait on ticket->wait for ->bytes to be reduced to 0, or ->error to be set
|
|
|
|
* on the ticket.
|
|
|
|
*
|
|
|
|
* -> btrfs_async_reclaim_metadata_space/btrfs_async_reclaim_data_space
|
|
|
|
* Flushes various things attempting to free up space.
|
|
|
|
*
|
|
|
|
* -> btrfs_try_granting_tickets()
|
|
|
|
* This is called by anything that either subtracts space from
|
|
|
|
* space_info->bytes_may_use, ->bytes_pinned, etc, or adds to the
|
|
|
|
* space_info->total_bytes. This loops through the ->priority_tickets and
|
|
|
|
* then the ->tickets list checking to see if the reservation can be
|
|
|
|
* completed. If it can the space is added to space_info->bytes_may_use and
|
|
|
|
* the ticket is woken up.
|
|
|
|
*
|
|
|
|
* -> ticket wakeup
|
|
|
|
* Check if ->bytes == 0, if it does we got our reservation and we can carry
|
|
|
|
* on, if not return the appropriate error (ENOSPC, but can be EINTR if we
|
|
|
|
* were interrupted.)
|
|
|
|
*
|
|
|
|
* MAKING RESERVATIONS, FLUSHING HIGH PRIORITY
|
|
|
|
*
|
|
|
|
* Same as the above, except we add ourselves to the
|
|
|
|
* space_info->priority_tickets, and we do not use ticket->wait, we simply
|
|
|
|
* call flush_space() ourselves for the states that are safe for us to call
|
|
|
|
* without deadlocking and hope for the best.
|
|
|
|
*
|
|
|
|
* THE FLUSHING STATES
|
|
|
|
*
|
|
|
|
* Generally speaking we will have two cases for each state, a "nice" state
|
|
|
|
* and a "ALL THE THINGS" state. In btrfs we delay a lot of work in order to
|
|
|
|
* reduce the locking over head on the various trees, and even to keep from
|
|
|
|
* doing any work at all in the case of delayed refs. Each of these delayed
|
|
|
|
* things however hold reservations, and so letting them run allows us to
|
|
|
|
* reclaim space so we can make new reservations.
|
|
|
|
*
|
|
|
|
* FLUSH_DELAYED_ITEMS
|
|
|
|
* Every inode has a delayed item to update the inode. Take a simple write
|
|
|
|
* for example, we would update the inode item at write time to update the
|
|
|
|
* mtime, and then again at finish_ordered_io() time in order to update the
|
|
|
|
* isize or bytes. We keep these delayed items to coalesce these operations
|
|
|
|
* into a single operation done on demand. These are an easy way to reclaim
|
|
|
|
* metadata space.
|
|
|
|
*
|
|
|
|
* FLUSH_DELALLOC
|
|
|
|
* Look at the delalloc comment to get an idea of how much space is reserved
|
|
|
|
* for delayed allocation. We can reclaim some of this space simply by
|
|
|
|
* running delalloc, but usually we need to wait for ordered extents to
|
|
|
|
* reclaim the bulk of this space.
|
|
|
|
*
|
|
|
|
* FLUSH_DELAYED_REFS
|
|
|
|
* We have a block reserve for the outstanding delayed refs space, and every
|
|
|
|
* delayed ref operation holds a reservation. Running these is a quick way
|
|
|
|
* to reclaim space, but we want to hold this until the end because COW can
|
|
|
|
* churn a lot and we can avoid making some extent tree modifications if we
|
|
|
|
* are able to delay for as long as possible.
|
|
|
|
*
|
|
|
|
* ALLOC_CHUNK
|
|
|
|
* We will skip this the first time through space reservation, because of
|
|
|
|
* overcommit and we don't want to have a lot of useless metadata space when
|
|
|
|
* our worst case reservations will likely never come true.
|
|
|
|
*
|
|
|
|
* RUN_DELAYED_IPUTS
|
|
|
|
* If we're freeing inodes we're likely freeing checksums, file extent
|
|
|
|
* items, and extent tree items. Loads of space could be freed up by these
|
|
|
|
* operations, however they won't be usable until the transaction commits.
|
|
|
|
*
|
|
|
|
* COMMIT_TRANS
|
|
|
|
* may_commit_transaction() is the ultimate arbiter on whether we commit the
|
|
|
|
* transaction or not. In order to avoid constantly churning we do all the
|
|
|
|
* above flushing first and then commit the transaction as the last resort.
|
|
|
|
* However we need to take into account things like pinned space that would
|
|
|
|
* be freed, plus any delayed work we may not have gotten rid of in the case
|
|
|
|
* of metadata.
|
|
|
|
*
|
|
|
|
* OVERCOMMIT
|
|
|
|
*
|
|
|
|
* Because we hold so many reservations for metadata we will allow you to
|
|
|
|
* reserve more space than is currently free in the currently allocate
|
|
|
|
* metadata space. This only happens with metadata, data does not allow
|
|
|
|
* overcommitting.
|
|
|
|
*
|
|
|
|
* You can see the current logic for when we allow overcommit in
|
|
|
|
* btrfs_can_overcommit(), but it only applies to unallocated space. If there
|
|
|
|
* is no unallocated space to be had, all reservations are kept within the
|
|
|
|
* free space in the allocated metadata chunks.
|
|
|
|
*
|
|
|
|
* Because of overcommitting, you generally want to use the
|
|
|
|
* btrfs_can_overcommit() logic for metadata allocations, as it does the right
|
|
|
|
* thing with or without extra unallocated space.
|
|
|
|
*/
|
|
|
|
|
2019-10-02 01:57:39 +08:00
|
|
|
u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
|
2019-06-19 04:09:19 +08:00
|
|
|
bool may_use_included)
|
|
|
|
{
|
|
|
|
ASSERT(s_info);
|
|
|
|
return s_info->bytes_used + s_info->bytes_reserved +
|
|
|
|
s_info->bytes_pinned + s_info->bytes_readonly +
|
|
|
|
(may_use_included ? s_info->bytes_may_use : 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* after adding space to the filesystem, we need to clear the full flags
|
|
|
|
* on all the space infos.
|
|
|
|
*/
|
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
2020-09-02 05:40:37 +08:00
|
|
|
list_for_each_entry(found, head, list)
|
2019-06-19 04:09:19 +08:00
|
|
|
found->full = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int create_space_info(struct btrfs_fs_info *info, u64 flags)
|
|
|
|
{
|
|
|
|
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
int i;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
space_info = kzalloc(sizeof(*space_info), GFP_NOFS);
|
|
|
|
if (!space_info)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
ret = percpu_counter_init(&space_info->total_bytes_pinned, 0,
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (ret) {
|
|
|
|
kfree(space_info);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++)
|
|
|
|
INIT_LIST_HEAD(&space_info->block_groups[i]);
|
|
|
|
init_rwsem(&space_info->groups_sem);
|
|
|
|
spin_lock_init(&space_info->lock);
|
|
|
|
space_info->flags = flags & BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
|
|
|
|
INIT_LIST_HEAD(&space_info->ro_bgs);
|
|
|
|
INIT_LIST_HEAD(&space_info->tickets);
|
|
|
|
INIT_LIST_HEAD(&space_info->priority_tickets);
|
|
|
|
|
2019-08-02 00:50:16 +08:00
|
|
|
ret = btrfs_sysfs_add_space_info_type(info, space_info);
|
|
|
|
if (ret)
|
2019-06-19 04:09:19 +08:00
|
|
|
return ret;
|
|
|
|
|
2020-09-02 05:40:37 +08:00
|
|
|
list_add(&space_info->list, &info->space_info);
|
2019-06-19 04:09:19 +08:00
|
|
|
if (flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
info->data_sinfo = space_info;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
struct btrfs_super_block *disk_super;
|
|
|
|
u64 features;
|
|
|
|
u64 flags;
|
|
|
|
int mixed = 0;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
disk_super = fs_info->super_copy;
|
|
|
|
if (!btrfs_super_root(disk_super))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
features = btrfs_super_incompat_flags(disk_super);
|
|
|
|
if (features & BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS)
|
|
|
|
mixed = 1;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_SYSTEM;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (mixed) {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
} else {
|
|
|
|
flags = BTRFS_BLOCK_GROUP_METADATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
flags = BTRFS_BLOCK_GROUP_DATA;
|
|
|
|
ret = create_space_info(fs_info, flags);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
|
|
|
|
u64 total_bytes, u64 bytes_used,
|
|
|
|
u64 bytes_readonly,
|
|
|
|
struct btrfs_space_info **space_info)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
int factor;
|
|
|
|
|
|
|
|
factor = btrfs_bg_type_to_factor(flags);
|
|
|
|
|
|
|
|
found = btrfs_find_space_info(info, flags);
|
|
|
|
ASSERT(found);
|
|
|
|
spin_lock(&found->lock);
|
|
|
|
found->total_bytes += total_bytes;
|
|
|
|
found->disk_total += total_bytes * factor;
|
|
|
|
found->bytes_used += bytes_used;
|
|
|
|
found->disk_used += bytes_used * factor;
|
|
|
|
found->bytes_readonly += bytes_readonly;
|
|
|
|
if (total_bytes > 0)
|
|
|
|
found->full = 0;
|
2019-08-23 03:10:58 +08:00
|
|
|
btrfs_try_granting_tickets(info, found);
|
2019-06-19 04:09:19 +08:00
|
|
|
spin_unlock(&found->lock);
|
|
|
|
*space_info = found;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
|
|
|
|
u64 flags)
|
|
|
|
{
|
|
|
|
struct list_head *head = &info->space_info;
|
|
|
|
struct btrfs_space_info *found;
|
|
|
|
|
|
|
|
flags &= BTRFS_BLOCK_GROUP_TYPE_MASK;
|
|
|
|
|
2020-09-02 05:40:37 +08:00
|
|
|
list_for_each_entry(found, head, list) {
|
|
|
|
if (found->flags & flags)
|
2019-06-19 04:09:19 +08:00
|
|
|
return found;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2019-06-19 04:09:20 +08:00
|
|
|
|
2020-02-22 05:41:10 +08:00
|
|
|
static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2019-06-19 04:09:20 +08:00
|
|
|
{
|
|
|
|
u64 profile;
|
|
|
|
u64 avail;
|
|
|
|
int factor;
|
|
|
|
|
2019-11-27 00:25:53 +08:00
|
|
|
if (space_info->flags & BTRFS_BLOCK_GROUP_SYSTEM)
|
2019-06-19 04:09:20 +08:00
|
|
|
profile = btrfs_system_alloc_profile(fs_info);
|
|
|
|
else
|
|
|
|
profile = btrfs_metadata_alloc_profile(fs_info);
|
|
|
|
|
|
|
|
avail = atomic64_read(&fs_info->free_chunk_space);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have dup, raid1 or raid10 then only half of the free
|
|
|
|
* space is actually usable. For raid56, the space info used
|
|
|
|
* doesn't include the parity drive, so we don't have to
|
|
|
|
* change the math
|
|
|
|
*/
|
|
|
|
factor = btrfs_bg_type_to_factor(profile);
|
|
|
|
avail = div_u64(avail, factor);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we aren't flushing all things, let us overcommit up to
|
|
|
|
* 1/2th of the space. If we can flush, don't let us overcommit
|
|
|
|
* too much, let it overcommit up to 1/8 of the space.
|
|
|
|
*/
|
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_ALL)
|
|
|
|
avail >>= 3;
|
|
|
|
else
|
|
|
|
avail >>= 1;
|
2020-02-22 05:41:10 +08:00
|
|
|
return avail;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info, u64 bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
u64 avail;
|
|
|
|
u64 used;
|
|
|
|
|
|
|
|
/* Don't overcommit when in mixed mode */
|
|
|
|
if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
|
|
|
avail = calc_available_free_space(fs_info, space_info, flush);
|
2019-06-19 04:09:20 +08:00
|
|
|
|
|
|
|
if (used + bytes < space_info->total_bytes + avail)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2019-06-19 04:09:22 +08:00
|
|
|
|
2020-04-07 18:38:49 +08:00
|
|
|
static void remove_ticket(struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket)
|
|
|
|
{
|
|
|
|
if (!list_empty(&ticket->list)) {
|
|
|
|
list_del_init(&ticket->list);
|
|
|
|
ASSERT(space_info->reclaim_size >= ticket->bytes);
|
|
|
|
space_info->reclaim_size -= ticket->bytes;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-06-19 04:09:22 +08:00
|
|
|
/*
|
|
|
|
* This is for space we already have accounted in space_info->bytes_may_use, so
|
|
|
|
* basically when we're returning space from block_rsv's.
|
|
|
|
*/
|
2019-08-23 03:10:58 +08:00
|
|
|
void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-19 04:09:22 +08:00
|
|
|
{
|
|
|
|
struct list_head *head;
|
|
|
|
enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_NO_FLUSH;
|
|
|
|
|
2019-08-23 03:10:58 +08:00
|
|
|
lockdep_assert_held(&space_info->lock);
|
2019-06-19 04:09:22 +08:00
|
|
|
|
2019-08-23 03:10:58 +08:00
|
|
|
head = &space_info->priority_tickets;
|
2019-06-19 04:09:22 +08:00
|
|
|
again:
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 23:15:24 +08:00
|
|
|
while (!list_empty(head)) {
|
|
|
|
struct reserve_ticket *ticket;
|
|
|
|
u64 used = btrfs_space_info_used(space_info, true);
|
|
|
|
|
|
|
|
ticket = list_first_entry(head, struct reserve_ticket, list);
|
|
|
|
|
|
|
|
/* Check and see if our ticket can be satisified now. */
|
|
|
|
if ((used + ticket->bytes <= space_info->total_bytes) ||
|
2020-01-17 22:07:39 +08:00
|
|
|
btrfs_can_overcommit(fs_info, space_info, ticket->bytes,
|
|
|
|
flush)) {
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 23:15:24 +08:00
|
|
|
btrfs_space_info_update_bytes_may_use(fs_info,
|
|
|
|
space_info,
|
|
|
|
ticket->bytes);
|
2020-04-07 18:38:49 +08:00
|
|
|
remove_ticket(space_info, ticket);
|
2019-06-19 04:09:22 +08:00
|
|
|
ticket->bytes = 0;
|
|
|
|
space_info->tickets_id++;
|
|
|
|
wake_up(&ticket->wait);
|
|
|
|
} else {
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 23:15:24 +08:00
|
|
|
break;
|
2019-06-19 04:09:22 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: stop partially refilling tickets when releasing space
btrfs_space_info_add_old_bytes is used when adding the extra space from
an existing reservation back into the space_info to be used by any
waiting tickets. In order to keep us from overcommitting we check to
make sure that we can still use this space for our reserve ticket, and
if we cannot we'll simply subtract it from space_info->bytes_may_use.
However this is problematic, because it assumes that only changes to
bytes_may_use would affect our ability to make reservations. Any
changes to bytes_reserved would be missed. If we were unable to make a
reservation prior because of reserved space, but that reserved space was
free'd due to unlink or truncate and we were allowed to immediately
reclaim that metadata space we would still ENOSPC.
Consider the example where we create a file with a bunch of extents,
using up 2MiB of actual space for the new tree blocks. Then we try to
make a reservation of 2MiB but we do not have enough space to make this
reservation. The iput() occurs in another thread and we remove this
space, and since we did not write the blocks we simply do
space_info->bytes_reserved -= 2MiB. We would never see this because we
do not check our space info used, we just try to re-use the freed
reservations.
To fix this problem, and to greatly simplify the wakeup code, do away
with this partial refilling nonsense. Use
btrfs_space_info_add_old_bytes to subtract the reservation from
space_info->bytes_may_use, and then check the ticket against the total
used of the space_info the same way we do with the initial reservation
attempt.
This keeps the reservation logic consistent and solves the problem of
early ENOSPC in the case that we free up space in places other than
bytes_may_use and bytes_pinned. Thanks,
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-28 23:15:24 +08:00
|
|
|
if (head == &space_info->priority_tickets) {
|
2019-06-19 04:09:22 +08:00
|
|
|
head = &space_info->tickets;
|
|
|
|
flush = BTRFS_RESERVE_FLUSH_ALL;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
}
|
2019-06-19 04:09:24 +08:00
|
|
|
|
|
|
|
#define DUMP_BLOCK_RSV(fs_info, rsv_name) \
|
|
|
|
do { \
|
|
|
|
struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \
|
|
|
|
spin_lock(&__rsv->lock); \
|
|
|
|
btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \
|
|
|
|
__rsv->size, __rsv->reserved); \
|
|
|
|
spin_unlock(&__rsv->lock); \
|
|
|
|
} while (0)
|
|
|
|
|
2019-08-23 03:19:04 +08:00
|
|
|
static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *info)
|
2019-06-19 04:09:24 +08:00
|
|
|
{
|
2019-08-23 03:19:04 +08:00
|
|
|
lockdep_assert_held(&info->lock);
|
2019-06-19 04:09:24 +08:00
|
|
|
|
|
|
|
btrfs_info(fs_info, "space_info %llu has %llu free, is %sfull",
|
|
|
|
info->flags,
|
|
|
|
info->total_bytes - btrfs_space_info_used(info, true),
|
|
|
|
info->full ? "" : "not ");
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
|
|
|
|
info->total_bytes, info->bytes_used, info->bytes_pinned,
|
|
|
|
info->bytes_reserved, info->bytes_may_use,
|
|
|
|
info->bytes_readonly);
|
|
|
|
|
|
|
|
DUMP_BLOCK_RSV(fs_info, global_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
|
|
|
|
DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
|
|
|
|
|
2019-08-23 03:19:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *info, u64 bytes,
|
|
|
|
int dump_block_groups)
|
|
|
|
{
|
2019-10-30 02:20:18 +08:00
|
|
|
struct btrfs_block_group *cache;
|
2019-08-23 03:19:04 +08:00
|
|
|
int index = 0;
|
|
|
|
|
|
|
|
spin_lock(&info->lock);
|
|
|
|
__btrfs_dump_space_info(fs_info, info);
|
|
|
|
spin_unlock(&info->lock);
|
|
|
|
|
2019-06-19 04:09:24 +08:00
|
|
|
if (!dump_block_groups)
|
|
|
|
return;
|
|
|
|
|
|
|
|
down_read(&info->groups_sem);
|
|
|
|
again:
|
|
|
|
list_for_each_entry(cache, &info->block_groups[index], list) {
|
|
|
|
spin_lock(&cache->lock);
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
|
2019-10-24 00:48:22 +08:00
|
|
|
cache->start, cache->length, cache->used, cache->pinned,
|
2019-06-19 04:09:24 +08:00
|
|
|
cache->reserved, cache->ro ? "[readonly]" : "");
|
|
|
|
spin_unlock(&cache->lock);
|
btrfs: fix lockdep splat from btrfs_dump_space_info
When running with -o enospc_debug you can get the following splat if one
of the dump_space_info's trip
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc5+ #20 Tainted: G OE
------------------------------------------------------
dd/563090 is trying to acquire lock:
ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]
but task is already holding lock:
ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&cache->lock){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
find_free_extent+0x7ef/0x13b0 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x530 [btrfs]
btrfs_cow_block+0x106/0x210 [btrfs]
commit_cowonly_roots+0x55/0x300 [btrfs]
btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x36/0x70
cleanup_mnt+0x104/0x160
task_work_run+0x5f/0x90
__prepare_exit_to_usermode+0x1bd/0x1c0
do_syscall_64+0x5e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #2 (&space_info->lock){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
clear_state_bit+0x81/0x1a0 [btrfs]
__clear_extent_bit+0x25c/0x5d0 [btrfs]
clear_extent_bit+0x15/0x20 [btrfs]
btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
truncate_cleanup_page+0x47/0xe0
truncate_inode_pages_range+0x238/0x840
truncate_pagecache+0x44/0x60
btrfs_setattr+0x202/0x5e0 [btrfs]
notify_change+0x33b/0x490
do_truncate+0x76/0xd0
path_openat+0x687/0xa10
do_filp_open+0x91/0x100
do_sys_openat2+0x215/0x2d0
do_sys_open+0x44/0x80
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&tree->lock#2){+.+.}-{2:2}:
_raw_spin_lock+0x25/0x30
find_first_extent_bit+0x32/0x150 [btrfs]
write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
__btrfs_write_out_cache+0x172/0x480 [btrfs]
btrfs_write_out_cache+0x7a/0xf0 [btrfs]
btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
commit_cowonly_roots+0x245/0x300 [btrfs]
btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
close_ctree+0xf9/0x2f5 [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x36/0x70
cleanup_mnt+0x104/0x160
task_work_run+0x5f/0x90
__prepare_exit_to_usermode+0x1bd/0x1c0
do_syscall_64+0x5e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&ctl->tree_lock){+.+.}-{2:2}:
__lock_acquire+0x1240/0x2460
lock_acquire+0xab/0x360
_raw_spin_lock+0x25/0x30
btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_space_info+0xf4/0x120 [btrfs]
btrfs_reserve_extent+0x176/0x180 [btrfs]
__btrfs_prealloc_file_range+0x145/0x550 [btrfs]
cache_save_setup+0x28d/0x3b0 [btrfs]
btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
btrfs_commit_transaction+0xcc/0xac0 [btrfs]
btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
btrfs_file_write_iter+0x3cf/0x610 [btrfs]
new_sync_write+0x11e/0x1b0
vfs_write+0x1c9/0x200
ksys_write+0x68/0xe0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&ctl->tree_lock --> &space_info->lock --> &cache->lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&cache->lock);
lock(&space_info->lock);
lock(&cache->lock);
lock(&ctl->tree_lock);
*** DEADLOCK ***
6 locks held by dd/563090:
#0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
#1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
#2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
#3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
#4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
#5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]
stack backtrace:
CPU: 3 PID: 563090 Comm: dd Tainted: G OE 5.8.0-rc5+ #20
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
Call Trace:
dump_stack+0x96/0xd0
check_noncircular+0x162/0x180
__lock_acquire+0x1240/0x2460
? wake_up_klogd.part.0+0x30/0x40
lock_acquire+0xab/0x360
? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
_raw_spin_lock+0x25/0x30
? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_free_space+0x2b/0xa0 [btrfs]
btrfs_dump_space_info+0xf4/0x120 [btrfs]
btrfs_reserve_extent+0x176/0x180 [btrfs]
__btrfs_prealloc_file_range+0x145/0x550 [btrfs]
? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
cache_save_setup+0x28d/0x3b0 [btrfs]
btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
btrfs_commit_transaction+0xcc/0xac0 [btrfs]
? start_transaction+0xe0/0x5b0 [btrfs]
btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
? ktime_get_coarse_real_ts64+0xa8/0xd0
? trace_hardirqs_on+0x1c/0xe0
btrfs_file_write_iter+0x3cf/0x610 [btrfs]
new_sync_write+0x11e/0x1b0
vfs_write+0x1c9/0x200
ksys_write+0x68/0xe0
do_syscall_64+0x52/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This is because we're holding the block_group->lock while trying to dump
the free space cache. However we don't need this lock, we just need it
to read the values for the printk, so move the free space cache dumping
outside of the block group lock.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-18 03:12:29 +08:00
|
|
|
btrfs_dump_free_space(cache, bytes);
|
2019-06-19 04:09:24 +08:00
|
|
|
}
|
|
|
|
if (++index < BTRFS_NR_RAID_TYPES)
|
|
|
|
goto again;
|
|
|
|
up_read(&info->groups_sem);
|
|
|
|
}
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 to_reclaim)
|
|
|
|
{
|
|
|
|
u64 bytes;
|
|
|
|
u64 nr;
|
|
|
|
|
2019-08-23 03:14:33 +08:00
|
|
|
bytes = btrfs_calc_insert_metadata_size(fs_info, 1);
|
2019-06-19 04:09:25 +08:00
|
|
|
nr = div64_u64(to_reclaim, bytes);
|
|
|
|
if (!nr)
|
|
|
|
nr = 1;
|
|
|
|
return nr;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define EXTENT_SIZE_PER_ITEM SZ_256K
|
|
|
|
|
|
|
|
/*
|
|
|
|
* shrink metadata reservation for delalloc
|
|
|
|
*/
|
2020-07-21 22:22:15 +08:00
|
|
|
static void shrink_delalloc(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
u64 to_reclaim, bool wait_ordered)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 delalloc_bytes;
|
|
|
|
u64 dio_bytes;
|
|
|
|
u64 items;
|
|
|
|
long time_left;
|
|
|
|
int loops;
|
|
|
|
|
|
|
|
/* Calc the number of the pages we need flush for space reservation */
|
2020-07-21 22:22:14 +08:00
|
|
|
if (to_reclaim == U64_MAX) {
|
|
|
|
items = U64_MAX;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* to_reclaim is set to however much metadata we need to
|
|
|
|
* reclaim, but reclaiming that much data doesn't really track
|
|
|
|
* exactly, so increase the amount to reclaim by 2x in order to
|
|
|
|
* make sure we're flushing enough delalloc to hopefully reclaim
|
|
|
|
* some metadata reservations.
|
|
|
|
*/
|
|
|
|
items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
|
|
|
|
to_reclaim = items * EXTENT_SIZE_PER_ITEM;
|
|
|
|
}
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
|
|
|
|
|
|
|
delalloc_bytes = percpu_counter_sum_positive(
|
|
|
|
&fs_info->delalloc_bytes);
|
|
|
|
dio_bytes = percpu_counter_sum_positive(&fs_info->dio_bytes);
|
|
|
|
if (delalloc_bytes == 0 && dio_bytes == 0) {
|
|
|
|
if (trans)
|
|
|
|
return;
|
|
|
|
if (wait_ordered)
|
|
|
|
btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are doing more ordered than delalloc we need to just wait on
|
|
|
|
* ordered extents, otherwise we'll waste time trying to flush delalloc
|
|
|
|
* that likely won't give us the space back we need.
|
|
|
|
*/
|
|
|
|
if (dio_bytes > delalloc_bytes)
|
|
|
|
wait_ordered = true;
|
|
|
|
|
|
|
|
loops = 0;
|
|
|
|
while ((delalloc_bytes || dio_bytes) && loops < 3) {
|
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 19:55:58 +08:00
|
|
|
btrfs_start_delalloc_roots(fs_info, items, true);
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
loops++;
|
|
|
|
if (wait_ordered && !trans) {
|
|
|
|
btrfs_wait_ordered_roots(fs_info, items, 0, (u64)-1);
|
|
|
|
} else {
|
|
|
|
time_left = schedule_timeout_killable(1);
|
|
|
|
if (time_left)
|
|
|
|
break;
|
|
|
|
}
|
2020-07-21 22:22:22 +08:00
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets) &&
|
|
|
|
list_empty(&space_info->priority_tickets)) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
delalloc_bytes = percpu_counter_sum_positive(
|
|
|
|
&fs_info->delalloc_bytes);
|
|
|
|
dio_bytes = percpu_counter_sum_positive(&fs_info->dio_bytes);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* maybe_commit_transaction - possibly commit the transaction if its ok to
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @bytes - the number of bytes we want to reserve
|
|
|
|
* @force - force the commit
|
|
|
|
*
|
|
|
|
* This will check to make sure that committing the transaction will actually
|
|
|
|
* get us somewhere and then commit the transaction if it does. Otherwise it
|
|
|
|
* will return -ENOSPC.
|
|
|
|
*/
|
|
|
|
static int may_commit_transaction(struct btrfs_fs_info *fs_info,
|
2020-07-21 22:22:30 +08:00
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
struct reserve_ticket *ticket = NULL;
|
|
|
|
struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
|
|
|
|
struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
|
2020-03-14 03:58:07 +08:00
|
|
|
struct btrfs_block_rsv *trans_rsv = &fs_info->trans_block_rsv;
|
2019-06-19 04:09:25 +08:00
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
u64 reclaim_bytes = 0;
|
2020-07-21 22:22:30 +08:00
|
|
|
u64 bytes_needed = 0;
|
2019-08-23 03:11:00 +08:00
|
|
|
u64 cur_free_bytes = 0;
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
trans = (struct btrfs_trans_handle *)current->journal_info;
|
|
|
|
if (trans)
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2019-08-23 03:11:00 +08:00
|
|
|
cur_free_bytes = btrfs_space_info_used(space_info, true);
|
|
|
|
if (cur_free_bytes < space_info->total_bytes)
|
|
|
|
cur_free_bytes = space_info->total_bytes - cur_free_bytes;
|
|
|
|
else
|
|
|
|
cur_free_bytes = 0;
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
if (!list_empty(&space_info->priority_tickets))
|
|
|
|
ticket = list_first_entry(&space_info->priority_tickets,
|
|
|
|
struct reserve_ticket, list);
|
|
|
|
else if (!list_empty(&space_info->tickets))
|
|
|
|
ticket = list_first_entry(&space_info->tickets,
|
|
|
|
struct reserve_ticket, list);
|
2020-07-21 22:22:24 +08:00
|
|
|
if (ticket)
|
|
|
|
bytes_needed = ticket->bytes;
|
2019-08-23 03:11:00 +08:00
|
|
|
|
|
|
|
if (bytes_needed > cur_free_bytes)
|
|
|
|
bytes_needed -= cur_free_bytes;
|
|
|
|
else
|
|
|
|
bytes_needed = 0;
|
2019-06-19 04:09:25 +08:00
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
if (!bytes_needed)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
trans = btrfs_join_transaction(fs_info->extent_root);
|
|
|
|
if (IS_ERR(trans))
|
|
|
|
return PTR_ERR(trans);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See if there is enough pinned space to make this reservation, or if
|
|
|
|
* we have block groups that are going to be freed, allowing us to
|
|
|
|
* possibly do a chunk allocation the next loop through.
|
|
|
|
*/
|
2020-07-21 22:22:30 +08:00
|
|
|
if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) ||
|
2019-06-19 04:09:25 +08:00
|
|
|
__percpu_counter_compare(&space_info->total_bytes_pinned,
|
|
|
|
bytes_needed,
|
|
|
|
BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
|
|
|
|
goto commit;
|
|
|
|
|
|
|
|
/*
|
2020-07-21 22:22:24 +08:00
|
|
|
* See if there is some space in the delayed insertion reserve for this
|
|
|
|
* reservation. If the space_info's don't match (like for DATA or
|
|
|
|
* SYSTEM) then just go enospc, reclaiming this space won't recover any
|
|
|
|
* space to satisfy those reservations.
|
2019-06-19 04:09:25 +08:00
|
|
|
*/
|
|
|
|
if (space_info != delayed_rsv->space_info)
|
|
|
|
goto enospc;
|
|
|
|
|
|
|
|
spin_lock(&delayed_rsv->lock);
|
|
|
|
reclaim_bytes += delayed_rsv->reserved;
|
|
|
|
spin_unlock(&delayed_rsv->lock);
|
|
|
|
|
|
|
|
spin_lock(&delayed_refs_rsv->lock);
|
|
|
|
reclaim_bytes += delayed_refs_rsv->reserved;
|
|
|
|
spin_unlock(&delayed_refs_rsv->lock);
|
2020-03-14 03:58:07 +08:00
|
|
|
|
|
|
|
spin_lock(&trans_rsv->lock);
|
|
|
|
reclaim_bytes += trans_rsv->reserved;
|
|
|
|
spin_unlock(&trans_rsv->lock);
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
if (reclaim_bytes >= bytes_needed)
|
|
|
|
goto commit;
|
|
|
|
bytes_needed -= reclaim_bytes;
|
|
|
|
|
|
|
|
if (__percpu_counter_compare(&space_info->total_bytes_pinned,
|
|
|
|
bytes_needed,
|
|
|
|
BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
|
|
|
|
goto enospc;
|
|
|
|
|
|
|
|
commit:
|
|
|
|
return btrfs_commit_transaction(trans);
|
|
|
|
enospc:
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to flush some data based on policy set by @state. This is only advisory
|
|
|
|
* and may fail for various reasons. The caller is supposed to examine the
|
|
|
|
* state of @space_info to detect the outcome.
|
|
|
|
*/
|
|
|
|
static void flush_space(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info, u64 num_bytes,
|
|
|
|
int state)
|
|
|
|
{
|
|
|
|
struct btrfs_root *root = fs_info->extent_root;
|
|
|
|
struct btrfs_trans_handle *trans;
|
|
|
|
int nr;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
switch (state) {
|
|
|
|
case FLUSH_DELAYED_ITEMS_NR:
|
|
|
|
case FLUSH_DELAYED_ITEMS:
|
|
|
|
if (state == FLUSH_DELAYED_ITEMS_NR)
|
|
|
|
nr = calc_reclaim_items_nr(fs_info, num_bytes) * 2;
|
|
|
|
else
|
|
|
|
nr = -1;
|
|
|
|
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = btrfs_run_delayed_items_nr(trans, nr);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
break;
|
|
|
|
case FLUSH_DELALLOC:
|
|
|
|
case FLUSH_DELALLOC_WAIT:
|
2020-07-21 22:22:15 +08:00
|
|
|
shrink_delalloc(fs_info, space_info, num_bytes,
|
2019-06-19 04:09:25 +08:00
|
|
|
state == FLUSH_DELALLOC_WAIT);
|
|
|
|
break;
|
|
|
|
case FLUSH_DELAYED_REFS_NR:
|
|
|
|
case FLUSH_DELAYED_REFS:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (state == FLUSH_DELAYED_REFS_NR)
|
|
|
|
nr = calc_reclaim_items_nr(fs_info, num_bytes);
|
|
|
|
else
|
|
|
|
nr = 0;
|
|
|
|
btrfs_run_delayed_refs(trans, nr);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
break;
|
|
|
|
case ALLOC_CHUNK:
|
|
|
|
case ALLOC_CHUNK_FORCE:
|
|
|
|
trans = btrfs_join_transaction(root);
|
|
|
|
if (IS_ERR(trans)) {
|
|
|
|
ret = PTR_ERR(trans);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ret = btrfs_chunk_alloc(trans,
|
2020-07-21 22:22:16 +08:00
|
|
|
btrfs_get_alloc_profile(fs_info, space_info->flags),
|
2019-06-19 04:09:25 +08:00
|
|
|
(state == ALLOC_CHUNK) ? CHUNK_ALLOC_NO_FORCE :
|
|
|
|
CHUNK_ALLOC_FORCE);
|
|
|
|
btrfs_end_transaction(trans);
|
|
|
|
if (ret > 0 || ret == -ENOSPC)
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2019-08-02 06:19:33 +08:00
|
|
|
case RUN_DELAYED_IPUTS:
|
2019-06-19 04:09:25 +08:00
|
|
|
/*
|
|
|
|
* If we have pending delayed iputs then we could free up a
|
|
|
|
* bunch of pinned space, so make sure we run the iputs before
|
|
|
|
* we do our pinned bytes check below.
|
|
|
|
*/
|
|
|
|
btrfs_run_delayed_iputs(fs_info);
|
|
|
|
btrfs_wait_on_delayed_iputs(fs_info);
|
2019-08-02 06:19:33 +08:00
|
|
|
break;
|
|
|
|
case COMMIT_TRANS:
|
2020-07-21 22:22:30 +08:00
|
|
|
ret = may_commit_transaction(fs_info, space_info);
|
2019-06-19 04:09:25 +08:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
ret = -ENOSPC;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state,
|
|
|
|
ret);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64
|
|
|
|
btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
|
2019-11-27 00:25:53 +08:00
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
u64 used;
|
2020-02-22 05:41:10 +08:00
|
|
|
u64 avail;
|
2019-06-19 04:09:25 +08:00
|
|
|
u64 expected;
|
2020-03-10 17:00:35 +08:00
|
|
|
u64 to_reclaim = space_info->reclaim_size;
|
2019-06-19 04:09:25 +08:00
|
|
|
|
2020-03-10 17:00:35 +08:00
|
|
|
lockdep_assert_held(&space_info->lock);
|
2020-02-22 05:41:10 +08:00
|
|
|
|
|
|
|
avail = calc_available_free_space(fs_info, space_info,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL);
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We may be flushing because suddenly we have less space than we had
|
|
|
|
* before, and now we're well over-committed based on our current free
|
|
|
|
* space. If that's the case add in our overage so we make sure to put
|
|
|
|
* appropriate pressure on the flushing state machine.
|
|
|
|
*/
|
|
|
|
if (space_info->total_bytes + avail < used)
|
|
|
|
to_reclaim += used - (space_info->total_bytes + avail);
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
if (to_reclaim)
|
|
|
|
return to_reclaim;
|
|
|
|
|
|
|
|
to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
|
2020-01-17 22:07:39 +08:00
|
|
|
if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL))
|
2019-06-19 04:09:25 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
|
|
|
|
2020-01-17 22:07:39 +08:00
|
|
|
if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
|
|
|
|
BTRFS_RESERVE_FLUSH_ALL))
|
2019-06-19 04:09:25 +08:00
|
|
|
expected = div_factor_fine(space_info->total_bytes, 95);
|
|
|
|
else
|
|
|
|
expected = div_factor_fine(space_info->total_bytes, 90);
|
|
|
|
|
|
|
|
if (used > expected)
|
|
|
|
to_reclaim = used - expected;
|
|
|
|
else
|
|
|
|
to_reclaim = 0;
|
|
|
|
to_reclaim = min(to_reclaim, space_info->bytes_may_use +
|
|
|
|
space_info->bytes_reserved);
|
|
|
|
return to_reclaim;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
2019-11-27 00:25:53 +08:00
|
|
|
u64 used)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
u64 thresh = div_factor_fine(space_info->total_bytes, 98);
|
|
|
|
|
|
|
|
/* If we're just plain full then async reclaim just slows us down. */
|
|
|
|
if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
|
|
|
|
return 0;
|
|
|
|
|
2019-11-27 00:25:53 +08:00
|
|
|
if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
|
2019-06-19 04:09:25 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
|
|
|
|
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
|
|
|
|
}
|
|
|
|
|
2020-03-14 03:58:05 +08:00
|
|
|
static bool steal_from_global_rsv(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket)
|
|
|
|
{
|
|
|
|
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
|
|
|
|
u64 min_bytes;
|
|
|
|
|
|
|
|
if (global_rsv->space_info != space_info)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
spin_lock(&global_rsv->lock);
|
2020-03-14 03:58:06 +08:00
|
|
|
min_bytes = div_factor(global_rsv->size, 1);
|
2020-03-14 03:58:05 +08:00
|
|
|
if (global_rsv->reserved < min_bytes + ticket->bytes) {
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
global_rsv->reserved -= ticket->bytes;
|
2020-06-27 18:40:44 +08:00
|
|
|
remove_ticket(space_info, ticket);
|
2020-03-14 03:58:05 +08:00
|
|
|
ticket->bytes = 0;
|
|
|
|
wake_up(&ticket->wait);
|
|
|
|
space_info->tickets_id++;
|
|
|
|
if (global_rsv->reserved < global_rsv->size)
|
|
|
|
global_rsv->full = 0;
|
|
|
|
spin_unlock(&global_rsv->lock);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2019-08-28 23:12:47 +08:00
|
|
|
/*
|
|
|
|
* maybe_fail_all_tickets - we've exhausted our flushing, start failing tickets
|
|
|
|
* @fs_info - fs_info for this fs
|
|
|
|
* @space_info - the space info we were flushing
|
|
|
|
*
|
|
|
|
* We call this when we've exhausted our flushing ability and haven't made
|
|
|
|
* progress in satisfying tickets. The reservation code handles tickets in
|
|
|
|
* order, so if there is a large ticket first and then smaller ones we could
|
|
|
|
* very well satisfy the smaller tickets. This will attempt to wake up any
|
|
|
|
* tickets in the list to catch this case.
|
|
|
|
*
|
|
|
|
* This function returns true if it was able to make progress by clearing out
|
|
|
|
* other tickets, or if it stumbles across a ticket that was smaller than the
|
|
|
|
* first ticket.
|
|
|
|
*/
|
|
|
|
static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
struct reserve_ticket *ticket;
|
2019-08-28 23:12:47 +08:00
|
|
|
u64 tickets_id = space_info->tickets_id;
|
|
|
|
u64 first_ticket_bytes = 0;
|
|
|
|
|
2019-08-23 03:19:04 +08:00
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) {
|
|
|
|
btrfs_info(fs_info, "cannot satisfy tickets, dumping space info");
|
|
|
|
__btrfs_dump_space_info(fs_info, space_info);
|
|
|
|
}
|
|
|
|
|
2019-08-28 23:12:47 +08:00
|
|
|
while (!list_empty(&space_info->tickets) &&
|
|
|
|
tickets_id == space_info->tickets_id) {
|
|
|
|
ticket = list_first_entry(&space_info->tickets,
|
|
|
|
struct reserve_ticket, list);
|
|
|
|
|
2020-03-14 03:58:05 +08:00
|
|
|
if (ticket->steal &&
|
|
|
|
steal_from_global_rsv(fs_info, space_info, ticket))
|
|
|
|
return true;
|
|
|
|
|
2019-08-28 23:12:47 +08:00
|
|
|
/*
|
|
|
|
* may_commit_transaction will avoid committing the transaction
|
|
|
|
* if it doesn't feel like the space reclaimed by the commit
|
|
|
|
* would result in the ticket succeeding. However if we have a
|
|
|
|
* smaller ticket in the queue it may be small enough to be
|
|
|
|
* satisified by committing the transaction, so if any
|
|
|
|
* subsequent ticket is smaller than the first ticket go ahead
|
|
|
|
* and send us back for another loop through the enospc flushing
|
|
|
|
* code.
|
|
|
|
*/
|
|
|
|
if (first_ticket_bytes == 0)
|
|
|
|
first_ticket_bytes = ticket->bytes;
|
|
|
|
else if (first_ticket_bytes > ticket->bytes)
|
|
|
|
return true;
|
2019-06-19 04:09:25 +08:00
|
|
|
|
2019-08-23 03:19:04 +08:00
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
|
|
|
btrfs_info(fs_info, "failing ticket with %llu bytes",
|
|
|
|
ticket->bytes);
|
|
|
|
|
2020-04-07 18:38:49 +08:00
|
|
|
remove_ticket(space_info, ticket);
|
2019-06-19 04:09:25 +08:00
|
|
|
ticket->error = -ENOSPC;
|
|
|
|
wake_up(&ticket->wait);
|
2019-08-28 23:12:47 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're just throwing tickets away, so more flushing may not
|
|
|
|
* trip over btrfs_try_granting_tickets, so we need to call it
|
|
|
|
* here to see if we can make progress with the next ticket in
|
|
|
|
* the list.
|
|
|
|
*/
|
|
|
|
btrfs_try_granting_tickets(fs_info, space_info);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
2019-08-28 23:12:47 +08:00
|
|
|
return (tickets_id != space_info->tickets_id);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is for normal flushers, we can wait all goddamned day if we want to. We
|
|
|
|
* will loop and continuously try to flush as long as we are making progress.
|
|
|
|
* We count progress as clearing off tickets each time we have to loop.
|
|
|
|
*/
|
|
|
|
static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
u64 to_reclaim;
|
|
|
|
int flush_state;
|
|
|
|
int commit_cycles = 0;
|
|
|
|
u64 last_tickets_id;
|
|
|
|
|
|
|
|
fs_info = container_of(work, struct btrfs_fs_info, async_reclaim_work);
|
|
|
|
space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2019-11-27 00:25:53 +08:00
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
|
2019-06-19 04:09:25 +08:00
|
|
|
if (!to_reclaim) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
do {
|
|
|
|
flush_space(fs_info, space_info, to_reclaim, flush_state);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info,
|
2019-11-27 00:25:53 +08:00
|
|
|
space_info);
|
2019-06-19 04:09:25 +08:00
|
|
|
if (last_tickets_id == space_info->tickets_id) {
|
|
|
|
flush_state++;
|
|
|
|
} else {
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
if (commit_cycles)
|
|
|
|
commit_cycles--;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to force a chunk allocation until we've tried
|
|
|
|
* pretty hard to reclaim space. Think of the case where we
|
|
|
|
* freed up a bunch of space and so have a lot of pinned space
|
|
|
|
* to reclaim. We would rather use that than possibly create a
|
|
|
|
* underutilized metadata chunk. So if this is our first run
|
|
|
|
* through the flushing state machine skip ALLOC_CHUNK_FORCE and
|
|
|
|
* commit the transaction. If nothing has changed the next go
|
|
|
|
* around then we can force a chunk allocation.
|
|
|
|
*/
|
|
|
|
if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
|
|
|
|
flush_state++;
|
|
|
|
|
|
|
|
if (flush_state > COMMIT_TRANS) {
|
|
|
|
commit_cycles++;
|
|
|
|
if (commit_cycles > 2) {
|
2019-08-28 23:12:47 +08:00
|
|
|
if (maybe_fail_all_tickets(fs_info, space_info)) {
|
2019-06-19 04:09:25 +08:00
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
commit_cycles--;
|
|
|
|
} else {
|
|
|
|
space_info->flush = 0;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
flush_state = FLUSH_DELAYED_ITEMS_NR;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
} while (flush_state <= COMMIT_TRANS);
|
|
|
|
}
|
|
|
|
|
2020-07-21 22:22:34 +08:00
|
|
|
/*
|
|
|
|
* FLUSH_DELALLOC_WAIT:
|
|
|
|
* Space is freed from flushing delalloc in one of two ways.
|
|
|
|
*
|
|
|
|
* 1) compression is on and we allocate less space than we reserved
|
|
|
|
* 2) we are overwriting existing space
|
|
|
|
*
|
|
|
|
* For #1 that extra space is reclaimed as soon as the delalloc pages are
|
|
|
|
* COWed, by way of btrfs_add_reserved_bytes() which adds the actual extent
|
|
|
|
* length to ->bytes_reserved, and subtracts the reserved space from
|
|
|
|
* ->bytes_may_use.
|
|
|
|
*
|
|
|
|
* For #2 this is trickier. Once the ordered extent runs we will drop the
|
|
|
|
* extent in the range we are overwriting, which creates a delayed ref for
|
|
|
|
* that freed extent. This however is not reclaimed until the transaction
|
|
|
|
* commits, thus the next stages.
|
|
|
|
*
|
|
|
|
* RUN_DELAYED_IPUTS
|
|
|
|
* If we are freeing inodes, we want to make sure all delayed iputs have
|
|
|
|
* completed, because they could have been on an inode with i_nlink == 0, and
|
|
|
|
* thus have been truncated and freed up space. But again this space is not
|
|
|
|
* immediately re-usable, it comes in the form of a delayed ref, which must be
|
|
|
|
* run and then the transaction must be committed.
|
|
|
|
*
|
|
|
|
* FLUSH_DELAYED_REFS
|
|
|
|
* The above two cases generate delayed refs that will affect
|
|
|
|
* ->total_bytes_pinned. However this counter can be inconsistent with
|
|
|
|
* reality if there are outstanding delayed refs. This is because we adjust
|
|
|
|
* the counter based solely on the current set of delayed refs and disregard
|
|
|
|
* any on-disk state which might include more refs. So for example, if we
|
|
|
|
* have an extent with 2 references, but we only drop 1, we'll see that there
|
|
|
|
* is a negative delayed ref count for the extent and assume that the space
|
|
|
|
* will be freed, and thus increase ->total_bytes_pinned.
|
|
|
|
*
|
|
|
|
* Running the delayed refs gives us the actual real view of what will be
|
|
|
|
* freed at the transaction commit time. This stage will not actually free
|
|
|
|
* space for us, it just makes sure that may_commit_transaction() has all of
|
|
|
|
* the information it needs to make the right decision.
|
|
|
|
*
|
|
|
|
* COMMIT_TRANS
|
|
|
|
* This is where we reclaim all of the pinned space generated by the previous
|
|
|
|
* two stages. We will not commit the transaction if we don't think we're
|
|
|
|
* likely to satisfy our request, which means if our current free space +
|
|
|
|
* total_bytes_pinned < reservation we will not commit. This is why the
|
|
|
|
* previous states are actually important, to make sure we know for sure
|
|
|
|
* whether committing the transaction will allow us to make progress.
|
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met. The log is basically
while (!space_info->full && !list_empty(&space_info->tickets))
flush_space(space_info, flush_state);
where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE. This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks. This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.
Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously. Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-26 04:56:59 +08:00
|
|
|
*
|
|
|
|
* ALLOC_CHUNK_FORCE
|
|
|
|
* For data we start with alloc chunk force, however we could have been full
|
|
|
|
* before, and then the transaction commit could have freed new block groups,
|
|
|
|
* so if we now have space to allocate do the force chunk allocation.
|
2020-07-21 22:22:34 +08:00
|
|
|
*/
|
2020-07-21 22:22:33 +08:00
|
|
|
static const enum btrfs_flush_state data_flush_states[] = {
|
|
|
|
FLUSH_DELALLOC_WAIT,
|
|
|
|
RUN_DELAYED_IPUTS,
|
|
|
|
FLUSH_DELAYED_REFS,
|
|
|
|
COMMIT_TRANS,
|
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met. The log is basically
while (!space_info->full && !list_empty(&space_info->tickets))
flush_space(space_info, flush_state);
where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE. This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks. This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.
Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously. Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-26 04:56:59 +08:00
|
|
|
ALLOC_CHUNK_FORCE,
|
2020-07-21 22:22:33 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static void btrfs_async_reclaim_data_space(struct work_struct *work)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
2020-07-21 22:22:33 +08:00
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
u64 last_tickets_id;
|
|
|
|
int flush_state = 0;
|
|
|
|
|
|
|
|
fs_info = container_of(work, struct btrfs_fs_info, async_data_reclaim_work);
|
|
|
|
space_info = fs_info->data_sinfo;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
while (!space_info->full) {
|
|
|
|
flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (flush_state < ARRAY_SIZE(data_flush_states)) {
|
|
|
|
flush_space(fs_info, space_info, U64_MAX,
|
|
|
|
data_flush_states[flush_state]);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (list_empty(&space_info->tickets)) {
|
|
|
|
space_info->flush = 0;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last_tickets_id == space_info->tickets_id) {
|
|
|
|
flush_state++;
|
|
|
|
} else {
|
|
|
|
last_tickets_id = space_info->tickets_id;
|
|
|
|
flush_state = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flush_state >= ARRAY_SIZE(data_flush_states)) {
|
|
|
|
if (space_info->full) {
|
|
|
|
if (maybe_fail_all_tickets(fs_info, space_info))
|
|
|
|
flush_state = 0;
|
|
|
|
else
|
|
|
|
space_info->flush = 0;
|
|
|
|
} else {
|
|
|
|
flush_state = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
INIT_WORK(&fs_info->async_reclaim_work, btrfs_async_reclaim_metadata_space);
|
|
|
|
INIT_WORK(&fs_info->async_data_reclaim_work, btrfs_async_reclaim_data_space);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static const enum btrfs_flush_state priority_flush_states[] = {
|
|
|
|
FLUSH_DELAYED_ITEMS_NR,
|
|
|
|
FLUSH_DELAYED_ITEMS,
|
|
|
|
ALLOC_CHUNK,
|
|
|
|
};
|
|
|
|
|
2019-08-02 06:19:37 +08:00
|
|
|
static const enum btrfs_flush_state evict_flush_states[] = {
|
|
|
|
FLUSH_DELAYED_ITEMS_NR,
|
|
|
|
FLUSH_DELAYED_ITEMS,
|
|
|
|
FLUSH_DELAYED_REFS_NR,
|
|
|
|
FLUSH_DELAYED_REFS,
|
|
|
|
FLUSH_DELALLOC,
|
|
|
|
FLUSH_DELALLOC_WAIT,
|
|
|
|
ALLOC_CHUNK,
|
|
|
|
COMMIT_TRANS,
|
|
|
|
};
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
|
2019-08-02 06:19:36 +08:00
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket,
|
|
|
|
const enum btrfs_flush_state *states,
|
|
|
|
int states_nr)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
|
|
|
u64 to_reclaim;
|
|
|
|
int flush_state;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
2019-11-27 00:25:53 +08:00
|
|
|
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info);
|
2019-06-19 04:09:25 +08:00
|
|
|
if (!to_reclaim) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
flush_state = 0;
|
|
|
|
do {
|
2019-08-02 06:19:36 +08:00
|
|
|
flush_space(fs_info, space_info, to_reclaim, states[flush_state]);
|
2019-06-19 04:09:25 +08:00
|
|
|
flush_state++;
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ticket->bytes == 0) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
2019-08-02 06:19:36 +08:00
|
|
|
} while (flush_state < states_nr);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
|
|
|
|
2020-07-21 22:22:26 +08:00
|
|
|
static void priority_reclaim_data_space(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
2020-07-21 22:22:33 +08:00
|
|
|
struct reserve_ticket *ticket)
|
2020-07-21 22:22:26 +08:00
|
|
|
{
|
|
|
|
while (!space_info->full) {
|
|
|
|
flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
if (ticket->bytes == 0) {
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:19:34 +08:00
|
|
|
static void wait_reserve_ticket(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket)
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
{
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
while (ticket->bytes > 0 && ticket->error == 0) {
|
|
|
|
ret = prepare_to_wait_event(&ticket->wait, &wait, TASK_KILLABLE);
|
|
|
|
if (ret) {
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 17:53:41 +08:00
|
|
|
/*
|
|
|
|
* Delete us from the list. After we unlock the space
|
|
|
|
* info, we don't want the async reclaim job to reserve
|
|
|
|
* space for this ticket. If that would happen, then the
|
|
|
|
* ticket's task would not known that space was reserved
|
|
|
|
* despite getting an error, resulting in a space leak
|
|
|
|
* (bytes_may_use counter of our space_info).
|
|
|
|
*/
|
2020-04-07 18:38:49 +08:00
|
|
|
remove_ticket(space_info, ticket);
|
2019-08-02 06:19:34 +08:00
|
|
|
ticket->error = -EINTR;
|
2019-06-19 04:09:25 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
schedule();
|
|
|
|
|
|
|
|
finish_wait(&ticket->wait, &wait);
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:19:35 +08:00
|
|
|
/**
|
|
|
|
* handle_reserve_ticket - do the appropriate flushing and waiting for a ticket
|
|
|
|
* @fs_info - the fs
|
|
|
|
* @space_info - the space_info for the reservation
|
|
|
|
* @ticket - the ticket for the reservation
|
|
|
|
* @flush - how much we can flush
|
|
|
|
*
|
|
|
|
* This does the work of figuring out how to flush for the ticket, waiting for
|
|
|
|
* the reservation, and returning the appropriate error if there is one.
|
|
|
|
*/
|
|
|
|
static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info,
|
|
|
|
struct reserve_ticket *ticket,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2019-08-02 06:19:37 +08:00
|
|
|
switch (flush) {
|
2020-07-21 22:22:33 +08:00
|
|
|
case BTRFS_RESERVE_FLUSH_DATA:
|
2019-08-02 06:19:37 +08:00
|
|
|
case BTRFS_RESERVE_FLUSH_ALL:
|
2020-03-14 03:58:05 +08:00
|
|
|
case BTRFS_RESERVE_FLUSH_ALL_STEAL:
|
2019-08-02 06:19:35 +08:00
|
|
|
wait_reserve_ticket(fs_info, space_info, ticket);
|
2019-08-02 06:19:37 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_RESERVE_FLUSH_LIMIT:
|
2019-08-02 06:19:36 +08:00
|
|
|
priority_reclaim_metadata_space(fs_info, space_info, ticket,
|
|
|
|
priority_flush_states,
|
|
|
|
ARRAY_SIZE(priority_flush_states));
|
2019-08-02 06:19:37 +08:00
|
|
|
break;
|
|
|
|
case BTRFS_RESERVE_FLUSH_EVICT:
|
|
|
|
priority_reclaim_metadata_space(fs_info, space_info, ticket,
|
|
|
|
evict_flush_states,
|
|
|
|
ARRAY_SIZE(evict_flush_states));
|
|
|
|
break;
|
2020-07-21 22:22:26 +08:00
|
|
|
case BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE:
|
2020-07-21 22:22:33 +08:00
|
|
|
priority_reclaim_data_space(fs_info, space_info, ticket);
|
2020-07-21 22:22:26 +08:00
|
|
|
break;
|
2019-08-02 06:19:37 +08:00
|
|
|
default:
|
|
|
|
ASSERT(0);
|
|
|
|
break;
|
|
|
|
}
|
2019-08-02 06:19:35 +08:00
|
|
|
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
ret = ticket->error;
|
|
|
|
if (ticket->bytes || ticket->error) {
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 17:53:41 +08:00
|
|
|
/*
|
2020-03-14 03:58:09 +08:00
|
|
|
* We were a priority ticket, so we need to delete ourselves
|
|
|
|
* from the list. Because we could have other priority tickets
|
|
|
|
* behind us that require less space, run
|
|
|
|
* btrfs_try_granting_tickets() to see if their reservations can
|
|
|
|
* now be made.
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 17:53:41 +08:00
|
|
|
*/
|
2020-03-14 03:58:09 +08:00
|
|
|
if (!list_empty(&ticket->list)) {
|
|
|
|
remove_ticket(space_info, ticket);
|
|
|
|
btrfs_try_granting_tickets(fs_info, space_info);
|
|
|
|
}
|
|
|
|
|
2019-08-02 06:19:35 +08:00
|
|
|
if (!ret)
|
|
|
|
ret = -ENOSPC;
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
ASSERT(list_empty(&ticket->list));
|
Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).
Here's the sequence of steps leading to the space leak:
1) A task tries to create a file for example, so it ends up trying to
start a transaction at btrfs_create();
2) The filesystem is currently in a state where there is not enough
metadata free space to satisfy the transaction's needs. So at
space-info.c:__reserve_metadata_bytes() we create a ticket and
add it to the list of tickets of the space info object. Also,
because the metadata async reclaim job is not running, we queue
a job ro run metadata reclaim;
3) In the meanwhile the task receives a signal (like SIGTERM from
a kill command for example);
4) After queing the async reclaim job, at __reserve_metadata_bytes(),
we unlock the metadata space info and call handle_reserve_ticket();
5) That last function calls wait_reserve_ticket(), which acquires the
lock from the metadata space info. Then in the first iteration of
its while loop, it calls prepare_to_wait_event(), which returns
-ERESTARTSYS because the task has a pending signal. As a result,
we set the error field of the ticket to -EINTR and exit the while
loop without deleting the ticket from the list of tickets (in the
space info object). After exiting the loop we unlock the space info;
6) The async reclaim job is able to release enough metadata, acquires
the metadata space info's lock and then reserves space for the ticket,
since the ticket is still in the list of (non-priority) tickets. The
space reservation happens at btrfs_try_granting_tickets(), called from
maybe_fail_all_tickets(). This increments the bytes_may_use counter
from the metadata space info object, sets the ticket's bytes field to
zero (meaning success, that space was reserved) and removes it from
the list of tickets;
7) wait_reserve_ticket() returns, with the error field of the ticket
set to -EINTR. Then handle_reserve_ticket() just propagates that error
to the caller. Because an error was returned, the caller does not
release the reserved space, since the expectation is that any error
means no space was reserved.
Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.
Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.
This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.
When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
CPU: 0 PID: 13240 Comm: umount Tainted: G W L 5.3.0-rc8-btrfs-next-48+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
(...)
RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
FS: 00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x1ad/0x390 [btrfs]
generic_shutdown_super+0x6c/0x110
kill_anon_super+0xe/0x30
btrfs_kill_super+0x12/0xa0 [btrfs]
deactivate_locked_super+0x3a/0x70
cleanup_mnt+0xb4/0x160
task_work_run+0x7e/0xc0
exit_to_usermode_loop+0xfa/0x100
do_syscall_64+0x1cb/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4274d2cb37
(...)
RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bcf4b235461b26f6 ]---
BTRFS info (device sdb): space_info 4 has 19116032 free, is full
BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
Fixes: 374bf9c5cd7d0b ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 17:53:41 +08:00
|
|
|
/*
|
|
|
|
* Check that we can't have an error set if the reservation succeeded,
|
|
|
|
* as that would confuse tasks and lead them to error out without
|
|
|
|
* releasing reserved space (if an error happens the expectation is that
|
|
|
|
* space wasn't reserved at all).
|
|
|
|
*/
|
|
|
|
ASSERT(!(ticket->bytes == 0 && ticket->error));
|
2019-08-02 06:19:35 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-03-14 03:58:08 +08:00
|
|
|
/*
|
|
|
|
* This returns true if this flush state will go through the ordinary flushing
|
|
|
|
* code.
|
|
|
|
*/
|
|
|
|
static inline bool is_normal_flushing(enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
return (flush == BTRFS_RESERVE_FLUSH_ALL) ||
|
|
|
|
(flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
|
|
|
|
}
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
/**
|
|
|
|
* reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @space_info - the space info we want to allocate from
|
|
|
|
* @orig_bytes - the number of bytes we want
|
|
|
|
* @flush - whether or not we can flush to make our reservation
|
|
|
|
*
|
|
|
|
* This will reserve orig_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
|
|
|
*/
|
2020-07-21 22:22:28 +08:00
|
|
|
static int __reserve_bytes(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_space_info *space_info, u64 orig_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
2019-06-19 04:09:25 +08:00
|
|
|
{
|
2020-07-21 22:22:33 +08:00
|
|
|
struct work_struct *async_work;
|
2019-06-19 04:09:25 +08:00
|
|
|
struct reserve_ticket ticket;
|
|
|
|
u64 used;
|
|
|
|
int ret = 0;
|
2019-08-23 03:10:54 +08:00
|
|
|
bool pending_tickets;
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
ASSERT(orig_bytes);
|
|
|
|
ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_ALL);
|
|
|
|
|
2020-07-21 22:22:33 +08:00
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_DATA)
|
|
|
|
async_work = &fs_info->async_data_reclaim_work;
|
|
|
|
else
|
|
|
|
async_work = &fs_info->async_reclaim_work;
|
|
|
|
|
2019-06-19 04:09:25 +08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
ret = -ENOSPC;
|
|
|
|
used = btrfs_space_info_used(space_info, true);
|
2020-03-14 03:58:08 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want NO_FLUSH allocations to jump everybody, they can
|
|
|
|
* generally handle ENOSPC in a different way, so treat them the same as
|
|
|
|
* normal flushers when it comes to skipping pending tickets.
|
|
|
|
*/
|
|
|
|
if (is_normal_flushing(flush) || (flush == BTRFS_RESERVE_NO_FLUSH))
|
|
|
|
pending_tickets = !list_empty(&space_info->tickets) ||
|
|
|
|
!list_empty(&space_info->priority_tickets);
|
|
|
|
else
|
|
|
|
pending_tickets = !list_empty(&space_info->priority_tickets);
|
2019-06-19 04:09:25 +08:00
|
|
|
|
|
|
|
/*
|
2019-06-26 02:11:31 +08:00
|
|
|
* Carry on if we have enough space (short-circuit) OR call
|
|
|
|
* can_overcommit() to ensure we can overcommit to continue.
|
2019-06-19 04:09:25 +08:00
|
|
|
*/
|
2019-08-23 03:10:54 +08:00
|
|
|
if (!pending_tickets &&
|
|
|
|
((used + orig_bytes <= space_info->total_bytes) ||
|
2020-01-17 22:07:39 +08:00
|
|
|
btrfs_can_overcommit(fs_info, space_info, orig_bytes, flush))) {
|
2019-06-19 04:09:25 +08:00
|
|
|
btrfs_space_info_update_bytes_may_use(fs_info, space_info,
|
|
|
|
orig_bytes);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we couldn't make a reservation then setup our reservation ticket
|
|
|
|
* and kick the async worker if it's not already running.
|
|
|
|
*
|
|
|
|
* If we are a priority flusher then we just need to add our ticket to
|
|
|
|
* the list and we will do our own flushing further down.
|
|
|
|
*/
|
|
|
|
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
|
|
|
|
ticket.bytes = orig_bytes;
|
|
|
|
ticket.error = 0;
|
2020-03-10 17:00:35 +08:00
|
|
|
space_info->reclaim_size += ticket.bytes;
|
2019-06-19 04:09:25 +08:00
|
|
|
init_waitqueue_head(&ticket.wait);
|
2020-03-14 03:58:05 +08:00
|
|
|
ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
|
|
|
|
if (flush == BTRFS_RESERVE_FLUSH_ALL ||
|
2020-07-21 22:22:33 +08:00
|
|
|
flush == BTRFS_RESERVE_FLUSH_ALL_STEAL ||
|
|
|
|
flush == BTRFS_RESERVE_FLUSH_DATA) {
|
2019-06-19 04:09:25 +08:00
|
|
|
list_add_tail(&ticket.list, &space_info->tickets);
|
|
|
|
if (!space_info->flush) {
|
|
|
|
space_info->flush = 1;
|
|
|
|
trace_btrfs_trigger_flush(fs_info,
|
|
|
|
space_info->flags,
|
|
|
|
orig_bytes, flush,
|
|
|
|
"enospc");
|
2020-07-21 22:22:33 +08:00
|
|
|
queue_work(system_unbound_wq, async_work);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
list_add_tail(&ticket.list,
|
|
|
|
&space_info->priority_tickets);
|
|
|
|
}
|
|
|
|
} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
|
|
|
|
used += orig_bytes;
|
|
|
|
/*
|
|
|
|
* We will do the space reservation dance during log replay,
|
|
|
|
* which means we won't have fs_info->fs_root set, so don't do
|
|
|
|
* the async reclaim as we will panic.
|
|
|
|
*/
|
|
|
|
if (!test_bit(BTRFS_FS_LOG_RECOVERING, &fs_info->flags) &&
|
2019-11-27 00:25:53 +08:00
|
|
|
need_do_async_reclaim(fs_info, space_info, used) &&
|
2019-06-19 04:09:25 +08:00
|
|
|
!work_busy(&fs_info->async_reclaim_work)) {
|
|
|
|
trace_btrfs_trigger_flush(fs_info, space_info->flags,
|
|
|
|
orig_bytes, flush, "preempt");
|
|
|
|
queue_work(system_unbound_wq,
|
|
|
|
&fs_info->async_reclaim_work);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
|
|
|
|
return ret;
|
|
|
|
|
2019-08-02 06:19:35 +08:00
|
|
|
return handle_reserve_ticket(fs_info, space_info, &ticket, flush);
|
2019-06-19 04:09:25 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* reserve_metadata_bytes - try to reserve bytes from the block_rsv's space
|
|
|
|
* @root - the root we're allocating for
|
|
|
|
* @block_rsv - the block_rsv we're allocating for
|
|
|
|
* @orig_bytes - the number of bytes we want
|
|
|
|
* @flush - whether or not we can flush to make our reservation
|
|
|
|
*
|
|
|
|
* This will reserve orig_bytes number of bytes from the space info associated
|
|
|
|
* with the block_rsv. If there is not enough space it will make an attempt to
|
|
|
|
* flush out space to make room. It will do this by flushing delalloc if
|
|
|
|
* possible or committing the transaction. If flush is 0 then no attempts to
|
|
|
|
* regain reservations will be made and this will fail if there is not enough
|
|
|
|
* space already.
|
|
|
|
*/
|
|
|
|
int btrfs_reserve_metadata_bytes(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 orig_bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = root->fs_info;
|
|
|
|
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
|
|
|
|
int ret;
|
|
|
|
|
2020-07-21 22:22:28 +08:00
|
|
|
ret = __reserve_bytes(fs_info, block_rsv->space_info, orig_bytes, flush);
|
2019-06-19 04:09:25 +08:00
|
|
|
if (ret == -ENOSPC &&
|
|
|
|
unlikely(root->orphan_cleanup_state == ORPHAN_CLEANUP_STARTED)) {
|
|
|
|
if (block_rsv != global_rsv &&
|
|
|
|
!btrfs_block_rsv_use_bytes(global_rsv, orig_bytes))
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info:enospc",
|
|
|
|
block_rsv->space_info->flags,
|
|
|
|
orig_bytes, 1);
|
|
|
|
|
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
|
|
|
btrfs_dump_space_info(fs_info, block_rsv->space_info,
|
|
|
|
orig_bytes, 0);
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
2020-07-21 22:22:25 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* btrfs_reserve_data_bytes - try to reserve data bytes for an allocation
|
|
|
|
* @fs_info - the filesystem
|
|
|
|
* @bytes - the number of bytes we need
|
|
|
|
* @flush - how we are allowed to flush
|
|
|
|
*
|
|
|
|
* This will reserve bytes from the data space info. If there is not enough
|
|
|
|
* space then we will attempt to flush space as specified by flush.
|
|
|
|
*/
|
|
|
|
int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
|
|
|
|
enum btrfs_reserve_flush_enum flush)
|
|
|
|
{
|
|
|
|
struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
|
2020-07-21 22:22:28 +08:00
|
|
|
int ret;
|
2020-07-21 22:22:25 +08:00
|
|
|
|
2020-07-21 22:22:28 +08:00
|
|
|
ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
|
|
|
|
flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
|
2020-07-21 22:22:25 +08:00
|
|
|
ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
|
|
|
|
|
2020-07-21 22:22:28 +08:00
|
|
|
ret = __reserve_bytes(fs_info, data_sinfo, bytes, flush);
|
|
|
|
if (ret == -ENOSPC) {
|
|
|
|
trace_btrfs_space_reservation(fs_info, "space_info:enospc",
|
2020-07-21 22:22:25 +08:00
|
|
|
data_sinfo->flags, bytes, 1);
|
2020-07-21 22:22:28 +08:00
|
|
|
if (btrfs_test_opt(fs_info, ENOSPC_DEBUG))
|
|
|
|
btrfs_dump_space_info(fs_info, data_sinfo, bytes, 0);
|
|
|
|
}
|
2020-07-21 22:22:25 +08:00
|
|
|
return ret;
|
|
|
|
}
|