Commit Graph

10099 Commits

Author SHA1 Message Date
Nikolay Borisov d6ade6894e btrfs: don't opencode extent_changeset_free
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Jiapeng Chong 7000babdda btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned
Fix the following coccicheck warnings:

./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function
'dev_extent_hole_check_zoned' with return type bool.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Filipe Manana 2ce73c6335 btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.

This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  initial_file_count=500000
  add_files $initial_file_count 0 4

  echo
  echo "Creating first snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  echo
  echo "Adding more files..."
  add_files $((initial_file_count / 4)) $initial_file_count 4

  echo
  echo "Updating 1/50th of the initial files..."
  for ((i = 1; i < $initial_file_count; i += 50)); do
      xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
  done

  echo
  echo "Creating second snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap2

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing incremental send..."
  start=$(date +%s)
  btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
  end=$(date +%s)
  echo
  echo "Incremental send took $((end - start)) seconds"

  umount $MNT

Before this change, incremental send duration:

  with $initial_file_count == 200000:  51 seconds
  with $initial_file_count == 500000: 168 seconds

After this change, incremental send duration:

  with $initial_file_count == 200000:   39 seconds (-26.7%)
  with $initial_file_count == 500000:  125 seconds (-29.4%)

For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.

While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Filipe Manana 19358b154f btrfs: add btree read ahead for full send operations
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.

This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  initial_file_count=500000
  add_files $initial_file_count 0 4

  echo
  echo "Creating first snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  echo
  echo "Adding more files..."
  add_files $((initial_file_count / 4)) $initial_file_count 4

  echo
  echo "Updating 1/50th of the initial files..."
  for ((i = 1; i < $initial_file_count; i += 50)); do
      xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
  done

  echo
  echo "Creating second snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap2

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing incremental send..."
  start=$(date +%s)
  btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
  end=$(date +%s)
  echo
  echo "Incremental send took $((end - start)) seconds"

  umount $MNT

Before this change, full send duration:

  with $initial_file_count == 200000:  165 seconds
  with $initial_file_count == 500000:  407 seconds

After this change, full send duration:

  with $initial_file_count == 200000:  149 seconds (-10.2%)
  with $initial_file_count == 500000:  353 seconds (-14.2%)

For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves. The roots were at level 2.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Nikolay Borisov 98686ffc71 btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata
btrfs_block_rsv_add can return only ENOSPC since it's called with
NO_FLUSH modifier. This so simplify the logic in
btrfs_delayed_inode_reserve_metadata to exploit this invariant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add assert and comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:15 +02:00
Nikolay Borisov 8e3c9d3cf8 btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadata
It's only used for tracepoint to obtain the inode number, but we already
have the ino from btrfs_delayed_node::inode_id.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov ae396a3b7a btrfs: simplify commit logic in try_flush_qgroup
It's no longer expected to call this function with an open transaction
so all the workarounds concerning this can be removed. In fact it'll
constitute a bug to call this function with a transaction already held
so WARN in this case.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Anand Jain e5ce988690 btrfs: scrub: drop a few function declarations
Drop function declarations at the beginning of the file scrub.c. These
functions are defined before they are used in the same file and don't
need forward declaration.

No functional changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Anand Jain f4639636b6 btrfs: change return type to bool in btrfs_extent_readonly
btrfs_extent_readonly() checks if the block group is readonly, the bool
return type should be used.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Anand Jain 05947ae186 btrfs: unexport btrfs_extent_readonly() and make it static
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov b6e9f16c5f btrfs: replace open coded while loop with proper construct
btrfs_inc_block_group_ro wants to ensure that the current transaction is
not running dirty block groups, if it is it waits and loops again.
That logic is currently implemented using a goto label. Actually using
a proper do {} while() construct doesn't hurt readability nor does it
introduce excessive nesting and makes the relevant code stand out by
being encompassed in the loop construct. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov 20bbf20e95 btrfs: replace offset_in_entry with in_range
No point in duplicating the functionality just use the generic helper
that has the same semantics.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov cca5de97ae btrfs: make find_desired_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Nikolay Borisov bfc78479eb btrfs: make btrfs_replace_file_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Qu Wenruo 0b3dcd131d btrfs: fix comment for btrfs ordered extent flag bits
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit 3c198fe064 ("btrfs: rework the order of
btrfs_ordered_extent::flags") but the fixup did not get merged in time.

The 4 types are for ordered extent itself, not for direct io.
Only 3 types support direct io, REGULAR/NOCOW/PREALLOC.

Fix the comment to reflect that.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:14 +02:00
Miklos Szeredi 97fc297754 btrfs: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: David Sterba <dsterba@suse.com>
2021-04-12 15:04:29 +02:00
Linus Torvalds 7d90072491 for-5.12-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBy9DoACgkQxWXV+ddt
 WDtqdxAAnK4zx79k5ok6nlj8JlOfReimX4wPYYigiiKGY40cfQUZ1YUqbDscvrt+
 cbzvqJuMU/V/UVaPW/CLmNi5XpNlSmj0229iwy59BIcpXfgtAMTsa1zsY4teZ/AT
 3noNuT15CTeybwii0nT++AkqJbCbwXc5ItccGh9ZMOQwXuA5IUVTAzKrulUJoxXN
 zt23lX/ivtSfUH+pMMIG6wMVG2eGIP5m9drw+2n0yK08gt+oprLYnaAaE389mXgb
 TIRBafeBY7UA1YEcA4JDBDMNa0L8yWSV+XiMhxw7Ear7KoROAunKNbsG8USll6zb
 zBftfO+Gzv86wVvvPXg2KR8Qs9vyJMw2bOROFKzOnd+wQQ76v0XefOhNUUN98E6g
 tLTmCH+M1B1Qm1j2hVyOect/PMY51xqJA9xwlTtAbqIcz4qyOtfTR9KqqlWxVKJW
 9pAEMII063xEKVxgv2khOhewEjOgqa4v9YFQjVXdcHPKvGTAYBeoJA735+WnQ1HZ
 okPC5k3DoEcVZEkUPvespEsAqm+RoBufNxWmQ7hq5N3IwZAXsIwTlhysgrXQWyc9
 aTigWBq6rQ/bMz/57vI626+MAMh3StL+UOxlWiT+GToInpjZwoxZ0lgQdD6vUfUm
 T90T2930+PTkykQM9sNdQygGiH0J5FzkvneYvpkOYJ/+vphsRiA=
 =MuRt
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more patch that we'd like to get to 5.12 before release.

  It's changing where and how the superblock is stored in the zoned
  mode. It is an on-disk format change but so far there are no
  implications for users as the proper mkfs support hasn't been merged
  and is waiting for the kernel side to settle.

  Until now, the superblocks were derived from the zone index, but zone
  size can differ per device. This is changed to be based on fixed
  offset values, to make it independent of the device zone size.

  The work on that got a bit delayed, we discussed the exact locations
  to support potential device sizes and usecases. (Partially delayed
  also due to my vacation.) Having that in the same release where the
  zoned mode is declared usable is highly desired, there are userspace
  projects that need to be updated to recognize the feature. Pushing
  that to the next release would make things harder to test"

* tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: move superblock logging zone location
2021-04-11 11:53:36 -07:00
Naohiro Aota 53b74fa990 btrfs: zoned: move superblock logging zone location
Moves the location of the superblock logging zones. The new locations of
the logging zones are now determined based on fixed block addresses
instead of on fixed zone numbers.

The old placement method based on fixed zone numbers causes problems when
one needs to inspect a file system image without access to the drive zone
information. In such case, the super block locations cannot be reliably
determined as the zone size is unknown. By locating the superblock logging
zones using fixed addresses, we can scan a dumped file system image without
the zone information since a super block copy will always be present at or
after the fixed known locations.

Introduce the following three pairs of zones containing fixed offset
locations, regardless of the device zone size.

  - primary superblock: offset   0B (and the following zone)
  - first copy:         offset 512G (and the following zone)
  - Second copy:        offset   4T (4096G, and the following zone)

If a logging zone is outside of the disk capacity, we do not record the
superblock copy.

The first copy position is much larger than for a non-zoned filesystem,
which is at 64M.  This is to avoid overlapping with the log zones for
the primary superblock. This higher location is arbitrary but allows
supporting devices with very large zone sizes, plus some space around in
between.

Such large zone size is unrealistic and very unlikely to ever be seen in
real devices. Currently, SMR disks have a zone size of 256MB, and we are
expecting ZNS drives to be in the 1-4GB range, so this limit gives us
room to breathe. For now, we only allow zone sizes up to 8GB. The
maximum zone size that would still fit in the space is 256G.

The fixed location addresses are somewhat arbitrary, with the intent of
maintaining superblock reliability for smaller and larger devices, with
the preference for the latter. For this reason, there are two superblocks
under the first 1T. This should cover use cases for physical devices and
for emulated/device-mapper devices.

The superblock logging zones are reserved for superblock logging and
never used for data or metadata blocks. Note that we only reserve the
two zones per primary/copy actually used for superblock logging. We do
not reserve the ranges of zones possibly containing superblocks with the
largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G).

The zones containing the fixed location offsets used to store
superblocks on a non-zoned volume are also reserved to avoid confusion.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-10 12:13:16 +02:00
Sami Tolvanen 4f0f586bf0 treewide: Change list_sort to use const pointers
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
2021-04-08 16:04:22 -07:00
Linus Torvalds 701c09c988 for-5.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBctBgACgkQxWXV+ddt
 WDu1nA//bzuPwW3nO+enE+ipi4t6UJTJpHLeDgdMshWwhBIHVt+oFxTUIt4Zd0kT
 0hJ+mbNrZHzmDmzpb6ifQn0D6k+wq6zbsEgLtwgmPmBszaXIw46FvnYnxd9FtCde
 9SQzBKa86i/KMkRtaIvpUcunniIo5Aj0Hvu0oPgTKObqiB4HP2nV6rKody+mP9JW
 RanWbBi0JvI4UE/J2Ud1sNWFdDtVpXpcktj1dsI8gbsYNR05HpM08SEUgeF/ts3I
 yB/L18I5CUeFHyo/yogbj7kkikugPGsmOj/A86UZ6x3NxWoC+m7UXoGrO2/qlFem
 qd3ioXZKlnPqeX29kAy/REa3xjE61istlDVC/vckqmXBfYc6WK/KAJvFAGI+/3VI
 9HvIbBokUQzekhFlA02RTqGcasStXX7VSeJyzyAbXjGhZQKfFTHR8ZBtrREiVBC9
 58K+g8SSqIb/9iJqYV4h82lSBRSdf9kHx7CSB2gOBuifihY+chVr4Xzhq12IlXbK
 TNlue0BTwYLJStwx2dnY2beLbLG34/4FNRsuAR/9JsCio7Bfj0qN8htIyvfsiMxr
 mkrH7+Ykd10FqC8uu6MHiW9k428871Era3B97TgyQ0V17ehh4IN0v9V7kckk9EWw
 3omaPwuF2FGfFOoTR7ipKO0nDx0/y2knnDSTsWknNG09Ciwa+Ww=
 =SuJv
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes for issues that have some user visibility and are simple enough
  for this time of development cycle:

   - a few fixes for rescue= mount option, adding more checks for
     missing trees

   - fix sleeping in atomic context on qgroup deletion

   - fix subvolume deletion on mount

   - fix build with M= syntax

   - fix checksum mismatch error message for direct io"

* tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix check_data_csum() error message for direct I/O
  btrfs: fix sleep while in non-sleep context during qgroup removal
  btrfs: fix subvolume/snapshot deletion not triggered on mount
  btrfs: fix build when using M=fs/btrfs
  btrfs: do not initialize dev replace for bad dev root
  btrfs: initialize device::fs_info always
  btrfs: do not initialize dev stats if we have no dev_root
  btrfs: zoned: remove outdated WARN_ON in direct IO
2021-03-25 15:38:22 -07:00
Linus Torvalds 81aa0968b7 for-5.12-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBTeBsACgkQxWXV+ddt
 WDtwcBAAoto5Pbc3Lvt0aha3qn9q/Ms9lNU3YIwTjqXV3lIRKksWCS7kQmWlFmLz
 dILhdRBg1iWVh8qbeqpL5su7yNJduypsY/ImJroukb/BzwQViFRDGy5qIc56qLH2
 OVTx4LQ0zdqVdD86Qj0mt9ilSjgXYN+J53IUjsSSyJIpgt3vVcfjCYSkFO8zBiMH
 eliRtYShzJHkjEwVWLZRzk76oTnFQEC28IdYJ4y95mYl2wCABfTU2ylSeVDTtc6O
 x+fNMHHRmde2nbsHc+0eMm7rYLXuzvyx/tY17u6A6iwEQLGjE4rXOVZ7kA93WgAd
 YTXhM/B+YFfirNh029Av/MJP+2t9YBEODAHl1tnOdM0mfvXkpimaW0jvUEhi5f6I
 ZGu5FytscsgjyUK827WL7bZKO8WMzTLQvB3ryZ9UcrHm3QbZ7xGdoBE2L86p4Euw
 LiXUALdOWeYjFKSW9WWKrtQBtdjlLQYqJt+hL0ifaGlnfoi2G+DQeKtL9ZAKH5Cu
 gcjDUewnJtYPLyDOCRjQPFcts/MD5o81qMLeEwshmZT/bNMD9JOGEppCxBWGWSCx
 dYGq04Wib/dN710i5jB1XbJboBmT2SZDyBeiKTpCXs5mECBU00uWkkO98oId1YS3
 wHu9qyGUOi2g88V27jH593/JstUYn6zyxJYIZX84mzcxOqZlKuo=
 =auMP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "There are still regressions being found and fixed in the zoned mode
  and subpage code, the rest are fixes for bugs reported by users.

  Regressions:

   - subpage block support:
      - readahead works on the proper block size
      - fix last page zeroing

   - zoned mode:
      - linked list corruption for tree log

  Fixes:

   - qgroup leak after falloc failure

   - tree mod log and backref resolving:
      - extent buffer cloning race when resolving backrefs
      - pin deleted leaves with active tree mod log users

   - drop debugging flag from slab cache"

* tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: always pin deleted leaves when there are active tree mod log users
  btrfs: fix race when cloning extent buffer during rewind of an old root
  btrfs: fix slab cache flags for free space tree bitmap
  btrfs: subpage: make readahead work properly
  btrfs: subpage: fix wild pointer access during metadata read failure
  btrfs: zoned: fix linked list corruption after log root tree allocation failure
  btrfs: fix qgroup data rsv leak caused by falloc failure
  btrfs: track qgroup released data in own variable in insert_prealloc_file_extent
  btrfs: fix wrong offset to zero out range beyond i_size
2021-03-18 13:38:42 -07:00
Omar Sandoval c1d6abdac4 btrfs: fix check_data_csum() error message for direct I/O
Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
check_data_csum()") replaced the start parameter to check_data_csum()
with page_offset(), but page_offset() is not meaningful for direct I/O
pages. Bring back the start parameter.

Fixes: 265d4ac03f ("btrfs: sink parameter start and len to check_data_csum")
CC: stable@vger.kernel.org # 5.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-18 21:25:11 +01:00
Filipe Manana 0bb7883009 btrfs: fix sleep while in non-sleep context during qgroup removal
While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
producing the following trace:

  [821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
  [821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
  [821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G        W         5.11.6 #15
  [821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
  [821.843647] Call Trace:
  [821.843650]  dump_stack+0xa1/0xfb
  [821.843656]  ___might_sleep+0x144/0x160
  [821.843659]  mutex_lock+0x17/0x40
  [821.843662]  kernfs_remove_by_name_ns+0x1f/0x80
  [821.843666]  sysfs_remove_group+0x7d/0xe0
  [821.843668]  sysfs_remove_groups+0x28/0x40
  [821.843670]  kobject_del+0x2a/0x80
  [821.843672]  btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
  [821.843685]  __del_qgroup_rb+0x12/0x150 [btrfs]
  [821.843696]  btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
  [821.843707]  btrfs_ioctl+0x3129/0x36a0 [btrfs]
  [821.843717]  ? __mod_lruvec_page_state+0x5e/0xb0
  [821.843719]  ? page_add_new_anon_rmap+0xbc/0x150
  [821.843723]  ? kfree+0x1b4/0x300
  [821.843725]  ? mntput_no_expire+0x55/0x330
  [821.843728]  __x64_sys_ioctl+0x5a/0xa0
  [821.843731]  do_syscall_64+0x33/0x70
  [821.843733]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [821.843736] RIP: 0033:0x4cd3fb
  [821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
  [821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
  [821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
  [821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
  [821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049

Fix this by removing the qgroup sysfs entry while not holding the spinlock,
since the spinlock is only meant for protection of the qgroup rbtree.

Reported-by: Stuart Shelton <srcshelton@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7A5485BB-0628-419D-A4D3-27B1AF47E25A@gmail.com/
Fixes: 49e5fb4621 ("btrfs: qgroup: export qgroups in sysfs")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-18 16:53:24 +01:00
Filipe Manana 8d488a8c7b btrfs: fix subvolume/snapshot deletion not triggered on mount
During the mount procedure we are calling btrfs_orphan_cleanup() against
the root tree, which will find all orphans items in this tree. When an
orphan item corresponds to a deleted subvolume/snapshot (instead of an
inode space cache), it must not delete the orphan item, because that will
cause btrfs_find_orphan_roots() to not find the orphan item and therefore
not add the corresponding subvolume root to the list of dead roots, which
results in the subvolume's tree never being deleted by the cleanup thread.

The same applies to the remount from RO to RW path.

Fix this by making btrfs_find_orphan_roots() run before calling
btrfs_orphan_cleanup() against the root tree.

A test case for fstests will follow soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
Fixes: 638331fa56 ("btrfs: fix transaction leak and crash after cleaning up orphans on RO mount")
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:22 +01:00
David Sterba ebd99a6b34 btrfs: fix build when using M=fs/btrfs
There are people building the module with M= that's supposed to be used
for external modules. This got broken in e9aa7c285d ("btrfs: enable
W=1 checks for btrfs").

  $ make M=fs/btrfs
  scripts/Makefile.lib:10: *** Recursive variable 'KBUILD_CFLAGS' references itself (eventually).  Stop.
  make: *** [Makefile:1755: modules] Error 2

There's a difference compared to 'make fs/btrfs/btrfs.ko' which needs
to rebuild a few more things and also the dependency modules need to be
available. It could fail with eg.

  WARNING: Symbol version dump "Module.symvers" is missing.
	   Modules may not have dependencies or modversions.

In some environments it's more convenient to rebuild just the btrfs
module by M= so let's make it work.

The problem is with recursive variable evaluation in += so the
conditional C options are stored in a temporary variable to avoid the
recursion.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:18 +01:00
Josef Bacik 3cb894972f btrfs: do not initialize dev replace for bad dev root
While helping Neal fix his broken file system I added a debug patch to
catch if we were calling btrfs_search_slot with a NULL root, and this
stack trace popped:

  we tried to search with a NULL root
  CPU: 0 PID: 1760 Comm: mount Not tainted 5.11.0-155.nealbtrfstest.1.fc34.x86_64 #1
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
  Call Trace:
   dump_stack+0x6b/0x83
   btrfs_search_slot.cold+0x11/0x1b
   ? btrfs_init_dev_replace+0x36/0x450
   btrfs_init_dev_replace+0x71/0x450
   open_ctree+0x1054/0x1610
   btrfs_mount_root.cold+0x13/0xfa
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x131/0x3d0
   ? legacy_get_tree+0x27/0x40
   ? btrfs_show_options+0x640/0x640
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   path_mount+0x441/0xa80
   __x64_sys_mount+0xf4/0x130
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f644730352e

Fix this by not starting the device replace stuff if we do not have a
NULL dev root.

Reported-by: Neal Gompa <ngompa13@gmail.com>
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:14 +01:00
Josef Bacik 820a49dafc btrfs: initialize device::fs_info always
Neal reported a panic trying to use -o rescue=all

  BUG: kernel NULL pointer dereference, address: 0000000000000030
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 696 Comm: mount Tainted: G        W         5.12.0-rc2+ #296
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200
  RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070
  RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400
  RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150
  R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000
  R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150
  FS:  00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0
  Call Trace:
   ? btrfs_init_dev_stats+0x1f/0xf0
   ? kmem_cache_alloc+0xef/0x1f0
   btrfs_init_dev_stats+0x5f/0xf0
   open_ctree+0x10cb/0x1720
   btrfs_mount_root.cold+0x12/0xea
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   path_mount+0x433/0xa00
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens because when we call btrfs_init_dev_stats we do
device->fs_info->dev_root.  However device->fs_info isn't initialized
because we were only calling btrfs_init_devices_late() if we properly
read the device root.  However we don't actually need the device root to
init the devices, this function simply assigns the devices their
->fs_info pointer properly, so this needs to be done unconditionally
always so that we can properly dereference device->fs_info in rescue
cases.

Reported-by: Neal Gompa <ngompa13@gmail.com>
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:12 +01:00
Josef Bacik 82d62d06db btrfs: do not initialize dev stats if we have no dev_root
Neal reported a panic trying to use -o rescue=all

  BUG: kernel NULL pointer dereference, address: 0000000000000030
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1
  RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0
  RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10
  RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000
  RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000
  R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00
  R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10
  FS:  00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0
  Call Trace:
  ? btrfs_init_dev_stats+0x1f/0xf0
  btrfs_init_dev_stats+0x62/0xf0
  open_ctree+0x1019/0x15ff
  btrfs_mount_root.cold+0x13/0xfa
  legacy_get_tree+0x27/0x40
  vfs_get_tree+0x25/0xb0
  vfs_kern_mount.part.0+0x71/0xb0
  btrfs_mount+0x131/0x3d0
  ? legacy_get_tree+0x27/0x40
  ? btrfs_show_options+0x640/0x640
  legacy_get_tree+0x27/0x40
  vfs_get_tree+0x25/0xb0
  path_mount+0x441/0xa80
  __x64_sys_mount+0xf4/0x130
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f698c04e52e

This happens because we unconditionally attempt to initialize device
stats on mount, but we may not have been able to read the device root.
Fix this by skipping initializing the device stats if we do not have a
device root.

Reported-by: Neal Gompa <ngompa13@gmail.com>
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:09 +01:00
Johannes Thumshirn f3da882eae btrfs: zoned: remove outdated WARN_ON in direct IO
In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if
we're submitting a DIO write on a zoned filesystem but are not using
REQ_OP_ZONE_APPEND to submit the IO to the block device.

This is a left over from a previous version where btrfs_dio_iomap_begin()
didn't use btrfs_use_zone_append() to check for sequential write only
zones.

It is an oversight from the development phase. In v11 (I think) I've
added 08f455593f ("btrfs: zoned: cache if block group is on a
sequential zone") and forgot to remove the WARN_ON_ONCE() for
544d24f9de ("btrfs: zoned: enable zone append writing for direct IO").

When developing auto relocation I got hit by the WARN as a block groups
where relocated to conventional zone and the dio code calls
btrfs_use_zone_append() introduced by 08f455593f to check if it can
use zone append (a.k.a. if it's a sequential zone) or not and sets the
appropriate flags for iomap.

I've never hit it in testing before, as I was relying on emulation to
test the conventional zones code but this one case wasn't hit, because
on emulation fs_info->max_zone_append_size is 0 and the WARN doesn't
trigger either.

Fixes: 544d24f9de ("btrfs: zoned: enable zone append writing for direct IO")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:41:51 +01:00
Filipe Manana 485df75554 btrfs: always pin deleted leaves when there are active tree mod log users
When freeing a tree block we may end up adding its extent back to the
free space cache/tree, as long as there are no more references for it,
it was created in the current transaction and writeback for it never
happened. This is generally fine, however when we have tree mod log
operations it can result in inconsistent versions of a btree after
unwinding extent buffers with the recorded tree mod log operations.

This is because:

* We only log operations for nodes (adding and removing key/pointers),
  for leaves we don't do anything;

* This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
  for a node that points to a leaf that was deleted;

* Before we apply the logged operation to unwind a node, we can have
  that leaf's extent allocated again, either as a node or as a leaf, and
  possibly for another btree. This is possible if the leaf was created in
  the current transaction and writeback for it never started, in which
  case btrfs_free_tree_block() returns its extent back to the free space
  cache/tree;

* Then, before applying the tree mod log operation, some task allocates
  the metadata extent just freed before, and uses it either as a leaf or
  as a node for some btree (can be the same or another one, it does not
  matter);

* After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
  get the target node with an item pointing to the metadata extent that
  now has content different from what it had before the leaf was deleted.
  It might now belong to a different btree and be a node and not a leaf
  anymore.

  As a consequence, the results of searches after the unwinding can be
  unpredictable and produce unexpected results.

So make sure we pin extent buffers corresponding to leaves when there
are tree mod log users.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 20:32:22 +01:00
Filipe Manana dbcc7d57bf btrfs: fix race when cloning extent buffer during rewind of an old root
While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:

  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:1210!
  invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
  Code: 05 48 8d 74 10 (...)
  RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
  RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
  RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
  RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
  R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
  R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
  FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
  Call Trace:
   btrfs_search_old_slot+0x265/0x10d0
   ? lock_acquired+0xbb/0x600
   ? btrfs_search_slot+0x1090/0x1090
   ? free_extent_buffer.part.61+0xd7/0x140
   ? free_extent_buffer+0x13/0x20
   resolve_indirect_refs+0x3e9/0xfc0
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? add_prelim_ref.part.11+0x150/0x150
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? lock_acquired+0xbb/0x600
   ? __kasan_check_write+0x14/0x20
   ? do_raw_spin_unlock+0xa8/0x140
   ? rb_insert_color+0x30/0x360
   ? prelim_ref_insert+0x12d/0x430
   find_parent_nodes+0x5c3/0x1830
   ? resolve_indirect_refs+0xfc0/0xfc0
   ? lock_release+0xc8/0x620
   ? fs_reclaim_acquire+0x67/0xf0
   ? lock_acquire+0xc7/0x510
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x160/0x210
   ? lock_release+0xc8/0x620
   ? fs_reclaim_acquire+0x67/0xf0
   ? lock_acquire+0xc7/0x510
   ? poison_range+0x38/0x40
   ? unpoison_range+0x14/0x40
   ? trace_hardirqs_on+0x55/0x120
   btrfs_find_all_roots_safe+0x142/0x1e0
   ? find_parent_nodes+0x1830/0x1830
   ? btrfs_inode_flags_to_xflags+0x50/0x50
   iterate_extent_inodes+0x20e/0x580
   ? tree_backref_for_extent+0x230/0x230
   ? lock_downgrade+0x3d0/0x3d0
   ? read_extent_buffer+0xdd/0x110
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? lock_acquired+0xbb/0x600
   ? __kasan_check_write+0x14/0x20
   ? _raw_spin_unlock+0x22/0x30
   ? __kasan_check_write+0x14/0x20
   iterate_inodes_from_logical+0x129/0x170
   ? iterate_inodes_from_logical+0x129/0x170
   ? btrfs_inode_flags_to_xflags+0x50/0x50
   ? iterate_extent_inodes+0x580/0x580
   ? __vmalloc_node+0x92/0xb0
   ? init_data_container+0x34/0xb0
   ? init_data_container+0x34/0xb0
   ? kvmalloc_node+0x60/0x80
   btrfs_ioctl_logical_to_ino+0x158/0x230
   btrfs_ioctl+0x205e/0x4040
   ? __might_sleep+0x71/0xe0
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? getrusage+0x4b6/0x9c0
   ? __kasan_check_read+0x11/0x20
   ? lock_release+0xc8/0x620
   ? __might_fault+0x64/0xd0
   ? lock_acquire+0xc7/0x510
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? __kasan_check_read+0x11/0x20
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? __kasan_check_read+0x11/0x20
   ? lock_release+0xc8/0x620
   ? __task_pid_nr_ns+0xd3/0x250
   ? lock_acquire+0xc7/0x510
   ? __fget_files+0x160/0x230
   ? __fget_light+0xf2/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7fd1976e2427
  Code: 00 00 90 48 8b 05 (...)
  RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
  RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
  RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
  R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
  R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
  Modules linked in:
  ---[ end trace ec8931a1c36e57be ]---

  (gdb) l *(__tree_mod_log_rewind+0x3b1)
  0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
  1205                     * the modification. as we're going backwards, we do the
  1206                     * opposite of each operation here.
  1207                     */
  1208                    switch (tm->op) {
  1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
  1210                            BUG_ON(tm->slot < n);
  1211                            fallthrough;
  1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
  1213                    case MOD_LOG_KEY_REMOVE:
  1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);

Here's what happens to hit that BUG_ON():

1) We have one tree mod log user (through fiemap or the logical ino ioctl),
   with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;

2) Another task is at ctree.c:balance_level() and we have eb X currently as
   the root of the tree, and we promote its single child, eb Y, as the new
   root.

   Then, at ctree.c:balance_level(), we call:

      tree_mod_log_insert_root(eb X, eb Y, 1);

3) At tree_mod_log_insert_root() we create tree mod log elements for each
   slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
   with a ->logical pointing to ebX->start. These are placed in an array
   named tm_list.
   Lets assume there are N elements (N pointers in eb X);

4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
   element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
   ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
   to the level of eb X and ->generation set to the generation of eb X;

5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
   tm_list as argument. After that, tree_mod_log_free_eb() calls
   __tree_mod_log_insert() for each member of tm_list in reverse order,
   from highest slot in eb X, slot N - 1, to slot 0 of eb X;

6) __tree_mod_log_insert() sets the sequence number of each given tree mod
   log operation - it increments fs_info->tree_mod_seq and sets
   fs_info->tree_mod_seq as the sequence number of the given tree mod log
   operation.

   This means that for the tm_list created at tree_mod_log_insert_root(),
   the element corresponding to slot 0 of eb X has the highest sequence
   number (1 + N), and the element corresponding to the last slot has the
   lowest sequence number (2);

7) Then, after inserting tm_list's elements into the tree mod log rbtree,
   the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
   sequence number, which is N + 2;

8) Back to ctree.c:balance_level(), we free eb X by calling
   btrfs_free_tree_block() on it. Because eb X was created in the current
   transaction, has no other references and writeback did not happen for
   it, we add it back to the free space cache/tree;

9) Later some other task T allocates the metadata extent from eb X, since
   it is marked as free space in the space cache/tree, and uses it as a
   node for some other btree;

10) The tree mod log user task calls btrfs_search_old_slot(), which calls
    get_old_root(), and finally that calls __tree_mod_log_oldest_root()
    with time_seq == 1 and eb_root == eb Y;

11) First iteration of the while loop finds the tree mod log element with
    sequence number N + 2, for the logical address of eb Y and of type
    MOD_LOG_ROOT_REPLACE;

12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
    of the loop, and set root_logical to point to tm->old_root.logical
    which corresponds to the logical address of eb X;

13) On the next iteration of the while loop, the call to
    tree_mod_log_search_oldest() returns the smallest tree mod log element
    for the logical address of eb X, which has a sequence number of 2, an
    operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
    the old slot N - 1 of eb X (eb X had N items in it before being freed);

14) We then break out of the while loop and return the tree mod log operation
    of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
    eb X, to get_old_root();

15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
    and set "logical" to the logical address of eb X, which was the old
    root. We then call tree_mod_log_search() passing it the logical
    address of eb X and time_seq == 1;

16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
    which results in adding a tree mod log operation of type
    MOD_LOG_KEY_ADD to the tree mod log - this is done at
    ctree.c:insert_ptr() - but after adding the tree mod log operation
    and before updating the number of items in eb X from 0 to 1...

17) The task at get_old_root() calls tree_mod_log_search() and gets the
    tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
    Then it enters the following if branch:

    if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
       (...)
    } (...)

    Calls read_tree_block() for eb X, which gets a reference on eb X but
    does not lock it - task T has it locked.
    Then it clones eb X while it has nritems set to 0 in its header, before
    task T sets nritems to 1 in eb X's header. From hereupon we use the
    clone of eb X which no other task has access to;

18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
    mod log operation we just got from tree_mod_log_search() in the
    previous step and the cloned version of eb X;

19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
    of items set in eb X's clone, which is 0. Then we enter the while loop,
    and in its first iteration we process the MOD_LOG_KEY_ADD operation,
    which just decrements "n" from 0 to (u32)-1, since "n" is declared with
    a type of u32. At the end of this iteration we call rb_next() to find the
    next tree mod log operation for eb X, that gives us the mod log operation
    of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
    number of N + 1 (steps 3 to 6);

20) Then we go back to the top of the while loop and trigger the following
    BUG_ON():

        (...)
        switch (tm->op) {
        case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                 BUG_ON(tm->slot < n);
                 fallthrough;
        (...)

    Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.

Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Fixes: 834328a849 ("Btrfs: tree mod log's old roots could still be part of the tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 20:32:17 +01:00
David Sterba 34e49994d0 btrfs: fix slab cache flags for free space tree bitmap
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
that's a debugging flag and not always enabled. Also the other slabs are
created with at least SLAB_MEM_SPREAD that we want as well to average
the memory placement cost.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 3acd48507d ("btrfs: fix allocation of free space cache v1 bitmap pages")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 20:32:08 +01:00
Qu Wenruo 60484cd9d5 btrfs: subpage: make readahead work properly
In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT
while we're not doing anything specific to PAGE_SIZE.

One of the most affected part is the radix tree operation of
btrfs_fs_info::reada_tree.

If using PAGE_SHIFT, subpage metadata readahead is broken and does no
help reading metadata ahead.

Fix the problem by using btrfs_fs_info::sectorsize_bits so that
readahead could work for subpage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 11:06:21 +01:00
Qu Wenruo d9bb77d51e btrfs: subpage: fix wild pointer access during metadata read failure
[BUG]
When running fstests for btrfs subpage read-write test, it has a very
high chance to crash at generic/475 with the following stack:

 BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10
 Unable to handle kernel paging request at virtual address ffff80001157e7c0
 CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G        WC        5.12.0-rc2-custom+ #5
 Hardware name: Khadas VIM3 (DT)
 Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
 pc : queued_spin_lock_slowpath+0x1a0/0x390
 lr : do_raw_spin_lock+0xc4/0x11c
 Call trace:
  queued_spin_lock_slowpath+0x1a0/0x390
  _raw_spin_lock+0x68/0x84
  btree_readahead_hook+0x38/0xc0 [btrfs]
  end_bio_extent_readpage+0x504/0x5f4 [btrfs]
  bio_endio+0x170/0x1a4
  end_workqueue_fn+0x3c/0x60 [btrfs]
  btrfs_work_helper+0x1b0/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30
 Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827)

[CAUSE]
In end_bio_extent_readpage(), if we hit an error during read, we will
handle the error differently for data and metadata.
For data we queue a repair, while for metadata, we record the error and
let the caller choose what to do.

But the code is still using page->private to grab extent buffer, which
no longer points to extent buffer for subpage metadata pages.

Thus this wild pointer access leads to above crash.

[FIX]
Introduce a helper, find_extent_buffer_readpage(), to grab extent
buffer.

The difference against find_extent_buffer_nospinlock() is:

- Also handles regular sectorsize == PAGE_SIZE case
- No extent buffer refs increase/decrease
  As extent buffer under IO must have non-zero refs, so this is safe

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-16 11:06:18 +01:00
Filipe Manana e3d3b41576 btrfs: zoned: fix linked list corruption after log root tree allocation failure
When using a zoned filesystem, while syncing the log, if we fail to
allocate the root node for the log root tree, we are not removing the
log context we allocated on stack from the list of log contexts of the
log root tree. This means after the return from btrfs_sync_log() we get
a corrupted linked list.

Fix this by allocating the node before adding our stack allocated context
to the list of log contexts of the log root tree.

Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15 16:57:19 +01:00
Qu Wenruo a3ee79bd8f btrfs: fix qgroup data rsv leak caused by falloc failure
[BUG]
When running fsstress with only falloc workload, and a very low qgroup
limit set, we can get qgroup data rsv leak at unmount time.

 BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480
 BTRFS error (device dm-0): qgroup reserved space leaked

The minimal reproducer looks like:

  #!/bin/bash
  dev=/dev/test/test
  mnt="/mnt/btrfs"
  fsstress=~/xfstests-dev/ltp/fsstress
  runtime=8

  workload()
  {
          umount $dev &> /dev/null
          umount $mnt &> /dev/null
          mkfs.btrfs -f $dev > /dev/null
          mount $dev $mnt

          btrfs quota en $mnt
          btrfs quota rescan -w $mnt
          btrfs qgroup limit 16m 0/5 $mnt

          $fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \
  		-d $mnt -v > /tmp/fsstress

          umount $mnt
          if dmesg | grep leak ; then
		echo "!!! FAILED !!!"
  		exit 1
          fi
  }

  for (( i=0; i < $runtime; i++)); do
          echo "=== $i/$runtime==="
          workload
  done

Normally it would fail before round 4.

[CAUSE]
In function insert_prealloc_file_extent(), we first call
btrfs_qgroup_release_data() to know how many bytes are reserved for
qgroup data rsv.

Then use that @qgroup_released number to continue our work.

But after we call btrfs_qgroup_release_data(), we should either queue
@qgroup_released to delayed ref or free them manually in error path.

Unfortunately, we lack the error handling to free the released bytes,
leaking qgroup data rsv.

All the error handling function outside won't help at all, as we have
released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED
bit is already cleared, thus all btrfs_qgroup_free_data() call won't
free any data rsv.

[FIX]
Add free_qgroup tag to manually free the released qgroup data rsv.

Reported-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Fixes: 9729f10a60 ("btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15 16:57:15 +01:00
Qu Wenruo fbf48bb0b1 btrfs: track qgroup released data in own variable in insert_prealloc_file_extent
There is a piece of weird code in insert_prealloc_file_extent(), which
looks like:

	ret = btrfs_qgroup_release_data(inode, file_offset, len);
	if (ret < 0)
		return ERR_PTR(ret);
	if (trans) {
		ret = insert_reserved_file_extent(trans, inode,
						  file_offset, &stack_fi,
						  true, ret);
	...
	}
	extent_info.is_new_extent = true;
	extent_info.qgroup_reserved = ret;
	...

Note how the variable @ret is abused here, and if anyone is adding code
just after btrfs_qgroup_release_data() call, it's super easy to
overwrite the @ret and cause tons of qgroup related bugs.

Fix such abuse by introducing new variable @qgroup_released, so that we
won't reuse the existing variable @ret.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15 16:57:12 +01:00
Qu Wenruo d2dcc8ed8e btrfs: fix wrong offset to zero out range beyond i_size
[BUG]
The test generic/091 fails , with the following output:

  fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
  mapped writes DISABLED
  Seed set to 1
  main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
  main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
  skipping zero size read
  truncating to largest ever: 0xe400
  copying to largest ever: 0x1f400
  cloning to largest ever: 0x70000
  cloning to largest ever: 0x77000
  fallocating to largest ever: 0x7a120
  Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<<
  ...

[CAUSE]
In commit c28ea613fa ("btrfs: subpage: fix the false data csum mismatch error")
end_bio_extent_readpage() changes to only zero the range inside the bvec
for incoming subpage support.

But that commit is using incorrect offset to calculate the start.

For subpage, we can have a case that the whole bvec is beyond isize,
thus we need to calculate the correct offset.

But the offending commit is using @end (bvec end), other than @start
(bvec start) to calculate the start offset.

This means, we only zero the last byte of the bvec, not from the isize.
This stupid bug makes the range beyond isize is not properly zeroed, and
failed above test.

[FIX]
Use correct @start to calculate the range start.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: c28ea613fa ("btrfs: subpage: fix the false data csum mismatch error")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-15 16:56:54 +01:00
Linus Torvalds ce307084c9 block-5.12-2021-03-12-v2
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBLzKsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi0ID/9djN1db0OrAjQgWdOQsKwzcPG4fmVRHJAu
 Zi8SPRj0ByonWGaPWjiSi297/j00dfYFFIXaB1Pfo4j0wX0IK8bJINl0G8SN6Dag
 WYBBrT/5rCQgD8fjQ1XhuzuqLwxwcZfYXAnCAlqABG18nPk532D4dX2CMEasl8F7
 XWTTj5PqHDN4bCcriH1GEA5S+2nmoz5YXjNZEDcY3/pQMdyb8Jo9mRfZubkrnRxK
 c9fz2LjUz0IRaSb+9PILY5qDLOSIh+vHOIk/3BKW9DoqU/S3kTTr4twqnOclfVPH
 VgJM9b+sHveVCztCJ9bnNGkW7HWjUQa8gb/B40NBxKEhw7w/HCjykhhxd+QTUQTM
 GJVMRGYWhzuUEuU1M1hArPua0GLmPKSvC0CRgbKRmgPNjshTquZPJnBBFwv2wZKQ
 GkrwktdK9ihE1ya4gu20MupST3PIpT3jtc6NAizr6DCy0wJ0Z1X5KYnFdbtS79No
 I9qPC8lu3AcZq6NXdBfTO9ngIdiUwi9AfSYj7koS/4dmnVccVJmaj0/NNmVp2Ro3
 HtaObanBnTi9v8YHl8WgX6lq5RjuQ204fXmd0No4mHFvgxsl7YaX+JBts7S3A2Nf
 PoQLqmulcLmzT3EVuEg279aXw2rbnyWHARbF/5/tIr4JcugtLJhwFnBA5YgFreq9
 lSbqgoKSHw==
 =qHyO
 -----END PGP SIGNATURE-----

Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Mostly just random fixes all over the map.

  The only odd-one-out change is finally getting the rename of
  BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the
  multipage bvec change, but it's been left.

  Do it now to avoid hassles around changes piling up for the next merge
  window.

  Summary:

   - NVMe pull request:
      - one more quirk (Dmitry Monakhov)
      - fix max_zone_append_sectors initialization (Chaitanya Kulkarni)
      - nvme-fc reset/create race fix (James Smart)
      - fix status code on aborts/resets (Hannes Reinecke)
      - fix the CSS check for ZNS namespaces (Chaitanya Kulkarni)
      - fix a use after free in a debug printk in nvme-rdma (Lv Yunlong)

   - Follow-up NVMe error fix for NULL 'id' (Christoph)

   - Fixup for the bd_size_lock being IRQ safe, now that the offending
     driver has been dropped (Damien).

   - rsxx probe failure error return (Jia-Ju)

   - umem probe failure error return (Wei)

   - s390/dasd unbind fixes (Stefan)

   - blk-cgroup stats summing fix (Xunlei)

   - zone reset handling fix (Damien)

   - Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph)

   - Suppress uevent trigger for hidden devices (Daniel)

   - Fix handling of discard on busy device (Jan)

   - Fix stale cache issue with zone reset (Shin'ichiro)"

* tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block:
  nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
  block: Discard page cache of zone reset target range
  block: Suppress uevent for hidden device when removed
  block: rename BIO_MAX_PAGES to BIO_MAX_VECS
  nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
  nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
  nvme-core: check ctrl css before setting up zns
  nvme-fc: fix racing controller reset and create association
  nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
  nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
  nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
  nvme: simplify error logic in nvme_validate_ns()
  nvme: set max_zone_append_sectors nvme_revalidate_zones
  block: rsxx: fix error return code of rsxx_pci_probe()
  block: Fix REQ_OP_ZONE_RESET_ALL handling
  umem: fix error return code in mm_pci_probe()
  blk-cgroup: Fix the recursive blkg rwstat
  s390/dasd: fix hanging IO request during DASD driver unbind
  s390/dasd: fix hanging DASD driver unbind
  block: Try to handle busy underlying device on discard
2021-03-12 13:25:49 -08:00
Christoph Hellwig a8affc03a9 block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed.  Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11 07:47:48 -07:00
Linus Torvalds f09b04cc64 for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBCOi4ACgkQxWXV+ddt
 WDtXvw//TWx3m05qHJqqG8V90uel8hB2J5vd4CA2r62Je1G8RDho57Bo7fyvL4l+
 mdCPt+INajb0mpp0IoHMtyLHefojgNOsrX6FAK1/gjnLkjRLFZ3wQqkA34Ue9pNs
 2u+rMY6eB105iaS3VejEmiebr++MZfjfQRV+GXU336AEeOEDZdgol8o6jMyde5TO
 zRH9Dni5Sy/YAGGAb0vaoG2BMyVigrqkbjkzwjYChbUj/KuyffAgQj0v8BvsC9Y6
 DnPD5yrt5kSZzuqQFH7c2jxLN0cvW+tJ0znCpnwn/nmiCALbl6y2a4dmewC32TwJ
 II+3OPGpYudafLJEP15qafsJb7LmEfnGwUIrfEZbyb4lQG12uyYOdP3IN7+8td14
 fd29GE62w5aErsmurcMFj/x43k4DIfcqC8b+Y+S27JZF1szh7ExCfoYC/6c5e5Qf
 j6/6RtRSVqdxImRd0QYv3mCIeSG0CH2UR/1otvC81jRTHRyB3r6TV8wPLo+5K/Rk
 ongKZ+BQa5RUk8skdFburhrkDDKgfBcjlexl5Gsqw+D/xTGNAcVnNQrTtW9sTSle
 hB3b7CunXA1eCyui2SIqN1dR8hwao4b9RzYNs3y2jWjSPZD/Bp0BdQ8oxSPvIWkX
 a8kauFGhKhY2Tdqau+CQ4UbbQWzEB7FulkPCOLiHDDZjyxIvAA4=
 =tlU3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "More regression fixes and stabilization.

  Regressions:

   - zoned mode
      - count zone sizes in wider int types
      - fix space accounting for read-only block groups

   - subpage: fix page tail zeroing

  Fixes:

   - fix spurious warning when remounting with free space tree

   - fix warning when creating a directory with smack enabled

   - ioctl checks for qgroup inheritance when creating a snapshot

   - qgroup
      - fix missing unlock on error path in zero range
      - fix amount of released reservation on error
      - fix flushing from unsafe context with open transaction,
        potentially deadlocking

   - minor build warning fixes"

* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: do not account freed region of read-only block group as zone_unusable
  btrfs: zoned: use sector_t for zone sectors
  btrfs: subpage: fix the false data csum mismatch error
  btrfs: fix warning when creating a directory with smack enabled
  btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
  btrfs: export and rename qgroup_reserve_meta
  btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
  btrfs: fix spurious free_space_tree remount warning
  btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
  btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
  btrfs: ref-verify: use 'inline void' keyword ordering
2021-03-05 12:21:14 -08:00
Naohiro Aota badae9c869 btrfs: zoned: do not account freed region of read-only block group as zone_unusable
We migrate zone unusable bytes to read-only bytes when a block group is
set to read-only, and account all the free region as bytes_readonly.
Thus, we should not increase block_group->zone_unusable when the block
group is read-only.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-04 16:16:58 +01:00
Naohiro Aota d734492a14 btrfs: zoned: use sector_t for zone sectors
We need to use sector_t for zone_sectors, or it would set the zone size
to zero when the size >= 4GB (= 2^24 sectors) by shifting the
zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to
8GiB.

Fixes: 5b31646898 ("btrfs: get zone information of zoned block devices")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-04 16:16:55 +01:00
Qu Wenruo c28ea613fa btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).

With some extra debug info added, we have the following traces:

  0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
  0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
  0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
  0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
  0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
  0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
  0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
  0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
  1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
  1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
  1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
  7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0

[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.

But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.

This means for subpage if we have an inode whose content looks like below:

  0       16K     32K     48K     64K
  |///////|       |///////|       |

  |//| = data needs to be read from disk
  |  | = hole

And i_size is 64K initially.

Then the following race can happen:

		T1		|		T2
--------------------------------+--------------------------------
btrfs_do_readpage()		|
|- isize = 64K;			|
|  At this time, the isize is 	|
|  64K				|
|				|
|- submit_extent_page()		|
|  submit previous assembled bio|
|  assemble bio for [0, 16K)	|
|				|
|- submit_extent_page()		|
   submit read bio for [0, 16K) |
   assemble read bio for	|
   [32K, 48K)			|
 				|
				| btrfs_setsize()
				| |- i_size_write(, 16K);
				|    Now i_size is only 16K
end_io() for [0K, 16K)		|
|- end_bio_extent_readpage()	|
   |- btrfs_verify_data_csum()  |
   |  No csum error		|
   |- i_size = 16K;		|
   |- zero_user_segment(16K,	|
      PAGE_SIZE);		|
      !!! We zeroed range	|
      !!! [32K, 48K)		|
				| end_io for [32K, 48K)
				| |- end_bio_extent_readpage()
				|    |- btrfs_verify_data_csum()
				|       ! CSUM MISMATCH !
				|       ! As the range is zeroed now !

[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.

The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 17:48:00 +01:00
Filipe Manana fd57a98d6f btrfs: fix warning when creating a directory with smack enabled
When we have smack enabled, during the creation of a directory smack may
attempt to add a "smack transmute" xattr on the inode, which results in
the following warning and trace:

  WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0
  Modules linked in: nft_objref nf_conntrack_netbios_ns (...)
  CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:start_transaction+0x489/0x4f0
  Code: e9 be fc ff ff (...)
  RSP: 0018:ffffc90001887d10 EFLAGS: 00010202
  RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003
  RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000
  RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004
  R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2
  R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000
  FS:  00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? slab_free_freelist_hook+0xea/0x1b0
   ? trace_hardirqs_on+0x1c/0xe0
   btrfs_setxattr_trans+0x3c/0xf0
   __vfs_setxattr+0x63/0x80
   smack_d_instantiate+0x2d3/0x360
   security_d_instantiate+0x29/0x40
   d_instantiate_new+0x38/0x90
   btrfs_mkdir+0x1cf/0x1e0
   vfs_mkdir+0x14f/0x200
   do_mkdirat+0x6d/0x110
   do_syscall_64+0x2d/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f673196ae6b
  Code: 8b 05 11 (...)
  RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
  RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b
  RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d
  RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000
  R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000
  R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0
  irq event stamp: 11029
  hardirqs last  enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670
  hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670
  softirqs last  enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20
  softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20

This happens because at btrfs_mkdir() we call d_instantiate_new() while
holding a transaction handle, which results in the following call chain:

  btrfs_mkdir()
     trans = btrfs_start_transaction(root, 5);

     d_instantiate_new()
        smack_d_instantiate()
            __vfs_setxattr()
                btrfs_setxattr_trans()
                   btrfs_start_transaction()
                      start_transaction()
                         WARN_ON()
                           --> a tansaction start has TRANS_EXTWRITERS
                               set in its type
                         h->orig_rsv = h->block_rsv
                         h->block_rsv = NULL

     btrfs_end_transaction(trans)

Besides the warning triggered at start_transaction, we set the handle's
block_rsv to NULL which may cause some surprises later on.

So fix this by making btrfs_setxattr_trans() not start a transaction when
we already have a handle on one, stored in current->journal_info, and use
that handle. We are good to use the handle because at btrfs_mkdir() we did
reserve space for the xattr and the inode item.

Reported-by: Casey Schaufler <casey@schaufler-ca.com>
CC: stable@vger.kernel.org # 5.4+
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 17:47:56 +01:00
Nikolay Borisov 4d14c5cde5 btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:

 * ae5e070eac ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")

 * 6f23277a49 ("btrfs: qgroup: don't commit transaction when we already
 hold the handle")

Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:

  PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_timeout at ffffffffa52a1bdd
  #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
  #4  start_delalloc_inodes at ffffffffc0380db5
  #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
  #6  try_flush_qgroup at ffffffffc03f04b2
  #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
  #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
  #9  btrfs_update_inode at ffffffffc0385ba8
 #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
 #11  touch_atime at ffffffffa4cf0000
 #12  generic_file_read_iter at ffffffffa4c1f123
 #13  new_sync_read at ffffffffa4ccdc8a
 #14  vfs_read at ffffffffa4cd0849
 #15  ksys_read at ffffffffa4cd0bd1
 #16  do_syscall_64 at ffffffffa4a052eb
 #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c

This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:

  PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
  #0  __schedule at ffffffffa529e07d
  #1  schedule at ffffffffa529e4ff
  #2  schedule_preempt_disabled at ffffffffa529e80a
  #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
  #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
  #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
  #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
  #7  cow_file_range at ffffffffc03879c1
  #8  btrfs_run_delalloc_range at ffffffffc038894c
  #9  writepage_delalloc at ffffffffc03a3c8f
 #10  __extent_writepage at ffffffffc03a4c01
 #11  extent_write_cache_pages at ffffffffc03a500b
 #12  extent_writepages at ffffffffc03a6de2
 #13  do_writepages at ffffffffa4c277eb
 #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
 #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
 #16  normal_work_helper at ffffffffc03b706c
 #17  process_one_work at ffffffffa4aba4e4
 #18  worker_thread at ffffffffa4aba6fd
 #19  kthread at ffffffffa4ac0a3d
 #20  ret_from_fork at ffffffffa54001ff

To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.

Fixes: c53e965360 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 17:17:09 +01:00
Nikolay Borisov 80e9baed72 btrfs: export and rename qgroup_reserve_meta
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:58:30 +01:00
Nikolay Borisov 0f9c03d824 btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
Following commit f218ea6c47 ("btrfs: delayed-inode: Remove wrong
qgroup meta reservation calls") this function now reserves num_bytes,
rather than the fixed amount of nodesize. As such this requires the
same amount to be freed in case of failure. Fix this by adjusting
the amount we are freeing.

Fixes: f218ea6c47 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:56:04 +01:00
Boris Burkov c55a4319c4 btrfs: fix spurious free_space_tree remount warning
The intended logic of the check is to catch cases where the desired
free_space_tree setting doesn't match the mounted setting, and the
remount is anything but ro->rw. However, it makes the mistake of
checking equality on a masked integer (btrfs_test_opt) against a boolean
(btrfs_fs_compat_ro).

If you run the reproducer:
  $ mount -o space_cache=v2 dev mnt
  $ mount -o remount,ro mnt

you would expect no warning, because the remount is not attempting to
change the free space tree setting, but we do see the warning.

To fix this, add explicit bool type casts to the condition.

I tested a variety of transitions:
sudo mount -o space_cache=v2 /dev/vg0/lv0 mnt/lol
(fst enabled)
mount -o remount,ro mnt/lol
(no warning, no fst change)
sudo mount -o remount,rw,space_cache=v1,clear_cache
(no warning, ro->rw)
sudo mount -o remount,rw,space_cache=v2 mnt
(warning, rw->rw with change)
sudo mount -o remount,ro mnt
(no warning, no fst change)
sudo mount -o remount,rw,space_cache=v2 mnt
(no warning, no fst change)

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 5.11
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:55:55 +01:00
Dan Carpenter 5011c5a663 btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct.  Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.

Fixes: 6f72c7e20d ("Btrfs: add qgroup inheritance")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:55:47 +01:00
Nikolay Borisov 4f6a49de64 btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.

Fixes: a7f8b1c2ac ("btrfs: file: reserve qgroup space after the hole punch range is locked")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:55:44 +01:00
Randy Dunlap aedb9d9089 btrfs: ref-verify: use 'inline void' keyword ordering
Fix build warnings of function signature when CONFIG_STACKTRACE is not
enabled by reordering the 'inline' and 'void' keywords.

../fs/btrfs/ref-verify.c:221:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
 static void inline __save_stack_trace(struct ref_action *ra)
../fs/btrfs/ref-verify.c:225:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
 static void inline __print_stack_trace(struct btrfs_fs_info *fs_info,

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-02 16:55:40 +01:00
Linus Torvalds 7a7fd0de4a Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull kmap conversion updates from David Sterba:
 "This contains changes regarding kmap API use and eg conversion from
  kmap_atomic to kmap_local_page.

  The API belongs to memory management but to save cross-tree
  dependency headaches we've agreed to take it through the btrfs tree
  because there are some trivial conversions possible, while the rest
  will need some time and getting the easy cases out of the way would be
  convenient.

  The changes can be grouped:

   - function exports, new helpers

   - new VM_BUG_ON for additional verification; it's been discussed if
     it should be VM_BUG_ON or BUG_ON, the former was chosen due to
     performance reasons

   - code replaced by relevant helpers"

[ This is an updated version of a request that originally came in during
  the merge window, but I asked for some updates:

    https://lore.kernel.org/lkml/cover.1614090658.git.dsterba@suse.com/

  which is why this got merge after the merge window closed.  - Linus ]

* 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use copy_highpage() instead of 2 kmaps()
  btrfs: use memcpy_[to|from]_page() and kmap_local_page()
  mm/highmem: Add VM_BUG_ON() to mem*_page() calls
  mm/highmem: Introduce memcpy_page(), memmove_page(), and memset_page()
  mm/highmem: Convert memcpy_[to|from]_page() to kmap_local_page()
  mm/highmem: Lift memcpy_[to|from]_page to core
2021-03-01 11:24:18 -08:00
Linus Torvalds c608aca57d for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmA85UwACgkQxWXV+ddt
 WDsdeA/8DXM6pMGaLkYcvkGvR53/vWwQlKq+i+3zuc41fYFJ7k+DQ7/K5hDbEMoM
 E7YsksoRlNVruH/ZvSdtx1exQ/tNrTdqPuds/UR31lIvS2NX9OZZToGWoC8VmrNw
 eS9yAwz/7JKUBA6MlMxZFv89OJoHUX9brPSeZVA8hOo3jDr5LXVm0IBskYOBUDRx
 JIvt+lkJLKMXPWxwUt3hbkbFPAUQVxYYavhJhWiXT9gdxF+eRgjMI0EN43vBMN2y
 kZtoZGeWR64heo9ehFzYMDlAVyph/loGovQ7m6XVzkk5DQGitg0vs3iAG46WjEXt
 jxt0ZKmJQwJb3/zNPd8VlLMhULGc56jcq8uhaC2pXjhy18p7EAXml+fH51BExLYK
 11hiWtWsrbTsZuYgr6fpqVFukkL/yyH/s7iCWT8Wn+AoPg2fUD99F5nkKT2T0Sso
 t7MyJVlTdq8avWbTB+8kFx8+Hy1TsRz3Ic2Zpm8+F3KeVflrb31jJIp3cxPCdfUp
 fWX+7VDjKVt00Ti7uP0fAaFO4hn2FjYcWzR3KOjomWox+8LVxB8PbD4H8jD7As2a
 5gGGOULmkiZej7hcP6J6zvnmgZIVAGPsSGSVfZtPh4VGiycL3DozcD0x5QerLchR
 NZDyIBh2KGE0cRr+cjkPxDyeqfGXQ7VUjp13CBriCkER8SOmBdw=
 =QJEy
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "This is the first batch of fixes that usually arrive during the merge
  window code freeze. Regressions and stable material.

  Regressions:

   - fix deadlock in log sync in zoned mode

   - fix bugs in subpage mode still wrongly assuming sectorsize == page
     size

  Fixes:

   - fix missing kunmap of the Q stripe in RAID6

   - block group fixes:
      - fix race between extent freeing/allocation when using bitmaps
      - avoid double put of block group when emptying cluster

   - swapfile fixes:
      - fix swapfile writes vs running scrub
      - fix swapfile activation vs snapshot creation

   - fix stale data exposure after cloning a hole with NO_HOLES enabled

   - remove tree-checker check that does not work in case information
     from other leaves is necessary"

* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix deadlock on log sync
  btrfs: avoid double put of block group when emptying cluster
  btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
  btrfs: tree-checker: do not error out if extent ref hash doesn't match
  btrfs: fix race between swap file activation and snapshot creation
  btrfs: fix race between writes to swap files and scrub
  btrfs: avoid checking for RO block group twice during nocow writeback
  btrfs: fix race between extent freeing/allocation when using bitmaps
  btrfs: make check_compressed_csum() to be subpage compatible
  btrfs: make btrfs_submit_compressed_read() subpage compatible
  btrfs: fix raid6 qstripe kmap
2021-03-01 11:17:37 -08:00
Ira Weiny 80cc838423 btrfs: use copy_highpage() instead of 2 kmaps()
There are many places where kmap/memove/kunmap patterns occur.

This pattern exists in the core common function copy_highpage().

Use copy_highpage to avoid open coding the use of kmap and leverages the
core functions use of kmap_local_page().

Development of this patch was aided by the following coccinelle script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/copypage/kunmap pattern and replace with copy_highpage calls
//
// NOTE: The expressions in the copy page version of this kmap pattern are
// overly complex and so these all need individual attention.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// Then a copy_page where we have 2 pages involved.
//
@ copy_page_rule @
expression page, page2, To, From, Size;
identifier ptr, ptr2;
type VP, VP2;
@@

/* kmap */
(
-VP ptr = kmap(page);
...
-VP2 ptr2 = kmap(page2);
|
-VP ptr = kmap_atomic(page);
...
-VP2 ptr2 = kmap_atomic(page2);
|
-ptr = kmap(page);
...
-ptr2 = kmap(page2);
|
-ptr = kmap_atomic(page);
...
-ptr2 = kmap_atomic(page2);
)

// 1 or more copy versions of the entire page
<+...
(
-copy_page(To, From);
+copy_highpage(To, From);
|
-memmove(To, From, Size);
+memmoveExtra(To, From, Size);
)
...+>

/* kunmap */
(
-kunmap(page2);
...
-kunmap(page);
|
-kunmap(page);
...
-kunmap(page2);
|
-kmap_atomic(ptr2);
...
-kmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on copy_page_rule
@
identifier copy_page_rule.ptr;
identifier copy_page_rule.ptr2;
type VP, VP1;
type VP2, VP21;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;
-VP2 ptr2;
	... when != ptr2;
? VP21 ptr2;

// </smpl>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-26 12:45:15 +01:00
Ira Weiny 3590ec5899 btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs.

This pattern was lifted to the core common functions
memcpy_[to|from]_page().

Use these new functions to reduce the code, eliminate direct uses of
kmap, and leverage the new core functions use of kmap_local_page().

Also, there is 1 place where a kmap/memcpy is followed by an
optional memset.  Here we leave the kmap open coded to avoid remapping
the page but use kmap_local_page() directly.

Development of this patch was aided by the coccinelle script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate.  Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// simple memcpy version
//
@ memcpy_rule1 @
expression page, T, F, B, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memcpy(ptr + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(ptr, F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, ptr + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, ptr, B);
+memcpy_from_page(T, page, 0, B);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule1
@
identifier memcpy_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

//
// Some callers kmap without a temp pointer
//
@ memcpy_rule2 @
expression page, T, Off, F, B;
@@

<+...
(
-memcpy(kmap(page) + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(kmap(page), F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, kmap(page) + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, kmap(page), B);
+memcpy_from_page(T, page, 0, B);
)
...+>
-kunmap(page);
// No need for the ptr variable removal

//
// Catch all
//
@ memcpy_rule3 @
expression page;
expression GenTo, GenFrom, GenSize;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memcpy
// match a catch all to be evaluated by hand.
//
-memcpy(GenTo, GenFrom, GenSize);
+memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
+memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule3
@
identifier memcpy_rule3.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

// <smpl>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-26 12:45:15 +01:00
Christoph Hellwig 87fa0f3eb2 mm/filemap: rename generic_file_buffered_read to filemap_read
Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.

Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Linus Torvalds 7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Johannes Thumshirn 6e37d24599 btrfs: zoned: fix deadlock on log sync
Lockdep with fstests test case btrfs/041 detected a unsafe locking
scenario when we allocate the log node on a zoned filesystem.

btrfs/041
 ============================================
 WARNING: possible recursive locking detected
 5.11.0-rc7+ #939 Not tainted
 --------------------------------------------
 xfs_io/698 is trying to acquire lock:
 ffff88810cd673a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x3d1/0xee0 [btrfs]

 but task is already holding lock:
 ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&root->log_mutex);
   lock(&root->log_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by xfs_io/698:
  #0: ffff88810cd66620 (sb_internal){.+.+}-{0:0}, at: btrfs_sync_file+0x2c3/0x570 [btrfs]
  #1: ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]

 stack backtrace:
 CPU: 0 PID: 698 Comm: xfs_io Not tainted 5.11.0-rc7+ #939
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
 Call Trace:
  dump_stack+0x77/0x97
  __lock_acquire.cold+0xb9/0x32a
  lock_acquire+0xb5/0x400
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  __mutex_lock+0x7b/0x8d0
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  ? btrfs_sync_log+0x3d1/0xee0 [btrfs]
  ? find_first_extent_bit+0x9f/0x100 [btrfs]
  ? __mutex_unlock_slowpath+0x35/0x270
  btrfs_sync_log+0x3d1/0xee0 [btrfs]
  btrfs_sync_file+0x3a8/0x570 [btrfs]
  __x64_sys_fsync+0x34/0x60
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens, because we are taking the ->log_mutex albeit it has already
been locked.

Also while at it, fix the bogus unlock of the tree_log_mutex in the error
handling.

Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:08:48 +01:00
Josef Bacik 95c85fba1f btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.

Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.

Fixes: fa9c0d795f ("Btrfs: rework allocation clustering")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:45 +01:00
Filipe Manana 3660d0bcdb btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.

The conditions for this to happen are the following:

1) We have a file with a size of, for example, 1280K;

2) There is a written (non-prealloc) extent for the file range from 1024K
   to 1280K with a length of 256K;

3) This particular file extent layout is durably persisted, so that the
   existing superblock persisted on disk points to a subvolume root where
   the file has that exact file extent layout and state;

4) The file is truncated to a smaller size, to an offset lower than the
   start offset of its last extent, for example to 800K. The truncate sets
   the full sync runtime flag on the inode;

6) Fsync the file to log it and clear the full sync runtime flag;

7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
   into the file with a destination offset that starts at or beyond the
   256K file extent item we had - for example to offset 1024K;

8) Since the clone operation does not find extents in the source range,
   we end up in the if branch at the bottom of btrfs_clone() where we
   punch a hole for the file range starting at offset 1024K by calling
   btrfs_replace_file_extents(). There we end up not setting the full
   sync flag on the inode, because we don't know we are being called in
   a clone context (and not fallocate's punch hole operation), and
   neither do we create an extent map to represent a hole because the
   requested range is beyond eof;

9) A further fsync to the file will be a fast fsync, since the clone
   operation did not set the full sync flag, and therefore it relies on
   modified extent maps to correctly log the file layout. But since
   it does not find any extent map marking the range from 1024K (the
   previous eof) to the new eof, it does not log a file extent item
   for that range representing the hole;

10) After a power failure no hole for the range starting at 1024K is
   punched and we end up exposing stale data from the old 256K extent.

Turning this into exact steps:

  $ mkfs.btrfs -f -O no-holes /dev/sdi
  $ mount /dev/sdi /mnt

  # Create our test file with 3 extents of 256K and a 256K hole at offset
  # 256K. The file has a size of 1280K.
  $ xfs_io -f -s \
              -c "pwrite -S 0xab -b 256K 0 256K" \
              -c "pwrite -S 0xcd -b 256K 512K 256K" \
              -c "pwrite -S 0xef -b 256K 768K 256K" \
              -c "pwrite -S 0x73 -b 256K 1024K 256K" \
              /mnt/sdi/foobar

  # Make sure it's durably persisted. We want the last committed super
  # block to point to this particular file extent layout.
  sync

  # Now truncate our file to a smaller size, falling within a position of
  # the second extent. This sets the full sync runtime flag on the inode.
  # Then fsync the file to log it and clear the full sync flag from the
  # inode. The third extent is no longer part of the file and therefore
  # it is not logged.
  $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar

  # Now do a clone operation that only clones the hole and sets back the
  # file size to match the size it had before the truncate operation
  # (1280K).
  $ xfs_io \
        -c "reflink /mnt/foobar 256K 1024K 256K" \
        -c "fsync" \
        /mnt/foobar

  # File data before power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1310720

  <power fail>

  # Mount the fs again to replay the log tree.
  $ mount /dev/sdi /mnt

  # File data after power failure:
  $ od -A d -t x1 /mnt/foobar
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
  *
  0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
  *
  0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
  *
  1310720

The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.

The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.

So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.

A test case for fstests will follow soon.

Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:45 +01:00
Josef Bacik 1119a72e22 btrfs: tree-checker: do not error out if extent ref hash doesn't match
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system.  Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like

key.objectid	= <bytenr>
key.type	= <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset	= hash(tree, owner, offset)

However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction.  This is relatively easy to
reproduce, simply do something like the following

  xfs_io -f -c "pwrite 0 1M" file
  offset=2

  for i in {0..10000}
  do
	  xfs_io -c "reflink file 0 ${offset}M 1M" file
	  offset=$(( offset + 2 ))
  done

  xfs_io -c "reflink file 0 17999258914816 1M" file
  xfs_io -c "reflink file 0 35998517829632 1M" file
  xfs_io -c "reflink file 0 53752752058368 1M" file

  btrfs filesystem sync

And the sync will error out because we'll abort the transaction.  The
magic values above are used because they generate hash collisions with
the first file in the main subvol.

The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.

Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes: 0785a9aacf ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:44 +01:00
Filipe Manana dd0734f2a8 btrfs: fix race between swap file activation and snapshot creation
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.

However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.

Basically what can happen is:

1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
   There it sees that root->nr_swapfiles has a value of 0 so it continues;

2) Task B enters btrfs_swap_activate(). It is not aware that another task
   started snapshot creation but it did not finish yet. It increments
   root->nr_swapfiles from 0 to 1;

3) Task B checks that the file meets all requirements to be an active
   swap file - it has NOCOW set, there are no snapshots for the inode's
   root at the moment, no file holes, no reflinked extents, etc;

4) Task B returns success and now the file is an active swap file;

5) Task A commits the transaction to create the snapshot and finishes.
   The swap file's extents are now shared between the original root and
   the snapshot;

6) A write into an extent of the swap file is attempted - there is a
   snapshot of the file's root, so we fall back to COW mode and therefore
   the physical location of the extent changes on disk.

So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.

Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:35 +01:00
Filipe Manana 195a49eaf6 btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 18:07:15 +01:00
Filipe Manana 20903032cd btrfs: avoid checking for RO block group twice during nocow writeback
During the nocow writeback path, we currently iterate the rbtree of block
groups twice: once for checking if the target block group is RO with the
call to btrfs_extent_readonly()), and once again for getting a nocow
reference on the block group with a call to btrfs_inc_nocow_writers().

Since btrfs_inc_nocow_writers() already returns false when the target
block group is RO, remove the call to btrfs_extent_readonly(). Not only
we avoid searching the blocks group rbtree twice, it also helps reduce
contention on the lock that protects it (specially since it is a spin
lock and not a read-write lock). That may make a noticeable difference
on very large filesystems, with thousands of allocated block groups.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:55 +01:00
Nikolay Borisov 3c17916510 btrfs: fix race between extent freeing/allocation when using bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:

T1:                                    T2:
btrfs_alloc_from_cluster               insert_into_bitmap <holds tree_lock>
 if (entry->bytes == 0)                   if (block_group && !list_empty(&block_group->cluster_list)) {
    rb_erase(entry)

 spin_unlock(&cluster->lock);
   (total_bitmaps is still 4)           spin_lock(&cluster->lock);
                                         <doesn't find entry in cluster->root>
 spin_lock(&ctl->tree_lock);             <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock>       <a new entry and calls add_new_bitmap>
					    recalculate_thresholds  <crashes,
                                              due to total_bitmaps
					      becoming 5 and triggering
					      an ASSERT>

To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.

CC: <stable@vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:31 +01:00
Qu Wenruo 04d4ba4c90 btrfs: make check_compressed_csum() to be subpage compatible
Currently check_compressed_csum() completely relies on sectorsize ==
PAGE_SIZE to do checksum verification for compressed extents.

To make it subpage compatible, this patch will:
- Do extra calculation for the csum range
  Since we have multiple sectors inside a page, we need to only hash
  the range we want, not the full page anymore.

- Do sector-by-sector hash inside the page

With this patch and previous conversion on
btrfs_submit_compressed_read(), now we can read subpage compressed
extents properly, and do proper csum verification.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:27 +01:00
Qu Wenruo be6a13613f btrfs: make btrfs_submit_compressed_read() subpage compatible
For compressed read, we always submit page read using page size.  This
doesn't work well with subpage, as for subpage one page can contain
several sectors.  Such submission will read range out of what we want,
and cause problems.

Thankfully to make it subpage compatible, we only need to change how the
last page of the compressed extent is read.

Instead of always adding a full page to the compressed read bio, if we're
at the last page, calculate the size using compressed length, so that we
only add part of the range into the compressed read bio.

Since we are here, also change the PAGE_SIZE used in
lookup_extent_mapping() to sectorsize.
This modification won't cause any functional change, as
lookup_extent_mapping() can handle the case where the search range is
larger than found extent range.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:25 +01:00
Ira Weiny d70cef0d46 btrfs: fix raid6 qstripe kmap
When a qstripe is required an extra page is allocated and mapped.  There
were 3 problems:

1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
   number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
   through the loop.

The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping.  So map the
page for the duration of the loop.

Similarly, improve the algorithm by mapping the parity page just 1 time.

Fixes: 5a6ac9eacb ("Btrfs, raid56: support parity scrub on raid56")
CC: stable@vger.kernel.org # 4.4.x: c17af96554a8: btrfs: raid56: simplify tracking of Q stripe presence
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:21 +01:00
Linus Torvalds 582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds 4f016a316f New code for 5.12:
- Adjust the final parameter of iomap_dio_rw.
 - Add a new flag to request that iomap directio writes return EAGAIN if
   the write is not a pure overwrite within EOF; this will be used to
   reduce lock contention with unaligned direct writes on XFS.
 - Amend XFS' directio code to eliminate exclusive locking for unaligned
   direct writes if the circumstances permit
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
 tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
 H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
 5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
 f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
 yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
 sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
 qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
 xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
 jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
 59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
 NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
 =AR07
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "The big change in this cycle is some new code to make it possible for
  XFS to try unaligned directio overwrites without taking locks. If the
  block is fully written and within EOF (i.e. doesn't require any
  further fs intervention) then we can let the unlocked write proceed.
  If not, we fall back to synchronizing direct writes.

  Summary:

   - Adjust the final parameter of iomap_dio_rw.

   - Add a new flag to request that iomap directio writes return EAGAIN
     if the write is not a pure overwrite within EOF; this will be used
     to reduce lock contention with unaligned direct writes on XFS.

   - Amend XFS' directio code to eliminate exclusive locking for
     unaligned direct writes if the circumstances permit"

* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: reduce exclusive locking on unaligned dio
  xfs: split the unaligned DIO write code out
  xfs: improve the reflink_bounce_dio_write tracepoint
  xfs: simplify the read/write tracepoints
  xfs: remove the buffered I/O fallback assert
  xfs: cleanup the read/write helper naming
  xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  xfs: factor out a xfs_ilock_iocb helper
  iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
  iomap: pass a flags argument to iomap_dio_rw
  iomap: rename the flags variable in __iomap_dio_rw
2021-02-21 10:29:20 -08:00
Linus Torvalds 6f3952cbe0 for-5.12-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
 WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
 arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
 qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
 ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
 hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
 2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
 3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
 bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
 wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
 8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
 fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
 =0O6r
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings updates of space handling, performance improvements or bug
  fixes. The subpage block size and zoned mode features have reached
  state where they're usable but with limitations.

  Performance or related:

   - do not block on deleted block group mutex in the cleaner, avoids
     some long stalls

   - improved flushing: make it work better with ticket space
     reservations and avoid excessive transaction commits in some
     scenarios, slightly improves throughput for random write load

   - preemptive background flushing: separate the logic from ticket
     reservations, improve the accounting and decisions when to flush in
     low space conditions

   - less lock contention related to running delayed refs, let just one
     thread do the flushing when there are many inside transaction
     commit

   - dbench workload improvements: avoid unnecessary work when logging
     inodes, fewer fallbacks to transaction commit and thus less waiting
     for it (+7% throughput, -20% latency)

  Core:

   - subpage block size
      - currently read-only support
      - refactor and generalize code where sectorsize is assumed to be
        page size, add the subpage handling everywhere
      - the read-write support is on the way, page sizes are still
        limited to 4K or 64K

   - zoned mode, first working version but with limitations
      - SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
        location for structures" and chunked allocation
      - superblock as the only fixed data structure needs special
        handling, uses 2 consecutive zones as a ring buffer
      - tree-log support with a dedicated block group to avoid unordered
        writes
      - emulated zones on non-zoned devices
      - not yet working
      - all non-single block group profiles, requires more zone write
        pointer synchronization between the multiple block groups
      - fitrim due to dependency on space cache, can be implemented

  Fixes:

   - ref-verify: proper tree owner and node level tracking

   - fix pinned byte accounting, causing some early ENOSPC now more
     likely due to other changes in delayed refs

  Other:

   - error handling fixes and improvements

   - more error injection points

   - more function documentation

   - more and updated tracepoints

   - subset of W=1 checked by default

   - update comments to allow more automatic kdoc parameter checks"

* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
  btrfs: zoned: enable to mount ZONED incompat flag
  btrfs: zoned: deal with holes writing out tree-log pages
  btrfs: zoned: reorder log node allocation on zoned filesystem
  btrfs: zoned: serialize log transaction on zoned filesystems
  btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
  btrfs: split alloc_log_tree()
  btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
  btrfs: zoned: enable relocation on a zoned filesystem
  btrfs: zoned: support dev-replace in zoned filesystems
  btrfs: zoned: implement copying for zoned device-replace
  btrfs: zoned: implement cloning for zoned device-replace
  btrfs: zoned: mark block groups to copy for device-replace
  btrfs: zoned: do not use async metadata checksum on zoned filesystems
  btrfs: zoned: wait for existing extents before truncating
  btrfs: zoned: serialize metadata IO
  btrfs: zoned: introduce dedicated data write path for zoned filesystems
  btrfs: zoned: enable zone append writing for direct IO
  btrfs: zoned: use ZONE_APPEND write for zoned mode
  btrfs: save irq flags when looking up an ordered extent
  btrfs: zoned: cache if block group is on a sequential zone
  ...
2021-02-21 10:00:39 -08:00
Linus Torvalds e42ee56fe5 for-5.11-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAmlkAACgkQxWXV+ddt
 WDuwNxAAiBAhEwPllzyU86p4RMMip5pa24zu11HkTya65yGk6EFuj4zTlx/L5Fn6
 JOjxwlPqaTItER1PYJ5HRdIy1Y2E4eWEiDLolvmvDCPZrfKRKhBU1MZbgXwDbp+Z
 pwaJGIm5ZaXDGyuFge3bKA48BERfqxRBO3qIOZ0tzgsUFLlZ2d9EdDc99093/J6k
 QzIijXQjFnvnB2MNawN1b/KQ63xqXLo2hemKcKIFCxJHm9eaet/qwGHl5iuR5ScY
 bOGCWvLSkCXceartDur3msOZXur09YLyfeYmE9dj1FN3aNu97sW8VivWRrs3aglK
 if51iYrrjKSnDr4SOK28S5UYdgeStb/qWWtosdcMsQVBo0t7iCnGT2psGaQCkdfG
 FChqbs2uXlbJrojlelV6xbaU3S2D2MtSz5mF+I2G5MpQbj1jkhYE9ZTUQeibcd7o
 l+edn/VJvVK4X0NAX8pIWJ4nFY1HqUTyfn28IQ7ymBhyyUloIoazvSkBuSWy6iy0
 9aPpohOKjCw8Y3MbgcIfIEJhdK+aIKF8ZPh52+zcXQzf1OtSryVarLHsNXWm9vJ8
 tHsRHCzrbLFdAXZccT6YlerzPs4+PVf44UknDbFCg7sLcG04NIGGrMXOtTHwgEZL
 BEywTjAMlMDjrEXouxYAPNPnEg/NlvQGZYRvBnxrtZE4G2fxJ7o=
 =7w6G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A regression fix caused by a refactoring in 5.11.

  A corrupted superblock wouldn't be detected by checksum verification
  due to wrongly placed initialization of the checksum length, thus
  making memcmp always work"

* tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize fs_info::csum_size earlier in open_ctree
2021-02-13 11:55:29 -08:00
Su Yue 83c68bbcb6 btrfs: initialize fs_info::csum_size earlier in open_ctree
User reported that btrfs-progs misc-tests/028-superblock-recover fails:

      [TEST/misc]   028-superblock-recover
  unexpected success: mounted fs with corrupted superblock
  test failed for case 028-superblock-recover

The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.

Commit 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size
everywhere") replaces all calls to btrfs_super_csum_size by
fs_info::csum_size. The calls include the place where fs_info->csum_size
is not initialized. So btrfs_check_super_csum() passes because memcmp()
with len 0 always returns 0.

Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
csum type in superblock is valid in open_ctree().

Link: https://github.com/kdave/btrfs-progs/issues/250
Fixes: 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size everywhere")
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-12 14:48:24 +01:00
Naohiro Aota 9d294a685f btrfs: zoned: enable to mount ZONED incompat flag
This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:52:24 +01:00
Naohiro Aota b528f46713 btrfs: zoned: deal with holes writing out tree-log pages
Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. This happens when someone,
e.g., a concurrent transaction commit, writes a dirty extent in this
tree-log commit.

If we are not going to wait for the extents, we can hope the concurrent
writing fills the hole for us. So, we can ignore the error in this case and
hope the next write will succeed.

If we want to wait for them and got the error, we cannot wait for them
because it will cause a deadlock. So, let's bail out to a full commit in
this case.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:52:24 +01:00
Naohiro Aota 3ddebf27fc btrfs: zoned: reorder log node allocation on zoned filesystem
This is the 3/3 patch to enable tree-log on zoned filesystems.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:48:41 +01:00
Naohiro Aota fa1a0f42a0 btrfs: zoned: serialize log transaction on zoned filesystems
This is the 2/3 patch to enable tree-log on zoned filesystems.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at
the time of a log transaction commit. The nodes of the global log root
tree (fs_info->log_root_tree), also have the same problem with mixed
allocation.

Serializes log transactions by waiting for a committing transaction when
someone tries to start a new transaction, to avoid the mixed allocation
problem. We must also wait for running log transactions from another
subvolume, but there is no easy way to detect which subvolume root is
running a log transaction. So, this patch forbids starting a new log
transaction when other subvolumes already allocated the global log root
tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:48:37 +01:00
Naohiro Aota 40ab3be102 btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams.  As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:08 +01:00
Naohiro Aota 6ab6ebb760 btrfs: split alloc_log_tree()
This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().

Also export the latter part is to be used in the next patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota f7ef5287a6 btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.

We can consider three methods to repair an IO failure in zoned filesystems:

(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
    the new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 32430c6148 btrfs: zoned: enable relocation on a zoned filesystem
Currently fallocate() is disabled on a zoned filesystem. Since current
relocation process relies on preallocation to move file data extents, it
must be handled differently.

On a zoned filesystem, we just truncate the inode to the size that we
wanted to pre-allocate. Then, we flush dirty pages on the file before
finishing the relocation process. run_delalloc_zoned() will handle all
the allocations and submit IOs to the underlying layers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 7db1c5d14d btrfs: zoned: support dev-replace in zoned filesystems
This is 4/4 patch to implement device-replace on zoned filesystems.

Even after the copying is done, the write pointers of the source device
and the destination device may not be synchronized. For example, when
the last allocated extent is freed before device-replace process, the
extent is not copied, leaving a hole there.

Synchronize the write pointers by writing zeroes to the destination
device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota de17addce7 btrfs: zoned: implement copying for zoned device-replace
This is 3/4 patch to implement device-replace on zoned filesystems.

This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.

The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d8 ("Btrfs: fix
block group remaining RO forever after error during device replace").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 6143c23ccc btrfs: zoned: implement cloning for zoned device-replace
This is 2/4 patch to implement device replace for zoned filesystems.

In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).

Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 78ce9fc269 btrfs: zoned: mark block groups to copy for device-replace
This is the 1/4 patch to support device-replace on zoned filesystems.

We have two types of IOs during the device replace process. One is an IO
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an IO to "clone" (by
handle_ops_on_dev_replace()) new incoming write IOs from users to the
source device into the target device.

Cloning incoming IOs can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the IO is
directed to the middle of a target device zone, which breaks the
sequential write requirement.

However, the cloning function cannot be disabled since incoming IOs
targeting already copied device extents must be cloned so that the IO is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 4eef29ef63 btrfs: zoned: do not use async metadata checksum on zoned filesystems
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
the metadata write IOs.

Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU
and not per zone.

To preserve write bio ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as
a single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple
zones are being written simultaneously, HDD seek overhead lowers the
achievable maximum bandwidth, resulting again in a per zone checksum
serialization not affecting the performance.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 24c0a7227f btrfs: zoned: wait for existing extents before truncating
When truncating a file, file buffers which have already been allocated
but not yet written may be truncated. Truncating these buffers could
cause breakage of a sequential write pattern in a block group if the
truncated blocks are for example followed by blocks allocated to another
file. To avoid this problem, always wait for write out of all unwritten
buffers before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 0bc09ca129 btrfs: zoned: serialize metadata IO
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota 42c0110009 btrfs: zoned: introduce dedicated data write path for zoned filesystems
If more than one IO is issued for one file extent, these IO can be
written to separate regions on a device. Since we cannot map one file
extent to such a separate area on a zoned filesystem, we need to follow
the "one IO == one ordered extent" rule.

The normal buffered, uncompressed and not pre-allocated write path (used
by cow_file_range()) sometimes does not follow this rule. It can write a
part of an ordered extent when specified a region to write e.g., when
its called from fdatasync().

Introduce a dedicated (uncompressed buffered) data write path for zoned
filesystems, that will COW the region and write it at once.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Naohiro Aota 544d24f9de btrfs: zoned: enable zone append writing for direct IO
Likewise to buffered IO, enable zone append writing for direct IO when
its used on a zoned block device.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Naohiro Aota d8e3fb106f btrfs: zoned: use ZONE_APPEND write for zoned mode
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.

Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.

Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().

And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Johannes Thumshirn 24533f6a9a btrfs: save irq flags when looking up an ordered extent
A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.

btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:06 +01:00
Johannes Thumshirn 08f455593f btrfs: zoned: cache if block group is on a sequential zone
On a zoned filesystem, cache if a block group is on a sequential write
only zone.

On sequential write only zones, we can use REQ_OP_ZONE_APPEND for
writing data, therefore provide btrfs_use_zone_append() to figure out if
IO is targeting a sequential write only zone and we can use
REQ_OP_ZONE_APPEND for data writing.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota 138082f366 btrfs: extend btrfs_rmap_block for specifying a device
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.

Extend the function to match to a specified device. The old functionality
of querying all devices is left intact by specifying NULL as target
device.

A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
as this function is intended to reverse-map the result of a bio, which
only has a block_device.

Also export the function for later use.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Johannes Thumshirn cacb2cea46 btrfs: zoned: check if bio spans across an ordered extent
To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota d22002fd37 btrfs: zoned: split ordered extent when bio is sent
For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.

Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.

extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.

Several sanity checks need to be done in extract_ordered_extent() e.g.

- We cannot split once end_bio'd ordered extent because we cannot divide
  ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
  Since the function is called before btrfs_wq_submit_bio() or
  btrfs_csum_one_bio(), this should be always ensured.

We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota cfe94440d1 btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.

Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota e1326f0339 btrfs: zoned: use bio_add_zone_append_page
A zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow these restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to cache the target device to btrfs_io_bio(bio)->device.

Caching only the target device is sufficient here as zoned filesystems
only supports the single profile at the moment. Once more profiles will be
supported btrfs_io_bio can hold an extent_map to be able to check for the
restrictions of all devices the btrfs_bio will be mapped to.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota 953651eb30 btrfs: factor out helper adding a page to bio
Factor out adding a page to a bio from submit_extent_page().  The page
is added only when bio_flags are the same, contiguous and the added page
fits in the same stripe as pages in the bio.

Condition checks are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota dcba6e48b5 btrfs: zoned: reset zones of unused block groups
We must reset the zones of a deleted unused block group to rewind the
zones' write pointers to the zones' start.

To do this, we can use the DISCARD_SYNC code to do the reset when the
filesystem is running on zoned devices.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota 011b41bffa btrfs: zoned: advance allocation pointer after tree log node
Since the allocation info of a tree log node is not recorded in the extent
tree, calculate_alloc_pointer() cannot detect this node, so the pointer
can be over a tree node.

Replaying the log calls btrfs_remove_free_space() for each node in the
log tree.

So, advance the pointer after the node to not allocate over it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota d3575156f6 btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Naohiro Aota 2eda57089e btrfs: zoned: implement sequential extent allocation
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Therefore the allocator never manages bitmaps or clusters. Also, add
assertions to the corresponding functions.

As zone append writing is used, it would be unnecessary to track the
allocation offset, as the allocator only needs to check available space.
But by tracking and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there
is no IO reordering.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota 169e0da91a btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.

Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota a94794d50d btrfs: zoned: calculate allocation offset for conventional zones
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.

But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.

For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota 08e11a3db0 btrfs: zoned: load zone's allocation offset
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address within
a block group. To facilitate this, add an "alloc_offset" to the
block-group to track the logical addresses of the write pointer.

This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.

For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.

We need fine-grained logical to physical mapping to support such separated
physical address issue. Since it should require additional metadata type,
disable non-single profiles for now.

This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota 381a696eb5 btrfs: zoned: verify device extent is aligned to zone
Add a check in verify_one_dev_extent() to ensure that a device extent on
a zoned block device is aligned to the respective zone boundary.

If it isn't, mark the filesystem as unclean.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Naohiro Aota 1cd6121f2a btrfs: zoned: implement zoned chunk allocator
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.

To implement the allocator, we need to extend the following functions for
a zoned filesystem.

- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size

init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.

dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.

dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.

With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.

Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Johannes Thumshirn 3c9daa09cc btrfs: zoned: allow zoned filesystems on non-zoned block devices
Run a zoned filesystem on non-zoned devices. This is done by "slicing up"
the block device into static sized chunks and fake a conventional zone on
each of them. The emulated zone size is determined from the size of device
extent.

This is mainly aimed at testing of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:21 +01:00
Naohiro Aota 1cb3dc3f79 btrfs: zoned: disallow fitrim on zoned filesystems
The implementation of fitrim depends on space cache, which is not used
and disabled for zoned extent allocator. So the current code does not
work with zoned filesystem.

In the future, we can implement fitrim for zoned filesystems by enabling
space cache (but, only for fitrim) or scanning the extent tree at fitrim
time.  For now, disallow fitrim on zoned filesystems.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:20 +01:00
Johannes Thumshirn b53429bad3 btrfs: zoned: do not load fs_info::zoned from incompat flag
Don't set the zoned flag in fs_info as soon as we're encountering the
incompat filesystem flag for a zoned filesystem on mount. The zoned flag
in fs_info is in a union together with the zone_size, so setting it too
early will result in setting an incorrect zone_size as well.

Once the correct zone_size is read from the device, we can rely on the
zoned flag in fs_info as well to determine if the filesystem is zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:20 +01:00
Johannes Thumshirn 4afd2fe835 btrfs: release path before calling to btrfs_load_block_group_zone_info
Since we have no write pointer in conventional zones, we cannot
determine the allocation offset from it. Instead, we set the allocation
offset after the highest addressed extent. This is done by reading the
extent tree in btrfs_load_block_group_zone_info().

However, this function is called from btrfs_read_block_groups(), so the
read lock for the tree node could be recursively taken.

To avoid this unsafe locking scenario, release the path before reading
the extent tree to get the allocation offset.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:20 +01:00
Naohiro Aota d6639b35da btrfs: zoned: use regular super block location on zone emulation
A zoned filesystem currently has a superblock at the beginning of the
superblock logging zones if the zones are conventional. This difference
in superblock position causes a chicken-and-egg problem for filesystems
with emulated zones. Since the device is a regular (non-zoned) device,
we cannot know if the filesystem is regular or zoned while reading the
superblock. But, to load the superblock, we need to see if it is
emulated zoned or not.

Place the superblocks at the same location as they are on regular
filesystem on regular devices to solve the problem. It is possible
because it's ensured that all the superblock locations are at an
(emulated) conventional zone on regular devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:19 +01:00
Naohiro Aota 7365104236 btrfs: zoned: defer loading zone info after opening trees
This is a preparation patch to implement zone emulation on a regular
device.

To emulate a zoned filesystem on a regular (non-zoned) device, we need to
decide an emulated zone size. Instead of making it a compile-time static
value, we'll make it configurable at mkfs time. Since we have one zone ==
one device extent restriction, we can determine the emulated zone size
from the size of a device extent. We can extend btrfs_get_dev_zone_info()
to show a regular device filled with conventional zones once the zone size
is decided.

The current call site of btrfs_get_dev_zone_info() during the mount process
is earlier than loading the file system trees so that we don't know the
size of a device extent at this point. Thus we can't slice a regular device
to conventional zones.

This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
info for all the devices. And, it places this function in open_ctree()
after loading the trees.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:16 +01:00
Filipe Manana 72c9925f87 btrfs: fix extent buffer leak on failure to copy root
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().

So fix that by unlocking the extent buffer and dropping our reference on
it before returning.

Fixes: be20aa9dba ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:04 +01:00
Qu Wenruo 2c4d8cb737 btrfs: explain page locking and readahead in read_extent_buffer_pages()
In read_extent_buffer_pages(), if we failed to lock the page atomically,
we just exit with return value 0.

This is counter-intuitive, as normally if we can't lock what we need, we
would return something like EAGAIN.

But that return hides under (wait == WAIT_NONE) branch, which only gets
triggered for readahead.

And for readahead, if we failed to lock the page, it means the extent
buffer is either being read by other thread, or has been read and is
under modification.  Either way the eb will or has been cached, thus
readahead has no need to wait for it.

Add comment on this counter-intuitive behavior.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:04 +01:00
Qu Wenruo 0bb3eb3ee8 btrfs: allow read-only mount of 4K sector size fs on 64K page system
This adds the basic RO mount ability for 4K sector size on 64K page
system.

Currently we only plan to support 4K and 64K page system.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 92082d4097 btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:

  btrfs_do_read_page()
  {
	while (cur <= end) {
		/* No need to read from disk */
		if (HOLE/PREALLOC/INLINE){
			memset();
			set_extent_uptodate();
			continue;
		}
		/* Read from disk */
		ret = submit_extent_page(end_bio_extent_readpage);
  }

  end_bio_extent_readpage()
  {
	endio_readpage_uptodate_page_status();
  }

This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.

But for subpage, there is more work to be done in page status update:

- Page Unlock condition
  Unlike regular page size == sectorsize case, we can no longer just
  unlock a page.
  Only the last reader of the page can unlock the page.
  This means, we can unlock the page either in the while() loop, or in
  the endio function.

- Page uptodate condition
  Since we have multiple sectors to read for a page, we can only mark
  the full page uptodate if all sectors are uptodate.

To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:

- begin_page_read()
  For regular case, it does nothing.
  For subpage case, it updates the reader counters so that later
  end_page_read() can know who is the last one to unlock the page.

- end_page_read()
  This is just endio_readpage_uptodate_page_status() renamed.
  The original name is a little too long and too specific for endio.

  The new thing added is the condition for page unlock.
  Now for subpage data, we unlock the page if we're the last reader.

This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.

Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 32443de338 btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:

- We have page->mapping set
  page->mapping->host is used to grab btrfs_fs_info, thus we can only
  call this function after page is mapped to an inode.

  One call site attaches pages to inode manually, thus we have to modify
  the timing of set_page_extent_mapped() a bit.

- As soon as possible, before other operations
  Since memory allocation can fail, we have to do extra error handling.
  Calling set_page_extent_mapped() as soon as possible can simply the
  error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 371cdc0700 btrfs: introduce subpage metadata validation check
For subpage metadata validation check, there are some differences:

- Read must finish in one bvec
  Since we're just reading one subpage range in one page, it should
  never be split into two bios nor two bvecs.

- How to grab the existing eb
  Instead of grabbing eb using page->private, we have to go search radix
  tree as we don't have any direct pointer at hand.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 4325cb2293 btrfs: support subpage in endio_readpage_update_page_status()
To handle subpage status update, add the following:

- Use btrfs_page_*() subpage-aware helpers to update page status
  Now we can handle both cases well.

- No page unlock for subpage metadata
  Since subpage metadata doesn't utilize page locking at all, skip it.
  For subpage data locking, it's handled in later commits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 4012daf769 btrfs: introduce read_extent_buffer_subpage()
Introduce a helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.

The difference between regular and subpage routines are:

- No page locking
  Here we completely rely on extent locking.
  Page locking can reduce the concurrency greatly, as if we lock one
  page to read one extent buffer, all the other extent buffers in the
  same page will have to wait.

- Extent uptodate condition
  Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
  We also need to check btrfs_subpage::uptodate_bitmap.

- No page iteration
  Just one page, no need to loop, this greatly simplified the subpage
  routine.

This patch only implements the bio submit part, no endio support yet.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo d1e86e3fc3 btrfs: support subpage in try_release_extent_buffer()
Unlike the original try_release_extent_buffer(),
try_release_subpage_extent_buffer() will iterate through all the ebs in
the page, and try to release each.

We can release the full page only after there's no private attached,
which means all ebs of that page have been released as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 92d83e9436 btrfs: support subpage in btrfs_clone_extent_buffer
For btrfs_clone_extent_buffer(), it's mostly the same code of
__alloc_dummy_extent_buffer(), except it has extra page copy.

So to make it subpage compatible, we only need to:

- Call set_extent_buffer_uptodate() instead of SetPageUptodate()
  This will set correct uptodate bit for subpage and regular sector size
  cases.

Since we're calling set_extent_buffer_uptodate() which will also set
EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
either.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 251f2acc71 btrfs: support subpage in set/clear_extent_buffer_uptodate()
To support subpage in set_extent_buffer_uptodate and
clear_extent_buffer_uptodate we only need to use the subpage-aware
helpers to update the page bits.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 03a816b32b btrfs: introduce helpers for subpage error status
Introduce the following functions to handle subpage error status:

- btrfs_subpage_set_error()
- btrfs_subpage_clear_error()
- btrfs_subpage_test_error()
  These helpers can only be called when the page has subpage attached
  and the range is ensured to be inside the page.

- btrfs_page_set_error()
- btrfs_page_clear_error()
- btrfs_page_test_error()
  These helpers can handle both regular sector size and subpage without
  problem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo a1d767c11c btrfs: introduce helpers for subpage uptodate status
Introduce the following functions to handle subpage uptodate status:

- btrfs_subpage_set_uptodate()
- btrfs_subpage_clear_uptodate()
- btrfs_subpage_test_uptodate()
  These helpers can only be called when the page has subpage attached
  and the range is ensured to be inside the page.

- btrfs_page_set_uptodate()
- btrfs_page_clear_uptodate()
- btrfs_page_test_uptodate()
  These helpers can handle both regular sector size and subpage.
  Although caller should still ensure that the range is inside the page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 09bc1f0fb8 btrfs: attach private to dummy extent buffer pages
There are locations where we allocate dummy extent buffers for temporary
usage, like in tree_mod_log_rewind() or get_old_root().

These dummy extent buffers will be handled by the same eb accessors, and
if they don't have page::private subpage eb accessors could fail.

To address such problems, make __alloc_dummy_extent_buffer() attach
page private for dummy extent buffers too.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 8ff8466d29 btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.

Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.

For subpage case, handle detaching page private.

For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.

For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.

But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:

  Extent buffer A is the new extent buffer which will be allocated,
  while extent buffer B is the last existing extent buffer of the page.

  		T1 (eb A) 	 |		T2 (eb B)
  -------------------------------+------------------------------
  alloc_extent_buffer()		 | btrfs_release_extent_buffer_pages()
  |- p = find_or_create_page()   | |
  |- attach_extent_buffer_page() | |
  |				 | |- detach_extent_buffer_page()
  |				 |    |- if (!page_range_has_eb())
  |				 |    |  No new eb in the page range yet
  |				 |    |  As new eb A hasn't yet been
  |				 |    |  inserted into radix tree.
  |				 |    |- btrfs_detach_subpage()
  |				 |       |- detach_page_private();
  |- radix_tree_insert()	 |

  Then we have a metadata eb whose page has no private bit.

To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.

In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock.  Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.

The section is marked by:

- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:02 +01:00
Qu Wenruo 819822107d btrfs: make grab_extent_buffer_from_page() handle subpage case
For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.

We have radix tree lock protecting us from inserting the same eb into
the tree.  Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() handle the existing eb in radix tree.

Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting into the radix tree.

So for grab_extent_buffer(), just always return NULL for subpage case.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo 760f991f14 btrfs: make attach_extent_buffer_page() handle subpage case
For subpage case, we need to allocate additional memory for each
metadata page.

So we need to:

- Allow attach_extent_buffer_page() to return int to indicate allocation
  failure

- Allow manually pre-allocate subpage memory for alloc_extent_buffer()
  As we don't want to use GFP_ATOMIC under spinlock, we introduce
  btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
  purpose.
  (The simple wrap for btrfs_free_subpage() is for later convert to
   kmem_cache. Already internally tested without problem)

- Preallocate btrfs_subpage structure for alloc_extent_buffer()
  We don't want to call memory allocation with spinlock held, so
  do preallocation before we acquire mapping->private_lock.

- Handle subpage and regular case differently in
  attach_extent_buffer_page()
  For regular case, no change, just do the usual thing.
  For subpage case, allocate new memory or use the preallocated memory.

For future subpage metadata, we will make use of radix tree to grab
extent buffer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo cac06d843f btrfs: introduce the skeleton of btrfs_subpage structure
For sectorsize < page size support, we need a structure to record extra
status info for each sector of a page.

Introduce the skeleton structure, all subpage related code would go to
subpage.[ch].

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo 62c053fbb2 btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support
For the incoming subpage support, UNMAPPED extent buffer will have
different behavior in btrfs_release_extent_buffer().

This means we need to set UNMAPPED bit early before calling
btrfs_release_extent_buffer().

Currently there is only one caller which relies on
btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
late:
- btrfs_clone_extent_buffer()

Make it subpage compatible by setting the UNMAPPED bit early, since
we're here, also move the UPTODATE bit early.

There is another caller, __alloc_dummy_extent_buffer(), setting
UNMAPPED bit late, but that function clean up the allocated page
manually, thus no need for any modification.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Qu Wenruo 6869b0a8be btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.

However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.

This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Filipe Manana d0c2f4fa55 btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9627107     0.153    61.938
 Close        7072076     0.001     3.175
 Rename        407633     1.222    44.439
 Unlink       1943895     0.658    44.440
 Deltree          256    17.339   110.891
 Mkdir            128     0.003     0.009
 Qpathinfo    8725406     0.064    17.850
 Qfileinfo    1529516     0.001     2.188
 Qfsinfo      1599884     0.002     1.457
 Sfileinfo     784200     0.005     3.562
 Find         3373513     0.411    30.312
 WriteX       4802132     0.053    29.054
 ReadX       15089959     0.002     5.801
 LockX          31344     0.002     0.425
 UnlockX        31344     0.001     0.173
 Flush         674724     5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9931568     0.111    25.597
 Close        7295730     0.001     2.171
 Rename        420549     0.982    49.714
 Unlink       2005366     0.497    39.015
 Deltree          256    11.149    89.242
 Mkdir            128     0.002     0.014
 Qpathinfo    9001863     0.049    20.761
 Qfileinfo    1577730     0.001     2.546
 Qfsinfo      1650508     0.002     3.531
 Sfileinfo     809031     0.005     5.846
 Find         3480259     0.309    23.977
 WriteX       4952505     0.043    41.283
 ReadX       15568127     0.002     5.476
 LockX          32338     0.002     0.978
 UnlockX        32338     0.001     2.032
 Flush         696017     7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

 --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8956748     0.342   108.312
 Close        6579660     0.001     3.823
 Rename        379209     2.396    81.897
 Unlink       1808625     1.108   131.148
 Deltree          256    25.632   172.176
 Mkdir            128     0.003     0.018
 Qpathinfo    8117615     0.131    55.916
 Qfileinfo    1423495     0.001     2.635
 Qfsinfo      1488496     0.002     5.412
 Sfileinfo     729472     0.007     8.643
 Find         3138598     0.855    78.321
 WriteX       4470783     0.102    79.442
 ReadX       14038139     0.002     7.578
 LockX          29158     0.002     0.844
 UnlockX        29158     0.001     0.567
 Flush         627746    14.168   506.151

Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms

After applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9069003     0.303    43.193
 Close        6662328     0.001     3.888
 Rename        383976     2.194    46.418
 Unlink       1831080     1.022    43.873
 Deltree          256    24.037   155.763
 Mkdir            128     0.002     0.005
 Qpathinfo    8219173     0.137    30.233
 Qfileinfo    1441203     0.001     3.204
 Qfsinfo      1507092     0.002     4.055
 Sfileinfo     738775     0.006     5.431
 Find         3177874     0.936    38.170
 WriteX       4526152     0.084    39.518
 ReadX       14213562     0.002    24.760
 LockX          29522     0.002     1.221
 UnlockX        29522     0.001     0.694
 Flush         635652    14.358   422.039

Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms

 --> +6.8% throughput, -18.1% max latency

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Filipe Manana 64d6b281ba btrfs: remove unnecessary check_parent_dirs_for_sync()
Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
   not an explicit request to log it and it is not needed neither to
   guarantee the consistency of the inode originally asked to be logged
   (fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
   unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:01 +01:00
Filipe Manana 0e44cb3f94 btrfs: skip logging inodes already logged when logging new entries
When logging new directory entries of a directory, we log the inodes of
new dentries and the inodes of dentries pointing to directories that
may have been created in past transactions. For the case of directories
we log in full mode, which can be particularly expensive for large
directories.

We do use btrfs_inode_in_log() to skip already logged inodes, however for
that helper to return true, it requires that the log transaction used to
log the inode to be already committed. This means that when we have more
than one task using the same log transaction we can end up logging an
inode multiple times, which is a waste of time and not necessary since
the log will be committed by one of the tasks and the others will wait for
the log transaction to be committed before returning to user space.

So simply replace the use of btrfs_inode_in_log() with the new helper
function need_log_inode(), introduced in a previous commit.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Filipe Manana 3e6a86a193 btrfs: skip logging directories already logged when logging all parents
Some times when we fsync an inode we need to do a full log of all its
ancestors (due to unlink, link or rename operations), which can be an
expensive operation, specially if the directories are large.

However if we find an ancestor directory inode that is already logged in
the current transaction, and has no inserted/updated/deleted xattrs since
it was last logged, we can skip logging the directory again. We are safe
to skip that since we know that for logged directories, any link, unlink
or rename operations that implicate the directory will update the log as
necessary.

So use the helper need_log_dir(), introduced in a previous commit, to
detect already logged directories that can be skipped.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Filipe Manana ab12313a9f btrfs: avoid logging new ancestor inodes when logging new inode
When we fsync a new file, created in the current transaction, we check
all its ancestor inodes and always log them if they were created in the
current transaction - even if we have already logged them before, which
is a waste of time.

So avoid logging new ancestor inodes if they were already logged before
and have no xattrs added/updated/removed since they were last logged.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Filipe Manana e593e54ed1 btrfs: stop setting nbytes when filling inode item for logging
When we fill an inode item for logging we are setting its nbytes field
with the value returned by inode_get_bytes() (a VFS API), however we do
not need it because it is not used during log replay. In fact, for fast
fsyncs, when we call inode_get_bytes() we may even get an outdated value
for nbytes because the nbytes field of the inode is only updated when
ordered extents complete, and a fast fsync only waits for writeback to
complete, it does not wait for ordered extent completion.

So just remove the setup of nbytes and add an explicit comment mentioning
why we do not set it. This also avoids adding contention on the inode's
i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
inode_get_bytes() which is also called by our stat callback
(btrfs_getattr()).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Filipe Manana ddffcf6fb5 btrfs: remove unnecessary directory inode item update when deleting dir entry
When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.

This is not needed since commit d555438b6e ("Btrfs: drop dir i_size
when adding new names on replay"), where we explicitly ignore the i_size
of directory inode items on log replay. Before that we used it but it
was buggy as mentioned in that commit's change log (i_size got a larger
value then it should have).

So stop updating the i_size of the directory inode item in the log, as
that is a waste of time, adds more log contention to the log tree and
often results in COWing more extent buffers for the log tree.

This code path is triggered often during dbench workloads for example.
This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Michal Rostecki 4203431319 btrfs: let callers of btrfs_get_io_geometry pass the em
Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.

That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.

Change that to passing the extent mapping to the btrfs_get_io_geometry()
function as an argument.

This could improve performance in some cases.  For very large
filesystems, i.e. several thousands of allocated chunks, not only this
avoids searching two times the rbtree, saving time, it may also help
reducing contention on the lock that protects the tree - thinking of
writeback starting for multiple inodes, other tasks allocating or
removing chunks, and anything else that requires access to the rbtree.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Michal Rostecki <mrostecki@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Filipe's analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:00 +01:00
Qu Wenruo 951c80f83d btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
Commit dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could
span across a page") make btrfs_invalidapage() to search all ordered
extents.

The offending code looks like this:

  again:
	  start = page_start;
	  ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
	  if (ordred) {
		  end = min(page_end,
			    ordered->file_offset + ordered->num_bytes - 1);

		  /* Do the cleanup */

		  start = end + 1;
		  if (start < page_end)
			  goto again;
	  }

The behavior is indeed necessary for the incoming subpage support, but
when it iterates through all the ordered extents, it also resets the
search range @start.

This means, for the following cases, we can double account the ordered
extents, causing its bytes_left underflow:

	Page offset
	0		16K		32K
	|<--- OE 1  --->|<--- OE 2 ---->|

As the first iteration will find ordered extent (OE) 1, which doesn't
cover the full page, thus after cleanup code, we need to retry again.
But again label will reset start to page_start, and we got OE 1 again,
which causes double accounting on OE 1, and cause OE 1's byte_left to
underflow.

This problem can only happen for subpage case, as for regular sectorsize
== PAGE_SIZE case, we will always find a OE ends at or after page end,
thus no way to trigger the problem.

Move the again label after start = page_start.  There will be more
comprehensive rework to convert the open coded loop to a proper while
loop for subpage support.

Fixes: dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could span across a page")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Abaci Team a4559e6f6f btrfs: simplify condition in __btrfs_run_delayed_items
Fix the following coccicheck warnings:

./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Filipe Manana 2965194b77 btrfs: remove wrong comment for can_nocow_extent()
The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit e4ecaf90bc ("btrfs: add comments for
btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
the function btrfs_check_can_nocow(), which after that commit was renamed
to check_can_nocow(). So just remove that part of the comment.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik e5ad49e215 btrfs: add a trace class for dumping the current ENOSPC state
Often when I'm debugging ENOSPC related issues I have to resort to
printing the entire ENOSPC state with trace_printk() in different spots.
This gets pretty annoying, so add a trace state that does this for us.
Then add a trace point at the end of preemptive flushing so you can see
the state of the space_info when we decide to exit preemptive flushing.
This helped me figure out we weren't kicking in the preemptive flushing
soon enough.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik 4b02b00fe5 btrfs: adjust the flush trace point to include the source
Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik 88a777a6e5 btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good
start, but some workloads are particularly abusive and can quickly
overwhelm the preemptive flushing code and drive us into using tickets.

Handle this by clamping down on our threshold for starting and
continuing to run preemptive flushing.  This is particularly important
for our overcommit case, as we can really drive the file system into
overages and then it's more difficult to pull it back as we start to
actually fill up the file system.

The clamping is essentially 2^CLAMP, but we start at 1 so whatever we
calculate for overcommit is the baseline.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik 2e294c6049 btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit
unwieldy and confusing.  Simplify the logic to start preemptive flushing
if we've reserved more than half of our available free space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik 9f42d37748 btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns
the space currently required for flushing by the tickets, and if there
are no tickets it calculates a value for the preemptive flushing.

However for the normal ticketed flushing we really only care about the
space required for tickets.  We will accidentally come in and flush one
time, but as soon as we see there are no tickets we bail out of our
flushing.

Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
what is required for flushing if we have people waiting on space.  Then
move the preemptive flushing logic into need_preemptive_reclaim().  We
ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
because if we are in this path then we made our reservation and there
are not pending tickets currently, so we do not need to check it, simply
do the fuzzy logic to check if we're getting low on space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik f205edf773 btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have
space_info->reclaim_size set and we do not need to do background
reclaim.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik ae7913ba52 btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is
poorly named.  This is more checking if we need to preemptively flush
space, so rename it to need_preemptive_reclaim.

Also switch it to bool and make it plain static as followup patches will
move more code here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik 576fa34830 btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing.  However this meant that we essentially
never preemptively flushed.  This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction.  This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer.  With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on.  Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing.  Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik f00c42dd4c btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction().  This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik 5deb17e18e btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we
have more DIO in flight than we have normal buffered IO.  The reason for
this is because we can't "flush" DIO, we have to just wait on the
ordered extents to finish.

However this is true of all ordered extents.  If we have more ordered
space outstanding than dirty pages we should be waiting on ordered
extents.  We already are ok on this front technically, because we always
do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
the preemptive flushing code as well, so change this to count all
ordered bytes instead of just DIO ordered bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik ac1ea10e75 btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
trace point to report when we handle a reserve ticket.

I opted to spit out start_ns itself without calculating the difference
because there could be a gap between enabling the tracepoint and setting
start_ns.  Doing it this way allows us to filter on 0 start_ns so we
don't get bogus entries, and we can easily calculate the time difference
with bpftrace or something else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik 91e79a83ff btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our
kernels and it's because I used the wrong enum type for what I passed
into flush_space, caught by -Wenum-conversion.  Change the argument to
be explicitly the enum we're expecting to make everything consistent.
Maybe eventually gcc will catch errors like this.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Roman Anasal 8898038309 btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb
btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
struct send_ctx *sctx but when used in changed_cb it is immediately
cast to `struct send_ctx *sctx = ctx;`.

changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
- full_send_tree already passes a struct send_ctx *sctx
- btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
- void *ctx in btrfs_compare_trees is only used to be passed to changed_cb

So casting to/from void *ctx seems unnecessary and directly using
struct send_ctx *sctx instead provides better type-safety.

The original reason for using void *ctx in the first place seems to have
been dropped with 1b51d6fce4 ("btrfs: send: remove indirect callback
parameter for changed_cb").

Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Josef Bacik 488bc2a2d2 btrfs: run delayed refs less often in commit_cowonly_roots
We love running delayed refs in commit_cowonly_roots, but it is a bit
excessive.  I was seeing cases of running 3 or 4 refs a few times in a
row during this time.  Instead simply:

- update all of the roots first
- then run delayed refs
- then handle the empty block groups case
- and then if we have any more dirty roots do the whole thing again

This allows us to be much more efficient with our delayed ref running,
as we can batch a few more operations at once.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Josef Bacik dac348e925 btrfs: stop running all delayed refs during snapshot
This was added in commit 361048f586 ("Btrfs: fix full backref problem
when inserting shared block reference") to address a problem where we
hit the following BUG_ON() in alloc_reserved_tree_block

        if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
                BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));

However this BUG_ON() is bogus, and was removed by previous commit:

  btrfs: remove bogus BUG_ON in alloc_reserved_tree_block

We no longer need to run delayed refs because of this, and can remove
this flushing here.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Josef Bacik b7774425e0 btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
The fix 361048f586 ("Btrfs: fix full backref problem when inserting
shared block reference") added a delayed ref flushing at subvolume
creation time in order to avoid hitting this particular BUG_ON().

Before this fix, we were tripping the BUG_ON() by

1. Modify snapshot A, which creates blocks with a normal reference for
   snapshot A, as A is the owner of these blocks.  We now have delayed
   refs for these blocks.
2. Create a snapshot of A named B, which pushes references for the
   children blocks of the root node for the new root B, thus creating
   more delayed refs for newly allocated blocks.
3. A is modified, and because the metadata blocks can now be shared, it
   must push FULL_BACKREF references to the children of any block that A
   COWs down it's path to its target key.
4. Delayed refs are run.  Because these are newly allocated blocks, we
   have ->must_insert_reserved reserved set on the delayed ref head, we
   call into alloc_reserved_tree_block() to add the extent item, and
   then add our ref.  At the time of this fix, we were ordering
   FULL_BACKREF delayed ref operations first, so we'd go to add this
   reference and then BUG_ON() because we didn't have the FULL_BACKREF
   flag set.

The patch fixed this problem by making sure we ran the delayed refs
before we had the chance to modify A.  This meant that any *new* blocks
would have had their extent items created _before_ we would ever
actually COW down and generate FULL_BACKREF entries.  Thus the problem
went away.

However this BUG_ON() is actually completely bogus.  The existence of a
full backref doesn't necessarily mean that FULL_BACKREF must be set on
that block, it must only be set on the actual parent itself.  Consider
the example provided above.  If we COW down one path from A, any nodes
are going to have a FULL_BACKREF ref pushed down to _all_ of their
children, but not all of the children are going to have FULL_BACKREF
set.  It is completely valid to have an extent item with normal and full
backrefs without FULL_BACKREF actually set on the block itself.

As a final note, I have been testing with the patch (applied after this
one)

  btrfs: stop running all delayed refs during snapshot

which removed this flushing.  My test was a torture test which did a lot
of operations while snapshotting and deleting snapshots as well as
relocation, and I never tripped this BUG_ON().  This is actually because
at the time of 361048f586, we ordered SHARED keys _before_ normal
references, and thus they would get run first.  However currently they
are ordered _after_ normal references, so we'd do the initial creation
without having a shared reference, and thus not hit this BUG_ON(), which
explains why I didn't start hitting this problem during my testing with
my other patch applied.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Josef Bacik 2a4d84c11a btrfs: move delayed ref flushing for qgroup into qgroup helper
The commit d672633545 ("btrfs: qgroup: Make snapshot accounting work
with new extent-oriented qgroup.") added a flush of the delayed refs
during snapshot creation in order to get the qgroup accounting properly.
However this code has changed and been moved to it's own helper that is
skipped if qgroups are turned off.  Move the flushing to the helper, as
we do not need it when qgroups are turned off.

Also add a comment explaining why it exists, and why it doesn't actually
save us.  This will be helpful later when we try to fix qgroup
accounting properly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00
Josef Bacik ad368f3394 btrfs: only run delayed refs once before committing
We try to pre-flush the delayed refs when committing, because we want to
do as little work as possible in the critical section of the transaction
commit.

However doing this twice can lead to very long transaction commit delays
as other threads are allowed to continue to generate more delayed refs,
which potentially delays the commit by multiple minutes in very extreme
cases.

So simply stick to one pre-flush, and then continue the rest of the
transaction commit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik 61a56a992f btrfs: delayed refs pre-flushing should only run the heads we have
Previously our delayed ref running used the total number of items as the
items to run.  However we changed that to number of heads to run with
the delayed_refs_rsv, as generally we want to run all of the operations
for one bytenr.

But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the
number of items that we have.  This is generally fine, but if we have
some operation generation loads of delayed refs while we're doing this
pre-flushing in the transaction commit, we'll just spin forever doing
delayed refs.

Fix this to simply pick the number of delayed refs we currently have,
that way we do not end up doing a lot of extra work that's being
generated in other threads.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik e19eb11f4f btrfs: only let one thread pre-flush delayed refs in commit
I've been running a stress test that runs 20 workers in their own
subvolume, which are running an fsstress instance with 4 threads per
worker, which is 80 total fsstress threads.  In addition to this I'm
running balance in the background as well as creating and deleting
snapshots.  This test takes around 12 hours to run normally, going
slower and slower as the test goes on.

The reason for this is because fsstress is running fsync sometimes, and
because we're messing with block groups we often fall through to
btrfs_commit_transaction, so will often have 20-30 threads all calling
btrfs_commit_transaction at the same time.

These all get stuck contending on the extent tree while they try to run
delayed refs during the initial part of the commit.

This is suboptimal, really because the extent tree is a single point of
failure we only want one thread acting on that tree at once to reduce
lock contention.

Fix this by making the flushing mechanism a bit operation, to make it
easy to use test_and_set_bit() in order to make sure only one task does
this initial flush.

Once we're into the transaction commit we only have one thread doing
delayed ref running, it's just this initial pre-flush that is
problematic.  With this patch my stress test takes around 90 minutes to
run, instead of 12 hours.

The memory barrier is not necessary for the flushing bit as it's
ordered, unlike plain int. The transaction state accessed in
btrfs_should_end_transaction could be affected by that too as it's not
always used under transaction lock. Upon Nikolay's analysis in [1]
it's not necessary:

  In should_end_transaction it's read without holding any locks. (U)

  It's modified in btrfs_cleanup_transaction without holding the
  fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.

  set in cleanup_transaction under fs_info->trans_lock (L)
  set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
  set in btrfs_commit_trans to COMMIT_UNBLOCK under
  fs_info->trans_lock.(L)

  set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
  point the transaction is finished and fs_info->running_trans is NULL (U
  but irrelevant).

  So by the looks of it we can have a concurrent READ race with a WRITE,
  due to reads not taking a lock. In this case what we want to ensure is
  we either see new or old state. I consulted with Will Deacon and he said
  that in such a case we'd want to annotate the accesses to ->state with
  (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
  don't think this could happen but I imagine at some point KCSAN would
  flag such an access as racy (which it is).

[1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comments regarding memory barrier ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik ddfd08cb04 btrfs: do not block on deleted bgs mutex in the cleaner
While running some stress tests I started getting hung task messages.
This is because the delete unused block groups code has to take the
delete_unused_bgs_mutex to do it's work, which is taken by balance to
make sure we don't delete block groups while we're balancing.

The problem is that balance can take a while, and so we were getting
hung task warnings.  We don't need to block and run these things, and
the cleaner is needed to do other work, so trylock on this mutex and
just bail if we can't acquire it right away.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik 867ed321f9 btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing.  However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:

btrfs_inc_extent_ref
  btrfs_copy_root
    create_reloc_root
      btrfs_init_reloc_root
	btrfs_record_root_in_trans
	  btrfs_start_transaction
	    btrfs_update_inode
	      btrfs_update_time
		touch_atime
		  file_accessed
		    btrfs_file_mmap

This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem.  Fix this by aborting the transaction if we fail
to do the reference modification.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik eddda68d97 btrfs: add asserts for deleting backref cache nodes
A weird KASAN problem that Zygo reported could have been easily caught
if we checked for basic things in our backref freeing code.  We have two
methods of freeing a backref node

- btrfs_backref_free_node: this just is kfree() essentially.
- btrfs_backref_drop_node: this actually unlinks the node and cleans up
  everything and then calls btrfs_backref_free_node().

We should mostly be using btrfs_backref_drop_node(), to make sure the
node is properly unlinked from the backref cache, and only use
btrfs_backref_free_node() when we know the node isn't actually linked to
the backref cache.  We made a mistake here and thus got the KASAN splat.

Make this style of issue easier to find by adding some ASSERT()'s to
btrfs_backref_free_node() and adjusting our deletion stuff to properly
init the list so we can rely on list_empty() checks working properly.

  BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
  Read of size 8 at addr ffff888112402950 by task btrfs/28836

  CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  Call Trace:
   dump_stack+0xbc/0xf9
   ? btrfs_backref_cleanup_node+0x18a/0x420
   print_address_description.constprop.8+0x21/0x210
   ? record_print_text.cold.34+0x11/0x11
   ? btrfs_backref_cleanup_node+0x18a/0x420
   ? btrfs_backref_cleanup_node+0x18a/0x420
   kasan_report.cold.10+0x20/0x37
   ? btrfs_backref_cleanup_node+0x18a/0x420
   __asan_load8+0x69/0x90
   btrfs_backref_cleanup_node+0x18a/0x420
   btrfs_backref_release_cache+0x83/0x1b0
   relocate_block_group+0x394/0x780
   ? merge_reloc_roots+0x4a0/0x4a0
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   ? check_flags.part.50+0x6c/0x1e0
   ? btrfs_relocate_chunk+0x120/0x120
   ? kmem_cache_alloc_trace+0xa06/0xcb0
   ? _copy_from_user+0x83/0xc0
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f4c4bdfe427
  RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427
  RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003
  RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
  R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001
  R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001

  Allocated by task 28836:
   kasan_save_stack+0x21/0x50
   __kasan_kmalloc.constprop.18+0xbe/0xd0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x410/0xcb0
   btrfs_backref_alloc_node+0x46/0xf0
   btrfs_backref_add_tree_node+0x60d/0x11d0
   build_backref_tree+0xc5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 28836:
   kasan_save_stack+0x21/0x50
   kasan_set_track+0x20/0x30
   kasan_set_free_info+0x1f/0x30
   __kasan_slab_free+0xf3/0x140
   kasan_slab_free+0xe/0x10
   kfree+0xde/0x200
   btrfs_backref_error_cleanup+0x452/0x530
   build_backref_tree+0x1a5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff888112402900
   which belongs to the cache kmalloc-128 of size 128
  The buggy address is located 80 bytes inside of
   128-byte region [ffff888112402900, ffff888112402980)
  The buggy address belongs to the page:
  page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402
  flags: 0x17ffe0000000200(slab)
  raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440
  raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                   ^
   ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik f78743fbda btrfs: do not warn if we can't find the reloc root when looking up backref
The backref code is looking for a reloc_root that corresponds to the
given fs root.  However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item.  This would
result in no corresponding reloc_root being in the reloc root cache, and
thus would return NULL when we do the find_reloc_root() call.

Because of this we do not want to WARN_ON().  This presumably was meant
to catch developer errors, cases where we messed up adding the reloc
root.  However we can easily hit this case with error injection, and
thus should not do a WARN_ON().

CC: stable@vger.kernel.org # 5.10+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:56 +01:00
Josef Bacik 938fcbfb0c btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
While doing error injection testing with my relocation patches I hit the
following assert:

  assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3357!
  invalid opcode: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 24351 Comm: umount Tainted: G        W         5.10.0-rc3+ #193
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x1a
  RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282
  RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000
  RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050
  RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000
  R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100
  R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100
  FS:  00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0
  Call Trace:
   btrfs_free_block_groups.cold+0x55/0x55
   close_ctree+0x2c5/0x306
   ? fsnotify_destroy_marks+0x14/0x100
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20
   deactivate_locked_super+0x36/0xa0
   cleanup_mnt+0x12d/0x190
   task_work_run+0x5c/0xa0
   exit_to_user_mode_prepare+0x1b1/0x1d0
   syscall_exit_to_user_mode+0x54/0x280
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happened because I injected an error in btrfs_cow_block() while
running the dirty block groups.  When we run the dirty block groups, we
splice the list onto a local list to process.  However if an error
occurs, we only cleanup the transactions dirty block group list, not any
pending block groups we have on our locally spliced list.

In fact if we fail to allocate a path in this function we'll also fail
to clean up the splice list.

Fix this by splicing the list back onto the transaction dirty block
group list so that the block groups are cleaned up.  Then add a 'out'
label and have the error conditions jump to out so that the errors are
handled properly.  This also has the side-effect of fixing a problem
where we would clear 'ret' on error because we unconditionally ran
btrfs_run_delayed_refs().

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Josef Bacik c78a10aebb btrfs: fix reloc root leak with 0 ref reloc roots on recovery
When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later.  The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because we never call __add_reloc_root() on the reloc_root.

This exit condition simply isn't correct really.  During normal
operation we can remove ourselves from the rb tree and then we're meant
to clean up later at merge_reloc_roots() time, and this happens
correctly.  During recovery we're depending on free_reloc_roots() to
drop our references, but we're short-circuiting.

Fix this by continuing to check if we're on the list and dropping
ourselves from the reloc_control root list and dropping our reference
appropriately.  Change the corresponding BUG_ON() to an ASSERT() that
does the correct thing if we aren't in the rb tree.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Nigel Christian 2e626e5673 btrfs: remove repeated word in struct member comment
Comment for processed extent end of range has an unnecessary "in",
remove it.

Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Josef Bacik 81e75ac74e btrfs: account for new extents being deleted in total_bytes_pinned
My recent patch set "A variety of lock contention fixes", found here

https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/
(Tracked in https://github.com/btrfs/linux/issues/86)

that reduce lock contention on the extent root by running delayed refs
less often resulted in a regression in generic/371.  This test
fallocate()'s the fs until it's full, deletes all the files, and then
tries to fallocate() until full again.

Before these patches we would run all of the delayed refs during
flushing, and then would commit the transaction because we had plenty of
pinned space to recover in order to allocate.  However my patches made
it so we weren't running the delayed refs as aggressively, which meant
that we appeared to have less pinned space when we were deciding to
commit the transaction.

We use the space_info->total_bytes_pinned to approximate how much space
we have pinned.  It's approximate because if we remove a reference to an
extent we may free it, but there may be more references to it than we
know of at that point, but we account it as pinned at the creation time,
and then it's properly accounted when the delayed ref runs.

The way we account for pinned space is if the
delayed_ref_head->total_ref_mod is < 0, because that is clearly a
freeing option.  However there is another case, and that is where
->total_ref_mod == 0 && ->must_insert_reserved == 1.

When we allocate a new extent, we have ->total_ref_mod == 1 and we have
->must_insert_reserved == 1.  This is used to indicate that it is a
brand new extent and will need to have its extent entry added before we
modify any references on the delayed ref head.  But if we subsequently
remove that extent reference, our ->total_ref_mod will be 0, and that
space will be pinned and freed.  Accounting for this case properly
allows for generic/371 to pass with my delayed refs patches applied.

It's important to note that this problem exists without the referenced
patches, it just was uncovered by them.

CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Josef Bacik 2187374f35 btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
Currently we pass things around to figure out if we maybe freeing data
based on the state of the delayed refs head.  This makes the accounting
sort of confusing and hard to follow, as it's distinctly separate from
the delayed ref heads stuff, but also depends on it entirely.

Fix this by explicitly adjusting the space_info->total_bytes_pinned in
the delayed refs code.  We now have two places where we modify this
counter, once where we create the delayed and destroy the delayed refs,
and once when we pin and unpin the extents.  This means there is a
slight overlap between delayed refs and the pin/unpin mechanisms, but
this is simply used by the ENOSPC infrastructure to determine if we need
to commit the transaction, so there's no adverse affect from this, we
might simply commit thinking it will give us enough space when it might
not.

CC: stable@vger.kernel.org # 5.10
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Nikolay Borisov e9aa7c285d btrfs: enable W=1 checks for btrfs
Now that the btrfs' codebase is clean of almost all W=1 warnings let's
enable those checks unconditionally for the entire fs/btrfs/ and its
subdirectories to catch potential errors during development.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:55 +01:00
Nikolay Borisov 8c31a3dbaa btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev
This fixes warning:

fs/btrfs/zoned.c:491:6: warning: variable ‘zone_size’ set but not used [-Wunused-but-set-variable]
  491 |  u64 zone_size;

which got introduced in 12659251ca ("btrfs: implement log-structured
superblock for ZONED mode"). We'll enable the warning by default and
want clean build until the relevant zoned patches land.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov 3bed2da1b0 btrfs: fix parameter description for functions in extent_io.c
This makes the file W=1 clean and fixes the following warnings:

fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages'
fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov d98b188ea4 btrfs: fix parameter description in space-info.c
With these fixes space-info.c is clear for W=1 warnings, namely the
following ones are fixed:

fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction'
fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction'
fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket'
fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket'
fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket'
fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket'
fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes'
fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes'
fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes'
fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes'
fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes'
fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes'
fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov b762d1d08d btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space
Fixes following warnings:

fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'inode' not described in 'btrfs_inode_rsv_release'
fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_inode_rsv_release'
fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'reserved' not described in 'btrfs_delalloc_release_space'
fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_delalloc_release_space'
fs/btrfs/delalloc-space.c:472: warning: Excess function parameter 'release_bytes' description in 'btrfs_delalloc_release_space'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov 6e353e3b3c btrfs: document btrfs_check_shared parameters
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov 2639631d34 btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs
Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:54 +01:00
Nikolay Borisov 9ee9b97990 btrfs: document fs_info in btrfs_rmap_block
Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov 9241969547 btrfs: document now parameter of peek_discard_list
Fixes fs/btrfs/discard.c:203: warning: Function parameter or member 'now' not described in 'peek_discard_list'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov f092cf3cfd btrfs: improve parameter description for __btrfs_write_out_cache
Fixes following W=1 warnings:
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache'
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache'
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache'
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache'
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache'
fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache'

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov 696eb22b67 btrfs: fix parameter description in delayed-ref.c functions
This fixes the following warnings:

fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_release'
fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'nr' not described in 'btrfs_delayed_refs_rsv_release'
fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'fs_info' not described in 'btrfs_migrate_to_delayed_refs_rsv'
fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'src' not described in 'btrfs_migrate_to_delayed_refs_rsv'
fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'num_bytes' not described in 'btrfs_migrate_to_delayed_refs_rsv'
fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_refill'
fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'flush' not described in 'btrfs_delayed_refs_rsv_refill'

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov ca4207ae13 btrfs: fix function description formats in file-item.c
This fixes following W=1 warnings:

fs/btrfs/file-item.c:27: warning: Cannot understand  * @inode:  the inode we want to update the disk_i_size for
 on line 27 - I thought it was a doc line
fs/btrfs/file-item.c:65: warning: Cannot understand  * @inode - the inode we're modifying
 on line 65 - I thought it was a doc line
fs/btrfs/file-item.c:91: warning: Cannot understand  * @inode - the inode we're modifying
 on line 91 - I thought it was a doc line

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov 9ad37bb3ff btrfs: fix parameter description of btrfs_add_extent_mapping
This fixes the following compiler warnings:

fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping'
fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping'
fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping'
fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping'
fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping'

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Nikolay Borisov 401bd2dd12 btrfs: document modified parameter of add_extent_mapping
Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member
'modified' not described in 'add_extent_mapping'

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:53 +01:00
Qu Wenruo 3c198fe064 btrfs: rework the order of btrfs_ordered_extent::flags
[BUG]
There is a long existing bug in the last parameter of
btrfs_add_ordered_extent(), in commit 771ed689d2 ("Btrfs: Optimize
compressed writeback and reads") back to 2008.

In that ancient commit btrfs_add_ordered_extent() expects the @type
parameter to be one of the following:

- BTRFS_ORDERED_REGULAR
- BTRFS_ORDERED_NOCOW
- BTRFS_ORDERED_PREALLOC
- BTRFS_ORDERED_COMPRESSED

But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE.

Ironically extra check in __btrfs_add_ordered_extent() won't set the bit
if we see (type == IO_DONE || type == IO_COMPLETE), and avoid any
obvious bug.

But this still leads to regular COW ordered extent having no bit to
indicate its type in various trace events, rendering REGULAR bit
useless.

[FIX]
Change the following aspects to avoid such problem:

- Reorder btrfs_ordered_extent::flags
  Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then
  DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR.

- Add extra ASSERT() for btrfs_add_ordered_extent_*()

- Remove @type parameter for btrfs_add_ordered_extent_compress()
  As the only valid @type here is BTRFS_ORDERED_COMPRESSED.

- Remove the unnecessary special check for IO_DONE/COMPLETE in
  __btrfs_add_ordered_extent()
  This is just to make the code work, with extra ASSERT(), there are
  limited values can be passed in.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Yang Li fe3b7bb085 btrfs: remove redundant NULL check before kvfree
Fix below warnings reported by coccicheck:
./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing
functions is not needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Josef Bacik 7e2a870a59 btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
Zygo reported the following panic when testing my error handling patches
for relocation:

  kernel BUG at fs/btrfs/backref.c:2545!
  invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G        W 14
  Hardware name: QEMU Standard PC (i440FX + PIIX,

  Call Trace:
   btrfs_backref_error_cleanup+0x4df/0x530
   build_backref_tree+0x1a5/0x700
   ? _raw_spin_unlock+0x22/0x30
   ? release_extent_buffer+0x225/0x280
   ? free_extent_buffer.part.52+0xd7/0x140
   relocate_tree_blocks+0x2a6/0xb60
   ? kasan_unpoison_shadow+0x35/0x50
   ? do_relocation+0xc10/0xc10
   ? kasan_kmalloc+0x9/0x10
   ? kmem_cache_alloc_trace+0x6a3/0xcb0
   ? free_extent_buffer.part.52+0xd7/0x140
   ? rb_insert_color+0x342/0x360
   ? add_tree_block.isra.36+0x236/0x2b0
   relocate_block_group+0x2eb/0x780
   ? merge_reloc_roots+0x470/0x470
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x18f0
   ? pvclock_clocksource_read+0xeb/0x190
   ? btrfs_relocate_chunk+0x120/0x120
   ? lock_contended+0x620/0x6e0
   ? do_raw_spin_lock+0x1e0/0x1e0
   ? do_raw_spin_unlock+0xa8/0x140
   btrfs_ioctl_balance+0x1f9/0x460
   btrfs_ioctl+0x24c8/0x4380
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This occurs because of this check

  if (RB_EMPTY_NODE(&upper->rb_node))
	  BUG_ON(!list_empty(&node->upper));

As we are dropping the backref node, if we discover that our upper node
in the edge we just cleaned up isn't linked into the cache that we are
now done with this node, thus the BUG_ON().

However this is an erroneous assumption, as we will look up all the
references for a node first, and then process the pending edges.  All of
the 'upper' nodes in our pending edges won't be in the cache's rb_tree
yet, because they haven't been processed.  We could very well have many
edges still left to cleanup on this node.

The fact is we simply do not need this check, we can just process all of
the edges only for this node, because below this check we do the
following

  if (list_empty(&upper->lower)) {
	  list_add_tail(&upper->lower, &cache->leaves);
	  upper->lowest = 1;
  }

If the upper node truly isn't used yet, then we add it to the
cache->leaves list to be cleaned up later.  If it is still used then the
last child node that has it linked into its node will add it to the
leaves list and then it will be cleaned up.

Fix this problem by dropping this logic altogether.  With this fix I no
longer see the panic when testing with error injection in the backref
code.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Josef Bacik f7ba2d3751 btrfs: keep track of the root owner for relocation reads
While testing the error paths in relocation, I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc3+ #206 Not tainted
  ------------------------------------------------------
  btrfs-balance/1571 is trying to acquire lock:
  ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0

  but task is already holding lock:
  ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-tree-00){++++}-{3:3}:
	 down_write_nested+0x43/0x80
	 __btrfs_tree_lock+0x27/0x100
	 btrfs_search_slot+0x248/0x890
	 relocate_tree_blocks+0x490/0x650
	 relocate_block_group+0x1ba/0x5d0
	 kretprobe_trampoline+0x0/0x50

  -> #1 (btrfs-csum-01){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x5ab/0x890
	 btrfs_del_csums+0x10b/0x3c0
	 __btrfs_free_extent+0x49d/0x8e0
	 __btrfs_run_delayed_refs+0x283/0x11f0
	 btrfs_run_delayed_refs+0x86/0x220
	 btrfs_start_dirty_block_groups+0x2ba/0x520
	 kretprobe_trampoline+0x0/0x50

  -> #0 (&head_ref->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0x116/0x3e0
	 __mutex_lock+0x7e/0x7b0
	 btrfs_lookup_extent_info+0x156/0x3b0
	 walk_down_proc+0x1c3/0x280
	 walk_down_tree+0x64/0xe0
	 btrfs_drop_subtree+0x182/0x260
	 do_relocation+0x52e/0x660
	 relocate_tree_blocks+0x2ae/0x650
	 relocate_block_group+0x1ba/0x5d0
	 kretprobe_trampoline+0x0/0x50

  other info that might help us debug this:

  Chain exists of:
    &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-tree-00);
				 lock(btrfs-csum-01);
				 lock(btrfs-tree-00);
    lock(&head_ref->mutex);

   *** DEADLOCK ***

  5 locks held by btrfs-balance/1571:
   #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40
   #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300
   #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0
   #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100
   #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100

  stack backtrace:
  CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1167/0x2150
   lock_acquire+0x116/0x3e0
   ? btrfs_lookup_extent_info+0x156/0x3b0
   __mutex_lock+0x7e/0x7b0
   ? btrfs_lookup_extent_info+0x156/0x3b0
   ? btrfs_lookup_extent_info+0x156/0x3b0
   ? release_extent_buffer+0x124/0x170
   ? _raw_spin_unlock+0x1f/0x30
   ? release_extent_buffer+0x124/0x170
   btrfs_lookup_extent_info+0x156/0x3b0
   walk_down_proc+0x1c3/0x280
   walk_down_tree+0x64/0xe0
   btrfs_drop_subtree+0x182/0x260
   do_relocation+0x52e/0x660
   relocate_tree_blocks+0x2ae/0x650
   ? add_tree_block+0x149/0x1b0
   relocate_block_group+0x1ba/0x5d0
   elfcorehdr_read+0x40/0x40
   ? elfcorehdr_read+0x40/0x40
   ? btrfs_balance+0x796/0xf40
   ? __kthread_parkme+0x66/0x90
   ? btrfs_balance+0xf40/0xf40
   ? balance_kthread+0x37/0x50
   ? kthread+0x137/0x150
   ? __kthread_bind_mask+0x60/0x60
   ? ret_from_fork+0x1f/0x30

As you can see this is bogus, we never take another tree's lock under
the csum lock.  This happens because sometimes we have to read tree
blocks from disk without knowing which root they belong to during
relocation.  We defaulted to an owner of 0, which translates to an fs
tree.  This is fine as all fs trees have the same class, but obviously
isn't fine if the block belongs to a COW only tree.

Thankfully COW only trees only have their owners root as a reference to
them, and since we already look up the extent information during
relocation, go ahead and check and see if this block might belong to a
COW only tree, and if so save the owner in the tree_block struct.  This
allows us to read_tree_block with the proper owner, which gets rid of
this lockdep splat.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Qu Wenruo c0f0a9e716 btrfs: introduce helper to grab an existing extent buffer from a page
This patch will extract the code to grab an extent buffer from a page
into a helper, grab_extent_buffer_from_page().

This reduces one indent level, and provides the work place for later
expansion for subapge support.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Qu Wenruo c0fab48095 btrfs: update comment for btrfs_dirty_pages
The original comment is from the initial merge, which has several
problems:

- No holes check any more
- No inline decision is made

Update the out-of-date comment with more correct one.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Qu Wenruo 6bc5636a67 btrfs: refactor __extent_writepage_io() to improve readability
The refactoring involves the following modifications:

- iosize alignment
  In fact we don't really need to manually do alignment at all.
  All extent maps should already be aligned, thus basic ASSERT() check
  would be enough.

- redundant variables
  We have extra variable like blocksize/pg_offset/end.
  They are all unnecessary.

  @blocksize can be replaced by sectorsize size directly, and it's only
  used to verify the em start/size is aligned.

  @pg_offset can be easily calculated using @cur and page_offset(page).

  @end is just assigned from @page_end and never modified, use
  "start + PAGE_SIZE - 1" directly and remove @page_end.

- remove some BUG_ON()s
  The BUG_ON()s are for extent map, which we have tree-checker to check
  on-disk extent data item and runtime check.
  ASSERT() should be enough.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:52 +01:00
Qu Wenruo 0c64c33c60 btrfs: rename parameter offset to disk_bytenr in submit_extent_page
The parameter offset is confusing, it's supposed to be the disk bytenr
of metadata/data.  Rename it to disk_bytenr and update the comment.

Also rename each offset passed to submit_extent_page() as @disk_bytenr
so they're consistent.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Qu Wenruo 58f74b2203 btrfs: refactor btrfs_dec_test_* functions for ordered extents
The refactoring involves the following modifications:

- Return bool instead of int

- Parameter update for @cached of btrfs_dec_test_first_ordered_pending()
  For btrfs_dec_test_first_ordered_pending(), @cached is only used to
  return the finished ordered extent.
  Rename it to @finished_ret.

- Comment updates

  * Change one stale comment
    Which still refers to btrfs_dec_test_ordered_pending(), but the
    context is calling  btrfs_dec_test_first_ordered_pending().
  * Follow the common comment style for both functions
    Add more detailed descriptions for parameters and the return value
  * Move the reason why test_and_set_bit() is used into the call sites

- Change how the return value is calculated
  The most anti-human part of the return value is:

    if (...)
	ret = 1;
    ...
    return ret == 0;

  This means, when we set ret to 1, the function returns 0.
  Change the local variable name to @finished, and directly return the
  value of it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Qu Wenruo 523929f1ca btrfs: make btrfs_dio_private::bytes u32
btrfs_dio_private::bytes is only assigned from bio::bi_iter::bi_size,
which is never larger than U32.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Nikolay Borisov d7830b7155 btrfs: remove always true condition in btrfs_start_delalloc_roots
Following the rework in e076ab2a2c ("btrfs: shrink delalloc pages
instead of full inodes") the nr variable is no longer passed by
reference to start_delalloc_inodes hence it cannot change. Additionally
we are always guaranteed for it to be positive number hence it's
redundant to have it as a condition in the loop. Simply remove that
usage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Nikolay Borisov 9db4dc241e btrfs: make btrfs_start_delalloc_root's nr argument a long
It's currently u64 which gets instantly translated either to LONG_MAX
(if U64_MAX is passed) or cast to an unsigned long (which is in fact,
wrong because writeback_control::nr_to_write is a signed, long type).

Just convert the function's argument to be long time which obviates the
need to manually convert u64 value to a long. Adjust all call sites
which pass U64_MAX to pass LONG_MAX. Finally ensure that in
shrink_delalloc the u64 is converted to a long without overflowing,
resulting in a negative number.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Filipe Manana 9c4a062a94 btrfs: send: remove stale code when checking for shared extents
After commit 040ee6120c ("Btrfs: send, improve clone range") we do not
use anymore the data_offset field of struct backref_ctx, as after that we
do all the necessary checks for the data offset of file extent items at
clone_range(). Since there are no more users of data_offset from that
structure, remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Nikolay Borisov 7056bf69e5 btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device
Instead of having three 'if' to handle non-NULL return value consolidate
this in one 'if (ret)'. That way the code is more obvious:

 - Always drop delete_unused_bgs_mutex if ret is not NULL
 - If ret is negative -> goto done
 - If it's 1 -> reset ret to 0, release the path and finish the loop.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:51 +01:00
Josef Bacik 1478143ac8 btrfs: ref-verify: make sure owner is set for all refs
I noticed that shared ref entries in ref-verify didn't have the proper
owner set, which caused me to think there was something seriously wrong.
However the problem is if we have a parent we simply weren't filling out
the owner part of the reference, even though we have it.

Fix this by making sure we set all the proper fields when we modify a
reference, this way we'll have the proper owner if a problem happens and
we don't waste time thinking we're updating the wrong level.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Josef Bacik 0d73a11c62 btrfs: ref-verify: pass down tree block level when building refs
I noticed that sometimes I would have the wrong level printed out with
ref-verify while testing some error injection related problems.  This is
because we only get the level from the main extent item, but our
references could go off the current leaf into another, and at that point
we lose our level.

Fix this by keeping track of the last tree block level that we found,
the same way we keep track of our bytenr and num_bytes, in case we
happen to wander into another leaf while still processing the references
for a bytenr.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Josef Bacik 1fec12a560 btrfs: noinline btrfs_should_cancel_balance
I was attempting to reproduce a problem that Zygo hit, but my error
injection wasn't firing for a few of the common calls to
btrfs_should_cancel_balance.  This is because the compiler decided to
inline it at these spots.  Keep this from happening by explicitly
marking the function as noinline so that error injection will always
work.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Josef Bacik f75e2b79b5 btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block
The following patches are going to address error handling in relocation,
in order to test those patches I need to be able to inject errors in
btrfs_search_slot and btrfs_cow_block, as we call both of these pretty
often in different cases during relocation.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Nikolay Borisov 69948022c9 btrfs: remove new_dirid argument from btrfs_create_subvol_root
It's no longer used. While at it also remove new_dirid in create_subvol
as it's used in a single place and open code it. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Nikolay Borisov 23125104d8 btrfs: make btrfs_root::free_objectid hold the next available objectid
Adjust the way free_objectid is being initialized, it now stores
BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary,
BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit
that now it becomes unnecessary to explicitly initialize free_objectid
for a newly create fs root.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Nikolay Borisov 6b8fad576a btrfs: rename btrfs_root::highest_objectid to free_objectid
This reflects the true purpose of the member as it's being used solely
in context where a new objectid is being allocated. Future changes will
also change the way it's being used to closely follow this semantics.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Nikolay Borisov 543068a217 btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid
This better reflects the semantics of the function i.e no search is
performed whatsoever.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Nikolay Borisov 453e487386 btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid
This function is used to initialize the in-memory
btrfs_root::highest_objectid member, which is used to get an available
objectid. Rename it to better reflect its semantics.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Nikolay Borisov 149716570b btrfs: cleanup local variables in btrfs_file_write_iter
First replace all inode instances with a pointer to btrfs_inode. This
removes multiple invocations of the BTRFS_I macro, subsequently remove
2 local variables as they are called only once and simply refer to
them directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Zhihao Cheng 3cc64e7ebf btrfs: clarify error returns values in __load_free_space_cache
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.

Fixes: a67509c300 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Josef Bacik 4f4317c13a btrfs: fix error handling in commit_fs_roots
While doing error injection I would sometimes get a corrupt file system.
This is because I was injecting errors at btrfs_search_slot, but would
only do it one time per stack.  This uncovered a problem in
commit_fs_roots, where if we get an error we would just break.  However
we're in a nested loop, the first loop being a loop to find all the
dirty fs roots, and then subsequent root updates would succeed clearing
the error value.

This isn't likely to happen in real scenarios, however we could
potentially get a random ENOMEM once and then not again, and we'd end up
with a corrupted file system.  Fix this by moving the error checking
around a bit to the main loop, as this is the only place where something
will fail, and return the error as soon as it occurs.

With this patch my reproducer no longer corrupts the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Linus Torvalds c05d51c773 for-5.11-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAUIkAACgkQxWXV+ddt
 WDsWVg/+IIEk9H1v9q9ShvVmPvmnlT8/0ywj1hdwFMBkFBjIeU8tBz9ZMGPXCzrF
 XemmWKChVOnR3SIq/bMrwuRC/Gv/pBvwVshXLP51YJHv7lSGX0Ayrb27BFQcVaC/
 3QhpE7veEiqxwLyMj+LWG4hE2X+oqiqzrXCpeC5un4zEluT45RSKooqueQ4jM8aw
 DrKLQA57a1YEIqrE2KQzy5A6BnSNyxPXEEX34kbugmmen46Fh77hrwme1K9vQn1t
 v3/V4LcarXADxxokAxU2Igb/vK0+BN33NOYsBwLWWD4kUaTGS4KczsDOowkRRTMH
 /qiQUdca0X7ElR+VFl8rgB8PxuJcZ87aCdsMkErUA4sjxyp11VDIeEgirPNAcXtR
 b+1LIkn3k3l8JzkKyXwDuZuNBsh0idTY24IE+QDBMIGq+jE1N6N3t5gEwa2NeaiP
 9O5QnS5XAJCo8a9+gp1aF5z94vwQwvf9TA80nGrnpxGmXEEEZ9PgXsc4JON1Blhn
 NtJDwBPzEjHCEYdE73/lRMsLmYeGhpRugKb+lQ+OTo2iZzxH2SjWn9vXKiN7vAp2
 zysjzdPfkY5BLggH5cPg0fuRaf/Is00EeVqn3eA7QsFKDhrpoPFBO+aV5xeshsaz
 8fjt7kkXFb+Vyy4SDvmPioJQ7/MFZ5Czn+BL1JwO4l/vYcEMUzM=
 =/yHv
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes for a late rc:

   - fix lockdep complaint on 32bit arches and also remove an unsafe
     memory use due to device vs filesystem lifetime

   - two fixes for free space tree:

      * race during log replay and cache rebuild, now more likely to
        happen due to changes in this dev cycle

      * possible free space tree corruption with online conversion
        during initial tree population"

* tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix log replay failure due to race with space cache rebuild
  btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
  btrfs: fix possible free space tree corruption with online conversion
2021-01-29 13:54:40 -08:00
Christoph Hellwig 616c6a6884 btrfs: use bio_kmalloc in __alloc_device
Use bio_kmalloc instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27 09:51:48 -07:00
Filipe Manana 9ad6d91f05 btrfs: fix log replay failure due to race with space cache rebuild
After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.

If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:

    btrfs_cache_block_group(cache, 1);

That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).

So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:

static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
       if (ret < 0 || search_start != *offset)
            return -EINVAL;
(...)

It's the "search_start != *offset" that results in the condition being
evaluated to true.

When this happens we got the following in dmesg/syslog:

[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed

We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.

However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.

So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().

This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
e747853cae ("btrfs: load free space cache asynchronously"), but in
fact the issue has been there for years, it was just much less likely
to happen.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:53 +01:00
Su Yue c41ec4529d btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
This effectively reverts commit d5c8238849 ("btrfs: convert
data_seqcount to seqcount_mutex_t").

While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):

  [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
  [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
  [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
  [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
  [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
  [66.441485] Call Trace:
  [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
  [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
  [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
  [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
  [66.441700]  ? __lock_acquire+0x35f/0x2650
  [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
  [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
  [66.441724]  ? call_rcu+0x2d3/0x530
  [66.441731]  ? __might_fault+0x41/0x90
  [66.441736]  ? kvm_sched_clock_read+0x15/0x50
  [66.441740]  ? sched_clock+0x8/0x10
  [66.441745]  ? sched_clock_cpu+0x13/0x180
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
  [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441785]  ? __might_fault+0x89/0x90
  [66.441791]  __do_fast_syscall_32+0x54/0x80
  [66.441796]  do_fast_syscall_32+0x32/0x70
  [66.441801]  do_SYSENTER_32+0x15/0x20
  [66.441805]  entry_SYSENTER_32+0x9f/0xf2
  [66.441808] EIP: 0xab7b5549
  [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
  [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
  [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
  [66.441833] irq event stamp: 42579
  [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
  [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
  [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
  [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60

  ========================================================================
  btrfs_remove_chunk+0x58b/0x7b0:
  __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
  (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
  (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
  ========================================================================

The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.

After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init().  In this scenario, its second
parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.

The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.

Commit d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:50 +01:00
Josef Bacik 2f96e40212 btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups.  While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:37 +01:00
Christoph Hellwig 309dca309f block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:17:20 -07:00
Christian Brauner 549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner ba73d98745
namei: handle idmapped mounts in may_*() helpers
The may_follow_link(), may_linkat(), may_lookup(), may_open(),
may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
helpers determine whether the caller is privileged enough to perform the
associated operations. Let them handle idmapped mounts by mapping the
inode or fsids according to the mount's user namespace. Afterwards the
checks are identical to non-idmapped inodes. The patch takes care to
retrieve the mount's user namespace right before performing permission
checks and passing it down into the fileystem so the user namespace
can't change in between by someone idmapping a mount that is currently
not idmapped. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner 0d56a4518d
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner e65ce2a50c
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.

The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.

In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.

Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner 2f221d6f7b
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 21cb47be6f
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christoph Hellwig 2f63296578 iomap: pass a flags argument to iomap_dio_rw
Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument.  The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-01-23 10:06:09 -08:00
Linus Torvalds 9791581c04 for-5.11-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAIojwACgkQxWXV+ddt
 WDstnw/+O0KSsK6ChZCNdjqFAgWL41RYj0fPOgM/8xlNaQyYHS0Jczeoud6m/2Wm
 U41kTb/a6xpmx0Z/2uf/5pDIBPFld/IUuUf/AdJsMzy8Bpky2/sfg6Kmx0tKGLXQ
 1WKp9ox0MlAUI0Tz/jGfX5rwsIgWKYKIF2iGUio/H1ktR3l+cXlmLWsSIB43F6VL
 AjKRRyFCNU//dV7syNMmmj9yU0HpSs53SpWxUIURuTFaE71LyUgzaxDTlZ6c/PET
 e4wdf8nl0wzEESCgSUPdh2AWNNiTEbbGhhhNi9250PUyQki2f4AGBlxVSLZH/fDn
 6PbBDvefW4umCMeMxxmgnYJU6tG78qg/LvxzZXt54rOtB0WMbrIl0u7hFCVhQ3Qk
 nqrS4tqeaL+OeuR6xamBMaRohgRFa9S+QVjTwtDFo/oVYH4TVvQDfKQS6GsWwDvB
 ySzz3WewoFqhe47cMsy28Dg49xkDSIJIr5hmSNGSXTreZ2JIa+qLKywoH87+YDIE
 ql0PN47z4NB+MbWDV7SZM8DCVqiQ7+1LOV9bPmqfvNl3YTfvXyMaoPLmWWVstPr2
 iyhXrvESgm1s2RCF1a0tXIkv82L6QYjJ3eeEDsvAmtKBouNL9BnMvwi3zW5yKiry
 m1qj7C7e6C1TivYitcCfbRCKqeAnUv8VwkSbW9BvNDe7i5AD++U=
 =gSYr
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more one line fixes for various bugs, stable material.

   - fix send when emitting clone operation from the same file and root

   - fix double free on error when cleaning backrefs

   - lockdep fix during relocation

   - handle potential error during reloc when starting transaction

   - skip running delayed refs during commit (leftover from code removal
     in this dev cycle)"

* tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't clear ret in btrfs_start_dirty_block_groups
  btrfs: fix lockdep splat in btrfs_recover_relocation
  btrfs: do not double free backref nodes on error
  btrfs: don't get an EINTR during drop_snapshot for reloc
  btrfs: send: fix invalid clone operations when cloning from the same file and root
  btrfs: no need to run delayed refs after commit_fs_roots during commit
2021-01-20 14:15:33 -08:00
Josef Bacik 34d1eb0e59 btrfs: don't clear ret in btrfs_start_dirty_block_groups
If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly.  Fix this by only running the delayed refs
if there was no failure.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 16:00:11 +01:00
Josef Bacik fb28610097 btrfs: fix lockdep splat in btrfs_recover_relocation
While testing the error paths of relocation I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc6+ #217 Not tainted
  ------------------------------------------------------
  mount/779 is trying to acquire lock:
  ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340

  but task is already holding lock:
  ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-root-00){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x462/0x8f0
	 btrfs_update_root+0x55/0x2b0
	 btrfs_drop_snapshot+0x398/0x750
	 clean_dirty_subvols+0xdf/0x120
	 btrfs_recover_relocation+0x534/0x5a0
	 btrfs_start_pre_rw_mount+0xcb/0x170
	 open_ctree+0x151f/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (sb_internal#2){.+.+}-{0:0}:
	 start_transaction+0x444/0x700
	 insert_balance_item.isra.0+0x37/0x320
	 btrfs_balance+0x354/0xf40
	 btrfs_ioctl_balance+0x2cf/0x380
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1120/0x1e10
	 lock_acquire+0x116/0x370
	 __mutex_lock+0x7e/0x7b0
	 btrfs_recover_balance+0x2f0/0x340
	 open_ctree+0x1095/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(sb_internal#2);
				 lock(btrfs-root-00);
    lock(&fs_info->balance_mutex);

   *** DEADLOCK ***

  2 locks held by mount/779:
   #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  stack backtrace:
  CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1120/0x1e10
   lock_acquire+0x116/0x370
   ? btrfs_recover_balance+0x2f0/0x340
   __mutex_lock+0x7e/0x7b0
   ? btrfs_recover_balance+0x2f0/0x340
   ? btrfs_recover_balance+0x2f0/0x340
   ? rcu_read_lock_sched_held+0x3f/0x80
   ? kmem_cache_alloc_trace+0x2c4/0x2f0
   ? btrfs_get_64+0x5e/0x100
   btrfs_recover_balance+0x2f0/0x340
   open_ctree+0x1095/0x1726
   btrfs_mount_root.cold+0x12/0xea
   ? rcu_read_lock_sched_held+0x3f/0x80
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   ? __kmalloc_track_caller+0x2f2/0x320
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   ? capable+0x3a/0x60
   path_mount+0x433/0xc10
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is straightforward to fix, simply release the path before we setup
the balance_ctl.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:59 +01:00
Josef Bacik 49ecc679ab btrfs: do not double free backref nodes on error
Zygo reported the following KASAN splat:

  BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
  Read of size 8 at addr ffff888112402950 by task btrfs/28836

  CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  Call Trace:
   dump_stack+0xbc/0xf9
   ? btrfs_backref_cleanup_node+0x18a/0x420
   print_address_description.constprop.8+0x21/0x210
   ? record_print_text.cold.34+0x11/0x11
   ? btrfs_backref_cleanup_node+0x18a/0x420
   ? btrfs_backref_cleanup_node+0x18a/0x420
   kasan_report.cold.10+0x20/0x37
   ? btrfs_backref_cleanup_node+0x18a/0x420
   __asan_load8+0x69/0x90
   btrfs_backref_cleanup_node+0x18a/0x420
   btrfs_backref_release_cache+0x83/0x1b0
   relocate_block_group+0x394/0x780
   ? merge_reloc_roots+0x4a0/0x4a0
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   ? check_flags.part.50+0x6c/0x1e0
   ? btrfs_relocate_chunk+0x120/0x120
   ? kmem_cache_alloc_trace+0xa06/0xcb0
   ? _copy_from_user+0x83/0xc0
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f4c4bdfe427

  Allocated by task 28836:
   kasan_save_stack+0x21/0x50
   __kasan_kmalloc.constprop.18+0xbe/0xd0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x410/0xcb0
   btrfs_backref_alloc_node+0x46/0xf0
   btrfs_backref_add_tree_node+0x60d/0x11d0
   build_backref_tree+0xc5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 28836:
   kasan_save_stack+0x21/0x50
   kasan_set_track+0x20/0x30
   kasan_set_free_info+0x1f/0x30
   __kasan_slab_free+0xf3/0x140
   kasan_slab_free+0xe/0x10
   kfree+0xde/0x200
   btrfs_backref_error_cleanup+0x452/0x530
   build_backref_tree+0x1a5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache().  This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up.  However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache.  Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.

Fixes: 75bfb9aff4 ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:53 +01:00
Josef Bacik 18d3bff411 btrfs: don't get an EINTR during drop_snapshot for reloc
This was partially fixed by f3e3d9cc35 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction.  The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.

Fixes: f3e3d9cc35 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:47 +01:00
Filipe Manana 518837e650 btrfs: send: fix invalid clone operations when cloning from the same file and root
When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.

Example reproducer:

  $ cat test-send-clone.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  # Create our test file with a single and large extent (1M) and with
  # different content for different file ranges that will be reflinked
  # later.
  xfs_io -f \
         -c "pwrite -S 0xab 0 128K" \
         -c "pwrite -S 0xcd 128K 128K" \
         -c "pwrite -S 0xef 256K 256K" \
         -c "pwrite -S 0x1a 512K 512K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now do a series of changes to our file such that we end up with
  # different parts of the extent reflinked into different file offsets
  # and we overwrite a large part of the extent too, so no file extent
  # items refer to that part that was overwritten. This used to confuse
  # the algorithm used by the kernel to figure out which file ranges to
  # clone, making it attempt to clone from a source range starting at
  # the current eof of the file, resulting in the receiver to fail since
  # it is an invalid clone operation.
  #
  xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
         -c "reflink $MNT/foobar 0K 512K 256K" \
         -c "reflink $MNT/foobar 512K 128K 256K" \
         -c "pwrite -S 0x73 384K 640K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  echo -e "\nFile digest in the original filesystem:"
  md5sum $MNT/snap2/foobar

  # Now unmount the filesystem, create a new one, mount it and try to
  # apply both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  # Must match what we got in the original filesystem of course.
  echo -e "\nFile digest in the new filesystem:"
  md5sum $MNT/snap2/foobar

  umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

  $ ./test-send-clone.sh
  wrote 131072/131072 bytes at offset 0
  128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
  wrote 131072/131072 bytes at offset 131072
  128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
  wrote 262144/262144 bytes at offset 262144
  256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
  wrote 524288/524288 bytes at offset 524288
  512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  linked 983040/983040 bytes at offset 1048576
  960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
  linked 262144/262144 bytes at offset 524288
  256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
  linked 262144/262144 bytes at offset 131072
  256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
  wrote 655360/655360 bytes at offset 393216
  640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2

  File digest in the original filesystem:
  9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to foobar: Invalid argument

  File digest in the new filesystem:
  132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
   the extent is shared and can be (fully or partially) found at file
   offset 0.

   At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
   64K and, since the file extent item at file offset 0 has a data offset
   of 0, it issues a clone operation, from the same file and root, that
   has a source range offset of 64K, destination offset of 1M and a length
   of 64K, since the extent item at file offset 0 refers only to the first
   128K of the shared extent.

   After this clone operation, the file size (and eof) at the receiver is
   increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
   it checks for the next file extent item, which starts at file offset
   128K. This file extent item has a data offset of 0 and a length of
   256K, so a clone operation with a source range offset of 256K, a
   destination offset of 1088K (1M + 64K) and length of 128K is issued.

   After this operation the file size (and eof) at the receiver increases
   from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
   it checks for the next file extent item, located at file offset 384K.
   This file extent item points to a different extent, not the one we want
   to clone, with a length of 640K. So we issue a write operation into the
   file range 1216K (1088K + 128K, end of the last clone operation), with
   a length of 640K and with a data matching the one we can find for that
   range in send root.

   After this operation, the file size (and eof) at the receiver increases
   from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
   we look into the file extent item, which is for file offset 1M and it
   points to the extent we want to clone, with a data offset of 64K and a
   length of 960K.

   However this matches the file offset we started with, the start of the
   range to clone into. So we can't for sure find any file extent item
   from here onwards with the rest of the data we want to clone, yet we
   proceed and since the file extent item points to the shared extent,
   with a data offset of 64K, we issue a clone operation with a source
   range starting at file offset 1856K, which matches the file extent
   item's offset, 1M, plus the amount of data cloned and written so far,
   which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
   operation is invalid since the source range offset matches the current
   eof of the file in the receiver. We should have stopped looking for
   extents to clone at this point and instead fallback to write, which
   would simply the contain the data in the file range from 1856K to
   1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-12 15:35:09 +01:00
David Sterba 14ff8e1970 btrfs: no need to run delayed refs after commit_fs_roots during commit
The inode number cache has been removed in this dev cycle, there's one
more leftover. We don't need to run the delayed refs again after
commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
is no more since 5297199a8b ("btrfs: remove inode number cache
feature").

Nothing else between commit_fs_roots and btrfs_qgroup_account_extents
could create new delayed refs so the qgroup consistency should be safe.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-12 15:35:04 +01:00
Linus Torvalds 6e68b9961f for-5.11-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/8jD4ACgkQxWXV+ddt
 WDteWQ//QcpD6STpLwAC+g6zJyJln7Au9lfQvawugvOJssbtdPkJQP3ZiK+Izwi/
 /xagu6XMazJM+47acNJKDNntOqVkp+O6CxEbLU+rL/D288L3HEGxayZ2LL90wm6J
 tbIebOE+BSVZ/5oe0jVdqZXwYvUtTiJ7PoFgrZPXJCnddSitZRD3tC4Wmi/Yo5+0
 +7CW6PT3/s7KARwYXpgpMM5vi8qO2nfHfTUdRlSh59g7zC/TH7HiitL6roHzlX1k
 g/aaKYLVcg62OPpw7ZXwde/qH8n1TR+H5WX6vBInqd/9jYcNkVGqijCgBeL1TJkN
 Vx/b69ccODK2GNzuuYoo3k3XvSwZWsOTZp+k4y3EZ1cMONMo1snu/xglYsvSZvUL
 lNCQlA9hIZNskRwEvkEea68/bQdiOl6xezgR9tajMlmz7oCsV/Cz/MJ+RfqaxdH3
 bV6eTTex67lQfzAda+gN+zjBrFzQdmK700gKimdzF1XfcYmmCIdZVX8Gm/N6ldQN
 LNRe8zYRaqrmRk9PQ355RqYDZmft/wLiUV6V0j74oV65WpPe2R4pULWdmPAGm6Oj
 UWM+ZR3u9m8asg7ghKYgct2pxCS3+gLbDNXNcOSxYxthEEZB2JqkAMjtjCfwJilN
 PXfuXaBKRmRck+AcYfbBrfJOljQ+zAJdTK/Rid40TwwpFCe/jjY=
 =G3R4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "More material for stable trees.

   - tree-checker: check item end overflow

   - fix false warning during relocation regarding extent type

   - fix inode flushing logic, caused notable performance regression
     (since 5.10)

   - debugging fixups:
      - print correct offset for reloc tree key
      - pass reliable fs_info pointer to error reporting helper"

* tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: shrink delalloc pages instead of full inodes
  btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
  btrfs: tree-checker: check if chunk item end overflows
  btrfs: prevent NULL pointer dereference in extent_io_tree_panic
  btrfs: print the actual offset in btrfs_root_name
2021-01-11 14:18:56 -08:00
Josef Bacik e076ab2a2c btrfs: shrink delalloc pages instead of full inodes
Commit 38d715f494 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot.  However this introduced a pretty serious
performance regression.  To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush.  The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221ce
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us.  This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe.  I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

  5.9		0m54.258s
  5.10		1m26.212s
  5.10+patch	0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

  5.10.5            4m00s
  5.10.5+patch      1m08s
  5.11-rc2	    5m14s
  5.11-rc2+patch    1m30s

Reported-by: René Rebe <rene@exactcode.de>
Fixes: 38d715f494 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add my test results ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-08 16:36:44 +01:00
Qu Wenruo 50e31ef486 btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.

Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.

The dmesg would look like:

  [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
  [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
  [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
  [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

[CAUSE]
The ENOENT error is returned from the following call chain:

  add_data_references()
  |- delete_v1_space_cache();
     |- if (!found)
	   return -ENOENT;

The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.

With extra debugging, the offending tree block looks like this:

  leaf bytenr = 42676709441536, data_bytenr = 34626327621632

                ctime 1567904822.739884119 (2019-09-08 03:07:02)
                mtime 0.0 (1970-01-01 01:00:00)
                otime 0.0 (1970-01-01 01:00:00)
        item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
                generation 1517381 type 2 (prealloc)
                prealloc data disk byte 34626327621632 nr 262144 <<<
                prealloc data offset 0 nr 262144
        item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
                generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
                lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
                uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.

[FIX]
The code is modified in commit 19b546d7a1 ("btrfs: relocation: Use
btrfs_find_all_leafs to locate data extent parent tree leaves"), before
that commit, we use something like

  "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
ignoring BTRFS_FILE_EXTENT_PREALLOC.

Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

Reported-by: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
Fixes: 19b546d7a1 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
CC: stable@vger.kernel.org # 5.6+
Tested-By: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Su Yue 347fb0cfc9 btrfs: tree-checker: check if chunk item end overflows
While mounting a crafted image provided by user, kernel panics due to
the invalid chunk item whose end is less than start.

  [66.387422] loop: module loaded
  [66.389773] loop0: detected capacity change from 262144 to 0
  [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
  [66.431061] BTRFS info (device loop0): disk space caching is enabled
  [66.431078] BTRFS info (device loop0): has skinny extents
  [66.437101] BTRFS error: insert state: end < start 29360127 37748736
  [66.437136] ------------[ cut here ]------------
  [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
  [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G           O      5.11.0-rc1-custom #45
  [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
  [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
  [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
  [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
  [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
  [66.437447] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.437451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.437460] PKRU: 55555554
  [66.437464] Call Trace:
  [66.437475]  set_extent_bit+0x652/0x740 [btrfs]
  [66.437539]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.437576]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.437621]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.437674]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.437708]  ? kvm_sched_clock_read+0x18/0x40
  [66.437739]  open_ctree+0xb32/0x1734 [btrfs]
  [66.437781]  ? bdi_register_va+0x1b/0x20
  [66.437788]  ? super_setup_bdi_name+0x79/0xd0
  [66.437810]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.437854]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437873]  legacy_get_tree+0x34/0x60
  [66.437880]  vfs_get_tree+0x2d/0xc0
  [66.437888]  vfs_kern_mount.part.0+0x78/0xc0
  [66.437897]  vfs_kern_mount+0x13/0x20
  [66.437902]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.437940]  ? kfree+0x5ff/0x670
  [66.437944]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437962]  legacy_get_tree+0x34/0x60
  [66.437974]  vfs_get_tree+0x2d/0xc0
  [66.437983]  path_mount+0x48c/0xd30
  [66.437998]  __x64_sys_mount+0x108/0x140
  [66.438011]  do_syscall_64+0x38/0x50
  [66.438018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.438023] RIP: 0033:0x7f0138827f6e
  [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [66.438078] irq event stamp: 18169
  [66.438082] hardirqs last  enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0
  [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0
  [66.438092] softirqs last  enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438103] ---[ end trace e114b111db64298b ]---
  [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
  [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
  [66.441069] ------------[ cut here ]------------
  [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
  [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G        W  O      5.11.0-rc1-custom #45
  [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [66.458356] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.459841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.462196] PKRU: 55555554
  [66.462692] Call Trace:
  [66.463139]  set_extent_bit.cold+0x30/0x98 [btrfs]
  [66.464049]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.490466]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.514097]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.534976]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.555718]  ? kvm_sched_clock_read+0x18/0x40
  [66.575758]  open_ctree+0xb32/0x1734 [btrfs]
  [66.595272]  ? bdi_register_va+0x1b/0x20
  [66.614638]  ? super_setup_bdi_name+0x79/0xd0
  [66.633809]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.652938]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.671925]  legacy_get_tree+0x34/0x60
  [66.690300]  vfs_get_tree+0x2d/0xc0
  [66.708221]  vfs_kern_mount.part.0+0x78/0xc0
  [66.725808]  vfs_kern_mount+0x13/0x20
  [66.742730]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.759350]  ? kfree+0x5ff/0x670
  [66.775441]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.791750]  legacy_get_tree+0x34/0x60
  [66.807494]  vfs_get_tree+0x2d/0xc0
  [66.823349]  path_mount+0x48c/0xd30
  [66.838753]  __x64_sys_mount+0x108/0x140
  [66.854412]  do_syscall_64+0x38/0x50
  [66.869673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.885093] RIP: 0033:0x7f0138827f6e
  [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [67.216138] ---[ end trace e114b111db64298c ]---
  [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [67.465436] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [67.511660] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [67.558449] PKRU: 55555554
  [67.581146] note: mount[613] exited with preempt_count 2

The image has a chunk item which has a logical start 37748736 and length
18446744073701163008 (-8M). The calculated end 29360127 overflows.
EEXIST was caught by insert_state() because of the duplicate end and
extent_io_tree_panic() was called.

Add overflow check of chunk item end to tree checker so it can be
detected early at mount time.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Su Yue 29b665cc51 btrfs: prevent NULL pointer dereference in extent_io_tree_panic
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04 ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Josef Bacik 71008734d2 btrfs: print the actual offset in btrfs_root_name
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid.  Fix this helper to take the key
so we have access to the offset when we need it.

Fixes: 457f1864b5 ("btrfs: pretty print leaked root name")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Linus Torvalds 71c061d244 for-5.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/0cI8ACgkQxWXV+ddt
 WDspQw/8DcC8zhGgunk0m2kcXd6dFOGbsr3hNGCsgUSKESRw6AgTZ0rJf/QLjayF
 /vaJWzQW9ijfZ92fWZS+mrmskk0N8RFOsEvkCRLesgRaasbrkchLBo5HGQasOBEV
 LXyU878GrBkNaHzClJz+JdU26i0d17BFdddgtZVQ1St9Wd9ecc7Q6iqG80RWFeE7
 uVbhv+QjocM3EieOnwIy5Mz6jZgJLYwqw7/y2njKduBeJtbt1K1j/y7IJk0WFMUM
 8eUpDL6vlAHB8FjV2wWOzO46bbEaUpaBADM6yabrq0lnM0kr7Rb+WV/WSLM/AZ3g
 Hzs4qROOEP+zjfZ5nYjJQDJRMpSipZomsUY5uMZnhRxlZuHPaoBotRRzs5AIZYj2
 BnkfucOcjxS/JTBD//ltJXE8RxbMIyMBBBipbBwqmxOkR9gM9BPuJ6iJPfUX//gG
 1GHJ+FPns8ua3JW21ih6H31xNEPS36tsywvE8yCEtEWMxCFCBwgGu+4D8KpGBjtY
 ySFxkxxAbTuFi9fqSE/mBC+6lpbVTO0OvizuoEQh8C2izkXRbDsDVgPN8d7rCW7h
 Cdox4DUp61sNf+G3ll9Dv9ceAXroZTVRTHGjlav6NAFpydz3yPo5x54Ex7S+k3oN
 BAcZEl1Tl3hz4WxF8Ywc+yJ8n8l9AVa3KcYRXVbyVjTGg+JjU94=
 =jlQf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes that arrived before the end of the year:

   - a bunch of fixes related to transaction handle lifetime wrt various
     operations (umount, remount, qgroup scan, orphan cleanup)

   - async discard scheduling fixes

   - fix item size calculation when item keys collide for extend refs
     (hardlinks)

   - fix qgroup flushing from running transaction

   - fix send, wrong file path when there is an inode with a pending
     rmdir

   - fix deadlock when cloning inline extent and low on free metadata
     space"

* tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: run delayed iputs when remounting RO to avoid leaking them
  btrfs: add assertion for empty list of transactions at late stage of umount
  btrfs: fix race between RO remount and the cleaner task
  btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
  btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
  btrfs: merge critical sections of discard lock in workfn
  btrfs: fix racy access to discard_ctl data
  btrfs: fix async discard stall
  btrfs: tests: initialize test inodes location
  btrfs: send: fix wrong file path when there is an inode with a pending rmdir
  btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
  btrfs: correctly calculate item size used when item key collision happens
  btrfs: fix deadlock when cloning inline extent and low on free metadata space
2021-01-06 11:19:08 -08:00
Filipe Manana a8cc263eb5 btrfs: run delayed iputs when remounting RO to avoid leaking them
When remounting RO, after setting the superblock with the RO flag, the
cleaner task will start sleeping and do nothing, since the call to
btrfs_need_cleaner_sleep() keeps returning 'true'. However, when the
cleaner task goes to sleep, the list of delayed iputs may not be empty.

As long as we are in RO mode, the cleaner task will keep sleeping and
never run the delayed iputs. This means that if a filesystem unmount
is started, we get into close_ctree() with a non-empty list of delayed
iputs, and because the filesystem is in RO mode and is not in an error
state (or a transaction aborted), btrfs_error_commit_super() and
btrfs_commit_super(), which run the delayed iputs, are never called,
and later we fail the assertion that checks if the delayed iputs list
is empty:

  assertion failed: list_empty(&fs_info->delayed_iputs), in fs/btrfs/disk-io.c:4049
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3153!
  invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 1 PID: 3780621 Comm: umount Tainted: G             L    5.6.0-rc2-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
  Code: 8b 7b 58 48 85 ff 74 (...)
  RSP: 0018:ffffb748c89bbdf8 EFLAGS: 00010246
  RAX: 0000000000000051 RBX: ffff9608f2584000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffff91998988 RDI: 00000000ffffffff
  RBP: ffff9608f25870d8 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0cbc500
  R13: ffffffff92411750 R14: 0000000000000000 R15: ffff9608f2aab250
  FS:  00007fcbfaa66c80(0000) GS:ffff960936c80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fffc2c2dd38 CR3: 0000000235e54002 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x1a2/0x2e6 [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x93/0xc0
   exit_to_usermode_loop+0xf9/0x100
   do_syscall_64+0x20d/0x260
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7fcbfaca6307
  Code: eb 0b 00 f7 d8 64 89 (...)
  RSP: 002b:00007fffc2c2ed68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 0000558203b559b0 RCX: 00007fcbfaca6307
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000558203b55bc0
  RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fffc2c2dad0
  R10: 0000558203b55bf0 R11: 0000000000000246 R12: 0000558203b55bc0
  R13: 00007fcbfadcc204 R14: 0000558203b55aa8 R15: 0000000000000000
  Modules linked in: btrfs dm_flakey dm_log_writes (...)
  ---[ end trace d44d303790049ef6 ]---

So fix this by making the remount RO path run any remaining delayed iputs
after waiting for the cleaner to become inactive.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 15:00:08 +01:00
Filipe Manana 0a31daa4b6 btrfs: add assertion for empty list of transactions at late stage of umount
Add an assertion to close_ctree(), after destroying all the work queues,
to verify we do not have any transaction still open or committing at that
at that point. If we have any, it means something is seriously wrong and
that can cause memory leaks and use-after-free problems. This is motivated
by the previous patches that fixed bugs where we ended up leaking an open
transaction after unmounting the filesystem.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 15:00:06 +01:00
Filipe Manana a0a1db70df btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.

The following sequence of steps explains how the race happens.

1) The filesystem is mounted in RW mode and the cleaner task is running.
   This means that currently BTRFS_FS_CLEANER_RUNNING is set at
   fs_info->flags;

2) The cleaner task is currently running delayed iputs for example;

3) A filesystem RO remount operation starts;

4) The RO remount task calls btrfs_commit_super(), which commits any
   currently open transaction, and it finishes;

5) At this point the cleaner task is still running and it creates a new
   transaction by doing one of the following things:

   * When running the delayed iput() for an inode with a 0 link count,
     in which case at btrfs_evict_inode() we start a transaction through
     the call to evict_refill_and_join(), use it and then release its
     handle through btrfs_end_transaction();

   * When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
     a transaction is started at btrfs_drop_snapshot() and then its handle
     is released through a call to btrfs_end_transaction_throttle();

   * When the remount task was still running, and before the remount task
     called btrfs_delete_unused_bgs(), the cleaner task also called
     btrfs_delete_unused_bgs() and it picked and removed one block group
     from the list of unused block groups. Before the cleaner task started
     a transaction, through btrfs_start_trans_remove_block_group() at
     btrfs_delete_unused_bgs(), the remount task had already called
     btrfs_commit_super();

6) So at this point the filesystem is in RO mode and we have an open
   transaction that was started by the cleaner task;

7) Shortly after a filesystem unmount operation starts. At close_ctree()
   we stop the transaction kthread before it had a chance to commit the
   transaction, since less than 30 seconds (the default commit interval)
   have elapsed since the last transaction was committed;

8) We end up calling iput() against the btree inode at close_ctree() while
   there is an open transaction, and since that transaction was used to
   update btrees by the cleaner, we have dirty pages in the btree inode
   due to COW operations on metadata extents, and therefore writeback is
   triggered for the btree inode.

   So btree_write_cache_pages() is invoked to flush those dirty pages
   during the final iput() on the btree inode. This results in creating a
   bio and submitting it, which makes us end up at
   btrfs_submit_metadata_bio();

9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
   that calls btrfs_wq_submit_bio(), because check_async_write() returned
   a value of 1. This value of 1 is because we did not have hardware
   acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
   set in fs_info->flags;

10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
  Code: f0 01 00 00 48 39 c2 75 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
  RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
  RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 48 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c6 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Code: 48 83 bb b0 03 00 00 00 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
  RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c7 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Code: ad de 49 be 22 01 00 (...)
  RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
  RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
  RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c8 ]---
  BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
  BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
  BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

  stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
  Code: 54 55 53 48 89 f3 (...)
  RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
  RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
  RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
  R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
  FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
   btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
   submit_one_bio+0x61/0x70 [btrfs]
   btree_write_cache_pages+0x414/0x450 [btrfs]
   ? kobject_put+0x9a/0x1d0
   ? trace_hardirqs_on+0x1b/0xf0
   ? _raw_spin_unlock_irqrestore+0x3c/0x60
   ? free_debug_processing+0x1e1/0x2b0
   do_writepages+0x43/0xe0
   ? lock_acquired+0x199/0x490
   __writeback_single_inode+0x59/0x650
   writeback_single_inode+0xaf/0x120
   write_inode_now+0x94/0xd0
   iput+0x187/0x2b0
   close_ctree+0x2c6/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f3cfebabee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
  RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
  R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

  =============================================================================
  BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x0000000050cbdd61 @offset=12104
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        sync_filesystem+0x74/0x90
        generic_shutdown_super+0x22/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x0000000086e9b0ff @offset=12776
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
        commit_cowonly_roots+0x248/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000001a340018 @offset=4408
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_commit_transaction+0x60/0xc40 [btrfs]
        create_subvol+0x56a/0x990 [btrfs]
        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        btrfs_ioctl+0x1a92/0x36f0 [btrfs]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x000000002b46292a @offset=13648
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------
  INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? __mutex_unlock_slowpath+0x45/0x2a0
   kmem_cache_destroy+0x55/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000004cf95ea8 @offset=6264
  INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.

This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 15:00:02 +01:00
Filipe Manana 638331fa56 btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
When we delete a root (subvolume or snapshot), at the very end of the
operation, we attempt to remove the root's orphan item from the root tree,
at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any
error from btrfs_del_orphan_item() since it is not a serious problem and
the next time the filesystem is mounted we remove such stray orphan items
at btrfs_find_orphan_roots().

However if the filesystem is mounted RO and we have stray orphan items for
any previously deleted root, we can end up leaking a transaction and other
data structures when unmounting the filesystem, as well as crashing if we
do not have hardware acceleration for crc32c available.

The steps that lead to the transaction leak are the following:

1) The filesystem is mounted in RW mode;

2) A subvolume is deleted;

3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root,
   it gets a failure at btrfs_del_orphan_item(), which is ignored, due to
   an ENOMEM when allocating a path for example. So the orphan item for
   the root remains in the root tree;

4) The filesystem is unmounted;

5) The filesystem is mounted RO (-o ro). During the mount path we call
   btrfs_find_orphan_roots(), which iterates the root tree searching for
   orphan items. It finds the orphan item for our deleted root, and since
   it can not find the root, it starts a transaction to delete the orphan
   item (by calling btrfs_del_orphan_item());

6) The RO mount completes;

7) Before the transaction kthread commits the transaction created for
   deleting the orphan item (i.e. less than 30 seconds elapsed since the
   mount, the default commit interval), a filesystem unmount operation is
   started;

8) At close_ctree(), we stop the transaction kthread, but we still have a
   transaction open with at least one dirty extent buffer, a leaf for the
   tree root which was COWed when deleting the orphan item;

9) We then proceed to destroy the work queues, free the roots and block
   groups, etc. After that we drop the last reference on the btree inode by
   calling iput() on it. Since there are dirty pages for the btree inode,
   corresponding to the COWed extent buffer, btree_write_cache_pages() is
   invoked to flush those dirty pages. This results in creating a bio and
   submitting it, which makes us end up at btrfs_submit_metadata_bio();

10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
    that calls btrfs_wq_submit_bio(), because check_async_write() returned
    a value of 1. This value of 1 is because we did not have hardware
    acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
    set in fs_info->flags;

11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

 ------------[ cut here ]------------
 WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
 Code: f0 01 00 00 48 39 c2 75 (...)
 RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 01 48 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c6 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
 Code: 48 83 bb b0 03 00 00 00 (...)
 RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
 RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 01 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c7 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
 Code: ad de 49 be 22 01 00 (...)
 RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c8 ]---
 BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
 BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

 stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
 Code: 54 55 53 48 89 f3 (...)
 RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
 FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
  btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
  submit_one_bio+0x61/0x70 [btrfs]
  btree_write_cache_pages+0x414/0x450 [btrfs]
  ? kobject_put+0x9a/0x1d0
  ? trace_hardirqs_on+0x1b/0xf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? free_debug_processing+0x1e1/0x2b0
  do_writepages+0x43/0xe0
  ? lock_acquired+0x199/0x490
  __writeback_single_inode+0x59/0x650
  writeback_single_inode+0xaf/0x120
  write_inode_now+0x94/0xd0
  iput+0x187/0x2b0
  close_ctree+0x2c6/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f3cfebabee7
 Code: ff 0b 00 f7 d8 64 89 01 (...)
 RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
 =============================================================================
 BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? lock_release+0x20e/0x4c0
  kmem_cache_destroy+0x55/0x120
  btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x0000000050cbdd61 @offset=12104
 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        sync_filesystem+0x74/0x90
        generic_shutdown_super+0x22/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 INFO: Object 0x0000000086e9b0ff @offset=12776
 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
        commit_cowonly_roots+0x248/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 =============================================================================
 BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? lock_release+0x20e/0x4c0
  kmem_cache_destroy+0x55/0x120
  btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x000000001a340018 @offset=4408
 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_commit_transaction+0x60/0xc40 [btrfs]
        create_subvol+0x56a/0x990 [btrfs]
        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        btrfs_ioctl+0x1a92/0x36f0 [btrfs]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 INFO: Object 0x000000002b46292a @offset=13648
 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 =============================================================================
 BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? __mutex_unlock_slowpath+0x45/0x2a0
  kmem_cache_destroy+0x55/0x120
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x000000004cf95ea8 @offset=6264
 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

So fix this by calling btrfs_find_orphan_roots() in the mount path only if
we are mounting the filesystem in RW mode. It's pointless to have it called
for RO mounts anyway, since despite adding any deleted roots to the list of
dead roots, we will never have the roots deleted until the filesystem is
remounted in RW mode, as the cleaner kthread does nothing when we are
mounted in RO - btrfs_need_cleaner_sleep() always returns true and the
cleaner spends all time sleeping, never cleaning dead roots.

This is accomplished by moving the call to btrfs_find_orphan_roots() from
open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that
if later the filesystem is remounted RW, we populate the list of dead
roots and have the cleaner task delete the dead roots.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:59 +01:00
Filipe Manana cb13eea3b4 btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
If we remount a filesystem in RO mode while the qgroup rescan worker is
running, we can end up having it still running after the remount is done,
and at unmount time we may end up with an open transaction that ends up
never getting committed. If that happens we end up with several memory
leaks and can crash when hardware acceleration is unavailable for crc32c.
Possibly it can lead to other nasty surprises too, due to use-after-free
issues.

The following steps explain how the problem happens.

1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
   running;

2) We remount the filesystem in RO mode, and never stop/pause the rescan
   worker, so after the remount the rescan worker is still running. The
   important detail here is that the rescan task is still running after
   the remount operation committed any ongoing transaction through its
   call to btrfs_commit_super();

3) The rescan is still running, and after the remount completed, the
   rescan worker started a transaction, after it finished iterating all
   leaves of the extent tree, to update the qgroup status item in the
   quotas tree. It does not commit the transaction, it only releases its
   handle on the transaction;

4) A filesystem unmount operation starts shortly after;

5) The unmount task, at close_ctree(), stops the transaction kthread,
   which had not had a chance to commit the open transaction since it was
   sleeping and the commit interval (default of 30 seconds) has not yet
   elapsed since the last time it committed a transaction;

6) So after stopping the transaction kthread we still have the transaction
   used to update the qgroup status item open. At close_ctree(), when the
   filesystem is in RO mode and no transaction abort happened (or the
   filesystem is in error mode), we do not expect to have any transaction
   open, so we do not call btrfs_commit_super();

7) We then proceed to destroy the work queues, free the roots and block
   groups, etc. After that we drop the last reference on the btree inode
   by calling iput() on it. Since there are dirty pages for the btree
   inode, corresponding to the COWed extent buffer for the quotas btree,
   btree_write_cache_pages() is invoked to flush those dirty pages. This
   results in creating a bio and submitting it, which makes us end up at
   btrfs_submit_metadata_bio();

8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
   that calls btrfs_wq_submit_bio(), because check_async_write() returned
   a value of 1. This value of 1 is because we did not have hardware
   acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
   set in fs_info->flags;

9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
   workqueue at fs_info->workers, which was already freed before by the
   call to btrfs_stop_all_workers() at close_ctree(). This results in an
   invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
  Code: f0 01 00 00 48 39 c2 75 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
  RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
  RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 48 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c6 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Code: 48 83 bb b0 03 00 00 00 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
  RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c7 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Code: ad de 49 be 22 01 00 (...)
  RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
  RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
  RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c8 ]---
  BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
  BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
  BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

  stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
  Code: 54 55 53 48 89 f3 (...)
  RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
  RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
  RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
  R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
  FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
   btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
   submit_one_bio+0x61/0x70 [btrfs]
   btree_write_cache_pages+0x414/0x450 [btrfs]
   ? kobject_put+0x9a/0x1d0
   ? trace_hardirqs_on+0x1b/0xf0
   ? _raw_spin_unlock_irqrestore+0x3c/0x60
   ? free_debug_processing+0x1e1/0x2b0
   do_writepages+0x43/0xe0
   ? lock_acquired+0x199/0x490
   __writeback_single_inode+0x59/0x650
   writeback_single_inode+0xaf/0x120
   write_inode_now+0x94/0xd0
   iput+0x187/0x2b0
   close_ctree+0x2c6/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f3cfebabee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
  RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
  R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

  =============================================================================
  BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x0000000050cbdd61 @offset=12104
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
	btrfs_free_tree_block+0x128/0x360 [btrfs]
	__btrfs_cow_block+0x489/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	sync_filesystem+0x74/0x90
	generic_shutdown_super+0x22/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x0000000086e9b0ff @offset=12776
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
	commit_cowonly_roots+0x248/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000001a340018 @offset=4408
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
	btrfs_free_tree_block+0x128/0x360 [btrfs]
	__btrfs_cow_block+0x489/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	btrfs_commit_transaction+0x60/0xc40 [btrfs]
	create_subvol+0x56a/0x990 [btrfs]
	btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
	__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
	btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
	btrfs_ioctl+0x1a92/0x36f0 [btrfs]
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x33/0x80
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x000000002b46292a @offset=13648
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? __mutex_unlock_slowpath+0x45/0x2a0
   kmem_cache_destroy+0x55/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000004cf95ea8 @offset=6264
  INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
	__slab_alloc.isra.0+0x109/0x1c0
	kmem_cache_alloc+0x7bb/0x830
	btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
	btrfs_cow_block+0xf7/0x220 [btrfs]
	btrfs_search_slot+0x62a/0xc40 [btrfs]
	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
	open_ctree+0x125a/0x18a0 [btrfs]
	btrfs_mount_root.cold+0x13/0xed [btrfs]
	legacy_get_tree+0x30/0x60
	vfs_get_tree+0x28/0xe0
	fc_mount+0xe/0x40
	vfs_kern_mount.part.0+0x71/0x90
	btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
	kmem_cache_free+0x34c/0x3c0
	__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
	commit_cowonly_roots+0xfb/0x300 [btrfs]
	btrfs_commit_transaction+0x367/0xc40 [btrfs]
	close_ctree+0x113/0x2fa [btrfs]
	generic_shutdown_super+0x6c/0x100
	kill_anon_super+0x14/0x30
	btrfs_kill_super+0x12/0x20 [btrfs]
	deactivate_locked_super+0x31/0x70
	cleanup_mnt+0x100/0x160
	task_work_run+0x68/0xb0
	exit_to_user_mode_prepare+0x1bb/0x1c0
	syscall_exit_to_user_mode+0x4b/0x260
	entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

Fix this issue by having the remount path stop the qgroup rescan worker
when we are remounting RO and teach the rescan worker to stop when a
remount is in progress. If later a remount in RW mode happens, we are
already resuming the qgroup rescan worker through the call to
btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:57 +01:00
Pavel Begunkov 8fc058597a btrfs: merge critical sections of discard lock in workfn
btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
ktime.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:54 +01:00
Pavel Begunkov 1ea2872fc6 btrfs: fix racy access to discard_ctl data
Because only one discard worker may be running at any given point, it
could have been safe to modify ->prev_discard, etc. without
synchronization, if not for @override flag in
btrfs_discard_schedule_work() and delayed_work_pending() returning false
while workfn is running.

That may lead to torn reads of u64 for some architectures, but that's
not a big problem as only slightly affects the discard rate.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:53 +01:00
Pavel Begunkov ea9ed87c73 btrfs: fix async discard stall
Might happen that bg->discard_eligible_time was changed without
rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
time, peek_discard_list() returns NULL, and all work halts and goes to
sleep without further rescheduling even there are block groups to
discard.

It happens pretty often, but not so visible from the userspace because
after some time it usually will be kicked off anyway by someone else
calling btrfs_discard_reschedule_work().

Fix it by continue rescheduling if block group discard lists are not
empty.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:51 +01:00
Josef Bacik 675a4fc8f3 btrfs: tests: initialize test inodes location
I noticed that sometimes the module failed to load because the self
tests failed like this:

  BTRFS: selftest: fs/btrfs/tests/inode-tests.c:963 miscount, wanted 1, got 0

This turned out to be because sometimes the btrfs ino would be the btree
inode number, and thus we'd skip calling the set extent delalloc bit
helper, and thus not adjust ->outstanding_extents.

Fix this by making sure we initialize test inodes with a valid inode
number so that we don't get random failures during self tests.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:49 +01:00
Filipe Manana 0b3f407e67 btrfs: send: fix wrong file path when there is an inode with a pending rmdir
When doing an incremental send, if we have a new inode that happens to
have the same number that an old directory inode had in the base snapshot
and that old directory has a pending rmdir operation, we end up computing
a wrong path for the new inode, causing the receiver to fail.

Example reproducer:

  $ cat test-send-rmdir.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  mkdir $MNT/dir
  touch $MNT/dir/file1
  touch $MNT/dir/file2
  touch $MNT/dir/file3

  # Filesystem looks like:
  #
  # .                                     (ino 256)
  # |----- dir/                           (ino 257)
  #         |----- file1                  (ino 258)
  #         |----- file2                  (ino 259)
  #         |----- file3                  (ino 260)
  #

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now remove our directory and all its files.
  rm -fr $MNT/dir

  # Unmount the filesystem and mount it again. This is to ensure that
  # the next inode that is created ends up with the same inode number
  # that our directory "dir" had, 257, which is the first free "objectid"
  # available after mounting again the filesystem.
  umount $MNT
  mount $DEV $MNT

  # Now create a new file (it could be a directory as well).
  touch $MNT/newfile

  # Filesystem now looks like:
  #
  # .                                     (ino 256)
  # |----- newfile                        (ino 257)
  #

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  # Now unmount the filesystem, create a new one, mount it and try to apply
  # both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null

  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  umount $MNT

When running the test, the receive operation for the incremental stream
fails:

  $ ./test-send-rmdir.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: chown o257-9-0 failed: No such file or directory

So fix this by tracking directories that have a pending rmdir by inode
number and generation number, instead of only inode number.

A test case for fstests follows soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Tested-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:50:16 +01:00
Qu Wenruo ae5e070eac btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
There is a chance of racing for qgroup flushing which may lead to
deadlock:

	Thread A		|	Thread B
   (not holding trans handle)	|  (holding a trans handle)
--------------------------------+--------------------------------
__btrfs_qgroup_reserve_meta()   | __btrfs_qgroup_reserve_meta()
|- try_flush_qgroup()		| |- try_flush_qgroup()
   |- QGROUP_FLUSHING bit set   |    |
   |				|    |- test_and_set_bit()
   |				|    |- wait_event()
   |- btrfs_join_transaction()	|
   |- btrfs_commit_transaction()|

			!!! DEAD LOCK !!!

Since thread A wants to commit transaction, but thread B is holding a
transaction handle, blocking the commit.
At the same time, thread B is waiting for thread A to finish its commit.

This is just a hot fix, and would lead to more EDQUOT when we're near
the qgroup limit.

The proper fix would be to make all metadata/data reservations happen
without holding a transaction handle.

CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:50:07 +01:00
ethanwu 9a66497156 btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.

item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.

When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.

And split_leaf return EOVERFLOW from following code:

  if (extend && data_size + btrfs_item_size_nr(l, slot) +
      sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
      return -EOVERFLOW;

In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.

However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:

  data_size = sizeof(*di) + name_len;
  if (data_size + btrfs_item_size_nr(leaf, slot) +
      sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))

the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.

The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly

  [436149.586170] ------------[ cut here ]------------
  [436149.586173] BTRFS: Transaction aborted (error -75)
  [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
  [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
  [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
  [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
  [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
  [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
  [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
  [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
  [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
  [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
  [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
  [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
  [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [436149.586296] Call Trace:
  [436149.586311]  vfs_rename+0x383/0x920
  [436149.586313]  ? vfs_rename+0x383/0x920
  [436149.586315]  do_renameat2+0x4ca/0x590
  [436149.586317]  __x64_sys_rename+0x20/0x30
  [436149.586324]  do_syscall_64+0x5a/0x120
  [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [436149.586332] RIP: 0033:0x7fa3133b1d37
  [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
  [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
  [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
  [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
  [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00

Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction.  It simply
creates files and rename them to forged names that leads to
hash collision.

There are two ways to fix this. One is to simply revert the patch
878f2d2cb3 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.

The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.

  1. extent-tree.c: lookup_inline_extent_backref
  2. file-item.c: btrfs_csum_file_blocks

to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.

This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:50:00 +01:00
Filipe Manana 3d45f221ce btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:

* the async reclaim task queued a delalloc work to flush delalloc for
  the destination inode of the clone operation;

* the task executing that delalloc work gets blocked waiting for the
  range with the dirty page to be unlocked, which is currently locked
  by the task doing the clone operation;

* the async reclaim task blocks waiting for the delalloc work to complete;

* the cloning task is waiting on the waitqueue of its reservation ticket
  while holding the range with the dirty page locked in the inode's
  io_tree;

* if metadata space is not released by some other task (like delalloc for
  some other inode completing for example), the clone task waits forever
  and as a consequence the delalloc work and async reclaim tasks will hang
  forever as well. Releasing more space on the other hand may require
  starting a transaction, which will hang as well when trying to reserve
  metadata space, resulting in a deadlock between all these tasks.

When this happens, traces like the following show up in dmesg/syslog:

  [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
  [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
  [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
  [87452.326136] Call Trace:
  [87452.326737]  __schedule+0x5d1/0xcf0
  [87452.327390]  schedule+0x45/0xe0
  [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  [87452.328894]  ? finish_wait+0x90/0x90
  [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  [87452.330133]  ? __mod_memcg_state+0x8e/0x160
  [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
  [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  [87452.332007]  ? lock_release+0x20e/0x4c0
  [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
  [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
  [87452.333653]  ? lock_acquire+0x1a3/0x490
  [87452.334177]  do_writepages+0x43/0xe0
  [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
  [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
  [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
  [87452.337838]  process_one_work+0x24e/0x5e0
  [87452.338437]  worker_thread+0x50/0x3b0
  [87452.339137]  ? process_one_work+0x5e0/0x5e0
  [87452.339884]  kthread+0x153/0x170
  [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
  [87452.341153]  ret_from_fork+0x22/0x30
  [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
  [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
  [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
  [87452.345655] Call Trace:
  [87452.346305]  __schedule+0x5d1/0xcf0
  [87452.346947]  ? kvm_clock_read+0x14/0x30
  [87452.347676]  ? wait_for_completion+0x81/0x110
  [87452.348389]  schedule+0x45/0xe0
  [87452.349077]  schedule_timeout+0x30c/0x580
  [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [87452.350340]  ? lock_acquire+0x1a3/0x490
  [87452.351006]  ? try_to_wake_up+0x7a/0xa20
  [87452.351541]  ? lock_release+0x20e/0x4c0
  [87452.352040]  ? lock_acquired+0x199/0x490
  [87452.352517]  ? wait_for_completion+0x81/0x110
  [87452.353000]  wait_for_completion+0xab/0x110
  [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
  [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  [87452.354455]  flush_space+0x24f/0x660 [btrfs]
  [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  [87452.355565]  process_one_work+0x24e/0x5e0
  [87452.356024]  worker_thread+0x20f/0x3b0
  [87452.356487]  ? process_one_work+0x5e0/0x5e0
  [87452.356973]  kthread+0x153/0x170
  [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
  [87452.357880]  ret_from_fork+0x22/0x30
  (...)
  < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
  (...)
  [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
  [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
  [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
  [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
  [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
  [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
  [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
  [92867.447920] Call Trace:
  [92867.448435]  __schedule+0x5d1/0xcf0
  [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [92867.449423]  schedule+0x45/0xe0
  [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
  [92867.450576]  ? finish_wait+0x90/0x90
  [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
  [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
  [92867.453848]  ? lock_release+0x20e/0x4c0
  [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
  [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
  [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
  [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
  [92867.457213]  do_clone_file_range+0xd4/0x1f0
  [92867.457828]  vfs_clone_file_range+0x4d/0x230
  [92867.458355]  ? lock_release+0x20e/0x4c0
  [92867.458890]  ioctl_file_clone+0x8f/0xc0
  [92867.459377]  do_vfs_ioctl+0x342/0x750
  [92867.459913]  __x64_sys_ioctl+0x62/0xb0
  [92867.460377]  do_syscall_64+0x33/0x80
  [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  (...)
  < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
  (...)

Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).

Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.

This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.

CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621c ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:49:50 +01:00
Linus Torvalds ac7ac4618c for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
 wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
 N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
 e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
 fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
 IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
 Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
 Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
 GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
 Q1Xqs6xwww==
 =zo4w
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Another series of killing more code than what is being added, again
  thanks to Christoph's relentless cleanups and tech debt tackling.

  This contains:

   - blk-iocost improvements (Baolin Wang)

   - part0 iostat fix (Jeffle Xu)

   - Disable iopoll for split bios (Jeffle Xu)

   - block tracepoint cleanups (Christoph Hellwig)

   - Merging of struct block_device and hd_struct (Christoph Hellwig)

   - Rework/cleanup of how block device sizes are updated (Christoph
     Hellwig)

   - Simplification of gendisk lookup and removal of block device
     aliasing (Christoph Hellwig)

   - Block device ioctl cleanups (Christoph Hellwig)

   - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

   - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

   - sbitmap improvements (Pavel Begunkov)

   - Hybrid polling fix (Pavel Begunkov)

   - bvec iteration improvements (Pavel Begunkov)

   - Zone revalidation fixes (Damien Le Moal)

   - blk-throttle limit fix (Yu Kuai)

   - Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
  blk-mq: fix msec comment from micro to milli seconds
  blk-mq: update arg in comment of blk_mq_map_queue
  blk-mq: add helper allocating tagset->tags
  Revert "block: Fix a lockdep complaint triggered by request queue flushing"
  nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
  blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
  block: disable iopoll for split bio
  block: Improve blk_revalidate_disk_zones() checks
  sbitmap: simplify wrap check
  sbitmap: replace CAS with atomic and
  sbitmap: remove swap_lock
  sbitmap: optimise sbitmap_deferred_clear()
  blk-mq: skip hybrid polling if iopoll doesn't spin
  blk-iocost: Factor out the base vrate change into a separate function
  blk-iocost: Factor out the active iocgs' state check into a separate function
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Fix some typos in comments
  blktrace: fix up a kerneldoc comment
  block: remove the request_queue to argument request based tracepoints
  ...
2020-12-16 12:57:51 -08:00
Linus Torvalds f1ee3b8829 for-5.11-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/XdB4ACgkQxWXV+ddt
 WDv41g//dOkrwjAVBfDUwRT/yKqojyEsZB1aNyHlPHFw8KEw5oIW7wxR4oqXi2ed
 /i9KIJe4E9AfqAiexhLvA+Wyt/Sgwz+k4ys82PKhhRNQn7LE4tvhSBUu6JYJDU09
 6I1jagya7ILa8akFXZTmVbXdliI4Ab+pcXWAmQYK/xPVDxYTSsBf4o4MilNBA9FS
 lTwwBh5GTEtIkubr2yVd3pKfF4fT2g1hd+yglpHaOzpcrLMNN4hj4sUFlLbx/FlJ
 MWo+914cSNKJoebbnqhK9djD9hggaaXnNooqfBOXUhZN0VN9rQoKb5tW+TREQmFm
 shrmBSqN7CaqKfSOMZs7WOnTuTvmV/825PnLqDqcTUaLw+BgdyacpO9WflgfSs16
 Cdvagr1SqbrSQ/3WYCpbqPLDNP3XuZ6+m5OWizf6fhyo8xdFcUHZgRC8qejDlycy
 V/zP0c5OYOMi5vo6x/zhrD7Uft7xoFUVcSJCe8WPri082d9LbA2BqwCsullD60PQ
 K/fsmlHs5Uxxy3MFgBPVDdWGgaa9rQ2vXequezbozBIIeeVL+Q9zkeyBFSYuFeE8
 HToRE9B9BUEUh+p1JxPjOdFH/m+sKe1WMdmRLQthMzfOiNWW7pp/nL5rl4BUVmjm
 58dQS73Cj/YNdBomRJXPPtgKIJPAWRrzU/JBcwAdMoKy57oh9NQ=
 =5YAS
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "We have a mix of all kinds of changes, feature updates, core stuff,
  performance improvements and lots of cleanups and preparatory changes.

  User visible:

   - export filesystem generation in sysfs

   - new features for mount option 'rescue':
       - what's currently supported is exported in sysfs
       - 'ignorebadroots'/'ibadroots' - continue even if some essential
         tree roots are not usable (extent, uuid, data reloc, device,
         csum, free space)
       - 'ignoredatacsums'/'idatacsums' - skip checksum verification on
         data
       - 'all' - now enables 'ignorebadroots' + 'ignoredatacsums' +
         'nologreplay'

   - export read mirror policy settings to sysfs, new policies will be
     added in the future

   - remove inode number cache feature (mount -o inode_cache), obsoleted
     in 5.9

  User visible fixes:

   - async discard scheduling fixes on high loads

   - update inode byte counter atomically so stat() does not report
     wrong value in some cases

   - free space tree fixes:
       - correctly report status of v2 after remount
       - clear v1 cache inodes when v2 is newly enabled after remount

  Core:

   - switch own tree lock implementation to standard rw semaphore:
       - one-level lock nesting is not required anymore, the last use of
         this was in free space that's now loaded asynchronously
       - own implementation of adaptive spinning before taking mutex has
         been part of rwsem
       - performance seems to be better in general, much better (+tens
         of percents) for some workloads
       - lockdep does not complain

   - finish direct IO conversion to iomap infrastructure, remove
     temporary workaround for DSYNC after iomap API updates

   - preparatory work to support data and metadata blocks smaller than
     page:
       - generalize code that assumes sectorsize == PAGE_SIZE, lots of
         refactoring
       - planned namely for 64K pages (eg. arm64, ppc64)
       - scrub read-only support

   - preparatory work for zoned allocation mode (SMR/ZBC/ZNS friendly):
       - disable incompatible features
       - round-robin superblock write

   - free space cache (v1) is loaded asynchronously, remove tree path
     recursion

   - slightly improved time tacking for transaction kthread wake ups

  Performance improvements (note that the numbers depend on load type or
  other features and weren't run on the same machine):

   - skip unnecessary work:
       - do not start readahead for csum tree when scrubbing non-data
         block groups
       - do not start and wait for delalloc on snapshot roots on
         transaction commit
       - fix race when defragmenting leads to unnecessary IO

   - dbench speedups (+throughput%/-max latency%):
       - skip unnecessary searches for xattrs when logging an inode
         (+10.8/-8.2)
       - stop incrementing log batch when joining log transaction (1-2)
       - unlock path before checking if extent is shared during nocow
         writeback (+5.0/-20.5), on fio load +9.7% throughput/-9.8%
         runtime
       - several tree log improvements, eg. removing unnecessary
         operations, fixing races that lead to additional work
         (+12.7/-8.2)

   - tree-checker error branches annotated with unlikely() (+3%
     throughput)

  Other:

   - cleanups

   - lockdep fixes

   - more btrfs_inode conversions

   - error variable cleanups"

* tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
  btrfs: scrub: allow scrub to work with subpage sectorsize
  btrfs: scrub: support subpage data scrub
  btrfs: scrub: support subpage tree block scrub
  btrfs: scrub: always allocate one full page for one sector for RAID56
  btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
  btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
  btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
  btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  btrfs: update num_extent_pages to support subpage sized extent buffer
  btrfs: don't allow tree block to cross page boundary for subpage support
  btrfs: calculate inline extent buffer page size based on page size
  btrfs: factor out btree page submission code to a helper
  btrfs: make btrfs_verify_data_csum follow sector size
  btrfs: pass bio_offset to check_data_csum() directly
  btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
  btrfs: fix lockdep warning when creating free space tree
  btrfs: skip space_cache v1 setup when not using it
  btrfs: remove free space items when disabling space cache v1
  btrfs: warn when remount will not change the free space tree
  btrfs: use superblock state to print space_cache mount option
  ...
2020-12-15 18:40:42 -08:00
Linus Torvalds edd7ab7684 The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic implementation
     which builds the base for the kmap_local() API and make the
     kmap_atomic() interface wrappers which handle the disabling/enabling of
     preemption and pagefaults.
 
   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.
 
   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a mapping
     is established. It has to disable migration instead to guarantee that
     the virtual address of the mapped slot is the same accross preemption.
 
   - Provide better debug facilities: guard pages and enforced utilization
     of the mapping mechanics on 64bit systems when the architecture allows
     it.
 
   - Provide the new kmap_local() API which can now be used to cleanup the
     kmap_atomic() usage sites all over the place. Most of the usage sites
     do not require the implicit disabling of preemption and pagefaults so
     the penalty on 64bit and 32bit non-highmem systems is removed and quite
     some of the code can be simplified. A wholesale conversion is not
     possible because some usage depends on the implicit side effects and
     some need to be cleaned up because they work around these side effects.
 
     The migrate disable side effect is only effective on highmem systems
     and when enforced debugging is enabled. On 64bit and 32bit non-highmem
     systems the overhead is completely avoided.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
 0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
 4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
 p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
 RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
 +x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
 ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
 Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
 XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
 FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
 HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
 +jlfoJhMbtx5Gg==
 =n71I
 -----END PGP SIGNATURE-----

Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull kmap updates from Thomas Gleixner:
 "The new preemtible kmap_local() implementation:

   - Consolidate all kmap_atomic() internals into a generic
     implementation which builds the base for the kmap_local() API and
     make the kmap_atomic() interface wrappers which handle the
     disabling/enabling of preemption and pagefaults.

   - Switch the storage from per-CPU to per task and provide scheduler
     support for clearing mapping when scheduling out and restoring them
     when scheduling back in.

   - Merge the migrate_disable/enable() code, which is also part of the
     scheduler pull request. This was required to make the kmap_local()
     interface available which does not disable preemption when a
     mapping is established. It has to disable migration instead to
     guarantee that the virtual address of the mapped slot is the same
     across preemption.

   - Provide better debug facilities: guard pages and enforced
     utilization of the mapping mechanics on 64bit systems when the
     architecture allows it.

   - Provide the new kmap_local() API which can now be used to cleanup
     the kmap_atomic() usage sites all over the place. Most of the usage
     sites do not require the implicit disabling of preemption and
     pagefaults so the penalty on 64bit and 32bit non-highmem systems is
     removed and quite some of the code can be simplified. A wholesale
     conversion is not possible because some usage depends on the
     implicit side effects and some need to be cleaned up because they
     work around these side effects.

     The migrate disable side effect is only effective on highmem
     systems and when enforced debugging is enabled. On 64bit and 32bit
     non-highmem systems the overhead is completely avoided"

* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  ARM: highmem: Fix cache_is_vivt() reference
  x86/crashdump/32: Simplify copy_oldmem_page()
  io-mapping: Provide iomap_local variant
  mm/highmem: Provide kmap_local*
  sched: highmem: Store local kmaps in task struct
  x86: Support kmap_local() forced debugging
  mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
  mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
  microblaze/mm/highmem: Add dropped #ifdef back
  xtensa/mm/highmem: Make generic kmap_atomic() work correctly
  mm/highmem: Take kmap_high_get() properly into account
  highmem: High implementation details and document API
  Documentation/io-mapping: Remove outdated blurb
  io-mapping: Cleanup atomic iomap
  mm/highmem: Remove the old kmap_atomic cruft
  highmem: Get rid of kmap_types.h
  xtensa/mm/highmem: Switch to generic kmap atomic
  sparc/mm/highmem: Switch to generic kmap atomic
  powerpc/mm/highmem: Switch to generic kmap atomic
  nds32/mm/highmem: Switch to generic kmap atomic
  ...
2020-12-14 18:35:53 -08:00
Qu Wenruo b42fe98c92 btrfs: scrub: allow scrub to work with subpage sectorsize
Since btrfs scrub is utilizing its own infrastructure to submit
read/write, scrub is independent from all other routines.

This brings one very neat feature, allow us to read 4K data into offset
0 of a 64K page.  So is the writeback routine.

This makes scrub on subpage sector size much easier to implement, and
thanks to previous commits which just changed the implementation to
always do scrub based on sector size, now scrub can handle subpage
filesystem without any problem.

This patch will just remove the restriction on
(sectorsize != PAGE_SIZE), to make scrub finally work on subpage
filesystems.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo b29dca44ab btrfs: scrub: support subpage data scrub
Btrfs scrub is more flexible than buffered data write path, as we can
read an unaligned subpage data into page offset 0.

This ability makes subpage support much easier, we just need to check
each scrub_page::page_len and ensure we only calculate hash for [0,
page_len) of a page.

There is a small thing to notice: for subpage case, we still do sector
by sector scrub.  This means we will submit a read bio for each sector
to scrub, resulting in the same amount of read bios, just like on the 4K
page systems.

This behavior can be considered as a good thing, if we want everything
to be the same as 4K page systems.  But this also means, we're wasting
the possibility to submit larger bio using 64K page size.  This is
another problem to consider in the future.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo 53f3251d3b btrfs: scrub: support subpage tree block scrub
To support subpage tree block scrub, scrub_checksum_tree_block() only
needs to learn 2 new tricks:

- Follow sector size
  Now scrub_page only represents one sector, we need to follow it
  properly.

- Run checksum on all sectors
  Since scrub_page only represents one sector, we need to run checksum
  on all sectors, not only (nodesize >> PAGE_SIZE).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo d0a7a9c050 btrfs: scrub: always allocate one full page for one sector for RAID56
For scrub_pages() and scrub_pages_for_parity(), we currently allocate
one scrub_page structure for one page.

This is fine if we only read/write one sector one time.  But for cases
like scrubbing RAID56, we need to read/write the full stripe, which is
in 64K size for now.

For subpage size, we will submit the read in just one page, which is
normally a good thing, but for RAID56 case, it only expects to see one
sector, not the full stripe in its endio function.
This could lead to wrong parity checksum for RAID56 on subpage.

To make the existing code work well for subpage case, here we take a
shortcut by always allocating a full page for one sector.

This should provide the base to make RAID56 work for subpage case.

The cost is pretty obvious now, for one RAID56 stripe now we always need
16 pages. For support subpage situation (64K page size, 4K sector size),
this means we need full one megabyte to scrub just one RAID56 stripe.

And for data scrub, each 4K sector will also need one 64K page.

This is mostly just a workaround, the proper fix for this is a much
larger project, using scrub_block to replace scrub_page, and allow
scrub_block to handle multi pages, csums, and csum_bitmap to avoid
allocating one page for each sector.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo fa485d21a7 btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
Btrfs on-disk format chose to use u64 for almost everything, but there
are a other restrictions that won't let us use more than u32 for things
like extent length (the maximum length is 128MiB for non-hole extents),
or stripe length (we have device number limit).

This means if we don't have extra handling to convert u64 to u32, we
will always have some questionable operations like
"u32 = u64 >> sectorsize_bits" in the code.

This patch will try to address the problem by reducing the width for the
following members/parameters:

- scrub_parity::stripe_len
- @len of scrub_pages()
- @extent_len of scrub_remap_extent()
- @len of scrub_parity_mark_sectors_error()
- @len of scrub_parity_mark_sectors_data()
- @len of scrub_extent()
- @len of scrub_pages_for_parity()
- @len of scrub_extent_for_parity()

For members extracted from on-disk structure, like map->stripe_len, they
will be kept as is. Since that modification would require on-disk format
change.

There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
sites, extra ASSERT() is added to be extra safe for debug builds.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo 6275193ef1 btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:

- Remove the @file_offset parameter
  There are two factors making the @file_offset parameter useless:

  * For csum lookup in csum tree, file offset makes no sense
    We only need disk_bytenr, which is unrelated to file_offset

  * page_offset (file offset) of each bvec is not contiguous.
    Pages can be added to the same bio as long as their on-disk bytenr
    is contiguous, meaning we could have pages at different file offsets
    in the same bio.

  Thus passing file_offset makes no sense any more.
  The only user of file_offset is for data reloc inode, we will use
  a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into search_csum_tree()
  The new function will handle the csum search in csum tree.
  The return value is the same as btrfs_find_ordered_sum(), returning
  the number of found sectors which have checksum.

- Change how we do the main loop
  The only needed info from bio is:
  * the on-disk bytenr
  * the length

  After extracting the above info, we can do the search without bio
  at all, which makes the main loop much simpler:

	for (cur_disk_bytenr = orig_disk_bytenr;
	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
	     cur_disk_bytenr += count * sectorsize) {

		/* Lookup csum tree */
		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
					 search_len, csum_dst);
		if (!count) {
			/* Csum hole handling */
		}
	}

- Use single variable as the source to calculate all other offsets
  Instead of all different type of variables, we use only one main
  variable, cur_disk_bytenr, which represents the current disk bytenr.

  All involved values can be calculated from that variable, and
  all those variable will only be visible in the inner loop.

The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup.  Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
Qu Wenruo 9e46458a7c btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.

This means to read a page we either:

- Submit read bio if it's not uptodate
  This means we only need to search csum tree for checksums.

- The page is already uptodate
  It can be marked uptodate for previous read, or being marked dirty.
  As we always mark page uptodate for dirty page.
  In that case, we don't need to submit read bio at all, thus no need
  to search any checksums.

Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo 884b07d0f4 btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.

Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.

This patch introduces two helpers to do this:

- get_eb_page_index()
  This is to calculate the index to access extent_buffer::pages.
  It's just a simple wrapper around "start >> PAGE_SHIFT".

  For sectorsize == PAGE_SIZE case, nothing is changed.
  For sectorsize < PAGE_SIZE case, we always get index as 0, and
  the existing page shift also works.

- get_eb_offset_in_page()
  This is to calculate the offset to access extent_buffer::pages.
  This needs to take extent_buffer::start into consideration.

  For sectorsize == PAGE_SIZE case, extent_buffer::start is always
  aligned to PAGE_SIZE, thus adding extent_buffer::start to
  offset_in_page() won't change the result.
  For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
  us the correct offset to access.

This patch will touch the following parts to cover all extent buffer
accessors:

- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo 4a3dc93843 btrfs: update num_extent_pages to support subpage sized extent buffer
For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.

Update function num_extent_pages to handle such case.  Now
num_extent_pages() returns 1 for subpage sized extent buffer.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo 1aaac38c83 btrfs: don't allow tree block to cross page boundary for subpage support
As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.

The 64K is selected because:

- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size

This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.

Otherwise we would have to do complex multi-page handling of tree
blocks.  Currently there is no way to create such tree blocks.

In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.

While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.

Just add such graceful checks in case of an ancient filesystem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo deb6789553 btrfs: calculate inline extent buffer page size based on page size
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.

For a system using 64K page size, we would really have just one page.

While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.

Calculate the array size based on page size and the node size maximum.

- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page

Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo f91e0d0c4c btrfs: factor out btree page submission code to a helper
In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.

This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.

Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:10 +01:00
Qu Wenruo f44cf41075 btrfs: make btrfs_verify_data_csum follow sector size
Currently btrfs_verify_data_csum() just passes the whole page to
check_data_csum(), which is fine since we only support sectorsize ==
PAGE_SIZE.

To support subpage, we need to properly honor per-sector
checksum verification, just like what we did in dio read path.

This patch will do the csum verification in a for loop, starts with
pg_off == start - page_offset(page), with sectorsize increase for
each loop.

For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
will only loop once.

For subpage case, we do the iterate over each sector and if we found any
error, we return error.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Qu Wenruo 7ffd27e378 btrfs: pass bio_offset to check_data_csum() directly
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().

Both parameters are calculated values for csum lookup.

Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.

Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.

Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Qu Wenruo 1941b64b08 btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32.  But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.

Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov 8a6a87cd44 btrfs: fix lockdep warning when creating free space tree
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.

The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.

The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.

Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.

The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.

  [30.480136] ======================================================
  [30.480830] WARNING: possible circular locking dependency detected
  [30.481457] 5.9.0-rc8+ #76 Not tainted
  [30.481897] ------------------------------------------------------
  [30.482500] mount/520 is trying to acquire lock:
  [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.484054]
	      but task is already holding lock:
  [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.485581]
	      which lock already depends on the new lock.

  [30.486397]
	      the existing dependency chain (in reverse order) is:
  [30.487205]
	      -> #2 (btrfs-extent-01#2){++++}-{3:3}:
  [30.487825]        down_read_nested+0x43/0x150
  [30.488306]        __btrfs_tree_read_lock+0x39/0x180
  [30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.489477]        btrfs_search_slot+0x464/0x9b0
  [30.490009]        check_committed_ref+0x59/0x1d0
  [30.490603]        btrfs_cross_ref_exist+0x65/0xb0
  [30.491108]        run_delalloc_nocow+0x405/0x930
  [30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
  [30.492203]        writepage_delalloc+0xd4/0x150
  [30.492688]        __extent_writepage+0x18d/0x3a0
  [30.493199]        extent_write_cache_pages+0x2af/0x450
  [30.493743]        extent_writepages+0x34/0x70
  [30.494231]        do_writepages+0x31/0xd0
  [30.494642]        __filemap_fdatawrite_range+0xad/0xe0
  [30.495194]        btrfs_fdatawrite_range+0x1b/0x50
  [30.495677]        __btrfs_write_out_cache+0x40d/0x460
  [30.496227]        btrfs_write_out_cache+0x8b/0x110
  [30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
  [30.497317]        btrfs_commit_transaction+0xc0/0xba0
  [30.497861]        sync_filesystem+0x71/0x90
  [30.498303]        btrfs_remount+0x81/0x433
  [30.498767]        reconfigure_super+0x9f/0x210
  [30.499261]        path_mount+0x9d1/0xa30
  [30.499722]        do_mount+0x55/0x70
  [30.500158]        __x64_sys_mount+0xc4/0xe0
  [30.500616]        do_syscall_64+0x33/0x40
  [30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.501629]
	      -> #1 (btrfs-root-00){++++}-{3:3}:
  [30.502241]        down_read_nested+0x43/0x150
  [30.502727]        __btrfs_tree_read_lock+0x39/0x180
  [30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.503903]        btrfs_search_slot+0x464/0x9b0
  [30.504405]        btrfs_insert_empty_items+0x60/0xa0
  [30.504973]        btrfs_insert_item+0x60/0xd0
  [30.505412]        btrfs_create_tree+0x1b6/0x210
  [30.505913]        btrfs_create_free_space_tree+0x54/0x110
  [30.506460]        btrfs_mount_rw+0x15d/0x20f
  [30.506937]        btrfs_remount+0x356/0x433
  [30.507369]        reconfigure_super+0x9f/0x210
  [30.507868]        path_mount+0x9d1/0xa30
  [30.508264]        do_mount+0x55/0x70
  [30.508668]        __x64_sys_mount+0xc4/0xe0
  [30.509186]        do_syscall_64+0x33/0x40
  [30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.510271]
	      -> #0 (btrfs-free-space-00){++++}-{3:3}:
  [30.510972]        __lock_acquire+0x11ad/0x1b60
  [30.511432]        lock_acquire+0xa2/0x360
  [30.511917]        down_read_nested+0x43/0x150
  [30.512383]        __btrfs_tree_read_lock+0x39/0x180
  [30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.513455]        btrfs_search_slot+0x464/0x9b0
  [30.513947]        search_free_space_info+0x45/0x90
  [30.514465]        __add_to_free_space_tree+0x92/0x39d
  [30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
  [30.515639]        btrfs_mount_rw+0x15d/0x20f
  [30.516142]        btrfs_remount+0x356/0x433
  [30.516538]        reconfigure_super+0x9f/0x210
  [30.517065]        path_mount+0x9d1/0xa30
  [30.517438]        do_mount+0x55/0x70
  [30.517824]        __x64_sys_mount+0xc4/0xe0
  [30.518293]        do_syscall_64+0x33/0x40
  [30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.519335]
	      other info that might help us debug this:

  [30.520210] Chain exists of:
		btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2

  [30.521407]  Possible unsafe locking scenario:

  [30.522037]        CPU0                    CPU1
  [30.522456]        ----                    ----
  [30.522941]   lock(btrfs-extent-01#2);
  [30.523311]                                lock(btrfs-root-00);
  [30.523952]                                lock(btrfs-extent-01#2);
  [30.524620]   lock(btrfs-free-space-00);
  [30.525068]
	       *** DEADLOCK ***

  [30.525669] 5 locks held by mount/520:
  [30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
  [30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
  [30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
  [30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
  [30.531124]
	      stack backtrace:
  [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
  [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
  [30.533215] Call Trace:
  [30.533452]  dump_stack+0x8d/0xc0
  [30.533797]  check_noncircular+0x13c/0x150
  [30.534233]  __lock_acquire+0x11ad/0x1b60
  [30.534667]  lock_acquire+0xa2/0x360
  [30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
  [30.535525]  down_read_nested+0x43/0x150
  [30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
  [30.536400]  __btrfs_tree_read_lock+0x39/0x180
  [30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
  [30.537304]  btrfs_search_slot+0x464/0x9b0
  [30.537713]  ? trace_hardirqs_on+0x1c/0xf0
  [30.538148]  search_free_space_info+0x45/0x90
  [30.538572]  __add_to_free_space_tree+0x92/0x39d
  [30.539071]  ? printk+0x48/0x4a
  [30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
  [30.539972]  btrfs_mount_rw+0x15d/0x20f
  [30.540350]  btrfs_remount+0x356/0x433
  [30.540773]  ? shrink_dcache_sb+0xd9/0x100
  [30.541203]  reconfigure_super+0x9f/0x210
  [30.541642]  path_mount+0x9d1/0xa30
  [30.542040]  do_mount+0x55/0x70
  [30.542366]  __x64_sys_mount+0xc4/0xe0
  [30.542822]  do_syscall_64+0x33/0x40
  [30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.543691] RIP: 0033:0x7f109f7ab93a
  [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
  [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
  [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
  [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
  [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov af456a2c0a btrfs: skip space_cache v1 setup when not using it
If we are not using space cache v1, we should not create the free space
object or free space inodes. This comes up when we delete the existing
free space objects/inodes when migrating to v2, only to see them get
recreated for every dirtied block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov 36b216c85e btrfs: remove free space items when disabling space cache v1
When the filesystem transitions from space cache v1 to v2 or to
nospace_cache, it removes the old cached data, but does not remove
the FREE_SPACE items nor the free space inodes they point to. This
doesn't cause any issues besides being a bit inefficient, since these
items no longer do anything useful.

To fix it, when we are mounting, and plan to disable the space cache,
destroy each block group's free space item and free space inode.
The code to remove the items is lifted from the existing use case of
removing the block group, with a light adaptation to handle whether or
not we have already looked up the free space inode.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov 2838d255cb btrfs: warn when remount will not change the free space tree
If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
clear the free space tree. This can be surprising, so print a warning
to dmesg to make the failure more visible. It is also important to
ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
consistent, so ensure those are set to properly match the current on
disk state (which won't be changing).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov 04c4155969 btrfs: use superblock state to print space_cache mount option
To make the contents of /proc/mounts better match the actual state of
the filesystem, base the display of the space cache mount options off
the contents of the super block rather than the last mount options
passed in. Since there are many scenarios where the mount will ignore a
space cache option, simply showing the passed in option is misleading.

For example, if we mount with -o remount,space_cache=v2 on a read-write
file system without an existing free space tree, we won't build a free
space tree, but /proc/mounts will read space_cache=v2 (until we mount
again and it goes away)

cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
space_cache=v2, and if neither is the case, we print nospace_cache.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov 9484622945 btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.

In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.

We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).

Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov 8b228324a8 btrfs: clear free space tree on ro->rw remount
A user might want to revert to v1 or nospace_cache on a root filesystem,
and much like turning on the free space tree, that can only be done
remounting from ro->rw. Support clearing the free space tree on such
mounts by moving it into the shared remount logic.

Since the CLEAR_CACHE option sticks around across remounts, this change
would result in clearing the tree for ever on every remount, which is
not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
clear at mount end, which has the other bonus of not cluttering the
/proc/mounts output with clear_cache.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov 8cd2908846 btrfs: clear oneshot options on mount and remount
Some options only apply during mount time and are cleared at the end
of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
fits the bill, and this is a preparation patch for also clearing that
option.

One subtlety is that the current code only resets USEBACKUPROOT on rw
mounts, but the option is meaningfully "consumed" by a ro mount, so it
feels appropriate to clear in that case as well. A subsequent read-write
remount would not go through open_ctree, which is the only place that
checks the option, so the change should be benign.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov 5011139a47 btrfs: create free space tree on ro->rw remount
When a user attempts to remount a btrfs filesystem with
'mount -o remount,space_cache=v2', that operation silently succeeds.
Unfortunately, this is misleading, because the remount does not create
the free space tree. /proc/mounts will incorrectly show space_cache=v2,
but on the next mount, the file system will revert to the old
space_cache.

For now, we handle only the easier case, where the existing mount is
read-only and the new mount is read-write. In that case, we can create
the free space tree without contending with the block groups changing
as we go.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Boris Burkov 997e3e2e71 btrfs: only mark bg->needs_free_space if free space tree is on
If we attempt to create a free space tree while any block groups have
needs_free_space set, we will double add the new free space item
and hit EEXIST. Previously, we only created the free space tree on a new
mount, so we never hit the case, but if we try to create it on a
remount, such block groups could exist and trip us up.

We don't do anything with this field unless the free space tree is
enabled, so there is no harm in not setting it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Boris Burkov 8f1c21d749 btrfs: start orphan cleanup on ro->rw remount
When we mount a rw filesystem, we start the orphan cleanup process in
tree root and filesystem tree. However, when we remount a ro file system
rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
on tree_root and fs_root to the shared rw mount routine to effectively
add them on ro->rw remount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Boris Burkov 44c0ca211a btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.

However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.

Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.

This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Filipe Manana 47876f7cef btrfs: do not block inode logging for so long during transaction commit
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.

This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).

Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:

1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers

They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.

So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

The following script that uses dbench was used to measure the impact of
the whole patchset:

  $ cat test-dbench.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f -m single -d single $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).

Before patch set:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    11277211    0.250    85.340
 Close        8283172     0.002     6.479
 Rename        477515     1.935    86.026
 Unlink       2277936     0.770    87.071
 Deltree          256    15.732    81.379
 Mkdir            128     0.003     0.009
 Qpathinfo    10221180    0.056    44.404
 Qfileinfo    1789967     0.002     4.066
 Qfsinfo      1874399     0.003     9.176
 Sfileinfo     918589     0.061    10.247
 Find         3951758     0.341    54.040
 WriteX       5616547     0.047    85.079
 ReadX        17676028    0.005     9.704
 LockX          36704     0.003     1.800
 UnlockX        36704     0.002     0.687
 Flush         790541    14.115   676.236

Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms

After patch set:

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    12687926    0.171    86.526
 Close        9320780     0.002     8.063
 Rename        537253     1.444    78.576
 Unlink       2561827     0.559    87.228
 Deltree          374    11.499    73.549
 Mkdir            187     0.003     0.005
 Qpathinfo    11500300    0.061    36.801
 Qfileinfo    2017118     0.002     7.189
 Qfsinfo      2108641     0.003     4.825
 Sfileinfo    1033574     0.008     8.065
 Find         4446553     0.408    47.835
 WriteX       6335667     0.045    84.388
 ReadX        19887312    0.003     9.215
 LockX          41312     0.003     1.394
 UnlockX        41312     0.002     1.425
 Flush         889233    13.014   623.259

Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms

+12.7% throughput, -8.2% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Filipe Manana 639bd575b7 btrfs: fix race leading to unnecessary transaction commit when logging inode
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with without a transaction commit or some error like an
ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.

A sequence of steps that leads to this issue is the following:

1) The current open transaction has a generation of 1000;

2) A new block group is allocated, and as a consequence we must make sure
   any attempts to commit a log fallback to a transaction commit, so
   btrfs_set_log_full_commit() is called from btrfs_make_block_group().
   This sets fs_info->last_trans_log_full_commit to 1000;

3) Task A is holding a handle on transaction 1000 and tries to log inode X.
   Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
   which returns true, since fs_info->last_trans_log_full_commit has a
   value of 1000. So we end up returning EAGAIN and propagating it up to
   btrfs_sync_file(), where we commit transaction 1000;

4) The transaction commit task (task A) sets the transaction state to
   unblocked (TRANS_STATE_UNBLOCKED);

5) Some other task, task B, starts a new transaction with a generation of
   1001;

6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;

7) Transaction 1000 has not fully committed yet, we are still writing all
   the extent buffers it created;

8) Some new task, task C, starts an fsync of inode Y, gets a handle for
   transaction 1001, and it gets to btrfs_log_inode_parent() which does
   the following check:

     if (fs_info->last_trans_log_full_commit > last_committed) {
         ret = 1;
         goto end_no_trans;
     }

   At that point last_trans_log_full_commit has a value of 1000 and
   last_committed (value of fs_info->last_trans_committed) has a value of
   999, since transaction 1000 has not yet committed - it is either still
   writing out dirty extent buffers, its super blocks or unpinning
   extents.

   As a consequence we return 1, which gets propagated up to
   btrfs_sync_file(), which will then call btrfs_commit_transaction()
   for transaction 1001.

   As a consequence we have an unnecessary second transaction commit, we
   previously committed transaction 1000 and now commit transaction 1001
   as well, resulting in more overhead and increased latency.

So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.

Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().

This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Filipe Manana 47d3db41e1 btrfs: fix race that makes inode logging fallback to transaction commit
When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.

The race is described by the following steps:

1) We are at transaction 1000;

2) We modify an inode X (a directory) using transaction 1000 and set its
   last_unlink_trans field to 1000, because for example we removed one
   of its subdirectories;

3) We create a new inode Y with a dentry in inode X using transaction 1000,
   so its generation field is set to 1000;

4) The commit for transaction 1000 is started by task A;

5) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

6) Some task starts a new transaction with a generation of 1001;

7) We do some modification to inode Y (using transaction 1001);

8) The transaction 1000 commit starts unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

9) Task B starts an fsync on inode Y, and gets a handle for transaction
   1001. When it gets to check_parent_dirs_for_sync() it does the checking
   of the ancestor dentries because the following check does not evaluate
   to true:

       if (S_ISREG(inode->vfs_inode.i_mode) &&
           inode->generation <= last_committed &&
           inode->last_unlink_trans <= last_committed)
               goto out;

   The generation value for inode Y is 1000 and last_committed, which has
   the value read from fs_info->last_trans_committed, has a value of 999,
   so that check evaluates to false and we proceed to check the ancestor
   inodes.

   Once we get to the first ancestor, inode X, we call
   btrfs_must_commit_transaction() on it, which evaluates to true:

   static bool btrfs_must_commit_transaction(...)
   {
       struct btrfs_fs_info *fs_info = inode->root->fs_info;
       bool ret = false;

       mutex_lock(&inode->log_mutex);
       if (inode->last_unlink_trans > fs_info->last_trans_committed) {
           /*
            * Make sure any commits to the log are forced to be full
            * commits.
            */
            btrfs_set_log_full_commit(trans);
            ret = true;
       }
    (...)

    because inode's X last_unlink_trans has a value of 1000 and
    fs_info->last_trans_committed still has a value of 999, it returns
    true to check_parent_dirs_for_sync(), making it return 1 which is
    propagated up to btrfs_sync_file(), causing it to fallback to a full
    transaction commit of transaction 1001.

    We should have not fallen back to commit transaction 1001, since inode
    X had last_unlink_trans set to 1000 and the super blocks for
    transaction 1000 were already written. So while not resulting in a
    functional problem, it leads to a lot more work and higher latencies
    for a fsync since committing a transaction is usually more expensive
    than committing a log (if other filesystem changes happened under that
    transaction).

Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.

So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.

This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unlink_trans of an inode) and fsyncs of files and directories.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
Filipe Manana 4d6221d7d8 btrfs: fix race that causes unnecessary logging of ancestor inodes
When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unnecessarily log ancestor inodes that were created in
the previous transaction.

The race is described by the following steps:

1) We are at transaction 1000;

2) Directory inode X is created, its generation is set to 1000;

3) The commit for transaction 1000 is started by task A;

4) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

5) Inode Y, a regular file, is created under directory inode X, this
   results in starting a new transaction with a generation of 1001;

6) The transaction 1000 commit is unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

7) Task B calls fsync on inode Y and gets a handle for transaction 1001;

8) Task B ends up at log_all_new_ancestors() and then because inode Y has
   only one hard link, ends up at log_new_ancestors_fast(). There it reads
   a value of 999 from fs_info->last_trans_committed, and sees that the
   parent inode X has a generation of 1000, so we end up logging inode X:

     if (inode->generation > fs_info->last_trans_committed) {
         ret = btrfs_log_inode(trans, root, inode,
                               LOG_INODE_EXISTS, ctx);
         (...)

   which is not necessary since it was created in the past transaction,
   with a generation of 1000, and that transaction has already committed
   its super blocks - it's still unpinning extents so it has not yet
   updated fs_info->last_trans_committed from 999 to 1000.

   So this just causes us to spend more time logging and allocating and
   writing more tree blocks for the log tree.

So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.

This case is often hit when running dbench for a long enough duration.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
Filipe Manana 5f96bfb763 btrfs: fix race that results in logging old extents during a fast fsync
When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.

The following sequence of steps shows how this can happen:

1) We are at transaction 1000;

2) An ordered extent E from inode I completes, that is it has gone through
   btrfs_finish_ordered_io(), and it set the extent maps' generation to
   1000 when we unpin the extent, which is the generation of the current
   transaction;

3) The commit for transaction 1000 starts by task A;

4) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

5) Some change is made to inode I, resulting in creation of a new
   transaction with a generation of 1001;

6) The transaction 1000 commit starts unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

7) Task B starts an fsync on inode I, and when it gets to
   btrfs_log_changed_extents() sees the extent map for extent E in the
   list of modified extents. It sees the extent map has a generation of
   1000 and fs_info->last_trans_committed has a value of 999, so it
   proceeds to logging the respective file extent item and all the
   checksums covering its range.

   So we end up wasting time since the extent was already persisted and
   is reachable through the trees pointed to by the super block committed
   by transaction 1000.

So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
Filipe Manana de53d892e5 btrfs: fix race causing unnecessary inode logging during link and rename
When we are doing a rename or a link operation for an inode that was logged
in the previous transaction and that transaction is still committing, we
have a time window where we incorrectly consider that the inode was logged
previously in the current transaction and therefore decide to log it to
update it in the log. The following steps give an example on how this
happens during a link operation:

1) Inode X is logged in transaction 1000, so its logged_trans field is set
   to 1000;

2) Task A starts to commit transaction 1000;

3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;

4) Task B starts a link operation for inode X, and as a consequence it
   starts transaction 1001;

5) Task A is still committing transaction 1000, therefore the value stored
   at fs_info->last_trans_committed is still 999;

6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
   fs_info->last_trans_committed and because the logged_trans field of
   inode X has a value of 1000, the function does not return immediately,
   instead it proceeds to logging the inode, which should not happen
   because the inode was logged in the previous transaction (1000) and
   not in the current one (1001).

This is not a functional problem, just wasted time and space logging an
inode that does not need to be logged, contributing to higher latency
for link and rename operations.

So fix this by comparing the inodes' logged_trans field with the
generation of the current transaction instead of comparing with the value
stored in fs_info->last_trans_committed.

This case is often hit when running dbench for a long enough duration, as
it does lots of rename operations.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
David Sterba fa598b0696 btrfs: remove recalc_thresholds from free space ops
After removing the inode number cache that was using the free space
cache code, we can remove at least the recalc_thresholds callback from
the ops. Both code and tests use the same callback function. It's moved
before its first use.

The use_bitmaps callback is still needed by tests to create some
extents/bitmap setup.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
Nikolay Borisov f0d1219def btrfs: always set NODATASUM/NODATACOW in __create_free_space_inode
Since it's being used solely for the freespace cache unconditionally
set the flags required for it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:06 +01:00
Nikolay Borisov 7dbdb443a7 btrfs: remove crc_check logic from free space
Following removal of the ino cache io_ctl_init will be called only on
behalf of the freespace inode. In this case we always want to check
CRCs so conditional code that depended on io_ctl::check_crc can be
removed.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Nikolay Borisov 5297199a8b btrfs: remove inode number cache feature
It's been deprecated since commit b547a88ea5 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.

A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.

[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Nikolay Borisov abadc1fcd7 btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid
The former is going away as part of the inode map removal so switch
callers to btrfs_find_free_objectid. No functional changes since with
INODE_MAP disabled (default) find_free_objectid was called anyway.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Nikolay Borisov ec7d6dfd73 btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c
Those functions are going to be used even after inode cache is removed
so moved them to a more appropriate place.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
David Sterba 1201b58b67 btrfs: drop casts of bio bi_sector
Since commit 72deb455b5 ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it.  It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Naohiro Aota 12659251ca btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones.  However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today.  So, the number of superblock and
copies is limited to be two.  Second downside is that we cannot support
devices which have no conventional zones at all.

To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
  next to it

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:04 +01:00