flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().
This remove this unnecessary bio->bi_opf setting.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.
Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:
- deprecated: user transaction ioctl
- mount option ssd does not change allocation alignments
- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check
- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag
- prep work for better extent reference tracking (related to the
qgroup slowness with balance)
- prep work for compression heuristics
- memory allocation reductions (may help latencies on a loaded
system)
- better accounting for io waiting states
- error handling improvements (removed BUGs)
- added more sanity checks for shared refs
- fix readdir vs pagefault deadlock under some circumstances
- fix for 'no-hole' mode, certain combination of compressed and
inline extents
- send: fix emission of invalid clone operations
- fixup file mode if setting acls fail
- more fixes from fuzzing
- oher cleanups"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.
if (start < eb->len) and (start + len > eb->len),
then
a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,
b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).
The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.
It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.
This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.
Signed-off-by: David Sterba <dsterba@suse.com>
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.
btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3ab
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit
btrfs_is_parity_mirror's mirror_num - same as above
chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c ("Btrfs:
devid subset filter") and never used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.
No impact on code, but serves as documentation that a buffer is intended
not to be modified.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull btrfs fixes from David Sterba:
"We've identified and fixed a silent corruption (introduced by code in
the first pull), a fixup after the blk_status_t merge and two fixes to
incremental send that Filipe has been hunting for some time"
* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unexpected return value of bio_readpage_error
btrfs: btrfs_create_repair_bio never fails, skip error handling
btrfs: cloned bios must not be iterated by bio_for_each_segment_all
Btrfs: fix write corruption due to bio cloning on raid5/6
Btrfs: incremental send, fix invalid memory access
Btrfs: incremental send, fix invalid path for link commands
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.
This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.
[1] https://patchwork.kernel.org/patch/9838535/
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull percpu updates from Tejun Heo:
"These are the percpu changes for the v4.13-rc1 merge window. There are
a couple visibility related changes - tracepoints and allocator stats
through debugfs, along with __ro_after_init markings and a cosmetic
rename in percpu_counter.
Please note that the simple O(#elements_in_the_chunk) area allocator
used by percpu allocator is again showing scalability issues,
primarily with bpf allocating and freeing large number of counters.
Dennis is working on the replacement allocator and the percpu
allocator will be seeing increased churns in the coming cycles"
* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix static checker warnings in pcpu_destroy_chunk
percpu: fix early calls for spinlock in pcpu_stats
percpu: resolve err may not be initialized in pcpu_alloc
percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
percpu: add tracepoint support for percpu memory
percpu: expose statistics about percpu memory via debugfs
percpu: migrate percpu data structures to internal header
percpu: add missing lockdep_assert_held to func pcpu_free_area
mark most percpu globals as __ro_after_init
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
Commit 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.
However such warning doesn't take the following case into consideration:
0 4K 8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|
In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.
This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.
Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.
Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.
This patch doesn't cause any functional changes.
tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph pointed out that bio allocations backed by a bioset will never
fail. As we always use a bioset for all bio allocations, we can skip
the error handling. This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.
CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5f39d397df ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.
Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.
Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it. This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead pass around the failure tree and the io tree.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.
Detected by CoverityScan, CID#1414312 ("Logically dead code")
Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.
Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1: do recover with parity,
it could be fit into how we repair bad block during read.
The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.
Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have similar code here and there, this merges them into a helper.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
The bug is a regression after commit
(da2c7009f6 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8 "Btrfs: use helper to simplify lock/unlock pages").
So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed
-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------
This fixes the bug by returning the proper -EAGAIN.
Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
Signed-off-by: David Sterba <dsterba@suse.com>
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no error path in any of the instances, always return 0.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'tree' was used to call locking hook that does not exist anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (a91326679f) and page is not
needed anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can embed range_changed to the extent changeset to address following
problems:
- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data
The stack consuption where extent_changeset is used slightly increases:
before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40
Which is bearable.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
This introduces a new helper which can be used to process pages bits.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@cached_state is no more required in __extent_writepage_io, also remove
the goto label.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
try_release_extent_state reduces the gfp mask to GFP_NOFS if it is
compatible. This is true for GFP_KERNEL as well. There is no real
reason to do that though. There is no new lock taken down the
the only consumer of the gfp mask which is
try_release_extent_state
clear_extent_bit
__clear_extent_bit
alloc_extent_state
So this seems just unnecessary and confusing.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
b335b0034e ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
to prevent from giving an unappropriate gfp mask to the slab allocator
deeper down the callchain (in alloc_extent_state). This is wrong for
two reasons a) GFP_NOFS might be just too restrictive for the calling
context b) it is better to tweak the gfp mask down when it needs that.
So just remove the mask tweaking from btrfs_releasepage and move it
down to alloc_extent_state where it is needed.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
here, and Christoph pitched in corrections/improvements to make btrfs
use proper helpers for bio walking instead of doing it by hand.
There are some key fixes as well, including some long standing bugs
that took forever to track down in btrfs_drop_extents and during
balance"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
btrfs: limit async_work allocation and worker func duration
Revert "Btrfs: adjust len of writes if following a preallocated extent"
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs: opencode chunk locking, remove helpers
btrfs: remove root parameter from transaction commit/end routines
btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs: take an fs_info directly when the root is not used otherwise
btrfs: simplify btrfs_wait_cache_io prototype
btrfs: convert extent-tree tracepoints to use fs_info
btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
btrfs: root->fs_info cleanup, add fs_info convenience variables
btrfs: root->fs_info cleanup, update_block_group{,flags}
btrfs: root->fs_info cleanup, lock/unlock_chunks
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
btrfs: pull node/sector/stripe sizes out of root and into fs_info
btrfs: root->fs_info cleanup, io_ctl_init
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
btrfs: struct reada_control.root -> reada_control.fs_info
btrfs: struct btrfsic_state->root should be an fs_info
btrfs: alloc_reserved_file_extent trace point should use extent_root
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using copy_extent_buffer is suitable for copying betwenn buffers from an
arbitrary offset and deals with page boundaries. This is not necessary
when doing a full extent_buffer-to-extent_buffer copy. We can utilize
the copy_page helper as well.
Signed-off-by: David Sterba <dsterba@suse.com>
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
The fsid and chunk tree uuid are always located in the first page,
we don't need the to use write_extent_buffer.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate. The cast doesn't
matter in this case, the code works as intended. It causes a static
checker warning though so let's remove it.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
bitmaps in memory and copy them directly to/from the extent buffers with
{read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
granularity, which is equivalent to a little-endian bitmap. This means
that on big-endian systems, the in-memory bitmaps will be written to
disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
memory.
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".
This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.
fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch converts printk(KERN_* style messages to use the pr_* versions.
One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
This patch unsplits user-visible strings.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.
So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node. One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.
This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nobody uses this, it makes no sense to do partial reads of extent buffers.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.
This should reduce conflict of both patchset and the effort to rebase
them.
Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.
First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull block fixes from Jens Axboe:
"Here's the second round of block updates for this merge window.
It's a mix of fixes for changes that went in previously in this round,
and fixes in general. This pull request contains:
- Fixes for loop from Christoph
- A bdi vs gendisk lifetime fix from Dan, worth two cookies.
- A blk-mq timeout fix, when on frozen queues. From Gabriel.
- Writeback fix from Jan, ensuring that __writeback_single_inode()
does the right thing.
- Fix for bio->bi_rw usage in f2fs from me.
- Error path deadlock fix in blk-mq sysfs registration from me.
- Floppy O_ACCMODE fix from Jiri.
- Fix to the new bio op methods from Mike.
One more followup will be coming here, ensuring that we don't
propagate the block types outside of block. That, and a rename of
bio->bi_rw is coming right after -rc1 is cut.
- Various little fixes"
* 'for-linus' of git://git.kernel.dk/linux-block:
mm/block: convert rw_page users to bio op use
loop: make do_req_filebacked more robust
loop: don't try to use AIO for discards
blk-mq: fix deadlock in blk_mq_register_disk() error path
Include: blkdev: Removed duplicate 'struct request;' declaration.
Fixup direct bi_rw modifiers
block: fix bdi vs gendisk lifetime mismatch
blk-mq: Allow timeouts to run while queue is freezing
nbd: fix race in ioctl
block: fix use-after-free in seq file
f2fs: drop bio->bi_rw manual assignment
block: add missing group association in bio-cloning functions
blkcg: kill unused field nr_undestroyed_grps
writeback: Write dirty times for WB_SYNC_ALL writeback
floppy: fix open(O_ACCMODE) for ioctl-only open
Pull more btrfs updates from Chris Mason:
"This is part two of my btrfs pull, which is some cleanups and a batch
of fixes.
Most of the code here is from Jeff Mahoney, making the pointers we
pass around internally more consistent and less confusing overall. I
noticed a small problem right before I sent this out yesterday, so I
fixed it up and re-tested overnight"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
Btrfs: fix __MAX_CSUM_ITEMS
btrfs: btrfs_abort_transaction, drop root parameter
btrfs: add btrfs_trans_handle->fs_info pointer
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
btrfs: convert nodesize macros to static inlines
btrfs: introduce BTRFS_MAX_ITEM_SIZE
btrfs: cleanup, remove prototype for btrfs_find_root_ref
btrfs: copy_to_sk drop unused root parameter
btrfs: simpilify btrfs_subvol_inherit_props
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
btrfs: tests, require fs_info for root
btrfs: tests, move initialization into tests/
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs: prefix fsid to all trace events
btrfs: plumb fs_info into btrfs_work
btrfs: remove obsolete part of comment in statfs
btrfs: hide test-only member under ifdef
btrfs: Ratelimit "no csum found" info message
btrfs: Add ratelimit to btrfs printing
Btrfs: fix unexpected balance crash due to BUG_ON
...
bi_rw should be using bio_set_op_attrs to set bi_rw.
Signed-off-by: Shaun Tancheff <shaun@tancheff.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.
Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.
This commit adds the missing association in bio-cloning functions.
Fixes: da2f0f74cf ("Btrfs: add support for blkio controllers")
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.
This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.
ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
eb->io_pages is set in read_extent_buffer_pages().
In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.
When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
and ends up with a memory leak eventually.
This lets __do_readpage propagate errors to callers and adds the
'atomic_dec(&eb->io_pages)'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
their btrfs.
Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.
To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.
The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.
Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.
This adds a warning to let us quickly know what was happening.
Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch has btrfs's submit_one_bio users set the bio op using
bio_set_op_attrs and get the op using bio_op.
The next patches will continue to convert btrfs,
so submit_bio_hook and merge_bio_hook
related code will be modified to take only the bio. I did
not do it in this patch to try and keep it smaller.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:
btrfs_lookup_file_extent ->
btrfs_search_slot(..., ins_len=0, cow=0)
and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.
This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It seems to be long time unused, since 2008 and
6885f308b5 ("Btrfs: Misc 2.6.25 updates").
Propagating the removal touches some code but has no functional effect.
Signed-off-by: David Sterba <dsterba@suse.com>
Single caller passes GFP_NOFS. We can get rid of the
gfpflags_allow_blocking checks as NOFS can block but does not recurse to
filesystem through reclaim.
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to __clear_extent_bit, do not fail if the state preallocation
fails as we might not need it. One less BUG_ON.
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.
In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.
For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.
To avoid it, we just need to continue writeback from the last written page.
This also updates btrfs to behave like what write_cache_pages() does, ie, bail
out immediately if there is an error in writepage().
<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>
Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:
fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae7 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. The branch
in end_bio_extent_writepage has been skipped since
5fd0204355 ("Btrfs: finish ordered extents in their own thread").
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the clear_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
decreases:
text data bss dec hex filename
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko.before
939651 43670 23144 1006465 f5b81 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the set_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
increases:
text data bss dec hex filename
938427 43670 23144 1005241 f56b9 fs/btrfs/btrfs.ko.before
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko
Signed-off-by: David Sterba <dsterba@suse.com>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce new function clear_record_extent_bits(), which will clear bits
for given range and record the details about which ranges are cleared
and how many bytes in total it changes.
This provides the basis for later qgroup reserve codes.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.
This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.
Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.
Fix this by leaving the unlock exclusively to the caller.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs fixes from Chris Mason:
"These are small and assorted. Neil's is the oldest, I dropped the
ball thinking he was going to send it in"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: support NFSv2 export
Btrfs: open_ctree: Fix possible memory leak
Btrfs: fix deadlock when finalizing block group creation
Btrfs: update fix for read corruption of compressed and shared extents
Btrfs: send, fix corner case for reference overwrite detection
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.
Signed-off-by: David Sterba <dsterba@suse.com>
My previous fix in commit 005efedf2c ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create our test file with a single extent of 64Kb that is going to
# be compressed no matter which compression algo is used (zlib/lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone the compressed extent into an adjacent file offset.
$CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
echo "File digest before unmount:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Remount the fs or clear the page cache to trigger the bug in
# btrfs. Because the extent has an uncompressed length that is a
# multiple of 16 pages, all the pages belonging to the second range
# of the file (64K to 128K), which points to the same extent as the
# first range (0K to 64K), had their contents full of zeroes instead
# of the byte 0xaa. This was a bug exclusively in the read path of
# compressed extents, the correct data was stored on disk, btrfs
# just failed to fill in the pages correctly.
_scratch_remount
echo "File digest after remount:"
# Must match the digest we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
Pull btrfs fixes from Chris Mason:
"This is an assorted set I've been queuing up:
Jeff Mahoney tracked down a tricky one where we ended up starting IO
on the wrong mapping for special files in btrfs_evict_inode. A few
people reported this one on the list.
Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion. Chandan killed off an accounting bug during DIO
that lead to WARN_ONs as we freed inodes"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: keep dropped roots in cache until transaction commit
Btrfs: Direct I/O: Fix space accounting
btrfs: skip waiting on ordered range for special files
Btrfs: fix read corruption of compressed and shared extents
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
Btrfs: don't initialize a space info as full to prevent ENOSPC
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Pull btrfs updates from Chris Mason:
"This has Jeff Mahoney's long standing trim patch that fixes corners
where trims were missing. Omar has some raid5/6 fixes, especially for
using scrub and device replace when devices are missing.
Zhao Lie continues cleaning and fixing things, this series fixes some
really hard to hit corners in xfstests. I had to pull it last merge
window due to some deadlocks, but those are now resolved.
I added support for Tejun's new blkio controllers. It seems to work
well for single devices, we'll expand to multi-device as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
btrfs: fix compile when block cgroups are not enabled
Btrfs: fix file read corruption after extent cloning and fsync
Btrfs: check if previous transaction aborted to avoid fs corruption
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
btrfs: Prevent from early transaction abort
btrfs: Remove unused arguments in tree-log.c
btrfs: Remove useless condition in start_log_trans()
Btrfs: add support for blkio controllers
Btrfs: remove unused mutex from struct 'btrfs_fs_info'
Btrfs: fix parity scrub of RAID 5/6 with missing device
Btrfs: fix device replace of a missing RAID 5/6 device
Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
Btrfs: count devices correctly in readahead during RAID 5/6 replace
Btrfs: remove misleading handling of missing device scrub
btrfs: fix clone / extent-same deadlocks
Btrfs: fix defrag to merge tail file extent
Btrfs: fix warning in backref walking
btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
btrfs: Remove root argument in extent_data_ref_count()
btrfs: Fix wrong comment of btrfs_alloc_tree_block()
...
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them. It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.
The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.
Signed-off-by: Chris Mason <clm@fb.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.
Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.
Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.
The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull btrfs updates from Chris Mason:
"Outside of our usual batch of fixes, this integrates the subvolume
quota updates that Qu Wenruo from Fujitsu has been working on for a
few releases now. He gets an extra gold star for making btrfs smaller
this time, and fixing a number of quota corners in the process.
Dave Sterba tested and integrated Anand Jain's sysfs improvements.
Outside of exporting a symbol (ack'd by Greg) these are all internal
to btrfs and it's mostly cleanups and fixes. Anand also attached some
of our sysfs objects to our internal device management structs instead
of an object off the super block. It will make device management
easier overall and it's a better fit for how the sysfs files are used.
None of the existing sysfs files are moved around.
Thanks for all the fixes everyone"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
Btrfs: Check if kobject is initialized before put
lib: export symbol kobject_move()
Btrfs: sysfs: add support to show replacing target in the sysfs
Btrfs: free the stale device
Btrfs: use received_uuid of parent during send
Btrfs: fix use-after-free in btrfs_replay_log
btrfs: wait for delayed iputs on no space
btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
btrfs: ulist: Add ulist_del() function.
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
btrfs: qgroup: Switch rescan to new mechanism.
btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
btrfs: qgroup: Add new function to record old_roots.
btrfs: qgroup: Record possible quota-related extent for qgroup.
btrfs: qgroup: Add function qgroup_update_counters().
...
Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:
- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.
- Various cleanups around command types from Christoph.
- Cleanup of the suspend block I/O path from Christoph.
- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.
- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.
- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.
- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"
* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
through free_io_failure(), we weren't waking up any tasks waiting for the
extent's state EXTENT_LOCKED bit, leading to an hang.
So make sure clear_extent_bits() ends up waking up any waiters if the
bit EXTENT_LOCKED is supplied by its callers.
Zygo Blaxell was experiencing such hangs at inode eviction time after
file unlinks. Thanks to him for a set of scripts to reproduce the issue.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:
BUG_ON(extent_buffer_under_io(eb))
The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.
Here follows a sequence of steps that leads to the BUG_ON.
CPU 0 CPU 1 CPU 2
path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X
btrfs_release_path(path)
unlocks X
reads eb X
X->refs incremented to 3
locks eb X
btrfs_del_items(X)
X becomes empty
clean_tree_block(X)
clear EXTENT_BUFFER_DIRTY from X
btrfs_del_leaf(X)
unlocks X
extent_buffer_get(X)
X->refs incremented to 4
btrfs_free_tree_block(X)
X's range is not pinned
X's range added to free
space cache
free_extent_buffer_stale(X)
lock X->refs_lock
set EXTENT_BUFFER_STALE on X
release_extent_buffer(X)
X->refs decremented to 3
unlocks X->refs_lock
btrfs_release_path()
unlocks X
free_extent_buffer(X)
X->refs becomes 2
__btrfs_cow_block(Y)
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
gets offset == X->start
btrfs_init_new_buffer(X->start)
btrfs_find_create_tree_block(X->start)
alloc_extent_buffer(X->start)
find_extent_buffer(X->start)
finds eb X in radix tree
free_extent_buffer(X)
lock X->refs_lock
test X->refs == 2
test bit EXTENT_BUFFER_STALE is set
test !extent_buffer_under_io(eb)
increments X->refs to 3
mark_extent_buffer_accessed(X)
check_buffer_tree_ref(X)
--> does nothing,
X->refs >= 2 and
EXTENT_BUFFER_TREE_REF
is set in X
clear EXTENT_BUFFER_STALE from X
locks X
btrfs_mark_buffer_dirty()
set_extent_buffer_dirty(X)
check_buffer_tree_ref(X)
--> does nothing, X->refs >= 2 and
EXTENT_BUFFER_TREE_REF is set
sets EXTENT_BUFFER_DIRTY on X
test and clear EXTENT_BUFFER_TREE_REF
decrements X->refs to 2
release_extent_buffer(X)
decrements X->refs to 1
unlock X->refs_lock
unlock X
free_extent_buffer(X)
lock X->refs_lock
release_extent_buffer(X)
decrements X->refs to 0
btrfs_release_extent_buffer_page(X)
BUG_ON(extent_buffer_under_io(X))
--> EXTENT_BUFFER_DIRTY set on X
Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.
A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:
[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
while true; do
mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
snapshot_cmd="btrfs subvolume snapshot -r /mnt"
snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
umount /mnt
done
Which usually triggers the BUG_ON within less than 24 hours:
[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().
Running following command repeatly can check this memory leak problem
btrfs inspect-internal inode-resolve 256 /mnt/btrfs
Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Consider the following interleaving of overlapping calls to
alloc_extent_buffer:
Call 1:
- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages
Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
mark_extent_buffer_accessed on it
mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.
The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.
Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs updates from Chris Mason:
"This pull is mostly cleanups and fixes:
- The raid5/6 cleanups from Zhao Lei fixup some long standing warts
in the code and add improvements on top of the scrubbing support
from 3.19.
- Josef has round one of our ENOSPC fixes coming from large btrfs
clusters here at FB.
- Dave Sterba continues a long series of cleanups (thanks Dave), and
Filipe continues hammering on corner cases in fsync and others
This all was held up a little trying to track down a use-after-free in
btrfs raid5/6. It's not clear yet if this is just made easier to
trigger with this pull or if its a new bug from the raid5/6 cleanups.
Dave Sterba is the only one to trigger it so far, but he has a
consistent way to reproduce, so we'll get it nailed shortly"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
Btrfs: don't remove extents and xattrs when logging new names
Btrfs: fix fsync data loss after adding hard link to inode
Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Btrfs: account for large extents with enospc
Btrfs: don't set and clear delalloc for O_DIRECT writes
Btrfs: only adjust outstanding_extents when we do a short write
btrfs: Fix out-of-space bug
Btrfs: scrub, fix sleep in atomic context
Btrfs: fix scheduler warning when syncing log
Btrfs: Remove unnecessary placeholder in btrfs_err_code
btrfs: cleanup init for list in free-space-cache
btrfs: delete chunk allocation attemp when setting block group ro
btrfs: clear bio reference after submit_one_bio()
Btrfs: fix scrub race leading to use-after-free
Btrfs: add missing cleanup on sysfs init failure
Btrfs: fix race between transaction commit and empty block group removal
btrfs: add more checks to btrfs_read_sys_array
btrfs: cleanup, rename a few variables in btrfs_read_sys_array
btrfs: add checks for sys_chunk_array sizes
btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
...
On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Helper account_page_redirty() fixes dirty pages counter for redirtied
pages. This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.
Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
forced data type checking in compile.
Changelog v1->v2:
Rename following by David Sterba's suggestion:
put_btrfs_bio() -> btrfs_put_bio()
get_btrfs_bio() -> btrfs_get_bio()
bbio->ref_count -> bbio->refs
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.
The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.
struct extent_state {
u64 start; /* 0 8 */
u64 end; /* 8 8 */
struct rb_node rb_node; /* 16 24 */
wait_queue_head_t wq; /* 40 24 */
/* --- cacheline 1 boundary (64 bytes) --- */
atomic_t refs; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int state; /* 72 8 */
u64 private; /* 80 8 */
/* size: 88, cachelines: 2, members: 7 */
/* sum members: 84, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */
};
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Suppress the following warning displayed on building 32bit (i686) kernel.
===============================================================================
...
CC [M] fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
failrec = (struct io_failure_record *)state->private;
...
===============================================================================
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>
Make the extent buffer allocation interface consistent. Cloned eb will
set a valid fs_info. For dummy eb, we can drop the length parameter and
set it from fs_info.
The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.
Signed-off-by: David Sterba <dsterba@suse.cz>
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is actually inspired by Filipe's patch. When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch implement data repair function when direct read fails.
The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.
But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared. This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you. So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have. This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We've defined a 'offset' out of bio_for_each_segment_all.
This is just a clean rename, no function changes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.
On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"The biggest of these comes from Liu Bo, who tracked down a hang we've
been hitting since moving to kernel workqueues (it's a btrfs bug, not
in the generic code). His patch needs backporting to 3.16 and 3.15
stable, which I'll send once this is in.
Otherwise these are assorted fixes. Most were integrated last week
during KS, but I wanted to give everyone the chance to test the
result, so I waited for rc2 to come out before sending"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix task hang under heavy compressed write
Btrfs: fix filemap_flush call in btrfs_file_release
Btrfs: fix crash on endio of reading corrupted block
btrfs: fix leak in qgroup_subtree_accounting() error path
btrfs: Use right extent length when inserting overlap extent map.
Btrfs: clone, don't create invalid hole extent map
Btrfs: don't monopolize a core when evicting inode
Btrfs: fix hole detection during file fsync
Btrfs: ensure tmpfile inode is always persisted with link count of 0
Btrfs: race free update of commit root for ro snapshots
Btrfs: fix regression of btrfs device replace
Btrfs: don't consider the missing device when allocating new chunks
Btrfs: Fix wrong device size when we are resizing the device
Btrfs: don't write any data into a readonly device when scrub
Btrfs: Fix the problem that the replace destroys the seed filesystem
btrfs: Return right extent when fiemap gives unaligned offset and len.
Btrfs: fix wrong extent mapping for DirectIO
Btrfs: fix wrong write range for filemap_fdatawrite_range()
Btrfs: fix wrong missing device counter decrease
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
...
The crash is
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
This is in fact a regression.
It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.
The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more btrfs updates from Chris Mason:
"This has a few fixes since our last pull and a new ioctl for doing
btree searches from userland. It's very similar to the existing
ioctl, but lets us return larger items back down to the app"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix error handling in create_pending_snapshot
btrfs: fix use of uninit "ret" in end_extent_writepage()
btrfs: free ulist in qgroup_shared_accounting() error path
Btrfs: fix qgroups sanity test crash or hang
btrfs: prevent RCU warning when dereferencing radix tree slot
Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
btrfs: new ioctl TREE_SEARCH_V2
btrfs: tree_search, search_ioctl: direct copy to userspace
btrfs: new function read_extent_buffer_to_user
btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
btrfs: tree_search, search_ioctl: accept varying buffer
btrfs: tree_search: eliminate redundant nr_items check
If this condition in end_extent_writepage() is false:
if (tree->ops && tree->ops->writepage_end_io_hook)
we will then test an uninitialized "ret" at:
ret = ret < 0 ? ret : -EIO;
The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.
Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.
Signed-of-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
This new function reads the content of an extent directly to user memory.
Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
Pull btrfs updates from Chris Mason:
"The biggest change here is Josef's rework of the btrfs quota
accounting, which improves the in-memory tracking of delayed extent
operations.
I had been working on Btrfs stack usage for a while, mostly because it
had become impossible to do long stress runs with slab, lockdep and
pagealloc debugging turned on without blowing the stack. Even though
you upgraded us to a nice king sized stack, I kept most of the
patches.
We also have some very hard to find corruption fixes, an awesome sysfs
use after free, and the usual assortment of optimizations, cleanups
and other fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
Btrfs: convert smp_mb__{before,after}_clear_bit
Btrfs: fix scrub_print_warning to handle skinny metadata extents
Btrfs: make fsync work after cloning into a file
Btrfs: use right type to get real comparison
Btrfs: don't check nodes for extent items
Btrfs: don't release invalid page in btrfs_page_exists_in_range()
Btrfs: make sure we retry if page is a retriable exception
Btrfs: make sure we retry if we couldn't get the page
btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Btrfs: fix leaf corruption after __btrfs_drop_extents
Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
btrfs: free delayed node outside of root->inode_lock
btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
Btrfs: fix transaction leak during fsync call
btrfs: Avoid trucating page or punching hole in a already existed hole.
Btrfs: update commit root on snapshot creation after orphan cleanup
Btrfs: ioctl, don't re-lock extent range when not necessary
Btrfs: avoid visiting all extent items when cloning a range
...
__extent_writepage has two unrelated parts. First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.
This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.
Signed-off-by: Chris Mason <clm@fb.com>
This adds noinline_for_stack to two helpers used by
btree_write_cache_pages. It shaves us down from 424 bytes on the
stack to 280.
Signed-off-by: Chris Mason <clm@fb.com>
We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:
[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001
[ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)
It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.
This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:
mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
fsstress -p 6 -d /mnt -n 100000 -x \
"btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
-f allocsp=0 \
-f bulkstat=0 \
-f bulkstat1=0 \
-f chown=0 \
-f creat=1 \
-f dread=0 \
-f dwrite=0 \
-f fallocate=1 \
-f fdatasync=0 \
-f fiemap=0 \
-f freesp=0 \
-f fsync=0 \
-f getattr=0 \
-f getdents=0 \
-f link=0 \
-f mkdir=0 \
-f mknod=0 \
-f punch=1 \
-f read=0 \
-f readlink=0 \
-f rename=0 \
-f resvsp=0 \
-f rmdir=0 \
-f setxattr=0 \
-f stat=0 \
-f symlink=0 \
-f sync=0 \
-f truncate=1 \
-f unlink=0 \
-f unresvsp=0 \
-f write=4
So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.
Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.
Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.
We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...
When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.
This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.
Jeff Mahoney kicked in code to export filesystem info into sysfs.
Otherwise, lots of performance improvements, cleanups and bug fixes.
Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers. With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal. This will give me a way to do that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.
This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:
Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000
51.000 - 93.933: 693 ########
93.933 - 172.314: 938 ##########
172.314 - 315.408: 856 #########
315.408 - 576.646: 95 #
576.646 - 6415.830: 888 ##########
6415.830 - 11713.809: 1024 ###########
11713.809 - 21386.000: 4816 #####################################################
So traversing such trees can take some significant time that can
easily be avoided.
Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
Before this change:
sequential writes: 69.28Mb/sec (average of 5 runs)
random writes: 4.14Mb/sec (average of 5 runs)
After this change:
sequential writes: 69.91Mb/sec (average of 5 runs)
random writes: 5.69Mb/sec (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>