Previously, update_inode_page is not called under f2fs_lock_op.
Instead we should call with f2fs_write_inode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we can reuse nids as many as possible, we can mitigate producing obsolete
node pages in the page cache.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If node blocks were already moved, we don't need to move them again.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As the below comment of bio_alloc_bioset, f2fs can allocate multiple bios at the
same time. So, we can't guarantee that bio is allocated all the time.
"
* When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
* able to allocate a bio. This is due to the mempool guarantees. To make this
* work, callers must never allocate more than 1 bio at a time from this pool.
* Callers that need to allocate more than 1 bio must always submit the
* previously allocated bio for IO before attempting to allocate a new one.
* Failure to do so can cause deadlocks under memory pressure.
"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch increases the number of maximum hard links for one file.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should avoid needless checkpoints when there is no dirty and prefree segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces __count_free_nids/try_to_free_nids and registers
them in slab shrinker for shrinking under memory pressure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_delete_entry, if last dirent is remove from the dentry page,
we will try to punch that page since it has no valid date in it.
But truncate_hole which is used for punching could fail because of
no memory or IO error, if that happened, we'd better skip clearing
this valid dentry page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not write node pages when deleting orphan inodes.
In order to do that, we can eaisly set POR_DOING flag earlier before entering
orphan inode routine.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With CIFS_DEBUG_2 enabled, additional debug information is tracked inside each
mid_q_entry struct, however cifs_save_when_sent may use the mid_q_entry after it
has been freed from the appropriate callback if the transport layer has very low
latency. Holding the srv_mutex fixes this use-after-free, as cifs_save_when_sent
is called while the srv_mutex is held while the request is sent.
Signed-off-by: Christopher Oo <t-chriso@microsoft.com>
The server exports information about the share and underlying
device under an SMB3 export, including its attributes and
capabilities, which is stored by cifs.ko when first connecting
to the share.
Add ioctl to cifs.ko to allow user space smb3 helper utilities
(in cifs-utils) to display this (e.g. via smb3util).
This information is also useful for debugging and for
resolving configuration errors.
Signed-off-by: Steve French <steve.french@primarydata.com>
When read-write mount of a filesystem is requested but we find out we
can mount the filesystem only in read-only mode, we still modify
LVID in udf_close_lvid(). That is both unnecessary and contrary to
expectation that when we fall back to read-only mount we don't modify
the filesystem.
Make sure we call udf_close_lvid() only if we called udf_open_lvid() so
that filesystem gets modified only if we verified we are allowed to
write to it.
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Jan Kara <jack@suse.com>
Unlike the previous attempt, this takes into account the fact that
we may be calling it from the recovery thread itself. Detect this
by looking at what kind of open we're doing, and checking the state
of the NFS_DELEGATION_NEED_RECLAIM if it turns out we're doing a
reboot reclaim-type open.
Cc: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix CONFIG_LOCKDEP=n build, because asserts I put in to ensure we
aren't overrunning lockdep subclasses in commit 0952c81 ("xfs:
clean up inode lockdep annotations") use a define that doesn't
exist when CONFIG_LOCKDEP=n
Only check the subclass limits when lockdep is actually enabled.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.
Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_cloner
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with a single 100K extent starting at file
# offset 800K. We fsync the file here to make the fsync log tree gets
# a single csum item that covers the whole 100K extent, which causes
# the second fsync, done after the cloning operation below, to not
# leave in the log tree two csum items covering two sub-ranges
# ([0, 20K[ and [20K, 100K[)) of our extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone part of our extent into file offset 400K. This adds a file
# extent item to our inode's metadata that points to the 100K extent
# we created before, using a data offset of 20K and a data length of
# 20K, so that it refers to the sub-range [20K, 40K[ of our original
# extent.
$CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
-l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Now fsync our file to make sure the extent cloning is durably
# persisted. This fsync will not add a second csum item to the log
# tree containing the checksums for the blocks in the sub-range
# [20K, 40K[ of our extent, because there was already a csum item in
# the log tree covering the whole extent, added by the first fsync
# we did before.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Silently drop all writes and ummount to simulate a crash/power
# failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
# The fsync log replay first processes the file extent item
# corresponding to the file offset 400K (the one which refers to the
# [20K, 40K[ sub-range of our 100K extent) and then processes the file
# extent item for file offset 800K. It used to happen that when
# processing the later, it erroneously left in the csum tree 2 csum
# items that overlapped each other, 1 for the sub-range [20K, 40K[ and
# 1 for the whole range of our extent. This introduced a problem where
# subsequent lookups for the checksums of blocks within the range
# [40K, 100K[ of our extent would not find anything because lookups in
# the csum tree ended up looking only at the smaller csum item, the
# one covering the subrange [20K, 40K[. This made read requests assume
# an expected checksum with a value of 0 for those blocks, which caused
# checksum verification failure when the read operations finished.
# However those checksum failure did not result in read requests
# returning an error to user space (like -EIO for e.g.) because the
# expected checksum value had the special value 0, and in that case
# btrfs set all bytes of the corresponding pages with the value 0x01
# and produce the following warning in dmesg/syslog:
#
# "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
# 1322675045 expected csum 0"
#
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File digest after log replay:"
# Must match the same digest he had after cloning the extent and
# before the power failure happened.
md5sum $SCRATCH_MNT/foo | _filter_scratch
_unmount_flakey
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:
CPU 0 CPU 1
transaction N starts
(...)
btrfs_commit_transaction(N)
cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
cur_trans->state = TRANS_STATE_COMMIT_DOING;
(...)
cur_trans->state = TRANS_STATE_UNBLOCKED;
root->fs_info->running_transaction = NULL;
btrfs_start_transaction()
--> starts transaction N + 1
btrfs_write_and_wait_transaction(trans, root);
--> starts writing all new or COWed ebs created
at transaction N
creates some new ebs, COWs some
existing ebs but doesn't COW or
deletes eb X
btrfs_commit_transaction(N + 1)
(...)
cur_trans->state = TRANS_STATE_COMMIT_START;
(...)
wait_for_commit(root, prev_trans);
--> prev_trans == transaction N
btrfs_write_and_wait_transaction() continues
writing ebs
--> fails writing eb X, we abort transaction N
and set bit BTRFS_FS_STATE_ERROR on
fs_info->fs_state, so no new transactions
can start after setting that bit
cleanup_transaction()
btrfs_cleanup_one_transaction()
wakes up task at CPU 1
continues, doesn't abort because
cur_trans->aborted (transaction N + 1)
is zero, and no checks for bit
BTRFS_FS_STATE_ERROR in fs_info->fs_state
are made
btrfs_write_and_wait_transaction(trans, root);
--> succeeds, no errors during writeback
write_ctree_super(trans, root, 0);
--> succeeds
--> we have now a superblock that points us
to some root that uses eb X, which was
never written to disk
In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".
So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:
[ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Following arguments are not used in tree-log.c:
insert_one_name(): path, type
wait_log_commit(): trans
wait_for_writer(): trans
This patch remove them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
for start_log_trans():
fs/btrfs/tree-log.c:178 start_log_trans()
warn: we tested 'root->log_root' before and it was 'false'
fs/btrfs/tree-log.c
147 if (root->log_root) {
We test "root->log_root" here.
...
Reason:
Condition of:
fs/btrfs/tree-log.c:178: if (!root->log_root) {
is not necessary after commit: 7237f1833
It caused a smatch warning, and no functionally error.
Fix:
Deleting above condition will make smatch shut up,
but a better way is to do cleanup for start_log_trans()
to remove duplicated code and make code more readable.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Otherwise we break fstest case tests/read_write/mctime.t
Does files layout need the same fix as well?
Cc: stable@vger.kernel.org # v4.0+
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If layout is marked by NFS_LAYOUT_RETURN_BEFORE_CLOSE, we should always
send LAYOUTRETURN before close, and we don't need to do ROC drain if we
do send LAYOUTRETURN.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This reverts commit 4e379d36c0.
This commit opens up a race between the recovery code and the open code.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Cc: stable@vger.kernel # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we have an OPEN_DOWNGRADE and CLOSE race with one another, we want
to ensure that the layout is forgotten by the client, so that we
start afresh with a new layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The helper pnfs_roc() has already verified that we have no delegations,
and no further open files, hence no outstanding I/O and it has marked
all the return-on-close lsegs as being invalid.
Furthermore, it sets the NFS_LAYOUT_RETURN bit, thus serialising the
close/delegreturn with all future layoutget calls on this inode.
The checks in pnfs_roc_drain() for valid layout segments are therefore
redundant: those cannot exist until another layoutget completes.
The other check for whether or not NFS_LAYOUT_RETURN is set, actually
causes a hang, since we already know that we hold that flag.
To fix, we therefore strip out all the functionality in pnfs_roc_drain()
except the retrieval of the barrier state, and then rename the function
accordingly.
Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 5c4a79fb2b ("Don't prevent layoutgets when doing return-on-close")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Filesystems are responsible to manage file coherency between the page
cache and direct I/O. The generic dio code flushes dirty pages over the
range of a dio to ensure that the dio read or a future buffered read
returns the correct data. XFS has generally followed this pattern,
though traditionally has flushed and invalidated the range from the
start of the I/O all the way to the end of the file. This changed after
the following commit:
7d4ea3ce xfs: use ranged writeback and invalidation for direct IO
... as the full file flush was no longer necessary to deal with the
strange post-eof delalloc issues that were since fixed. Unfortunately,
we have since received complaints about performance degradation due to
the increased exclusive iolock cycles (which locks out parallel dio
submission) that occur when a file has cached pages. This does not occur
on filesystems that use the generic code as it also does not incorporate
locking.
The exclusive iolock is acquired any time the inode mapping has cached
pages, regardless of whether they reside in the range of the I/O or not.
If not, the flush/inval calls do no work and the lock was cycled for no
reason.
Under consideration of the cost of the exclusive iolock, update the dio
read and write handlers to flush and invalidate the entire mapping when
cached pages exist. In most cases, this increases the cost of the
initial flush sequence but eliminates the need for further lock cycles
and flushes so long as the workload does not actively mix direct and
buffered I/O. This also more closely matches historical behavior and
performance characteristics that users have come to expect.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
struct xfs_attr_leafblock contains 'entries' array which is declared
with size 1 altough it can in fact contain much more entries. Since this
array is followed by further struct members, gcc (at least in version
4.8.3) thinks that the array has the fixed size of 1 element and thus
may optimize away all accesses beyond the end of array resulting in
non-working code. This problem was only observed with userspace code in
xfsprogs, however it's better to be safe in kernel as well and have
matching kernel and xfsprogs definitions.
cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In the dir3 data block readahead function, use the regular read
verifier to check the block's CRC and spot-check the block contents
instead of directly calling only the spot-checking routine. This
prevents corrupted directory data blocks from being read into the
kernel, which can lead to garbage ls output and directory loops (if
say one of the entries contains slashes and other junk).
cc: <stable@vger.kernel.org> # 3.12 - 4.2
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
probably the wrong thing to do. Deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.
The locking order for regular inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on security
inode locks. Fixing the shmem issue there just moved the lockdep
reports somewhere else, and now we are getting false positives from
filesystem freezing annotations getting confused.
Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the ilock
for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.
Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism. Hence we need to hold the IOLOCK in the correct places to
provide this low level directory data access serialisation.
The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Lockdep annotations are a maintenance nightmare. Locking has to be
modified to suit the limitations of the annotations, and we're
always having to fix the annotations because they are unable to
express the complexity of locking heirarchies correctly.
So, next up, we've got more issues with lockdep annotations for
inode locking w.r.t. XFS_LOCK_PARENT:
- lockdep classes are exclusive and can't be ORed together
to form new classes.
- IOLOCK needs multiple PARENT subclasses to express the
changes needed for the readdir locking rework needed to
stop the endless flow of lockdep false positives involving
readdir calling filldir under the ILOCK.
- there are only 8 unique lockdep subclasses available,
so we can't create a generic solution.
IOWs we need to treat the 3-bit space available to each lock type
differently:
- IOLOCK uses xfs_lock_two_inodes(), so needs:
- at least 2 IOLOCK subclasses
- at least 2 IOLOCK_PARENT subclasses
- MMAPLOCK uses xfs_lock_two_inodes(), so needs:
- at least 2 MMAPLOCK subclasses
- ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
- at least 5 ILOCK subclasses
- one ILOCK_PARENT subclass
- one RTBITMAP subclass
- one RTSUM subclass
For the IOLOCK, split the space into two sets of subclasses.
For the MMAPLOCK, just use half the space for the one subclass to
match the non-parent lock classes of the IOLOCK.
For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
remaining individual subclasses.
Because they are now all different, modify xfs_lock_inumorder() to
handle the nested subclasses, and to assert fail if passed an
invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
if an invalid combination of lock primitives and inode counts are
passed that would result in a lockdep subclass annotation overflow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The node directory lookup code uses a state structure that tracks the
path of buffers used to search for the hash of a filename through the
leaf blocks. When the lookup encounters a block that ends with the
requested hash, but the entry has not yet been found, it must shift over
to the next block and continue looking for the entry (i.e., duplicate
hashes could continue over into the next block). This shift mechanism
involves walking back up and down the state structure, replacing buffers
at the appropriate btree levels as necessary.
When a buffer is replaced, the old buffer is released and the new buffer
read into the active slot in the path structure. Because the buffer is
read directly into the path slot, a buffer read failure can result in
setting a NULL buffer pointer in an active slot. This throws off the
state cleanup code in xfs_dir2_node_lookup(), which expects to release a
buffer from each active slot. Instead, a BUG occurs due to a NULL
pointer dereference:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8
IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
RIP: 0010:[<ffffffffa0585063>] [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs]
...
Call Trace:
[<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs]
[<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs]
[<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs]
[<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs]
[<ffffffff8122de8d>] lookup_real+0x1d/0x50
[<ffffffff8123330e>] path_openat+0x91e/0x1490
[<ffffffff81235079>] do_filp_open+0x89/0x100
...
This has been reproduced via a parallel fsstress and filesystem shutdown
workload in a loop. The shutdown triggers the read error in the
aforementioned codepath and causes the BUG in xfs_dir2_node_lookup().
Update xfs_da3_path_shift() to update the active path slot atomically
with respect to the caller when a buffer is replaced. This ensures that
the caller always sees the old or new buffer in the slot and prevents
the NULL pointer dereference.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The sparse inodes feature is currently considered experimental. We warn
at mount time from xfs_mount_validate_sb(). This function is part of the
superblock verifier codepath, however, which means it could be invoked
repeatedly on superblock reads or writes. This is currently only
noticeable from userspace, where mkfs produces multiple warnings at
format time.
As mkfs warnings were not the intent of this change, relocate the mount
time warning to xfs_fs_fill_super(), which is only invoked once and only
in kernel space.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Once the sb_uuid is changed, the wrong uuid is stamped into new
dquots on disk. Found by inspection, verified by generic/219.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that sb_uuid can be changed by the user, we cannot use this to
validate the metadata blocks being recovered belong to this
filesystem. We must check against the sb_meta_uuid as that will
remain unchanged.
There is a complication in this code - the superblock itself. We can
not check the sb_meta_uuid unconditionally, as that may not be set
on disk. Hence we must verify the superblock sb_uuid matches between
the log record and the in-core superblock.
Found by inspection after the previous two problems were found.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Adding this simple change to xfstests:common/rc::_scratch_mkfs_xfs:
+ if [ $mkfs_status -eq 0 ]; then
+ xfs_admin -U generate $SCRATCH_DEV > /dev/null
+ fi
triggers all sorts of errors in xfstests. xfs/104 is an example,
where growfs fails with a UUID mismatch corruption detected by
xfs_agf_write_verify() when trying to write the first new AG
headers.
Fix this problem by making sure we copy the sb_meta_uuid into new
metadata written by growfs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
After changing the UUID on a v5 filesystem, xfstests fails
immediately on a debug kernel with:
XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799
This needs to check against the sb_meta_uuid, not the user visible
UUID that was changed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
It's entirely possible for userspace to ask for an xattr which
does not exist.
Normally, there is no problem whatsoever when we ask for such
a thing, but when we look at an obfuscated metadump image
on a debug kernel with selinux, we trip over this ASSERT in
xfs_da3_path_shift():
*result = -ENOENT; /* we're out of our tree */
ASSERT(args->op_flags & XFS_DA_OP_OKNOENT);
It (more or less) only shows up in the above scenario, because
xfs_metadump obfuscates attr names, but chooses names which
keep the same hash value - and xfs_da3_node_lookup_int does:
if (((retval == -ENOENT) || (retval == -ENOATTR)) &&
(blk->hashval == args->hashval)) {
error = xfs_da3_path_shift(state, &state->path, 1, 1,
&retval);
IOWS, we only get down to the xfs_da3_path_shift() ASSERT
if we are looking for an xattr which doesn't exist, but we
find xattrs on disk which have the same hash, and so might be
a hash collision, so we try the path shift. When *that*
fails to find what we're looking for, we hit the assert about
XFS_DA_OP_OKNOENT.
Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this
rather corner-case problem with no ill side effects. It's
fine for an attr name lookup to fail.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If a failure occurs after the bmap free list is populated and before
xfs_bmap_finish() completes successfully (which returns a partial
list on failure), the bmap free list must be cancelled. Otherwise,
the extent items on the list are never freed and a memory leak
occurs.
Several random error paths throughout the code suffer this problem.
Fix these up such that xfs_bmap_cancel() is always called on error.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Several areas of code duplicate a pattern where we take the AIL lock,
check whether an item is in the AIL and remove it if so. Create a new
helper for this pattern and use it where appropriate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
The btree cursor cleanup function takes an error parameter that
affects how buffers are released from the cursor. All buffers are
released in the event of error. Several callers do not specify the
XFS_BTREE_ERROR flag in the event of error, however. This can cause
buffers to hang around locked or with an elevated hold count and
thus lead to umount hangs in the event of errors.
Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if
the cursor is being torn down due to error.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The root inode is read as part of the xfs_mountfs() sequence and the
reference is dropped in the event of failure after we grab the
inode. The reference drop doesn't necessarily free the inode,
however. It marks it for reclaim and potentially kicks off the
reclaim workqueue. The workqueue is destroyed further up the error
path, which means we are subject to crash if the workqueue job runs
after this point or a memory leak which is identified if the
xfs_inode_zone is destroyed (e.g., on module removal). Both of these
outcomes are reproducible via manual instrumentation of a mount
error after the root inode xfs_iget() call in xfs_mountfs().
Update the xfs_mountfs() error path to cancel any potential reclaim
work items and to run a synchronous inode reclaim if the root inode
is marked for reclaim. This ensures that no jobs remain on the queue
before it is destroyed and that the root inode is freed before the
reclaim mechanism is torn down.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The first 4 bytes of every basic block in the physical log is stamped
with the current lsn. To support this mechanism, the log record header
(first block of each new log record) contains space for the original
first byte of each log record block before it is replaced with the lsn.
The log record header has space for 32k worth of blocks. The version 2
log adds new extended record headers for each additional 32k worth of
blocks beyond what is supported by the record header.
The log record checksum incorporates the log record header, the extended
headers and the record payload. xlog_cksum() checksums the extended
headers based on log->l_iclog_heads, which specifies the number of
extended headers in a log record based on the log buffer size mount
option. The log buffer size is variable, however, and thus means the
checksum can be calculated differently based on how a filesystem is
mounted. This is problematic if a filesystem crashes and recovery occurs
on a subsequent mount using a different log buffer size. For example,
crash an active filesystem that is mounted with the default (32k)
logbsize, attempt remount/recovery using '-o logbsize=64k' and the mount
fails on or warns about log checksum failures.
To avoid this problem, update xlog_cksum() to calculate the checksum
based on the size of the log buffer according to the log record. The
size is already included in the h_size field of the log record header
and thus is available at log recovery time. Extended log record headers
are also only written when the log record is large enough to require
them. This makes checksum calculation of log records consistent with the
extended record header mechanism as well as how on-disk records are
checksummed with various log buffer size mount options.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Inode cluster buffers are invalidated and cancelled when inode chunks
are freed to notify log recovery that previous logged updates to the
metadata buffer should be skipped. This ensures that log recovery does
not overwrite buffers that might have already been reused.
On v4 filesystems, inode chunk allocation and inode updates are logged
via the cluster buffers and thus cancellation is easily detected via
buffer cancellation items. v5 filesystems use the new icreate
transaction, which uses logical logging and ordered buffers to log a
full inode chunk allocation at once. The resulting icreate item often
spans multiple inode cluster buffers.
Log recovery checks for cancelled buffers when processing icreate log
items, but it has a couple problems. First, it uses the full length of
the inode chunk rather than the cluster size. Second, it uses the length
in FSB units rather than BB units. Either of these problems prevent
icreate recovery from identifying cancelled buffers and thus inode
initialization proceeds unconditionally.
Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
cluster sized increments and check each increment for cancellation.
Since icreate is currently only used for the minimum atomic inode chunk
allocation, we expect that either all or none of the buffers will be
cancelled. Cancel the icreate if at least one buffer is cancelled to
avoid making a bad situation worse by initializing a partial inode
chunk, but detect such anomalies and warn the user.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Various log items have recovery tracepoints to identify whether a
particular log item is recovered or cancelled. Add the equivalent
tracepoints for the icreate transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log recovery occurs in two phases at mount time. In the first phase,
EFIs and EFDs are processed and potentially cancelled out. EFIs without
EFD objects are inserted into the AIL for processing and recovery in the
second phase. xfs_mountfs() runs various other operations between the
phases and is thus subject to failure. If failure occurs after the first
phase but before the second, pending EFIs sit on the AIL, pin it and
cause the mount to hang.
Update the mount sequence to ensure that pending EFIs are cancelled in
the event of failure. Add a recovery cancellation mechanism to iterate
the AIL and cancel all EFI items when requested. Plumb cancellation
support through the log mount finish helper and update xfs_mountfs() to
invoke cancellation in the event of failure after recovery has started.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The EFI is initialized with a reference count of 2. One for the EFI to
ensure the item makes it to the AIL and one for the subsequently created
EFD to release the EFI once the EFD is committed. Log recovery uses the
EFI in a similar manner, but implements a hack to remove both references
in one call once the EFD is handled.
Update log recovery to use EFI reference counting in a manner consistent
with the log. When an EFI is encountered during recovery, an EFI item is
allocated and inserted to the AIL directly. Since the EFI reference is
typically dropped when the EFI is unpinned and this is analogous with
AIL insertion, drop the EFI reference at this point.
When a corresponding EFD is encountered in the log, this indicates that
the extents were freed, no processing is required and the EFI can be
dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
reference at this point rather than open code the AIL removal and EFI
free.
Remaining EFIs (i.e., with no corresponding EFD) are processed in
xlog_recover_finish(). An EFD transaction is allocated and the extents
are freed, which transfers ownership of the EFI reference to the EFD
item in the log.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log recovery attempts to free extents with leftover EFIs in the AIL
after initial processing. If the extent free fails (e.g., due to
unrelated fs corruption), the transaction is cancelled, though it
might not be dirtied at the time. If this is the case, the EFD does
not abort and thus does not release the EFI. This can lead to hangs
as the EFI pins the AIL.
Update xlog_recover_process_efi() to log the EFD in the transaction
before xfs_free_extent() errors are handled to ensure the
transaction is dirty, aborts the EFD and releases the EFI on error.
Since this is a requirement for EFD processing (and consistent with
xfs_bmap_finish()), update the EFD logging helper to do the extent
free and unconditionally log the EFD. This encodes the required EFD
logging behavior into the helper and reduces the likelihood of
errors down the road.
[dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
failure.]
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Freeing an extent in XFS involves logging an EFI (extent free
intention), freeing the actual extent, and logging an EFD (extent
free done). The EFI object is created with a reference count of 2:
one for the current transaction and one for the subsequently created
EFD. Under normal circumstances, the first reference is dropped when
the EFI is unpinned and the second reference is dropped when the EFD
is committed to the on-disk log.
In event of errors or filesystem shutdown, there are various
potential cleanup scenarios depending on the state of the EFI/EFD.
The cleanup scenarios are confusing and racy, as demonstrated by the
following test sequence:
# mount $dev $mnt
# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
-f punch=1 -f creat=1 -f unlink=1 &
# sleep 5
# killall -9 fsstress; wait
# godown -f $mnt
# umount
... in which the final umount can hang due to the AIL being pinned
indefinitely by one or more EFI items. This can occur due to several
conditions. For example, if the shutdown occurs after the EFI is
committed to the on-disk log and the EFD committed to the CIL, but
before the EFD committed to the log, the EFD iop_committed() abort
handler does not drop its reference to the EFI. Alternatively,
manual error injection in the xfs_bmap_finish() codepath shows that
if an error occurs after the EFI transaction is committed but before
the EFD is constructed and logged, the EFI is never released from
the AIL.
Update the EFI/EFD item handling code to use a more straightforward
and reliable approach to error handling. If an error occurs after
the EFI transaction is committed and before the EFD is constructed,
release the EFI explicitly from xfs_bmap_finish(). If the EFI
transaction is cancelled, release the EFI in the unlock handler.
Once the EFD is constructed, it is responsible for releasing the EFI
under any circumstances (including whether the EFI item aborts due
to log I/O error). Update the EFD item handlers to release the EFI
if the transaction is cancelled or aborts due to log I/O error.
Finally, update xfs_bmap_finish() to log at least one EFD extent to
the transaction before xfs_free_extent() errors are handled to
ensure the transaction is dirty and EFD item error handling is
triggered.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Some callers need to make error handling decisions based on whether
the current transaction successfully committed or not. Rename
xfs_trans_roll(), add a new parameter and provide a wrapper to
preserve existing callers.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Release of the EFI either occurs based on the reference count or the
extent count. The extent count used is either the count tracked in
the EFI or EFD, depending on the particular situation. In either
case, the count is initialized to the final value and thus always
matches the current efi_next_extent value once the EFI is completely
constructed. For example, the EFI extent count is increased as the
extents are logged in xfs_bmap_finish() and the full free list is
always completely processed. Therefore, the count is guaranteed to
be complete once the EFI transaction is committed. The EFD uses the
efd_nextents counter to release the EFI. This counter is initialized
to the count of the EFI when the EFD is created. Thus the EFD, as
currently used, has no concept of partial EFI release based on
extent count.
Given that the EFI extent count is always released in whole, use of
the extent count for reference counting is unnecessary. Remove this
level of the API and release the EFI based on the core reference
count. The efi_next_extent counter remains because it is still used
to track the slot to log the next extent to free.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The following tracepoints are updated to report the cgroup used during
cgroup writeback.
* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]
Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it. Tracepoints which take
bdi are updated to take bdi_writeback instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a function to determine the path length of a kernfs node. This
for now will be used by writeback tracepoint updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.
This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
earlier implementation, wb's were keyed by blkcg ID.
bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
allows iterations to start from an arbitrary ID which is used to
interrupt and resume iterations.
Unfortunately, while changing wb to be keyed by memcg ID instead of
blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
are keyed by blkcg ID. This doesn't affect iterations which don't get
interrupted but bdi_split_work_to_wbs() makes use of iteration
resuming on allocation failures and thus may incorrectly skip or
repeat wb's.
Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around this call might not be needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Consider eCryptfs dcache entries to be stale when the corresponding
lower inode's i_nlink count is zero. This solves a problem caused by the
lower inode being directly modified, without going through the eCryptfs
mount, leaving stale eCryptfs dentries cached and the eCryptfs inode's
i_nlink count not being cleared.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
On a box with a lot of ram (148gb) I can make the box softlockup after running
an fs_mark job that creates hundreds of millions of empty files. This is
because we never generate enough memory pressure to keep the number of inodes on
our unused list low, so when we go to unmount we have to evict ~100 million
inodes. This makes one processor a very unhappy person, so add a cond_resched()
in dispose_list() and if we need a resched when processing the s_inodes list do
that and run dispose_list() on what we've currently culled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!
So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.
Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Dave Chinner <dchinner@redhat.com>
Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.
IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes. Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.
Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.
Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
Test:
$ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
/mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
/mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
/mnt/scratch/6 -d /mnt/scratch/7
Result:
wall sys create rate Physical write IO
time CPU (avg files/s) IOPS Bandwidth
----- ----- ------------ ------ ---------
unpatched 6m56s 15m47s 24,000+/-500 26,000 130MB/s
patched 5m06s 13m28s 32,800+/-600 1,500 180MB/s
improvement -26.44% -14.68% +36.67% -94.23% +38.46%
If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.
The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
generic_file_write_iter() will already do an fsync on our behalf
if the file descriptor is O_SYNC or the file is marked as IS_SYNC.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There are cases on which lowcomms_connect_sock() is called directly,
which caused the CF_WRITE_PENDING flag to not bet set upon reconnect,
specially on send_to_sock() error handling. On this last, the flag was
already cleared and no further attempt on transmitting would be done.
As dlm tends to connect when it needs to transmit something, it makes
sense to always mark this flag right after the connect.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
BUG_ON() is a severe action for this case, specially now that DLM with
SCTP will use 1 socket per association. Instead, we can just close the
socket on this error condition and return from the function.
Also move the check to an earlier stage as it won't change and thus we
can abort as soon as possible.
Although this issue was reported when still using SCTP with 1-to-many
API, this cleanup wouldn't be that simple back then because we couldn't
close the socket and making sure such event would cease would be hard.
And actually, previous code was closing the association, yet SCTP layer
is still raising the new data event. Probably a bug to be fixed in SCTP.
Reported-by: <tan.hu@zte.com.cn>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
DLM is using 1-to-many API but in a 1-to-1 fashion. That is, it's not
needed but this causes it to use sctp_do_peeloff() to mimic an
kernel_accept() and this causes a symbol dependency on sctp module.
By switching it to 1-to-1 API we can avoid this dependency and also
reduce quite a lot of SCTP-specific code in lowcomms.c.
The caveat is that now DLM won't always use the same src port. It will
choose a random one, just like TCP code. This allows the peers to
attempt simultaneous connections, which now are handled just like for
TCP.
Even more sharing between TCP and SCTP code on DLM is possible, but it
is intentionally left for a later commit.
Note that for using nodes with this commit, you have to have at least
the early fixes on this patchset otherwise it will trigger some issues
on old nodes.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
If we don't clear that bit, lowcomms_connect_sock() will not schedule
another attempt, and no further attempt will be done.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When a connection have issues DLM may need to close it. Therefore we
should also cancel pending workqueues for such connection at that time,
and not just when dlm is not willing to use this connection anymore.
Also, if we don't clear CF_CONNECT_PENDING flag, the error handling
routines won't be able to re-connect as lowcomms_connect_sock() will
check for it.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When using SCTP and accepting a new connection, DLM currently validates
if the peer trying to connect to it is one of the cluster nodes, but it
doesn't check if it already has a connection to it or not.
If it already had a connection, it will be overwritten, and the new one
will be used for writes, possibly causing the node to leave the cluster
due to communication breakage.
Still, one could DoS the node by attempting N connections and keeping
them open.
As said, but being explicit, both situations are only triggerable from
other cluster nodes, but are doable with only user-level perms.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Chuck reports seeing cases where a GETATTR that happens to race
with an asynchronous WRITE is overriding the file size, despite
the attribute barrier being set by the writeback code.
The culprit turns out to be the check in nfs_ctime_need_update(),
which sees that the ctime is newer than the cached ctime, and
assumes that it is safe to override the attribute barrier.
This patch removes that override, and ensures that attribute
barriers are always respected.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: a08a8cd375 ("NFS: Add attribute update barriers to NFS writebacks")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* bugfixes:
SUNRPC: Fix a thinko in xs_connect()
NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked()
NFS: nfs_set_pgio_error sometimes misses errors
These patches improve both client performance and scalability, most notably
by increasing the maixmum allowed rsize and wsize and by increasing the number
of RDMA "credits". There are also several bugfixes, such as correcting how
WRITE compounds are encoded and fixing large NFS symlink operations.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVy5O3AAoJENfLVL+wpUDr4FsP/RyDj8wZiHmLy5LGlcuIK7lJ
mOyblxvA1DdADNzHUk9ORG3T6+I2Be6K6gXgD4va+9tmcBAvUEI25OhnKnmau4U+
mK9ZYUQWISKdd9FHkX33PmgnspxBP0DdS/rO9GeFRh9EhjGmpHQzOb6BbUDe04wA
PEll9LKZM+NTzOzJL8X5kDA9rXHGLo1T+aN5OMA0EQqxRDurXgLvZXZ5YzLPzgzW
52KdBIRr0ijFtXeIB87B/ATqEetcDgMRkxB2vMuTbF1Jsc1Fu2xGj2KyncCda+X7
6rBMB139zB3rBGG2szWfXU733jZmID09esvSHiC5oM6KYVPMxlF9ktvs+dV8g07k
t+svp4Y1LhHEojfuOKA9arEG7URdlBOUvqwF/UfwFOfcdYR2d2hyfzO/VtsdxhkB
O+kNvrnu2WsWgiyzEVPw2K0d0I5pAQSSxR2lZKzOYFjpwwjqjuvX+xYeC35BvJHv
raGvIvdnPBqd9r9DO/v7kFv4xSNLpyi8o0CLtRowWN29LRVJqhz8gWOttBWOJ/1r
IJYRDMB+R3kWFKe99/ev3IUslV8hekqbH4iYUGuPwFQNurSb69AdrdktsMWHP/8P
1bMCidNCcNUjui2kTX7HwTRWjjjXwK3+RTSToe6t9iKxMYtLSif4m3K7n/0dAzdK
ip8Qpx18MuJHphNiXb79
=7PAr
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.3' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFS over RDMA Client Side Changes
These patches improve both client performance and scalability, most notably
by increasing the maixmum allowed rsize and wsize and by increasing the number
of RDMA "credits". There are also several bugfixes, such as correcting how
WRITE compounds are encoded and fixing large NFS symlink operations.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
And call nfs_file_clear_open_context() directly. This makes it obvious
that nfs_file_release() will always return 0.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All nfs_write_inode() does is pass its arguments to
nfs_commit_unstable_pages(). Let's cut out the middle man and have
nfs_write_pages() do the work directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
All these functions do is call nfs41_ping_server() without adding
anything. Let's remove them and give nfs41_ping_server() a better name
instead.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The idmap_init() and idmap_quit() functions only exist to call the
_keyring() version. Let's just call the keyring() functions directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
They already exist and do the exact same thing. Let's save ourselves
several lines of code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_readdir_xdr_to_array() uses both a cache array and an array of
pages, so I rename these functions to make it clearer how the code
works. nfs_readdir_large_page() becomes nfs_readdir_alloc_pages()
because this function has absolutely nothing to do with setting up a
large page. nfs_readdir_free_pagearray() becomes
nfs_readdir_free_pages() to stay consistent with the new alloc_pages()
function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This variable is initialized to NULL and is never modified before being
passed to nfs_readdir_free_large_page(). But that's okay, because
nfs_readdir_free_large_page() only seems to exist as a way of calling
nfs_readdir_free_pagearray() without this parameter. Let's simplify by
removing pages_ptr and nfs_readdir_free_pagearray().
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We already know that pg_lseg is NULL here.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We generally want to read and write to a block device that's used by
the pNFS block layout client (and even if it's read only the server
has no way of telling us). Add FMODE_WRITE to the mode argument
so that we don't incorrectly tell the block driver that we want a
read-only open.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of overwriting kernel memory reject too long signatures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to replace the __be32 with a void pointer to do proper arithmentics
on the virtual addresses so that we can get the right page pointers.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to include the first u32 for the number of entries. Add a helper
for the calculation instead of opencoding it so that it's in one place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
- Switch back to using list_for_each_entry(). Fixes an incorrect test
for list NULL termination.
- Do not assume that lists are sorted.
- Finally, consider an existing entry to match if it consists of a subset
of the addresses in the new entry.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should ensure that we always set the pgio_header's error field
if a READ or WRITE RPC call returns an error. The current code depends
on 'hdr->good_bytes' always being initialised to a large value, which
is not always done correctly by callers.
When this happens, applications may end up missing important errors.
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
leading to a type confusion issue. Fix it by checking file->f_op.
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Static checkers complain that the format string should be "%s". It does
not make a difference for the current code.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
My static check complains because we have:
if (!*bh)
return -ENOMEM;
if (*bh) {
The second check is unnecessary.
I've simplified this code by moving the "if (!*bh)" checks around. Also
Andreas Dilger says we should probably print a warning if sb_getblk()
fails.
[ Restructured the code so that we print a warning message as well if
the mmp block doesn't check out, and to print the error code to
disambiguate between the error cases. - TYT ]
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At some point along this sequence of changes:
f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems
ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen. This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.
(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).
To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.
Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.
This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.
This change tries to address the following problems:
- Firstly, __sb_start_write() looks simply buggy. It does
__sb_end_write() if it sees ->frozen, but if it migrates
to another CPU before percpu_counter_dec(), sb_wait_write()
can wrongly succeed if there is another task which holds
the same "semaphore": sb_wait_write() can miss the result
of the previous percpu_counter_inc() but see the result
of this percpu_counter_dec().
- As Dave Hansen reports, it is suboptimal. The trivial
microbenchmark that writes to a tmpfs file in a loop runs
12% faster if we change this code to rely on RCU and kill
the memory barriers.
- This code doesn't look simple. It would be better to rely
on the generic locking code.
According to Dave, this change adds the same performance
improvement.
Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:
- This will be "fixed" by the rcu_sync changes we are going
to merge. After that freeze_super()->percpu_down_write()
will use synchronize_sched(), and thaw_super() won't use
synchronize() at all.
This doesn't need any changes in fs/super.c.
- Once we merge rcu_sync changes, we can also change super.c
so that all wb_write->rw_sem's will share the single ->rss
in struct sb_writes, then freeze_super() will need only one
synchronize_sched().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.
We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.
And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.
So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
Not only we need to avoid the warning from lockdep_sys_exit(), the
caller of freeze_super() can never release this lock. Another thread
can do this, so there is another reason for rwsem_release().
Plus the comment should explain why we have to fool lockdep.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
1. wait_event(frozen < level) without rwsem_acquire_read() is just
wrong from lockdep perspective. If we are going to deadlock
because the caller is buggy, lockdep can't detect this problem.
2. __sb_start_write() can race with thaw_super() + freeze_super(),
and after "goto retry" the 2nd acquire_freeze_lock() is wrong.
3. The "tell lockdep we are doing trylock" hack doesn't look nice.
I think this is correct, but this logic should be more explicit.
Yes, the recursive read_lock() is fine if we hold the lock on a
higher level. But we do not need to fool lockdep. If we can not
deadlock in this case then try-lock must not fail and we can use
use wait == F throughout this code.
Note: as Dave Chinner explains, the "trylock" hack and the fat comment
can be probably removed. But this needs a separate change and it will
be trivial: just kill __sb_start_write() and rename do_sb_start_write()
back to __sb_start_write().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
In recover_orphan_inode, whenever f2fs_iget fail, we will make kernel panic,
but it's not reasonable, because f2fs_iget can fail due to a lot of reasons
including out of memory.
So we change error handling method as below:
a) when finding no entry for the orphan inode, bug_on for catching bugs;
b) for other reasons, report it to caller.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is not enough free segment, we should not assign a new segment
explicitly. Otherwise, we can run out of free segment.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
NLM locks don't conflict with NFSv4 share reservations, so we're not
going to learn anything new by watiting for them.
They do conflict with NFSv4 locks and with delegations.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
export.h refers to the pnfs_layouttype enum, which is defined there.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Nfsd has implement a site of seq_operations functions as sunrpc's cache.
Just exports sunrpc's codes, and remove nfsd's redundant codes.
v8, same as v6
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to Christoph's advice, this patch introduce a new helper
nfsd4_cb_sequence_done() for processing more callback errors, following
the example of the client's nfs41_sequence_done().
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Before the make_empty_dir_inode calls were introduce into proc, sysfs,
and sysctl those directories when stated reported an i_size of 0.
make_empty_dir_inode started reporting an i_size of 2. At least one
userspace application depended on stat returning i_size of 0. So
modify make_empty_dir_inode to cause an i_size of 0 to be reported for
these directories.
Cc: stable@vger.kernel.org
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
pnfs_clear_layoutreturn_waitbit() should already be calling
rpc_wake_up(&NFS_SERVER(ino)->roc_rpcwaitq) for us.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there is an outstanding return-on-close, then we just want new
layoutget requests to wait rather than fail.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should always test for outstanding layout returns, whether or not
pnfs_should_retry_layoutget() is true.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there are no valid layout segments, then we should already have
checked in pnfs_update_layout() whether or not this is the first
layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I'm not aware of any bugreports around this issue, but the locking
around the pnfs_commit_bucket is inconsistent at best. This patch
tightens it up by ensuring that the 'bucket->committing' list is always
changed atomically w.r.t. the 'bucket->clseg' layout segment tracking.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The xprt created by svc_create_xprt have be added to serv->sv_permsocks.
So putting the xprt directly is useless.
Otherwise, there is a more svc_xprt_put after the xprt be freed.
v2, same as v1.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
It is unusual to combine the open flags O_RDONLY and O_EXCL, but
it appears that libre-office does just that.
[pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
[pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>
NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
probably to reset the timestamps.
When it was an O_RDONLY open, the SETATTR command does not
identify any actual attributes to change.
If no delegation was provided to the open, the SETATTR uses the
all-zeros stateid and the request is accepted (at least by the
Linux NFS server - no harm, no foul).
If a read-delegation was provided, this is used in the SETATTR
request, and a Netapp filer will justifiably claim
NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
to retry - indefinitely.
So only treat O_EXCL specially if O_CREAT was also given.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 1d3d4437ea "vmscan: per-node deferred work" have made
register_shrinker can return an intergater error.
If register_shrinker() fail, the later unregister_shrinker() will
cause a NULL pointer access.
v2, same as v1.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Turned out I misinterpreted the spec...
Cc: Tom Haynes <thomas.haynes@primarydata.com>
Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Previously, we use radix tree to index all registered page entries for
atomic file, but now we only use radix tree to see whether current page
is indexed or not, since the other user of radix tree is gone in commit
042b7816aa ("f2fs: remove unnecessary call to invalidate inmemory pages").
So in this patch, we try to use one more efficient way:
Introducing a macro ATOMIC_WRITTEN_PAGE, and setting it as page private
value to indicate page indexing status. By using this way, we can save
memory and lookup time.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We run ltp testcase with f2fs and obtain a TFAIL in diotest4, the result in
detail is as fallow:
dio04
<<<test_start>>>
tag=dio04 stime=1432278894
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TFAIL : diotest4.c:129: write allows odd count.returns 1: Success
diotest4 4 TFAIL : diotest4.c:183: Odd count of read and write
diotest4 5 TPASS : Read beyond the file size
......
the result of ext4 with same environment:
dio04
<<<test_start>>>
tag=dio04 stime=1432259643
cmdline="diotest4"
contacts=""
analysis=exit
<<<test_output>>>
diotest4 1 TPASS : Negative Offset
diotest4 2 TPASS : removed
diotest4 3 TPASS : Odd count of read and write
diotest4 4 TPASS : Read beyond the file size
......
The reason is that when triggering DIO in f2fs, we will return zero value
in ->direct_IO if writer's buffer offset, file offset and transfer size is
not alignment to block size of filesystem, resulting in falling back into
buffered write instead of returning -EINVAL.
This patch fixes that problem by returning correct error number for above
case, and removing the judgement condition in check_direct_IO to make sure
the verification will be enabled for direct reader too.
Besides, Jaegeuk Kim pointed out that there is expectional cases we should
always make direct-io falling back into buffered write, such as dio in
encrypted file.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
[Chao Yu make small change and add detail description in commit message]
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We know "ret" is zero here so we can remove this condition.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
pnfs_layout_mark_request_commit() needs to ensure that it adds the
request to the commit list atomically with all the other updates
in order to prevent corruption to buckets[ds_commit_idx].wlseg
due to races with pnfs_generic_clear_request_commit().
Fixes: 338d00cfef ("pnfs: Refactor the *_layout_mark_request_commit...")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 294ac32e99 "nfsd: protect clid and verifier generation with
client_lock" moved gen_confirm() to gen_clid().
After that commit, setclientid will return a bad reply with all-zero
verifier after copy_clid().
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If using clientid_counter, it seems possible that gen_confirm could
generate the same verifier for the same client in some situations.
Add a new counter for client confirm verifier to make sure gen_confirm
generates a different verifier on each call for the same clientid.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
v2, new helper nfs4_free_stateowner for freeing so_owner.data and sop
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Layout is a state resource, nfsd should check it too.
v2, drop unneeded updating in nfsd4_renew()
v3, fix compile error without CONFIG_NFSD_PNFS
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Add an operation that will do setup of the service. In the case of a
classic thread-based service that means starting up threads. In the case
of a workqueue-based service, the setup will do something different.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirliey.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For now, all services use svc_xprt_do_enqueue, but once we add
workqueue-based service support, we'll need to do something different.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...not technically an operation, but it's more convenient and cleaner
to pass the module pointer in this struct.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Since we now have a container for holding svc_serv operations, move the
sv_function into it as well.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In later patches we'll need to abstract out more operations on a
per-service level, besides sv_shutdown and sv_function.
Declare a new svc_serv_ops struct to hold these operations, and move
sv_shutdown into this struct.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
fill_zero can fail due to a lot of reason, but previously we do not handle
its return value, so its callers such as punch_hole/f2fs_zero_range may
report success, but actually can fail because of error occurs inside
fill_zero.
This patch fixes to report correct return value of fill_zero.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull RCU pathwalk fix from Al Viro:
"Another racy use of nd->path.dentry in RCU mode"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
may_follow_link() should use nd->inode
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.
Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.
Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.
The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.
Signed-off-by: Chris Mason <clm@fb.com>
The code using 'ordered_extent_flush_mutex' mutex has removed by below
commit.
- 8d875f95da
btrfs: disable strict file flushes for renames and truncates
But the mutex still lives in struct 'btrfs_fs_info'.
So, this patch removes the mutex from struct 'btrfs_fs_info' and its
initialization code.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When testing the previous patch, Zhao Lei reported a similar bug when
attempting to scrub a degraded RAID 5/6 filesystem with a missing
device, leading to NULL pointer dereferences from the RAID 5/6 parity
scrubbing code.
The first cause was the same as in the previous patch: attempting to
call bio_add_page() on a missing block device. To fix this,
scrub_extent_for_parity() can just mark the sectors on the missing
device as errors instead of attempting to read from it.
Additionally, the code uses scrub_remap_extent() to map the extent of
the corresponding data stripe, but the extent wasn't already mapped. If
scrub_remap_extent() finds a missing block device, it doesn't initialize
extent_dev, so we're left with a NULL struct btrfs_device. The solution
is to use btrfs_map_block() directly.
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original implementation of device replace on RAID 5/6 seems to have
missed support for replacing a missing device. When this is attempted,
we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
crashes when we try to dereference it. This happens because
btrfs_map_block() has no choice but to return us the missing device
because RAID 5/6 don't have any alternate mirrors to read from, and a
missing device has a NULL bdev.
The idea implemented here is to handle the missing device case
separately, which better only happen when we're replacing a missing RAID
5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
reconstruct the data from parity, check it with
scrub_recheck_block_checksum(), and write it out with
scrub_write_block_to_dev_replace().
Reported-by: Philip <bugzilla@philip-seeger.de>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current RAID 5/6 recovery code isn't quite prepared to handle
missing devices. In particular, it expects a bio that we previously
attempted to use in the read path, meaning that it has valid pages
allocated. However, missing devices have a NULL blkdev, and we can't
call bio_add_page() on a bio with a NULL blkdev. We could do manual
manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
a separate path that allows us to manually add pages to the rbio.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Commit 5fbc7c59fd ("Btrfs: fix unfinished readahead thread for raid5/6
degraded mounting") fixed a problem where we would skip a missing device
when we shouldn't have because there are no other mirrors to read from
in RAID 5/6. After commit 2c8cdd6ee4 ("Btrfs, replace: write dirty
pages into the replace target device"), the fix doesn't work when we're
doing a missing device replace on RAID 5/6 because the replace device is
counted as a mirror so we're tricked into thinking we can safely skip
the missing device. The fix is to count only the real stripes and decide
based on that.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
scrub_submit() claims that it can handle a bio with a NULL block device,
but this is misleading, as calling bio_add_page() on a bio with a NULL
->bi_bdev would've already crashed. Delete this, as we're about to
properly handle a missing block device.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Clone and extent same lock their source and target inodes in opposite order.
In addition to this, the range locking in clone doesn't take ordering into
account. Fix this by having clone use the same locking helpers as
btrfs-extent-same.
In addition, I do a small cleanup of the locking helpers, removing a case
(both inodes being the same) which was poorly accounted for and never
actually used by the callers.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The file layout is
[extent 1]...[extent n][4k extent][HOLE][extent x]
extent 1~n and 4k extent can be merged during defrag, and the whole
defrag bytes is larger than our defrag thresh(256k), 4k extent as a
tail is left unmerged since we check if its next extent can be merged
(the next one is a hole, so the check will fail), the layout thus can
be
[new extent][4k extent][HOLE][extent x]
(1~n)
To fix it, beside looking at the next one, this also looks at the
previous one by checking @defrag_end, which is set to 0 when we
decide to stop merging contiguous extents, otherwise, we can merge
the previous one with our extent.
Also, this makes btrfs behave consistent with how xfs and ext4 do.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we do backref walking, we search firstly in queued delayed refs
and then the on-disk backrefs, but we parse differently for shared
references, for delayed refs we also add 'ref->root' while for on-disk
backrefs we don't, this can prevent us from merging refs indexed
by the same bytenr and cause find_parent_nodes() to throw a warning at
'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
that happens.
For shared references, no matter if it's delayed or on-disk, ref->root is
not at all used, instead it's ref->parent that really matters, so this has
delayed refs handled as the same way as on-disk refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When a task trying to double lock a extent buffer, there are no
lockdep warning about it because this lock may be in "blocking_lock"
state, and make us hard to debug.
This patch add a WARN_ON() for above condition, it can not report
all deadlock cases(as lock between tasks), but at least helps us
some.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
These wrong comment was copyed from another function(expired) from
init, this patch fixed them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current
code just return a err-value to caller, but leave new_created extent
buffer exist and locked.
Then subsequent code (in relocate) try to lock above eb again,
and caused deadlock without any dmesg.
(eb lock use wait_event(), so no lockdep message)
It is hard to do recover work in __btrfs_cow_block() at this error
point, but we can abort transaction to avoid deadlock and operate on
unstable state.a
It also helps developer to find wrong place quickly.
(better than a frozen fs without any dmesg before patch)
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
These arguments are not used in functions, remove them for cleanup
and make kernel stack happy.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove chunk_objectid argument from btrfs_relocate_chunk() because
it is not necessary, it can also cleanup some code in caller for
prepare its value.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
objectid's init-value is not used in any case, remove it.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We need error checking code for get_ref_objectid_v0() in
relocate_block_group(), to avoid unpredictable result, especially
for accessing uninitialized value(when function failed) after
this line.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.
Reason:
btrfs/070 do replace and defrag with fsstress simultaneously,
after above operation, checksum error is found by scrub.
Actually, it have no relationship with defrag operation, only
replace with fsstress can trigger this bug.
New data writen to target device have possibility rewrited by
old data from source device by replace code in debug, to avoid
above problem, we can set target block group to readonly in
replace period, so new data requested by other operation will
not write to same place with replace code.
Before patch(4.1-rc3):
30% failed in 100 xfstests.
After patch:
0% failed in 300 xfstests.
It also happened in btrfs/071 as it's another scrub with IO load tests.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Use new intruduced scrub_pause_on/off() can make this code block
clean and more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
It can reduce current duplicated code which is similar to
scrub_blocked_if_needed() but can not call it because little
different.
It also used by my next patch which is in same case.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
More than one code call set_block_group_ro() and restore rw in fail.
Old code use bool bit to save blockgroup's ro state, it can not
support parallel case(it is confirmd exist in my debug log).
This patch use ref count to store ro state, and rename
set_block_group_ro/set_block_group_rw
to
inc_block_group_ro/dec_block_group_ro.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we access extent_root in scrub_stripe() and
scrub_raid56_parity(), we need bypass unrelated tree item firstly
before using its contents to do other condition.
It is not a bug fix, only making code sequence in logic.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We need not load csum of whole strip in scrub because strip is trimed
before use, it is to say, what we really need to calculate csum is
data between [extent_logical, extent_len).
This patch changed to use above segment for btrfs_lookup_csums_range()
in scrub_stripe()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
For example, in scrub_raid56_parity(), following lines are used
to judge is all data processed:
place1: if (key.objectid > logic_end) ...
place2: if (logic_start >= logic_end) ...
...
(place2 is typo, is should be ">", it is copied from other
place, where logic_end's meaning is different, long story...)
We can fix above typo directly, but the root reason is ambiguous
meaning of logic_end in scrub raid56 parity.
In other place, XXX_end is pointed to data which is not included,
and we need to process segment of [XXX_start, XXX_end).
But for scrub raid56 parity, logic_end is pointed to lattest data
need to process, and introduced many "+ 1" and "- 1" in code as
below:
length = sparity->logic_end - sparity->logic_start + 1
logic_end - logic_start + 1
stripe_logical + increment - 1
This patch changed logic_end's meaning to make it in normal understanding
in raid56 parity functions and data struct alone with above bugfix.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When scrub_extent() failed, we need to free previois created
checksum list.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Old code checking cancel and pause request inside scrub stripe
operation, like:
loop() {
if (parity) {
scrub_parity_stripe();
continue;
}
check_cancel_and_pause()
scrub_normal_stripe();
}
Reason is when introduce raid56 stripe scrub, new code is inserted
simplely to front of loop.
Better to:
loop() {
check_cancel_and_pause()
if (parity)
scrub_parity_stripe();
else
scrub_normal_stripe();
}
This patch adjusted code place to realize above sequence.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When mount failed because missing device, we can see following
dmesg:
[ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed
[ 1060.273158] BTRFS: open_ctree failed
This patch add missing_device_number and tolerated_missing_device_number
to above output, to let user know what really happened, and helps
bug-report and debug.
dmesg after patch:
[ 127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed
[ 127.056099] BTRFS: open_ctree failed
Changelog v1->v2:
1: Changed to more clear description, suggested-by:
Anand Jain <anand.jain@oracle.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Scrub panic in following operation:
mkfs.ext4 /dev/vdh
btrfs-convert /dev/vdh
mount /dev/vdh /mnt/tmp1
btrfs scrub start -B /dev/vdh
(panic)
Reason:
1: In some case, leaf created by btrfs-convert was splited into 2
strips.
2: Scrub bypassed part of above wrong leaf data, but remain data
caused panic in scrub_checksum_tree_block().
For reason 1:
we can get following information after some simple operation.
a. mkfs.ext4 /dev/vdh
btrfs-convert /dev/vdh
b. btrfs-debug-tree /dev/vdh
we can see following item in extent tree:
item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
Its logical address is [27054080, 27070464)
and acrossed 2 strips:
[27000832, 27066368)
[27066368, 27131904)
Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)
For reason 2:
Scrub is trying to do a "bypass" in this case, but the result is
"panic", because current code lacks of some condition in bypass,
and let some wrong leaf data escaped.
This patch fixed above scrub code.
Before patch:
# btrfs scrub start -B /dev/vdh
(panic)
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
...
# dmesg
...
[ 59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
#
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have one more case where after a log tree is replayed we get
inconsistent metadata leading to stale directory entries, due to
some directories having entries pointing to some inode while the
inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.
To trigger the problem we need to have a file with multiple hard links
belonging to different parent directories. Then if one of those hard
links is removed and we fsync the file using one of its other links
that belongs to a different parent directory, we end up not logging
the fact that the removed hard link doesn't exists anymore in the
parent directory.
Simple reproducer:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test directory and file.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/foo
ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3
# Make sure everything done so far is durably persisted.
sync
# Now we remove one of our file's hardlinks in the directory testdir.
unlink $SCRATCH_MNT/testdir/foo3
# We now fsync our file using the "foo" link, which has a parent that
# is not the directory "testdir".
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Silently drop all writes and unmount to simulate a crash/power
# failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger journal/log replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After the journal/log is replayed we expect to not see the "foo3"
# link anymore and we should be able to remove all names in the
# directory "testdir" and then remove it (no stale directory entries
# left after the journal/log replay).
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
status=0
exit
The test fails with:
$ ./check generic/107
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
--- tests/generic/107.out 2015-08-01 01:39:45.807462161 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
@@ -1,3 +1,5 @@
QA output created by 107
Entries in testdir:
foo2
+foo3
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
(see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
_check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
Ran: generic/107
Failures: generic/107
Failed 1 of 1 tests
$ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
(...)
checking fs roots
root 5 inode 257 errors 200, dir isize wrong
unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
(...)
And produces the following warning in dmesg:
[127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257
[127298.762081] ------------[ cut here ]------------
[127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]()
[127298.767327] BTRFS: Transaction aborted (error -2)
(...)
[127298.788611] Call Trace:
[127298.789137] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[127298.790090] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[127298.791157] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[127298.792323] [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs]
[127298.793633] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[127298.794699] [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs]
[127298.797640] [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
[127298.798876] [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs]
[127298.800154] [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed
[127298.801303] [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb
[127298.802450] [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14
[127298.803797] [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b
[127298.805017] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[127298.806310] ---[ end trace bbfddacb7aaada7b ]---
[127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry).
So fix this by logging all parent inodes, current and old ones, to make
sure we do not get stale entries after log replay. This is not a simple
solution such as triggering a full transaction commit because it would
imply full transaction commit when an inode is fsynced in the same
transaction that modified it and reloaded it after eviction (because its
last_unlink_trans is set to the same value as its last_trans as of the
commit with the title "Btrfs: fix stale dir entries after unlink, inode
eviction and fsync"), and it would also make fstest generic/066 fail
since one of the fsyncs triggers a full commit and the next fsync will
not find the inode in the log anymore (therefore not removing the xattr).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The search key advancing condition used in copy_to_sk() is loose. It can
advance the key even if it reaches sk->max_*: e.g. when the max key = (512,
1024, -1) and the current key = (512, 1025, 10), it increments the
offset by 1, continues hopeless search from (512, 1025, 11). This issue
make ioctl() to take unexpectedly long time scanning all the leaf a blocks
one by one.
This commit fix the problem using standard way of key comparison:
btrfs_comp_cpu_keys()
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When cloning/deduplicating file extents (through the clone and extent_same
ioctls) we can get data back references with offset values that are a
result of an unsigned integer arithmetic underflow, that is, values that
are much larger then they could be otherwise.
This is not a problem when decrementing or dropping the back references
(happens when we overwrite the extents or punch a hole for example, through
__btrfs_drop_extents()), since we compute the same too large offset value,
but it is a problem for the backref walking code, used by an incremental
send and the ioctls that are used by the btrfs tool "inspect-internal"
commands, as it makes it miss the corresponding file extent items because
the search key is set for an extent item that starts at an offset matching
the exceptionally large offset value of the data back reference. For an
incremental send this causes the send ioctl to fail with -EIO.
So teach the backref walking code to deal with these cases by setting the
search key's offset to 0 if the backref's offset value is larger than
LLONG_MAX (the largest possible file offset). This makes sure the backref
walking code finds the corresponding file extent items at the expense of
scanning more items and leafs in the btree.
Fixing the clone/dedup ioctls to not produce such underflowed results would
require major changes breaking backward compatibility, updating user space
tools, etc.
Simple reproducer case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $send_files_dir
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
_need_to_be_root
send_files_dir=$TEST_DIR/btrfs-test-$seq
rm -f $seqres.full
rm -fr $send_files_dir
mkdir $send_files_dir
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a single extent of 64K starting at file
# offset 128K.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
| _filter_xfs_io
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/mysnap1
# Now clone parts of the original extent into lower offsets of the file.
#
# The first clone operation adds a file extent item to file offset 0
# that points to our initial extent with a data offset of 16K. The
# corresponding data back reference in the extent tree has an offset of
# 18446744073709535232, which is the result of file_offset - data_offset
# = 0 - 16K.
#
# The second clone operation adds a file extent item to file offset 16K
# that points to our initial extent with a data offset of 48K. The
# corresponding data back reference in the extent tree has an offset of
# 18446744073709518848, which is the result of file_offset - data_offset
# = 16K - 48K.
#
# Those large back reference offsets (result of unsigned arithmetic
# underflow) confused the back reference walking code (used by an
# incremental send and the multiple inspect-internal ioctls) and made it
# miss the back references, which for the case of an incremental send it
# made it fail with -EIO and print a message like the following to
# dmesg:
#
# "BTRFS error (device sdc): did not find backref in send_root. \
# inode=257, offset=0, disk_byte=12845056 found extent=12845056"
#
$CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
$CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
-l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
_run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
$SCRATCH_MNT/mysnap2
_run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
_run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
-f $send_files_dir/2.snap
echo "File digest in the original filesystem:"
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
# Now recreate the filesystem by receiving both send streams and verify
# we get the same file contents that the original filesystem had.
_scratch_unmount
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
_run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap
echo "File digest in the new filesystem:"
md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
status=0
exit
The test's expected golden output is:
wrote 65536/65536 bytes at offset 131072
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File digest in the original filesystem:
6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo
File digest in the new filesystem:
6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo
But it failed with:
(...)
@@ -1,7 +1,5 @@
QA output created by 097
wrote 65536/65536 bytes at offset 131072
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-File digest in the original filesystem:
-6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo
-File digest in the new filesystem:
-6c6079335cff141b8a31233ead04cbff SCRATCH_MNT/mysnap2/foo
...
$ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
(...)
ERROR: send ioctl failed with -5: Input/output error
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.
This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.
The following test case for fstests triggers the problem:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test file with 2 hard links.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Now remove one of the links, trigger inode eviction and then fsync
# our inode.
unlink $SCRATCH_MNT/testdir/bar
echo 2 > /proc/sys/vm/drop_caches
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
# Silently drop all writes on our scratch device to simulate a power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify our directory entries.
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
# If we remove our inode, its parent should become empty and therefore we should
# be able to remove the parent.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test failed on btrfs with:
generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
--- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100
@@ -1,3 +1,6 @@
QA output created by 098
Entries in testdir:
+bar
foo
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
(Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
$ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
Checking filesystem on /dev/sdc
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have another case where after an fsync log replay we get an inode with
a wrong link count (smaller than it should be) and a number of directory
entries greater than its link count. This happens when we add a new link
hard link to our inode A and then we fsync some other inode B that has
the side effect of logging the parent directory inode too. In this case
at log replay time we add the new hard link to our inode (the item with
key BTRFS_INODE_REF_KEY) when processing the parent directory but we
never adjust the link count of our inode A. As a result we get stale dir
entries for our inode A that can never be deleted and therefore it makes
it impossible to remove the parent directory (as its i_size can never
decrease back to 0).
A simple reproducer for fstests that triggers this issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our test directory and files.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
touch $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Create one hard link for file foo and another one for file bar. After
# that fsync only the file bar.
ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
# Silently drop all writes on scratch device to simulate power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify both our files have a link count of 2.
echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"
# We should be able to remove all the links of our files in testdir, and
# after that the parent directory should become empty and therefore
# possible to remove it.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test fails with:
-Link count for file foo: 2
+Link count for file foo: 1
Link count for file bar: 2
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
(...)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
And fsck's output:
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
Checking filesystem on /dev/sdc
(...)
So fix this by marking inodes for link count fixup at log replay time
whenever a directory entry is replayed if the entry was created in the
transaction where the fsync was made and if it points to a non-directory
inode.
This isn't a new problem/regression, the issue exists for a long time,
possibly since the log tree feature was added (2008).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fix from Chris Mason:
"We have a btrfs quota regression fix.
I merged this one on Thursday and have run it through tests against
current master.
Normally I wouldn't have sent this while you were finalizing rc6, but
I'm feeding mosquitoes in the adirondacks next week, so I wanted to
get this one out before leaving. I'll leave longer tests running and
check on things during the week, but I don't expect any problems"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: qgroup: Fix a regression in qgroup reserved space.
Looks like the word "contiguous" is often mistyped.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
This patch fix spelling typo inv various part of sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments. As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.
This was motivated by the following lockdep warning:
======================================================
[ INFO: possible circular locking dependency detected ]
4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
-------------------------------------------------------
httpd/1597 is trying to acquire lock:
(&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&mm->mmap_sem){++++++}:
lock_acquire+0xc7/0x270
__might_fault+0x7a/0xa0
filldir+0x9e/0x130
xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
xfs_readdir+0x1b4/0x330 [xfs]
xfs_file_readdir+0x2b/0x30 [xfs]
iterate_dir+0x97/0x130
SyS_getdents+0x91/0x120
entry_SYSCALL_64_fastpath+0x12/0x76
-> #2 (&xfs_dir_ilock_class){++++.+}:
lock_acquire+0xc7/0x270
down_read_nested+0x57/0xa0
xfs_ilock+0x167/0x350 [xfs]
xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
xfs_attr_get+0xbd/0x190 [xfs]
xfs_xattr_get+0x3d/0x70 [xfs]
generic_getxattr+0x4f/0x70
inode_doinit_with_dentry+0x162/0x670
sb_finish_set_opts+0xd9/0x230
selinux_set_mnt_opts+0x35c/0x660
superblock_doinit+0x77/0xf0
delayed_superblock_init+0x10/0x20
iterate_supers+0xb3/0x110
selinux_complete_init+0x2f/0x40
security_load_policy+0x103/0x600
sel_write_load+0xc1/0x750
__vfs_write+0x37/0x100
vfs_write+0xa9/0x1a0
SyS_write+0x58/0xd0
entry_SYSCALL_64_fastpath+0x12/0x76
...
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Morten Stevens <mstevens@fedoraproject.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using a large volume, for example 9T volume with 2T already used,
frequent creation of small files with O_DIRECT when the IO is not
cluster aligned may clear sectors in the wrong place. This will cause
filesystem corruption.
This is because p_cpos is a u32. When calculating the corresponding
sector it should be converted to u64 first, otherwise it may overflow.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org> [4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fsnotify_clear_marks_by_group_flags() can race with
fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked()
drops mark_mutex, a mark from the list iterated by
fsnotify_clear_marks_by_group_flags() can be freed and thus the next
entry pointer we have cached may become stale and we dereference free
memory.
Fix the problem by first moving marks to free to a special private list
and then always free the first entry in the special list. This method
is safe even when entries from the list can disappear once we drop the
lock.
Signed-off-by: Jan Kara <jack@suse.com>
Reported-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function may copy the si_addr_lsb field to user mode when it hasn't
been initialized, which can leak kernel stack data to user mode.
Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals. This is solved by
checking the value of si_signo in addition to si_code.
Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "BUG_ON(list_empty(&osb->blocked_lock_list))" in
ocfs2_downconvert_thread_do_work can be triggered in the following case:
ocfs2dc has firstly saved osb->blocked_lock_count to local varibale
processed, and then processes the dentry lockres. During the dentry
put, it calls iput and then deletes rw, inode and open lockres from
blocked list in ocfs2_mark_lockres_freeing. And this causes the
variable `processed' to not reflect the number of blocked lockres to be
processed, which triggers the BUG.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen reported the following;
My laptop has been behaving strangely with 4.2-rc2. Once I log
in to my X session, I start getting all kinds of strange errors
from applications and see this in my dmesg:
VFS: file-max limit 8192 reached
The problem is that the file-max is calculated before memory is fully
initialised and miscalculates how much memory the kernel is using. This
patch recalculates file-max after deferred memory initialisation. Note
that using memory hotplug infrastructure would not have avoided this
problem as the value is not recalculated after memory hot-add.
4.1: files_stat.max_files = 6582781
4.2-rc2: files_stat.max_files = 8192
4.2-rc2 patched: files_stat.max_files = 6562467
Small differences with the patch applied and 4.1 but not enough to matter.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Alex Ng <alexng@microsoft.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the change to new btrfs extent-oriented qgroup implement, due to
it doesn't use the old __qgroup_excl_accounting() for exclusive extent,
it didn't free the reserved bytes.
The bug will cause limit function go crazy as the reserved space is
never freed, increasing limit will have no effect and still cause
EQOUT.
The fix is easy, just free reserved bytes for newly created exclusive
extent as what it does before.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When testing with generic/101 in xfstests, error message outputed as below:
--- tests/generic/101.out
+++ results//generic/101.out.bad
@@ -10,10 +10,14 @@
File foo content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
-0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0372000
...
(Run 'diff -u tests/generic/101.out results/generic/101.out.bad' to see the entire diff)
The test flow is like below:
1. pwrite foo -S 0xaa 0 64K
2. pwrite foo -S 0xbb 64K 61K
3. sync
4. truncate foo 64K
5. truncate foo 125K
6. fsync foo
7. flakey drop writes
8. umount
After this test, we expect the data of recovered file will have the first
64k of data filling with value 0xaa and the next 61k of data filling with
value 0x00 because we have fsynced it before dropping writes in dm.
In f2fs, during recovering, we will only recover the valid block address
in direct node page if it is marked as a fsynced dnode, but block address
which means invalid/reserved (with value NULL_ADDR/NEW_ADDR) will not be
recovered. So, the file recovered shows its incorrect data 0xbb in range of
[61k, 125k].
In this patch, we fix to recover invalid/reserved block during recover flow.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In some cases, we only need the block address when we call
f2fs_reserve_block,
other fields of struct dnode_of_data aren't necessary.
We can try extent cache first for such cases in order to speed up the
process.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The current documentation for unregister_chrdev_region says that it return
a range of device numbers which is incorrect. Instead it unregister a
range of device numbers. Fix the documentation to make this clear.
Signed-off-by: Partha Pratim Mukherjee <ppm.floss@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.
Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.
Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.
Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With well over 200+ users of this api, there are a mere 12 users that
actually checked the return value of this function. And all of them
really didn't do anything with that information as the system or module
was shutting down no matter what.
So stop pretending like it matters, and just return void from
misc_deregister(). If something goes wrong in the call, you will get a
WARNING splat in the syslog so you know how to fix up your driver.
Other than that, there's nothing that can go wrong.
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To avoid meeting garbage data in next free node block at the end of warm
node chain when doing recovery, we will try to zero out that invalid block.
If the device is not support discard, our way for zeroing out block is:
grabbing a temporary zeroed page in meta inode, then, issue write request
with this page.
But, we forget to release that temporary page, so our memory usage will
increase without gaining any hit ratio benefit, so it's better to free it
for saving memory.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In following call path, we will pass a locked and referenced ipage
pointer to get_new_data_page:
- init_inode_metadata
- make_empty_dir
- get_new_data_page
There are two exit paths in get_new_data_page when error occurs:
1) grab_cache_page fails, ipage will not be released;
2) f2fs_reserve_block fails, ipage will be released in callee.
So, it's not consistent for error handling in get_new_data_page.
For f2fs_reserve_block, it's not very easy to change the rule
of error handling, since it's already complicated.
Here we deside to choose an easy way to fix this issue:
If any error occur in get_new_data_page, we will ensure releasing
ipage in this function.
The same issue is in f2fs_convert_inline_dir, fix that too.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Replace BUG_ON with f2fs_bug_on to deal with
block and segment validity check failed.
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In get_meta_page, we guarantee no failure for the returned page,
but sometimes, IO error from device will incur returning an
non-updated page.
Then, we still use this page as updated one, exception could happen
when using this kind of page.
So in this condition, we'd better freeze fs by making fs readonly and
and stop doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
some backing devices need pages to be stable during writeback. It doesn't
matter if
the page is completely overwritten or already uptodate, it needs to wait
before write.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to handle error cases in commit_inmem_pages.
If an error occurs, it stops to write the pages and return the error right
away.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When there is no enough free nids in free nid cache, we will try to
readahead FREE_NID_PAGES:4 nat pages into page cache of meta_inode,
then, reading nat entries in nat page for adding free nids to free nid
cache.
But when traversing all nat pages we readaheaded in a circulation,
our exit condition is not set right, one more nat page will be scanned
without readaheading, resulting worse read performance.
This patch fixes to read the correct number nat pages to avoid bad
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we clear inline data/dentry flag in handle_failed_inode, we will fail
to decline the stat count of inline data/dentry in f2fs_evict_inode due
to no flag in inode. So remove the wrong clearing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_ioc_start_{atomic,volatile}_write, if we failed in converting
inline data, we will report error to user, but still remain atomic/volatile
flag in inode, it will impact further writes for this file. Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes the incorrect range (0, LONG_MAX) which is used
in ranged fsync. If we use LONG_MAX as the parameter for indicating
the end of file we want to synchronize, in 32-bits architecture
machine, these datas after 4GB offset may not be persisted in
storage after ->fsync returned.
Here, we alter LONG_MAX to LLONG_MAX to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When flushing comes from background, if there is no dirty page in the
mapping of inode, we'd better to skip seeking dirty page from mapping
for writebacking.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The if statement "goto continue_unlock" is exactly the same when
each if condition is true that is depended on the value of both
"step" and "is_cold_data(page)" are 0 or 1. That means when the
value of "step" equals to "is_cold_data(page)", the if condition
is true and the if statement "goto continue_unlock" appears only
once, so it can be optimized to reduce the duplicated code.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In handle_failed_inode, there is a potential deadlock which can happen
in below call path:
- f2fs_create
- f2fs_lock_op down_read(cp_rwsem)
- f2fs_add_link
- __f2fs_add_link
- init_inode_metadata
- f2fs_init_security failed
- truncate_blocks failed
- handle_failed_inode
- f2fs_truncate
- truncate_blocks(..,true)
- write_checkpoint
- block_operations
- f2fs_lock_all down_write(cp_rwsem)
- f2fs_lock_op down_read(cp_rwsem)
So in this path, we pass parameter to f2fs_truncate to make sure
cp_rwsem in truncate_blocks will not be locked again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_do_collapse, region cp_rwsem covered is large, since it will be
held until all blocks are left shifted, so if we try to collapse small
area at the beginning of large file, checkpoint who want to grab writer's
lock of cp_rwsem will be delayed for long time.
In order to avoid this condition, altering to lock/unlock cp_rwsem each
shift operation.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a lookup and a insertion interface for extent tree.
The new lookup return the insert position and the prev/next
extents closest to the offset we lookup when find no match.
The new insertion uses above parameters to improve performance.
There are three possible insertions after the lookup in
f2fs_update_extent_tree, two of them insert parts of removed extent
back to tree, since no merge happens during this process, new insertion
skips the merge check in this scanario; the another insertion inserts a
new extent to tree, new insertion uses prev/next extent and insert
position to insert this extent directly, and save the time of searching
down the tree.
As long as tree remains unchanged between lookup and insertion, this
would work fine. And the new lookup would be useful when add
multi-blocks extent support for insertion interface.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch changes for a caller to handle the page after its bio gets an error.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Variables for recording extent cache ratio info were updated without
protection, this patch tries to alter them to atomic_t type for more
accurate stat.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds to stat the number of inline xattr inode for
showing in debugfs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
That encrypted page is used temporarily, so we don't need to mark it accessed.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull nfsd fixes from Bruce Fields.
* 'for-4.2' of git://linux-nfs.org/~bfields/linux:
nfsd: do nfs4_check_fh in nfs4_check_file instead of nfs4_check_olstateid
nfsd: Fix a file leak on nfsd4_layout_setlease failure
nfsd: Drop BUG_ON and ignore SECLABEL on absent filesystem
Now that we can get there in RCU mode, we shouldn't play with
nd->path.dentry->d_inode - it's not guaranteed to be stable.
Use nd->inode instead.
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If there are gced dirty pages and normal dirty pages in the mapping
of one inode, we might writeback them alternately with discontinuous
block address, resulting in low performance.
This patch introduces f2fs_write_cache_pages with codes copied from
write_cache_pages in mm/page-writeback.c.
In this function, we refactor flow with two steps:
1) writeback all cold type pages.
2) writeback all non-cold type pages.
By using this method, f2fs will writeback dirty pages with the same
temperature in bunch mode, it makes writeouted block being with
more continuous address, so they can be merged as much as possible
in f2fs bio cache, and also it will reduce the chance of submiting
small IO from block layer.
Test environment: 8g nokia sd card (very old sd card, but it shows
better effect when testing with this patch, and with a 32g kingston
sd card, I didn't see much more improvement).
Test step:
1. touch testfile;
2. truncate -s 512K testfile;
3. write all pages with odd index;
4. trigger gc by ioctl;
5. write all pages with even index;
6. time fsync testfile.
before:
real 0m0.402s
user 0m0.000s
sys 0m0.000s
after:
real 0m0.143s
user 0m0.004s
sys 0m0.004s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to return correct error number of ->setxattr, which
is reported by xfstest tests/generic/026 as below:
generic/026 - output mismatch
--- tests/generic/026.out
+++ results/generic/026.out.bad
@@ -4,6 +4,6 @@
1 below acl max
acl max
1 above acl max
-chacl: cannot set access acl on "largeaclfile": Argument list too long
+chacl: cannot set access acl on "largeaclfile": Numerical result out of range
use 16 aces
use 17 aces
...
Ran: generic/026
Failures: generic/026
Failed 1 of 1 tests
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, since 'commit 4531929e39 ("f2fs: move grabing orphan
pages out of protection region")' was committed, in write_orphan_inodes(),
we will grab all meta page in a batch before we use them under spinlock,
so that we can avoid large time delay of grabbing meta pages under
spinlock.
Now, 'commit d6c67a4fee ("f2fs: revmove spin_lock for
write_orphan_inodes")' remove the spinlock in write_orphan_inodes,
so there is no issue we describe above, we'd better recover to move
the grab operation to original place for readability.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With cost-benifit method, background gc will consider old section with
fewer valid blocks as candidate victim, these old blocks in section will
be treated as cold data, and laterly will be moved into cold segment.
But if the gcing page is attached by user through buffered or mmaped
write, we should reset the page as non-cold one, because this page may
have more opportunity for further updating.
So fix to add clearing code for the missed 'mmap' case.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When background gc is off, the only way to trigger gc is executing
a force gc in some operations who wants to grab space in disk.
The executing condition is limited: to execute force gc, we should
wait for the time when there is almost no more free section for LFS
allocation. This seems not reasonable for our user who wants to
control triggering gc by himself.
This patch introduces F2FS_IOC_GARBAGE_COLLECT interface for
triggering garbage collection by using ioctl. It provides our users
one more option to trigger gc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves extent cache related code from data.c into extent_cache.c
since extent cache is independent feature, and its codes are not relate to
others in data.c, it's better for us to maintain them in separated place.
There is no functionality change, but several small coding style fixes
including:
* rename __drop_largest_extent to f2fs_drop_largest_extent for exporting;
* rename misspelled word 'untill' to 'until';
* remove unneeded 'return' in the end of f2fs_destroy_extent_tree().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since only parts of extents longer than F2FS_MIN_EXTENT_LEN will
be kept in extent cache after split, extents already shorter than
F2FS_MIN_EXTENT_LEN don't need to try split at all.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update page flag (e.g. Uptodate/cold flag) in
->write_begin.
Otherwise, page will be non-uptodate when we try to write entire
page, and cold data flag in page will not be clean when gced page
is being rewritten.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an extent_tree entry has a zero reference count, we can drop it from the
cache in higher priority rather than currently referencing entries.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In ->writepages, we use writepages mutex lock to serialize all block
address allocation and page submitting pairs from different inodes.
This method makes our delayed dirty pages of one inode being written
continously as many as possible.
But there is one problem that we did not submit current cached bio in
protection region of writepages mutex lock, so there is a small chance
that we submit the one of other thread's as below, resulting in
splitting more bios.
thread 1 thread 2
->writepages
lock(writepages)
->write_cache_pages
unlock(writepages)
lock(writepages)
->write_cache_pages
->f2fs_submit_merged_bio
->writepage
unlock(writepages)
fs_mark-6535 [002] .... 2242.270230: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5766152, size = 524288
fs_mark-6536 [000] .... 2242.270361: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767176, size = 4096
fs_mark-6536 [000] .... 2242.270370: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, NODE, sector = 8138112, size = 4096
fs_mark-6535 [002] .... 2242.270776: f2fs_submit_write_bio: dev = (1,0), WRITE_SYNC, DATA, sector = 5767184, size = 516096
This may really increase time of block layer works, and may cause
larger IO lantency.
This patch moves the submitting operation into region of writepages
mutex lock to avoid bio splits when concurrently writebacking is
intensive.
my test environment: virtual machine,
intel cpu i5 2500, 8GB size memory, 4GB size ramdisk
time fs_mark -t 16 -L 1 -s 524288 -S 1 -d /mnt/f2fs/
before:
real 0m4.244s
user 0m0.088s
sys 0m12.336s
after:
real 0m3.822s
user 0m0.072s
sys 0m10.760s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with fs_mark, some blocks were written out as cold
data which were mixed with warm data, resulting in splitting more
bios.
This is because fs_mark will create file with random filename as
below:
559551ee~~~~~~~~15Z29OCC05JCKQP60JQ42MKV
559551ee~~~~~~~~NZAZ6X8OA8LHIIP6XD0L58RM
559551ef~~~~~~~~B15YDSWAK789HPSDZKYTW6WM
559551f1~~~~~~~~2DAE5DPS79785BUNTFWBEMP3
559551f1~~~~~~~~1MYDY0BKSQCJPI32Q8C514RM
559551f1~~~~~~~~YQOTMAOMN5CVRFOUNI026MP4
559551f3~~~~~~~~1WF42LPRTQJNPPGR3EINKMPE
559551f3~~~~~~~~8Y2NRK7CEPPAA02LY936PJPG
They are regarded as cold file since their filename are ended with
multimedia files' extension, but this should be wrong as we only
match the extension of filename, not the whole one.
In this patch, we try to fix the format of multimedia filename to:
"filename + '.' + extension", then we set cold file only its
filename matches the format.
So after this change, it will reduce the probability we set the
wrong cold file, also it helps a little for fs_mark's performance
on f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This makes the function check_dnode have a return type of bool
due to this particular function only ever returning either one
or zero as its return value and changes the name of the function
to is_alive in order to better explain this function's intended
work of checking if a dnode is still in use by the filesystem.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
[Jaegeuk Kim: change the return value check for the renamed function]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Because of the extent shrinker or other -ENOMEM scenarios, it cannot guarantee
that the largest extent would be cached in the tree all the time.
Instead of relying on extent_tree, we can simply check the cached one in extent
tree accordingly.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We don't need to handle the duplicate extent information.
The integrated rule is:
- update on-disk extent with largest one tracked by in-memory extent_cache
- destroy extent_tree for the truncation case
- drop per-inode extent_cache by shrinker
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a shrinker targeting to reduce memory footprint consumed
by a number of in-memory f2fs data structures.
In addition, it newly adds:
- sbi->umount_mutex to avoid data races on shrinker and put_super
- sbi->shruinker_run_no to not revisit objects
Note that the basic implementation was copied from fs/ubifs/shrinker.c
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch relocates cached_en not only to be covered by spin_lock, but also
to set once after checking out completely.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, f2fs_update_extent_cache() updates in-memory extent_cache all the
time, and then finally preserves its up-to-date extent into on-disk one during
f2fs_evict_inode.
But, in the following scenario:
1. mount
2. open & write an extent X
3. f2fs_evict_inode; on-disk extent is X
4. open & update the extent X with Y
5. sync; trigger checkpoint
6. power-cut
after power-on, f2fs should serve extent Y, but we have an on-disk extent X.
This causes a failure on xfstests/311.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes wrong calculation on block address field when an extent is
split.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For newly added fallocate types, it should convert inline_data before handling
block swapping.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Before iput is called, the inode number used by a bad inode can be reassigned
to other new inode, resulting in any abnormal behaviors on the new inode.
This should not happen for the new inode.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The write_checkpoint can update stat information, so we should destroy the stat
structure after it.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Dirty page can be exist in mapping of newly created symlink, but previously
we did not maintain the counting of dirty page for symlink like we maintained
for regular/directory, so the counting we lookuped should be wrong.
This patch adds missed dirty page counting for symlink to fix this issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The key_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently there is no limitation on number of reserved credits we can
ask for. If we ask for more reserved credits than 1/2 of maximum
transaction size, or if total number of credits exceeds the maximum
transaction size per operation (which is currently only possible with
the former) we will spin forever in start_this_handle().
Fix this by adding this limitation at the start of start_this_handle().
This patch also removes the credit limitation 1/2 of maximum transaction
size, since we really only want to limit the number of reserved credits.
There is not much point to limit the credits if there is still space in
the journal.
This accidentally also fixes the online resize, where due to the
limitation of the journal credits we're unable to grow file systems with
1k block size and size between 16M and 32M. It has been partially fixed
by 2c869b262a, but not entirely.
Thanks Jan Kara for helping me getting the correct fix.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull Ceph fixes from Sage Weil:
"There are two critical regression fixes for CephFS from Zheng, and an
RBD completion fix for layered images from Ilya"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix copyup completion race
ceph: always re-send cap flushes when MDS recovers
ceph: fix ceph_encode_locks_to_buffer()
Pull VFS fix from Al Viro:
"Spurious ENOTDIR fix"
This should fix the problems reported by Dominique Martinet and Hugh
Dickins.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
link_path_walk(): be careful when failing with ENOTDIR
In RCU mode we might end up with dentry evicted just we check
that it's a directory. In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.
Breakage had been introduced in commit b18825a - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid. That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory. The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.
Note that all branches since 3.12 have that problem...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs fixes from Chris Mason:
"Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix quick exhaustion of the system array in the superblock
btrfs: its btrfs_err() instead of btrfs_error()
btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
Currently, preprocess_stateid_op calls nfs4_check_olstateid which
verifies that the open stateid corresponds to the current filehandle in the
call by calling nfs4_check_fh.
If the stateid is a NFS4_DELEG_STID however, then no such check is done.
This could cause incorrect enforcement of permissions, because the
nfsd_permission() call in nfs4_check_file uses current the current
filehandle, but any subsequent IO operation will use the file descriptor
in the stateid.
Move the call to nfs4_check_fh into nfs4_check_file instead so that it
can be done for all stateid types.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
[bfields: moved fh check to avoid NULL deref in special stateid case]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
commit e548e9b93d makes the kclient
only re-send cap flush once during MDS failover. If the kclient sends
a cap flush after MDS enters reconnect stage but before MDS recovers.
The kclient will skip re-sending the same cap flush when MDS recovers.
This causes problem for newly created inode. The MDS handles cap
flushes before replaying unsafe requests, so it's possible that MDS
find corresponding inode is missing when handling cap flush. The fix
is reverting to old behaviour: always re-send when MDS recovers
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- remote attribute log recovery corruption fixes
- DAX page faults need to use direct mappings, not a page cache
mapping.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJVutbbAAoJEK3oKUf0dfode50QANILLE7Rq9rKTp9ogHZKxTfv
Y3SJLCpwsuZ9GQuvqgRk2MbZDgNsQxbKg4lCVlf+lbXWDoKINkLw73g1qPmEMA8q
3EQ+GNYcvsoGKlyIRod6HMuqIa9zIfremj+qk4/PCDhD50IjI/1QH5LqVz9lnaas
3N1zuU1t5dJvmUMPdLUvj46s51O9stiKTN0gbLnv5CEbTbyH1PiyqGubecJkcFVU
3oEdp0PyxeoxUgUWLh24tgQXzeLQTR/95viJsUILcdPBd3geCBbW3/pgke5Rg/3+
G/8aowNtDvNrg8sT6FuDgSR8kYudEssleQnfWO7rX7yigKLzSnNpD7m4779ZmEXP
ey32NVbXwslg92AYwM9A8EqzGInm+cMDWcRu22WpHKzWs1wpVlpdkJXbP1SE9fKm
pXVw1bT4XNArPjew/cclLMI2T0Tc6LMTUR5WxcF1N8RCTvdTm2qf+l43zXYFcBvX
IxaD6U8z7tyShbgG2xllM3D/ANznvIs4pAb9zK6vJK7ZcfCgTVY/2YkHZeoKGKdG
2Mo1SUQpaG8Xhm53IDfSLtEmTp9eQAdDooHCwjugsLXe9A65787EIabTK9oTsNG1
W90CBQzyrIliQ9v291eAB/e5gRmExVxyV9VnVPMw3jXJ9mwxyfKOEd9mW+pv8RfH
x3mj0CxjO14vQrrfCLjd
=kjas
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"There are a couple of recently found, long standing remote attribute
corruption fixes caused by log recovery getting confused after a
crash, and the new DAX code in XFS (merged in 4.2-rc1) needs to
actually use the DAX fault path on read faults.
Summary:
- remote attribute log recovery corruption fixes
- DAX page faults need to use direct mappings, not a page cache
mapping"
* tag 'xfs-for-linus-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: remote attributes need to be considered data
xfs: remote attribute headers contain an invalid LSN
xfs: call dax_fault on read page faults for DAX
When we clear the dirty bits in btrfs_delete_unused_bgs for extents
in the empty block group, it results in btrfs_finish_extent_commit being
unable to discard the freed extents.
The block group removal patch added an alternate path to forget extents
other than btrfs_finish_extent_commit. As a result, any extents that
would be freed when the block group is removed aren't discarded. In my
test run, with a large copy of mixed sized files followed by removal, it
left nearly 2/3 of extents undiscarded.
To clean up the block groups, we add the removed block group onto a list
that will be discarded after transaction commit.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The cleaner thread may already be sleeping by the time we enter
close_ctree. If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard. We also
explicitly remove unused block groups in the ro-remount path
for the same reason.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.
This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used. This is
a change for btrfs but is consistent with other file systems.
We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock. That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls. We release it
between searches to allow the file system to perform more-or-less
normally. Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search. This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Btrfs doesn't track superblocks with extent records so there is nothing
persistent on-disk to indicate that those blocks are in use. We track
the superblocks in memory to ensure they don't get used by removing them
from the free space cache when we load a block group from disk. Prior
to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
was fine since the block group would never be reclaimed so the superblock
was always safe. Once we started removing the empty block groups, we
were protected by the fact that discards weren't being properly issued
for unused space either via FITRIM or -odiscard. The block groups were
still being released, but the blocks remained on disk.
In order to properly discard unused block groups, we need to filter out
the superblocks from the discard range. Superblocks are located at fixed
locations on each device, so it makes sense to filter them out in
btrfs_issue_discard, which is used by both -odiscard and FITRIM.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
It's possible, though unexpected, to pass unaligned offsets and lengths
to btrfs_issue_discard. We then shift the offset/length values to sector
units. If an unaligned offset has been passed, it will result in the
entire sector being discarded, possibly losing data. An unaligned
length is safe but we'll end up returning an inaccurate number of
discarded bytes.
This patch aligns the offset to the 512B boundary, adjusts the length,
and warns, since we shouldn't be discarding on an offset that isn't
aligned with our sector size.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Initially this will just be the length argument passed to it,
but the following patches will adjust that to reflect re-alignment
and skipped blocks.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.
It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a new superblock field, sb_meta_uuid. If set, along with
a new incompat flag, the code will use that field on a V5 filesystem
to compare to metadata UUIDs, which allows us to change the user-
visible UUID at will. Userspace handles the setting and clearing
of the incompat flag as appropriate, as the UUID gets changed; i.e.
setting the user-visible UUID back to the original UUID (as stored in
the new field) will remove the incompatible feature flag.
If the incompat flag is not set, this copies the user-visible UUID into
into the meta_uuid slot in memory when the superblock is read from disk;
the meta_uuid field is not written back to disk in this case.
The remainder of this patch simply switches verifiers, initializers,
etc to use the new sb_meta_uuid field.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The header side of xfs_bit.c is already in libxfs, and the sparse
inode code requires the xfs_next_bit() function so pull in the
xfs_bit.c file so that a sparse inode enabled libxfs compiles
cleanly in userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_create() and xfs_create_tmpfile() have useless jumps to identical
labels. Simplify them.
Signed-off-by: Jan Kara <jack@suse.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The second and subsequent lines of multi-line logging messages
are not prefixed with the same information as the first line.
Separate messages with newlines into multiple calls to ensure
consistent prefixing and allow easier grep use.
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When log recovery hits a new transaction, it copies the transaction
header from the expected location in the log to the in-core structure
using the length from the op record header. This length is validated to
ensure it doesn't exceed the length of the record, but not against the
expected size of a transaction header (and thus the size of the in-core
structure). If the on-disk length is corrupted, the associated memcpy()
can overflow, write to unrelated memory and lead to crashes. This has
been reproduced via filesystem fuzzing.
The code currently handles the possibility that the transaction header
is split across two op records. Neither instance accounts for corruption
where the op record length might be larger than the in-core transaction
header. Update both sites to detect such corruption, warn and return an
error from log recovery. Also add some comments and assert that if the
record is split, the copy of the second portion is less than a full
header. Otherwise, this suggests the copy of the second portion could
have overwritten bits from the first and thus that something could be
wrong.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We have seen somewhat rare reports of the following assert from
xlog_cil_push_background() failing during ltp tests or somewhat
innocuous desktop root fs workloads (e.g., virt operations, initramfs
construction):
ASSERT(!list_empty(&cil->xc_cil));
The reasoning behind the assert is that the transaction has inserted
items to the CIL and hit background push codepath all with
cil->xc_ctx_lock held for reading. This locks out background commit from
emptying the CIL, which acquires the lock for writing. Therefore, the
reasoning is that the items previously inserted in the CIL should still
be present.
The cil->xc_ctx_lock read lock is not sufficient to protect the xc_cil
list, however, due to how CIL insertion is handled.
xlog_cil_insert_items() inserts and reorders the dirty transaction items
to the tail of the CIL under xc_cil_lock. It uses list_move_tail() to
achieve insertion and reordering in the same block of code. This
function removes and reinserts an item to the tail of the list. If a
transaction commits an item that was already logged and thus already
resides in the CIL, and said item is the sole item on the list, the
removal and reinsertion creates a temporary state where the list is
actually empty.
This state is not valid and thus should never be observed by concurrent
transaction commit-side checks in the circumstances outlined above. We
do not want to acquire the xc_cil_lock in all of these instances as it
was previously removed and replaced with a separate push lock for
performance reasons. Therefore, close any races with list_empty() on the
insertion side by ensuring that the list is never in a transient empty
state.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_bunmapi() doesn't care what type of extent is being freed and
does not look at the XFS_BMAPI_METADATA flag at all. As such we can
remove the XFS_BMAPI_METADATA from all callers that use it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We don't log remote attribute contents, and instead write them
synchronously before we commit the block allocation and attribute
tree update transaction. As a result we are writing to the allocated
space before the allcoation has been made permanent.
As a result, we cannot consider this allocation to be a metadata
allocation. Metadata allocation can take blocks from the free list
and so reuse them before the transaction that freed the block is
committed to disk. This behaviour is perfectly fine for journalled
metadata changes as log recovery will ensure the free operation is
replayed before the overwrite, but for remote attribute writes this
is not the case.
Hence we have to consider the remote attribute blocks to contain
data and allocate accordingly. We do this by dropping the
XFS_BMAPI_METADATA flag from the block allocation. This means the
allocation will not use blocks that are on the busy list without
first ensuring that the freeing transaction has been committed to
disk and the blocks removed from the busy list. This ensures we will
never overwrite a freed block without first ensuring that it is
really free.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In recent testing, a system that crashed failed log recovery on
restart with a bad symlink buffer magic number:
XFS (vda): Starting recovery (logdev: internal)
XFS (vda): Bad symlink block magic!
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060
On examination of the log via xfs_logprint, none of the symlink
buffers in the log had a bad magic number, nor were any other types
of buffer log format headers mis-identified as symlink buffers.
Tracing was used to find the buffer the kernel was tripping over,
and xfs_db identified it's contents as:
000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
.....
This is a remote attribute buffer, which are notable in that they
are not logged but are instead written synchronously by the remote
attribute code so that they exist on disk before the attribute
transactions are committed to the journal.
The above remote attribute block has an invalid LSN in it - cycle
0xd002000, block 0 - which means when log recovery comes along to
determine if the transaction that writes to the underlying block
should be replayed, it sees a block that has a future LSN and so
does not replay the buffer data in the transaction. Instead, it
validates the buffer magic number and attaches the buffer verifier
to it. It is this buffer magic number check that is failing in the
above assert, indicating that we skipped replay due to the LSN of
the underlying buffer.
The problem here is that the remote attribute buffers cannot have a
valid LSN placed into them, because the transaction that contains
the attribute tree pointer changes and the block allocation that the
attribute data is being written to hasn't yet been committed. Hence
the LSN field in the attribute block is completely unwritten,
thereby leaving the underlying contents of the block in the LSN
field. It could have any value, and hence a future overwrite of the
block by log recovery may or may not work correctly.
Fix this by always writing an invalid LSN to the remote attribute
block, as any buffer in log recovery that needs to write over the
remote attribute should occur. We are protected from having old data
written over the attribute by the fact that freeing the block before
the remote attribute is written will result in the buffer being
marked stale in the log and so all changes prior to the buffer stale
transaction will be cancelled by log recovery.
Hence it is safe to ignore the LSN in the case or synchronously
written, unlogged metadata such as remote attribute blocks, and to
ensure we do that correctly, we need to write an invalid LSN to all
remote attribute blocks to trigger immediate recovery of metadata
that is written over the top.
As a further protection for filesystems that may already have remote
attribute blocks with bad LSNs on disk, change the log recovery code
to always trigger immediate recovery of metadata over remote
attribute blocks.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.
Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Commit 3da40c7b08 ("ext4: only call ext4_truncate when size <= isize")
introduced a bug that c/mtime is not updated on truncate up.
Fix the issue by setting c/mtime explicitly in the truncate up case.
Note that ftruncate(2) is not affected, so you won't see this bug using
truncate(1) and xfs_io(1).
Signed-off-by: Zirong Lang <zorro.lang@gmail.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 6f6a6fda29 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.
Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 6f6a6fda29
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Highlights include:
Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce
Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory availability
- Don't revalidate the mapping if both size and change attr are up to date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits correctly
- Several pNFS layout return bugfixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
+5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
+ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
dyqHwpCjoUtmYAUr166U
=wlUD
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable patches:
- Fix a situation where the client uses the wrong (zero) stateid.
- Fix a memory leak in nfs_do_recoalesce
Bugfixes:
- Plug a memory leak when ->prepare_layoutcommit fails
- Fix an Oops in the NFSv4 open code
- Fix a backchannel deadlock
- Fix a livelock in sunrpc when sendmsg fails due to low memory
availability
- Don't revalidate the mapping if both size and change attr are up to
date
- Ensure we don't miss a file extension when doing pNFS
- Several fixes to handle NFSv4.1 sequence operation status bits
correctly
- Several pNFS layout return bugfixes"
* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
nfs: plug memory leak when ->prepare_layoutcommit fails
SUNRPC: Report TCP errors to the caller
sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
NFSv4.2: handle NFS-specific llseek errors
NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
NFS: Fix a memory leak in nfs_do_recoalesce
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
NFS: Don't revalidate the mapping if both size and change attr are up to date
NFSv4/pnfs: Ensure we don't miss a file extension
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
SUNRPC: xprt_complete_bc_request must also decrement the free slot count
SUNRPC: Fix a backchannel deadlock
pNFS: Don't throw out valid layout segments
pNFS: pnfs_roc_drain() fix a race with open
pNFS: Fix races between return-on-close and layoutreturn.
pNFS: pnfs_roc_drain should return 'true' when sleeping
pNFS: Layoutreturn must invalidate all existing layout segments.
...
"data" is currently leaked when the prepare_layoutcommit operation
returns an error. Put the cred before taking the spinlock in that
case, take the lock and then goto out_unlock which will drop the
lock and then free "data".
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Handle NFS-specific llseek errors instead of letting them leak out to
userspace.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Recoalescing does not affect whether or not we've already sent off
I/O, and doing so means that we end up sending a bunch of synchronous
for cases where we actually need to be using unstable writes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the function exits early, then we must put those requests that were
not processed back onto the &mirror->pg_list so they can be cleaned up
by nfs_pgio_error().
Fixes: a7d42ddb30 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We should release "sd" before returning.
Fixes: 0fa12ad1b285 ('ext4: Handle error from dquot_initialize()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
Pull block fixes from Jens Axboe:
"Four smaller fixes for the current series. This contains:
- A fix for clones of discard bio's, that can cause data corruption.
From Martin.
- A fix for null_blk, where in certain queue modes it could access a
request after it had been freed. From Mike Krinkin.
- An error handling leak fix for blkcg, from Tejun.
- Also from Tejun, export of the functions that a file system needs
to implement cgroup writeback support"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Do a full clone when splitting discard bios
block: export bio_associate_*() and wbc_account_io()
blkcg: fix gendisk reference leak in blkg_conf_prep()
null_blk: fix use-after-free problem