Create a helper method to remove extents from the CoW fork without
any of the side effects (rmapbt/bmbt updates) of the regular extent
deletion routine. We'll eventually use this to clear out the CoW fork
during ioend processing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Modify the writepage handler to find and convert pending delalloc
extents to real allocations. Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.
Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate(). In a subsequent
patch, we'll modify the writepage handler to call this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Wire up iomap_begin to detect shared extents and create delayed allocation
extents in the CoW fork:
1) Check if we already have an extent in the COW fork for the area.
If so nothing to do, we can move along.
2) Look up block number for the current extent, and if there is none
it's not shared move along.
3) Unshare the current extent as far as we are going to write into it.
For this we avoid an additional COW fork lookup and use the
information we set aside in step 1) above.
4) Goto 1) unless we've covered the whole range.
Last but not least, this updates the xfs_reflink_reserve_cow_range calling
convention to pass a byte offset and length, as that is what both callers
expect anyway. This patch has been refactored considerably as part of the
iomap transition.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow the creation of delayed allocation extents in the CoW fork. In
a subsequent patch we'll wire up iomap_begin to actually do this via
reflink helper functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only non-rt files can be reflinked, so check that when we load an
inode. Also, don't leak the attr fork if there's a failure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Report the reflink feature in the XFS geometry so that xfs_info and
friends know the filesystem has this feature.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Define all the tracepoints we need to inspect the runtime operation
of reflink/dedupe/copy-on-write.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Return the range of file blocks that bunmapi didn't free. This hint
is used by CoW and reflink to figure out what part of an extent
actually got freed so that it can set up the appropriate atomic
remapping of just the freed range.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Log recovery will iget an inode to replay BUI items and iput the inode
when it's done. Unfortunately, if the inode was unlinked, the iput
will see that i_nlink == 0 and decide to truncate & free the inode,
which prevents us from replaying subsequent BUIs. We can't skip the
BUIs because we have to replay all the redo items to ensure that
atomic operations complete.
Since unlinked inode recovery will reap the inode anyway, we can
safely introduce a new inode flag to indicate that an inode is in this
'unlinked recovery' state and should not be auto-reaped in the
drop_inode path.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Implement deferred versions of the inode block map/unmap functions.
These will be used in subsequent patches to make reflink operations
atomic.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend
BMAPI_REMAP (which means "don't touch the allocator or the quota
accounting") to apply to bunmapi as well. This will be used to
implement the unmap operation, which will be used by swapext.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Teach the bmap routine to know how to map a range of file blocks to a
specific range of physical blocks, instead of simply allocating fresh
blocks. This enables reflink to map a file to blocks that are already
in use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Provide a mechanism for higher levels to create BUI/BUD items, submit
them to the log, and a stub function to deal with recovered BUI items.
These parts will be connected to the rmapbt in a later patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create bmbt update intent/done log items to record redo information in
the log. Because we roll transactions multiple times for reflink
operations, we also have to track the status of the metadata updates
that will be recorded in the post-roll transactions in case we crash
before committing the final transaction. This mechanism enables log
recovery to finish what was already started.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
These functions will be used by the other reflink functions to find
the maximum length of a range of shared blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reduce the max AG usable space size so that we always have space for
the refcount btree root.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Identify refcountbt blocks in the log correctly so that we can
validate them during log recovery.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When we're unmapping blocks from a reflinked file, decrease the
refcount of the affected blocks and free the extents that are no
longer in use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Plumb in the upper level interface to schedule and finish deferred
refcount operations via the deferred ops mechanism.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Provide functions to adjust the reference counts for an extent of
physical blocks stored in the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Provide a mechanism for higher levels to create CUI/CUD items, submit
them to the log, and a stub function to deal with recovered CUI items.
These parts will be connected to the refcountbt in a later patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create refcount update intent/done log items to record redo
information in the log. Because we need to roll transactions between
updating the bmbt mapping and updating the reverse mapping, we also
have to track the status of the metadata updates that will be recorded
in the post-roll transactions, just in case we crash before committing
the final transaction. This mechanism enables log recovery to finish
what was already started.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Implement the generic btree operations required to manipulate refcount
btree blocks. The implementation is similar to the bmapbt, though it
will only allocate and free blocks from the AG.
Since the refcount root and level fields are separate from the
existing roots and levels array, they need a separate logging flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: fix logging of AGF refcount btree fields]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Every time we allocate or free a data extent, we might need to split
the refcount btree. Reserve some blocks in the transaction to handle
this possibility. Even though the deferred refcount code can roll a
transaction to avoid overloading the transaction, we can still exceed
the reservation.
Certain pathological workloads (1k blocks, no cowextsize hint, random
directio writes), cause a perfect storm wherein a refcount adjustment
of a large range of blocks causes full tree splits in two separate
extents in two separate refcount tree blocks; allocating new refcount
tree blocks causes rmap btree splits; and all the allocation activity
causes the freespace btrees to split, blowing the reservation.
(Reproduced by generic/167 over NFS atop XFS)
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick.wong@oracle.com: add commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Modify the growfs code to initialize new refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Since XFS reserves a small amount of space in each AG as the minimum
free space needed for an operation, save some more space in case we
touch the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add new per-AG refcount btree definitions to the per-AG structures.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Define all the tracepoints we need to inspect the refcount btree
runtime operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the size of an inline directory is so small that it doesn't
even cover the required header size, return an error to userspace
instead of ASSERTing and returning 0 like everything's ok.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
After the call to __blkdev_direct_IO the final reference to the file
might have been dropped by aio_complete already, and the call to
file_accessed might cause a use after free.
Instead update the access time before the I/O, similar to how we
update the time stamps before writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-and-tested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.
Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Log recovery has particular rules around buffer submission along with
tricky corner cases where independent transactions can share an LSN. As
such, it can be difficult to follow when/why buffers are submitted
during recovery.
Add a couple tracepoints to post the current LSN of a record when a new
record is being processed and when a buffer is being skipped due to LSN
ordering. Also, update the recover item class to include the LSN of the
current transaction for the item being processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log recovery is currently broken for v5 superblocks in that it never
updates the metadata LSN of buffers written out during recovery. The
metadata LSN is recorded in various bits of metadata to provide recovery
ordering criteria that prevents transient corruption states reported by
buffer write verifiers. Without such ordering logic, buffer updates can
be replayed out of order and lead to false positive transient corruption
states. This is generally not a corruption vector on its own, but
corruption detection shuts down the filesystem and ultimately prevents a
mount if it occurs during log recovery. This requires an xfs_repair run
that clears the log and potentially loses filesystem updates.
This problem is avoided in most cases as metadata writes during normal
filesystem operation update the metadata LSN appropriately. The problem
with log recovery not updating metadata LSNs manifests if the system
happens to crash shortly after log recovery itself. In this scenario, it
is possible for log recovery to complete all metadata I/O such that the
filesystem is consistent. If a crash occurs after that point but before
the log tail is pushed forward by subsequent operations, however, the
next mount performs the same log recovery over again. If a buffer is
updated multiple times in the dirty range of the log, an earlier update
in the log might not be valid based on the current state of the
associated buffer after all of the updates in the log had been replayed
(before the previous crash). If a verifier happens to detect such a
problem, the filesystem claims corruption and immediately shuts down.
This commonly manifests in practice as directory block verifier failures
such as the following, likely due to directory verifiers being
particularly detailed in their checks as compared to most others:
...
Mounting V5 Filesystem
XFS (dm-0): Starting recovery (logdev: internal)
XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \
file fs/xfs/libxfs/xfs_dir2_data.c. Caller xfs_dir3_data_verify ...
...
Update log recovery to update the metadata LSN of recovered buffers.
Since metadata LSNs are already updated by write verifer functions via
attached log items, attach a dummy log item to the buffer during
validation and explicitly set the LSN of the current transaction. This
ensures that the metadata LSN of a buffer is updated based on whether
the recovery I/O actually completes, and if so, that subsequent recovery
attempts identify that the buffer is already up to date with respect to
the current transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The log recovery buffer validation function is invoked in cases where a
buffer update may be skipped due to LSN ordering. If the validation
function happens to come across directory conversion situations (e.g., a
dir3 block to data conversion), it may warn about seeing a buffer log
format of one type and a buffer with a magic number of another.
This warning is not valid as the buffer update is ultimately skipped.
This is indicated by a current_lsn of NULLCOMMITLSN provided by the
caller. As such, update xlog_recover_validate_buf_type() to only warn in
such cases when a buffer update is expected.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The current LSN must be available to the buffer validation function to
provide the ability to update the metadata LSN of the buffer. Pass the
current_lsn value down to xlog_recover_validate_buf_type() in
preparation.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The fix to log recovery to update the metadata LSN in recovered buffers
introduces the requirement that a buffer is submitted only once per
current LSN. Log recovery currently submits buffers on transaction
boundaries. This is not sufficient as the abstraction between log
records and transactions allows for various scenarios where multiple
transactions can share the same current LSN. If independent transactions
share an LSN and both modify the same buffer, log recovery can
incorrectly skip updates and leave the filesystem in an inconsisent
state.
In preparation for proper metadata LSN updates during log recovery,
update log recovery to submit buffers for write on LSN change boundaries
rather than transaction boundaries. Explicitly track the current LSN in
a new struct xlog field to handle the various corner cases of when the
current LSN may or may not change.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Recently we've had a number of reports where log recovery on a v5
filesystem has reported corruptions that looked to be caused by
recovery being re-run over the top of an already-recovered
metadata. This has uncovered a bug in recovery (fixed elsewhere)
but the vector that caused this was largely unknown.
A kdump test started tripping over this problem - the system
would be crashed, the kdump kernel and environment would boot and
dump the kernel core image, and then the system would reboot. After
reboot, the root filesystem was triggering log recovery and
corruptions were being detected. The metadumps indicated the above
log recovery issue.
What is happening is that the kdump kernel and environment is
mounting the root device read-only to find the binaries needed to do
it's work. The result of this is that it is running log recovery.
However, because there were unlinked files and EFIs to be processed
by recovery, the completion of phase 1 of log recovery could not
mark the log clean. And because it's a read-only mount, the unmount
process does not write records to the log to mark it clean, either.
Hence on the next mount of the filesystem, log recovery was run
again across all the metadata that had already been recovered and
this is what triggered corruption warnings.
To avoid this problem, we need to ensure that a read-only mount
always updates the log when it completes the second phase of
recovery. We already handle this sort of issue with rw->ro remount
transitions, so the solution is as simple as quiescing the
filesystem at the appropriate time during the mount process. This
results in the log being marked clean so the mount behaviour
recorded in the logs on repeated RO mounts will change (i.e. log
recovery will no longer be run on every mount until a RW mount is
done). This is a user visible change in behaviour, but it is
harmless.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.
Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.
To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.
As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.
The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.
It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
To avoid clearing of capabilities or security related extended
attributes too early, inode_change_ok() will need to take dentry instead
of inode. Propagate dentry down to functions calling inode_change_ok().
This is rather straightforward except for xfs_set_mode() function which
does not have dentry easily available. Luckily that function does not
call inode_change_ok() anyway so we just have to do a little dance with
function prototypes.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2). Fix that.
References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Another users of buffer_heads bytes the dust.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Rename the current function to __xfs_setfilesize and add a non-static
wrapper that also takes care of creating the transaction. This new
helper will be used by the new iomap-based DAX path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We always just read the extent first, and will later lock exlusively
after first dropping the lock in case we actually allocate blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
So far DAX writes inherited the locking from direct I/O writes, but
the direct I/O model of using shared locks for writes is actually
wrong for DAX. For direct I/O we're out of any standards and don't
have to provide the Posix required exclusion between writers, but
for DAX which gets transparently enable on applications without any
knowledge of it we can't simply drop the requirement. Even worse
this only happens for aligned writes and thus doesn't show up for
many typical use cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently xfs_iomap_write_delay does up to lookups in the inode
extent tree, which is rather costly especially with the new iomap
based write path and small write sizes.
But it turns out that the low-level xfs_bmap_search_extents gives us
all the information we need in the regular delalloc buffered write
path:
- it will return us an extent covering the block we are looking up
if it exists. In that case we can simply return that extent to
the caller and are done
- it will tell us if we are beyoned the last current allocated
block with an eof return parameter. In that case we can create a
delalloc reservation and use the also returned information about
the last extent in the file as the hint to size our delalloc
reservation.
- it can tell us that we are writing into a hole, but that there is
an extent beyoned this hole. In this case we can create a
delalloc reservation that covers the requested size (possible
capped to the next existing allocation).
All that can be done in one single routine instead of bouncing up
and down a few layers. This reduced the CPU overhead of the block
mapping routines and also simplified the code a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
For long growing file writes we will usually already have the
eofblocks tag set when adding more speculative preallocations. Add
a flag in the inode to allow us to skip the the fairly expensive
AG-wide spinlocks and multiple radix tree operations in that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
And drop the pointless mp argument to xfs_iomap_eof_align_last_fsb,
while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We'll need it earlier in the file soon, so the unchanged function to
the top of xfs_iomap.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents. Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.
Therefore, create two per-AG block reservation pools. One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.
Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG. Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.
In other words, play some accounting tricks in-core to make sure that
we always have blocks available. On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When xfs_defer_finish calls ->finish_item, it's possible that
(refcount) won't be able to finish all the work in a single
transaction. When this happens, the ->finish_item handler should
shorten the log done item's list count, update the work item to
reflect where work should continue, and return -EAGAIN so that
defer_finish knows to retain the pending item on the pending list,
roll the transaction, and restart processing where we left off.
Plumb in the code and document how this mechanism is supposed to work.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Provide a helper method to count the number of blocks in a short form
btree. The refcount and rmap btrees need to know the number of blocks
already in use to set up their per-AG block reservations during mount.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create a helper to generate AG btree height calculator functions.
This will be used (much) later when we get to the refcount btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Remove the xfs_btree_bigkey mess and simply make xfs_btree_key big enough
to hold both keys in-core.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use variable length array declarations for RUI log items,
and replace the open coded sizeof formulae with a single function.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
As it stands today, the "fail immediately" vs. "retry forever"
values for max_retries and retry_timeout_seconds in the xfs metadata
error configurations are not consistent.
A retry_timeout_seconds of 0 means "retry forever," but a
max_retries of 0 means "fail immediately."
retry_timeout_seconds < 0 is disallowed, while max_retries == -1
means "retry forever."
Make this consistent across the error configs, such that a value of
0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
and a value of -1 always means "retry forever."
This makes retry_timeout a signed long to accommodate the -1, even
though it stores jiffies. Given our limit of a 1 day maximum
timeout, this should be sufficient even at much higher HZ values
than we have available today.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent"
changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is
insufficient space in the buffer assuming that setting context->count to -1
would be enough, but all of the ->put_listent callers only check seen_enough.
This results in a failed assertion:
XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175
in insufficient buffer size case.
This is only reproducible with at least 2 xattrs and only when the buffer
gets depleted before the last one.
Furthermore if buffersize is such that it is enough to hold the last xattr's
name, but not enough to hold the sum of preceeding xattr names listxattr won't
fail with ERANGE, but will suceed returning last xattr's name without the
first character. The first character end's up overwriting data stored at
(context->alist - 1).
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
"blocks" should be added back to fdblocks at undo time, not taken
away, i.e. the minus sign should not be used.
This is a regression introduced by commit 0d485ada40 ("xfs: use
generic percpu counters for free block counter"). And it's found by
code inspection, I didn't it in real world, so there's no
reproducer.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Christoph reports slab corruption when a deferred refcount update
aborts during _defer_finish(). The cause of this was broken log item
state tracking in xfs_defer_pending -- upon an abort,
_defer_trans_abort() will call abort_intent on all intent items,
including the ones that have already had a done item attached.
This is incorrect because each intent item has 2 refcount: the first
is released when the intent item is committed to the log; and the
second is released when the _done_ item is committed to the log, or
by the intent creator if there is no done item. In other words, once
we log the done item, responsibility for releasing the intent item's
second refcount is transferred to the done item and /must not/ be
performed by anything else.
The dfp_committed flag should have been tracking whether or not we had
a done item so that _defer_trans_abort could decide if it needs to
abort the intent item, but due to a thinko this was not the case. Rip
it out and track the done item directly so that we do the right thing
w.r.t. intent item freeing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_wait_buftarg() waits for all pending I/O, drains the ioend
completion workqueue and walks the LRU until all buffers in the cache
have been released. This is traditionally an unmount operation` but the
mechanism is also reused during filesystem freeze.
xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
which is intended more for a shutdown sequence in that it indicates to
the queue that new operations are not expected once the drain has begun.
New work jobs after this point result in a WARN_ON_ONCE() and are
otherwise dropped.
With filesystem freeze, however, read operations are allowed and can
proceed during or after the workqueue drain. If such a read occurs
during the drain sequence, the workqueue infrastructure complains about
the queued ioend completion work item and drops it on the floor. As a
result, the buffer remains on the LRU and the freeze never completes.
Despite the fact that the overall buffer cache cleanup is not necessary
during freeze, fix up this operation such that it is safe to invoke
during non-unmount quiesce operations. Replace the drain_workqueue()
call with flush_workqueue(), which runs a similar serialization on
pending workqueue jobs without causing new jobs to be dropped. This is
safe for unmount as unmount independently locks out new operations by
the time xfs_wait_buftarg() is invoked.
cc: <stable@vger.kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
From inspection, the superblock sb_inprogress check is done in the
verifier and triggered only for the primary superblock via a
"bp->b_bn == XFS_SB_DADDR" check.
Unfortunately, the primary superblock is an uncached buffer, and
hence it is configured by xfs_buf_read_uncached() with:
bp->b_bn = XFS_BUF_DADDR_NULL; /* always null for uncached buffers */
And so this check never triggers. Fix it.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If the initial LOOKUP_LE in the simple query range fails to find
anything, we should attempt to increment the btree cursor to see
if there actually /are/ records for what we're trying to find.
Without this patch, a bnobt range query of (0, $agsize) returns
no results because the leftmost record never has a startblock
of zero.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We only need the record's high key for the first record that we look
at; for all records, we /definitely/ need the regular record key.
Therefore, fix how the simple range query function gets its keys.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're logging the last non-spare field in the AGF, we don't
need to log the spare fields, so plumb in a new AGF logging flag
to help us avoid that.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since the kernel doesn't currently support the realtime rmapbt,
don't allow such filesystems to be mounted. Support will appear
in a future release.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If the caller passes in a cursor to a zero-height btree (which is
impossible), we never set block to anything but NULL, which causes the
later dereference of it to crash. Instead, just return -EFSCORRUPTED.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we're really tight on space, xfs_alloc_ag_vextent_small() can
allocate a block from the AGFL and give it to the caller. Since the
caller is never the AGFL-fixing method, we must remove the OWN_AG
reverse mapping because it will clash with whatever rmap the caller
wants to set up. This bug was discovered by running generic/299
repeatedly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use a special read-only iomap_ops implementation to support fiemap on
the attr fork.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We'll never get nimap == 0 for a successful return from xfs_bmapi_read,
so don't try to handle it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Track the number of blocks used for the rmapbt in the AGF. When we
get to the AG reservation code we need this counter to quickly
make our reservation during mount.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we do DAX IO, we try to invalidate the entire page cache held
on the file. This is incorrect as it will trash the entire mapping
tree that now tracks dirty state in exceptional entries in the radix
tree slots.
What we are trying to do is remove cached pages (e.g from reads
into holes) that sit in the radix tree over the range we are about
to write to. Hence we should just limit the invalidation to the
range we are about to overwrite.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The space reservations was without an explaination in commit
"Add error reporting calls in error paths that return EFSCORRUPTED"
back in 2003. There is no reason to reserve disk blocks in the
transaction when allocating blocks for delalloc space as we already
reserved the space when creating the delalloc extent.
With this fix we stop running out of the reserved pool in
generic/229, which has happened for long time with small blocksize
file systems, and has increased in severity with the new buffered
write path.
[ dchinner: we still need to pass the block reservation into
xfs_bmapi_write() to ensure we don't deadlock during AG selection.
See commit dbd5c8c ("xfs: pass total block res. as total
xfs_bmapi_write() parameter") for more details on why this is
necessary. ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The buffer I/O accounting mechanism tracks async buffers under I/O. As
an optimization, the buffer I/O count is incremented only once on the
first async I/O for a given hold cycle of a buffer and decremented once
the buffer is released to the LRU (or freed).
xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but
we have one or two corner cases where a buffer can be submitted for I/O
multiple times via different methods in a single hold cycle. If an async
I/O occurs first, the I/O count is incremented. If a sync I/O occurs
before the hold count drops, XBF_ASYNC is cleared by the time the I/O
count is decremented.
Remove the async assert check from xfs_buf_ioacct_dec() as this is a
perfectly valid scenario. For the purposes of I/O accounting, we really
only care about the buffer async state at I/O submission time.
Discovered-and-analyzed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In most cases, EPERM is returned on immutable inode, and there're only a
few places returning EACCES. I noticed this when running LTP on
overlayfs, setxattr03 failed due to unexpected EACCES on immutable
inode.
So converting all EACCES to EPERM on immutable inode.
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXpRdVAAoJEK3oKUf0dfod2tkP/24f1Znl9OQEPHoSZty9nXF0
dSjOzE2lHbR4xjjuYjbn1siFnIX0A5nPPqleBYmt3gatiO+24vE1BiNWjM6Y/y7r
3KHENRqmfSj26ha6wl/TUNaKnuFooBcQ0BaHI1IExFROitOSvZgPJPSrk29AH/Er
OVJkaoi3N3o9mrfUpF9/M55Yi/DhQiPBYxkqcXvaqcakbL91EIj5TLZ72MJqgfje
d6og33zxb21EDx9eIJEA0cWX4MLO2UQqFAuiJLzk2RkSAm6vRjbRJyYGG9jv81tP
9ZX1gAw47v0qk3nPVyAgbi862ukYCYzmr1g2b4S2b0UKLXxQb8Fw8D2mRbFXl2wg
wq0nKLg9jwsd8Yo7k8qOrUI9nl/E9Ytmj8t92Y49XvPjtsVFZREoCw3ojyjmlyZA
9BywL5BzMHF6SsXe6LBGJpoebrxCnq5176FREBnpmH7UHM0BcWa4YSekQShwg3DW
PFlBOxk5saz4Ktr5V3YUY+G6XgZ/AXWKlDox5+dESLIOgG0hyzbiVbPNSTQgDrnR
m9yUJPef1NQj2JWSZbqKn7FSZDO6/IT2aeokn1KuoaDJww5HC80juyB1VThmpZnl
QJGN6nmsYDVCLYjbT6scAzyGMYw9ZVhTM7eEk3kqAtCBf/nEyqJM+H0HYUDjfg9B
cG5cRtZNDDkc30lFezJX
=nXKv
-----END PGP SIGNATURE-----
Merge tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull more xfs updates from Dave Chinner:
"This is the second part of the XFS updates for this merge cycle, and
contains the new reverse block mapping feature for XFS.
Reverse mapping allows us to track the owner of a specific block on
disk precisely. It is implemented as a set of btrees (one per
allocation group) that track the owners of allocated extents.
Effectively it is a "used space tree" that is updated when we allocate
or free extents. i.e. it is coherent with the free space btrees we
already maintain and never overlaps with them.
This reverse mapping infrastructure is the building block of several
upcoming features - reflink, copy-on-write data, dedupe, online
metadata and data scrubbing, highly accurate bad sector/data loss
reporting to users, and significantly improved reconstruction of
damaged and corrupted filesystems. There's a lot of new stuff coming
along in the next couple of cycles,a nd it all builds in the rmap
infrastructure.
As such, it's a huge chunk of new code with new on-disk format
features and internal infrastructure. It warns at mount time as an
experimental feature and that it may eat data (as we do with all new
on-disk features until they stabilise). We have not released
userspace suport for it yet - userspace support currently requires
download from Darrick's xfsprogs repo and build from source, so the
access to this feature is really developer/tester only at this point.
Initial userspace support will be released at the same time kernel
with this code in it is released.
The new rmap enabled code regresses 3 xfstests - all are ENOSPC
related corner cases, one of which Darrick posted a fix for a few
hours ago. The other two are fixed by infrastructure that is part of
the upcoming reflink patchset. This new ENOSPC infrastructure
requires a on-disk format tweak required to keep mount times in
check - we need to keep an on-disk count of allocated rmapbt blocks so
we don't have to scan the entire btrees at mount time to count them.
This is currently being tested and will be part of the fixes sent in
the next week or two so users will not be exposed to this change"
* tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits)
xfs: move (and rename) the deferred bmap-free tracepoints
xfs: collapse single use static functions
xfs: remove unnecessary parentheses from log redo item recovery functions
xfs: remove the extents array from the rmap update done log item
xfs: in btree_lshift, only allocate temporary cursor when needed
xfs: remove unnecesary lshift/rshift key initialization
xfs: remove the get*keys and update_keys btree ops pointers
xfs: enable the rmap btree functionality
xfs: don't update rmapbt when fixing agfl
xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled
xfs: add rmap btree block detection to log recovery
xfs: add rmap btree geometry feature flag
xfs: propagate bmap updates to rmapbt
xfs: enable the xfs_defer mechanism to process rmaps to update
xfs: log rmap intent items
xfs: create rmap update intent log items
xfs: add rmap btree insert and delete helpers
xfs: convert unwritten status of reverse mappings
xfs: remove an extent from the rmap btree
xfs: add an extent to the rmap btree
...
Trond made a change to the server's tcp logic that allows a fast
client to better take advantage of high bandwidth networks, but
may increase the risk that a single client could starve other
clients; a new sunrpc.svc_rpc_per_connection_limit parameter
should help mitigate this in the (hopefully unlikely) event this
becomes a problem in practice.
Tom Haynes added a minimal flex-layout pnfs server, which is of
no use in production for now--don't build it unless you're doing
client testing or further server development.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXo7HNAAoJECebzXlCjuG+zqUP/RxO5jZjBhNI8/ayGdDW/Jnq
s0Fu6B+aNRV3GnugmIeI4tWNGnPyERNzFtjLKlnwaasz/oW4qBLqGbNUWC5xKARS
erODs0hM/1aCYWwNBEc5qXP2u23HrWVuQ+B5fg42ACyliKFGq5faDRmf6XGU/1kB
8unXGWPAiLiNZD/bWP91fYhThlLgpfHBFZ7M3G2IqmzWZTSELPzwp1bpRWt7yWQQ
z1oYtXToycbwz3yPVk3cXtaoqpjDUVZf2Guqgqi1BwEyEtYOSaYo1VHNsKDf4OId
QXQh64AqIK4uszpvtNhvsEaAECN7IiB+N4n2laFiQVmAf8Hfl3AnV/gKeD4lKmTj
TY6knnjZO/X88wn80MB7JR1H1WXvvzNIHwNR95qfub/lVKX+C+0AORRtYhi5F9ec
ixNs/z1ImLpYxAjiP/T5anD5xcX2S+LcSv7kRjhEufqNFtRAIqBZO9ZWbCdXAAyE
tcH9Cru4jeIlFO/y6O61EVrn9FFj2+0uu+7urefNRQ2Y9pmKeculJrLF6WO8WHms
4IzXMmjZK+358RVdX2Ji5Hw6rBDvfgP+LjB8Jn8CeIiNRONEjT+2/AYQcfk61aLb
INUbk6G6Vfd8iMO4aaRI9tmW+vKCOZa0IbnrNE1oHKp/AKBDr25i5YPSCsnl3r4Q
iR7rRe9FIkfqBpbfjVFv
=mo54
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Trond made a change to the server's tcp logic that allows a fast
client to better take advantage of high bandwidth networks, but may
increase the risk that a single client could starve other clients;
a new sunrpc.svc_rpc_per_connection_limit parameter should help
mitigate this in the (hopefully unlikely) event this becomes a
problem in practice.
- Tom Haynes added a minimal flex-layout pnfs server, which is of no
use in production for now--don't build it unless you're doing
client testing or further server development"
* tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd: remove some dead code in nfsd_create_locked()
nfsd: drop unnecessary MAY_EXEC check from create
nfsd: clean up bad-type check in nfsd_create_locked
nfsd: remove unnecessary positive-dentry check
nfsd: reorganize nfsd_create
nfsd: check d_can_lookup in fh_verify of directories
nfsd: remove redundant zero-length check from create
nfsd: Make creates return EEXIST instead of EACCES
SUNRPC: Detect immediate closure of accepted sockets
SUNRPC: accept() may return sockets that are still in SYN_RECV
nfsd: allow nfsd to advertise multiple layout types
nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock
nfsd/blocklayout: Make sure calculate signature/designator length aligned
xfs: abstract block export operations from nfsd layouts
SUNRPC: Remove unused callback xpo_adjust_wspace()
SUNRPC: Change TCP socket space reservation
SUNRPC: Add a server side per-connection limit
SUNRPC: Micro optimisation for svc_data_ready
SUNRPC: Call the default socket callbacks instead of open coding
SUNRPC: lock the socket while detaching it
...
Rename the deferred bmap-free to extent_free and make them only
trigger when we're really running deferred ops.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Nothing ever uses the extent array in the rmap update done redo
item, so remove it before it is fixed in the on-disk log format.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We only need the temporary cursor in _btree_lshift if we're shifting
in an overlapped btree. Therefore, factor that into a single block
of code so we avoid unnecessary cursor duplication.
Also fix use of the wrong cursor when checking for corruption in
xfs_btree_rshift().
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In the lshift/rshift functions we don't use the key variable for
anything now, so remove the variable and its initializer. The
update_keys functions figure out the key for a block on their own.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These are internal btree functions; we don't need them to be
dispatched via function pointers. Make them static again and
just check the overlapped flag to figure out what we need to
do. The strategy behind this patch was suggested by Christoph.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
Add the feature flag to the supported matrix so that the kernel can
mount and use rmap btree enabled filesystems
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick.wong@oracle.com: move the experimental tag]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates
when fixing the AG freelist. xfs_repair needs this during phase 5
to be able to adjust the freelist while it's reconstructing the rmap
btree; the missing entries will be added back at the very end of
phase 5 once the AGFL contents settle down.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Swapping extents between two inodes requires the owner to be updated
in the rmap tree for all the extents that are swapped. This code
does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until
support has been implemented. This will need to be done before the
rmap btree code can have the experimental tag removed.
This functionality will be provided in a (much) later patch, using
some of the reflink deferred block remapping functionality to
accomlish extent swapping with rmap updates.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
So such blocks can be correctly identified and have their operations
structures attached to validate recovery has not resulted in a
correct block.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Originally-From: Dave Chinner <dchinner@redhat.com>
So xfs_info and other userspace utilities know the filesystem is
using this feature.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we map, unmap, or convert an extent in a file's data or attr
fork, schedule a respective update in the rmapbt. Previous versions
of this patch required a 1:1 correspondence between bmap and rmap,
but this is no longer true as we now have ability to make interval
queries against the rmapbt.
We use the deferred operations code to handle redo operations
atomically and deadlock free. This plumbs in all five rmap actions
(map, unmap, convert extent, alloc, free); we'll use the first three
now for file data, and reflink will want the last two. We also add
an error injection site to test log recovery.
Finally, we need to fix the bmap shift extent code to adjust the
rmaps correctly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>