Make the flag match the mount option and usage.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rework xfs_parseargs to fill out the default value and then parse the
option directly into the mount structure, similar to what we do for
other updates, and open code the now trivial updates based on on the
on-disk superblock directly into xfs_mountfs.
Note that this change rejects the allocsize=0 mount option that has been
documented as invalid for a long time instead of just ignoring it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the allocsize name to match the mount option and usage instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
m_readio_blocks is entirely unused, and m_readio_blocks is only used in
xfs_stat_blksize in a max statements that is a no-op as it always has
the same value as m_writeio_log.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The -o wsync allocsize overwrite overwrite was part of a special hack
for NFSv2 servers in IRIX and has no real purpose in modern Linux, so
remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move xfs_preferred_iosize to xfs_iops.c, unobsfucate it and also handle
the realtime special case in the helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no real need for the local variables here - either they
are applied to the mount structure, or if the noalign mount option
is set the mount will fail entirely if either is set. Removing
them helps cleaning up the mount API conversion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
It appears the biosize mount option hasn't been documented as a valid
option since 2005, remove it.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Stop using the deprecated bio_set_op_attrs helper, and use a single
argument to xfs_buf_ioapply_map for the operation and flags.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_iread_extents open-codes everything in xfs_btree_visit_blocks, so
refactor the btree helper to be able to iterate only the records on
level 0, then port iread_extents to use it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, this function open-codes walking a bmbt to count the extents
and blocks in use by a particular inode fork. Since we now have a
function to tally extent records from the incore extent tree and a btree
helper to count every block in a btree, replace all that with calls to
the helpers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are a few places where we return -EIO instead of -EFSCORRUPTED
when we find corrupt metadata. Fix those places.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Actually call namecheck on directory entry names before we hand them
over to userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Actually call namecheck on attribute names before we hand them over to
userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Add missing structure checks in the attribute leaf verifier.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Remove xfs_zero_file_space and reorganize xfs_file_fallocate so that a
single call to xfs_alloc_file_space covers all modes that preallocate
blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we always have to write out of place preallocating blocks is
pointless. We already check for this in the normal falloc path, but
the check was missig in the legacy ALLOCSP path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These use the same scheme as the pre-existing mapping of the XFS
RESVP ioctls to ->falloc, so just extend it and remove the XFS
implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: fix compile error on s390]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These ioctls are implemented by the VFS and mapped to ->fallocate now,
so this code won't ever be reached.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Add a new xfs_inode_buftarg helper that gets the data I/O buftarg for a
given inode. Replace the existing xfs_find_bdev_for_inode and
xfs_find_daxdev_for_inode helpers with this new general one and cleanup
some of the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Flags passed to Q_XQUOTARM were not sanity checked for invalid values.
Fix that.
Fixes: 9da93f9b7c ("xfs: fix Q_XQUOTARM ioctl")
Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_pnfs.c file is missing an include of xfs_pnfs.h to
add the prototypes of the functions it exports. Include this
file to fix the following sparse warnings:
fs/xfs/xfs_pnfs.c:27:1: warning: symbol 'xfs_break_leased_layouts' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:52:1: warning: symbol 'xfs_fs_get_uuid' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:77:1: warning: symbol 'xfs_fs_map_blocks' was not declared. Should it be static?
fs/xfs/xfs_pnfs.c:226:1: warning: symbol 'xfs_fs_commit_blocks' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_bmapi_write() takes a total block requirement parameter that is
passed down to the block allocation code and is used to specify the
total block requirement of the associated transaction. This is used
to try and select an AG that can not only satisfy the requested
extent allocation, but can also accommodate subsequent allocations
that might be required to complete the transaction. For example,
additional bmbt block allocations may be required on insertion of
the resulting extent to an inode data fork.
While it's important for callers to calculate and reserve such extra
blocks in the transaction, it is not necessary to pass the total
value to xfs_bmapi_write() in all cases. The latter automatically
sets minleft to ensure that sufficient free blocks remain after the
allocation attempt to expand the format of the associated inode
(i.e., such as extent to btree conversion, btree splits, etc).
Therefore, any callers that pass a total block requirement of the
bmap mapping length plus worst case bmbt expansion essentially
specify the additional reservation requirement twice. These callers
can pass a total of zero to rely on the bmapi minleft policy.
Beyond being superfluous, the primary motivation for this change is
that the total reservation logic in the bmbt code is dubious in
scenarios where minlen < maxlen and a maxlen extent cannot be
allocated (which is more common for data extent allocations where
contiguity is not required). The total value is based on maxlen in
the xfs_bmapi_write() caller. If the bmbt code falls back to an
allocation between minlen and maxlen, that allocation will not
succeed until total is reset to minlen, which essentially throws
away any additional reservation included in total by the caller. In
addition, the total value is not reset until after alignment is
dropped, which means that such callers drop alignment far too
aggressively than necessary.
Update all callers of xfs_bmapi_write() that pass a total block
value of the mapping length plus bmbt reservation to instead pass
zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
requirement. This trades off slightly less conservative AG selection
for the ability to preserve alignment in more scenarios.
xfs_bmapi_write() callers that incorporate unrelated or additional
reservations in total beyond what is already included in minleft
must continue to use the former.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cap longest extent to the largest we can allocate based on limits
calculated at mount time. Dynamic state (such as finobt blocks)
can result in the longest free extent exceeding the size we can
allocate, and that results in failure to align full AG allocations
when the AG is empty.
Result:
xfs_io-4413 [003] 426.412459: xfs_alloc_vextent_loopfailed: dev 8:96 agno 0 agbno 32 minlen 243968 maxlen 244000 mod 0 prod 1 minleft 1 total 262148 alignment 32 minalignslop 0 len 0 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x5 firstblock 0xffffffffffffffff
minlen and maxlen are now separated by the alignment size, and
allocation fails because args.total > free space in the AG.
[bfoster: Added xfs_bmap_btalloc() changes.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_bumplink() call has set the inode log fieldmask XFS_ILOG_CORE,
so the next xfs_trans_log_inode() call is not necessary.
Signed-off-by: kaixuxia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Only bail out once we know that a COW allocation is actually required,
similar to how we handle normal data fork allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move more checks into the helpers that determine if we need a COW
operation or allocation and split the return path for when an existing
data for allocation has been found versus a new allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Renaming whichfork to allocfork in xfs_buffered_write_iomap_begin makes
the usage of this variable a little more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of lots of magic conditionals in the main write_begin
handler this make the intent very clear. Thing will become even
better once we support delayed allocations for extent size hints
and realtime allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move xfs_file_iomap_begin_delay near the end of the file next to the
other iomap functions to prepare for additional refactoring.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Start untangling xfs_file_iomap_begin by splitting out the read-only
case into its own set of iomap_ops with a very simply iomap_begin
helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We have lots of places that want to calculate the final fsb for
a offset + count in bytes and check that the result fits into
s_maxbytes. Factor out a helper for that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Replace our local hacks to report the source block in the main iomap
with the proper scrmap reporting.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Rejuggle the return path to prepare for filling out a source iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_reflink_allocate_cow consumes the source data fork imap, and
potentially returns the COW fork imap. Split the arguments in two
to clear up the calling conventions and to prepare for returning
a source iomap from ->iomap_begin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that xfs_file_unshare is not completely dumb we can just call it
directly without iterating the extent and reflink btrees ourselves.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no reason not to punch out stale delalloc blocks for zeroing
operations, as they otherwise behave exactly like normal writes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[commit message is verbose for discussion purposes - will trim it
down later. Some questions about implementation details at the end.]
Zorro Lang recently ran a new test to stress single inode extent
counts now that they are no longer limited by memory allocation.
The test was simply:
# xfs_io -f -c "falloc 0 40t" /mnt/scratch/big-file
# ~/src/xfstests-dev/punch-alternating /mnt/scratch/big-file
This test uncovered a problem where the hole punching operation
appeared to finish with no error, but apparently only created 268M
extents instead of the 10 billion it was supposed to.
Further, trying to punch out extents that should have been present
resulted in success, but no change in the extent count. It looked
like a silent failure.
While running the test and observing the behaviour in real time,
I observed the extent coutn growing at ~2M extents/minute, and saw
this after about an hour:
# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next ; \
> sleep 60 ; \
> xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 127657993
fsxattr.nextents = 129683339
#
And a few minutes later this:
# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4177861124
#
Ah, what? Where did that 4 billion extra extents suddenly come from?
Stop the workload, unmount, mount:
# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 166044375
#
And it's back at the expected number. i.e. the extent count is
correct on disk, but it's screwed up in memory. I loaded up the
extent list, and immediately:
# xfs_io -f -c "stat" /mnt/scratch/big-file |grep next
fsxattr.nextents = 4192576215
#
It's bad again. So, where does that number come from?
xfs_fill_fsxattr():
if (ip->i_df.if_flags & XFS_IFEXTENTS)
fa->fsx_nextents = xfs_iext_count(&ip->i_df);
else
fa->fsx_nextents = ip->i_d.di_nextents;
And that's the behaviour I just saw in a nutshell. The on disk count
is correct, but once the tree is loaded into memory, it goes whacky.
Clearly there's something wrong with xfs_iext_count():
inline xfs_extnum_t xfs_iext_count(struct xfs_ifork *ifp)
{
return ifp->if_bytes / sizeof(struct xfs_iext_rec);
}
Simple enough, but 134M extents is 2**27, and that's right about
where things went wrong. A struct xfs_iext_rec is 16 bytes in size,
which means 2**27 * 2**4 = 2**31 and we're right on target for an
integer overflow. And, sure enough:
struct xfs_ifork {
int if_bytes; /* bytes in if_u1 */
....
Once we get 2**27 extents in a file, we overflow if_bytes and the
in-core extent count goes wrong. And when we reach 2**28 extents,
if_bytes wraps back to zero and things really start to go wrong
there. This is where the silent failure comes from - only the first
2**28 extents can be looked up directly due to the overflow, all the
extents above this index wrap back to somewhere in the first 2**28
extents. Hence with a regular pattern, trying to punch a hole in the
range that didn't have holes mapped to a hole in the first 2**28
extents and so "succeeded" without changing anything. Hence "silent
failure"...
Fix this by converting if_bytes to a int64_t and converting all the
index variables and size calculations to use int64_t types to avoid
overflows in future. Signed integers are still used to enable easy
detection of extent count underflows. This enables scalability of
extent counts to the limits of the on-disk format - MAXEXTNUM
(2**31) extents.
Current testing is at over 500M extents and still going:
fsxattr.nextents = 517310478
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
XLOG_STATE_DO_CALLBACK is only entered through XLOG_STATE_DONE_SYNC
and just used in a single debug check. Remove the flag and thus
simplify the calling conventions for xlog_state_do_callback and
xlog_state_iodone_process_iclog.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
ic_state really is a set of different states, even if the values are
encoded as non-conflicting bits and we sometimes use logical and
operations to check for them. Switch all comparisms to check for
exact values (and use switch statements in a few places to make it
more clear) and turn the values into an implicitly enumerated enum
type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
XFSERRORDEBUG is never set and the code isn't all that useful, so remove
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
All but one caller of xlog_state_release_iclog hold l_icloglock and need
to drop and reacquire it to call xlog_state_release_iclog. Switch the
xlog_state_release_iclog calling conventions to expect the lock to be
held, and open code the logic (using a shared helper) in the only
remaining caller that does not have the lock (and where not holding it
is a nice performance optimization). Also move the refactored code to
require the least amount of forward declarations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor whitespace cleanup]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
This will allow optimizing various locking cycles in the following
patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
ic_io_size is only used inside xlog_write_iclog, where we can just use
the count parameter intead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
xlog_write_iclog expects a bool for the second argument. While any
non-0 value happens to work fine this makes all calls consistent.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The near mode fallback algorithm consists of a left/right scan of
the bnobt. This algorithm has very poor breakdown characteristics
under worst case free space fragmentation conditions. If a suitable
extent is far enough from the locality hint, each allocation may
scan most or all of the bnobt before it completes. This causes
pathological behavior and extremely high allocation latencies.
While locality is important to near mode allocations, it is not so
important as to incur pathological allocation latency to provide the
asolute best available locality for every allocation. If the
allocation is large enough or far enough away, there is a point of
diminishing returns. As such, we can bound the overall operation by
including an iterative cntbt lookup in the broader search. The cntbt
lookup is optimized to immediately find the extent with best
locality for the given size on each iteration. Since the cntbt is
indexed by extent size, the lookup repeats with a variably
aggressive increasing search key size until it runs off the edge of
the tree.
This approach provides a natural balance between the two algorithms
for various situations. For example, the bnobt scan is able to
satisfy smaller allocations such as for inode chunks or btree blocks
more quickly where the cntbt search may have to search through a
large set of extent sizes when the search key starts off small
relative to the largest extent in the tree. On the other hand, the
cntbt search more deterministically covers the set of suitable
extents for larger data extent allocation requests that the bnobt
scan may have to search the entire tree to locate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Lift the btree fixup path into a helper function.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In preparation to enhance the near mode allocation bnobt scan algorithm, lift
it into a separate function. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The bnobt "find best" helper implements a simple btree walker
function. This general pattern, or a subset thereof, is reused in
various parts of a near mode allocation operation. For example, the
bnobt left/right scans are each iterative btree walks along with the
cntbt lastblock scan.
Rework this function into a generic btree walker, add a couple
parameters to control termination behavior from various contexts and
reuse it where applicable.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Both algorithms duplicate the same btree allocation code. Eliminate
the duplication and reuse the fallback algorithm codepath.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The near mode bnobt scan searches left and right in the bnobt
looking for the closest free extent to the allocation hint that
satisfies minlen. Once such an extent is found, the left/right
search terminates, we search one more time in the opposite direction
and finish the allocation with the best overall extent.
The left/right and find best searches are currently controlled via a
combination of cursor state and local variables. Clean up this code
and prepare for further improvements to the near mode fallback
algorithm by reusing the allocation cursor best extent tracking
mechanism. Update the tracking logic to deactivate bnobt cursors
when out of allocation range and replace open-coded extent checks to
calls to the common helper. In doing so, rename some misnamed local
variables in the top-level near mode allocation function.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The cntbt lastblock scan checks the size, alignment, locality, etc.
of each free extent in the block and compares it with the current
best candidate. This logic will be reused by the upcoming optimized
cntbt algorithm, so refactor it into a separate helper. Note that
acur->diff is now initialized to -1 (unsigned) instead of 0 to
support the more granular comparison logic in the new helper.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If the size lookup lands in the last block of the by-size btree, the
near mode algorithm scans the entire block for the extent with best
available locality. In preparation for similar best available
extent tracking across both btrees, extend the allocation cursor
with best extent data and lift the associated state from the cntbt
last block scan code. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Extend the allocation cursor to track extent busy state for an
allocation attempt. No functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce a new allocation cursor data structure to encapsulate the
various states and structures used to perform an extent allocation.
This structure will eventually be used to track overall allocation
state across different search algorithms on both free space btrees.
To start, include the three btree cursors (one for the cntbt and two
for the bnobt left/right search) used by the near mode allocation
algorithm and refactor the cursor setup and teardown code into
helpers. This slightly changes cursor memory allocation patterns,
but otherwise makes no functional changes to the allocation
algorithm.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix sparse complaints]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The upcoming allocation algorithm update searches multiple
allocation btree cursors concurrently. As such, it requires an
active state to track when a particular cursor should continue
searching. While active state will be modified based on higher level
logic, we can define base functionality based on the result of
allocation btree lookups.
Define an active flag in the private area of the btree cursor.
Update it based on the result of lookups in the existing allocation
btree helpers. Finally, provide a new helper to query the current
state.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no point in applying extent size hints for always COW inodes,
as we would just have to COW any extra allocation beyond the data
actually written.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit d03a2f1b9f ("xfs: include WARN, REPAIR build options in
XFS_BUILD_OPTIONS"), Eric pointed out that the XFS_BUILD_OPTIONS string,
shown at module init time and in modinfo output, does not currently
include all available build options. So, he added in CONFIG_XFS_WARN and
CONFIG_XFS_REPAIR. However, this is not enough, add in CONFIG_XFS_QUOTA
and CONFIG_XFS_ASSERT_FATAL.
Signed-off-by: yu kuai <yukuai3@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The srcmap is used to identify where the read is to be performed from.
It is passed to ->iomap_begin, which can fill it in if we need to read
data for partially written blocks from a different location than the
write target. The srcmap is only supported for buffered writes so far.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as
srcmap if not set, adjust length down to srcmap end as well]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Use the existing iomap write_begin code to read the pages unshared
by iomap_file_unshare. That avoids the extra ->readpage call and
extent tree lookup currently done by read_mapping_page.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
That keeps the function a little easier to understand, and easier to
modify for pending enhancements.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_file_dirty is used to unshare reflink blocks. Rename the function
to xfs_file_unshare to better document that purpose, and skip iomaps
that are not shared and don't need zeroing. This will allow to simplify
the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All callers pass AOP_FLAG_NOFS, so lift that flag to iomap_write_begin
to allow reusing the flags arguments for an internal flags namespace
soon. Also remove the local index variable that is only used once.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If we encounter an IO error during writeback, log the inode, offset, and
sector number of the failure, instead of forcing the user to do some
sort of reverse mapping to figure out which file is affected.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No need to pass the full bio_vec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Move the initialization of ia and ib to the declaration line and remove
a superflous else.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that all the writepage code is in the iomap code there is no
need to keep this structure public.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
And inline mapping should never mark the page dirty and thus never end up
in writepages. Add a check for that condition and warn if it happens.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Take the xfs writeback code and move it to fs/iomap. A new structure
with three methods is added as the abstraction from the generic writeback
code to the file system. These methods are used to map blocks, submit an
ioend, and cancel a page that encountered an error before it was added to
an ioend.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: rename ->submit_ioend to ->prepare_ioend to clarify what it
does]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Lift the xfs code for tracing address space operations to the iomap
layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
File systems like gfs2 don't support delayed allocations or unwritten
extents and thus allocate normal mapped blocks to fill holes. To
cover the case of such file systems allocating new blocks to fill holes
also zero out mapped blocks with the new flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In preparation for moving the writeback code to iomap.c, replace the
XFS-specific COW fork concept with the iomap IOMAP_F_SHARED flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In preparation for moving the ioend structure to common code we need
to get rid of the xfs-specific xfs_trans type. Just make it a file
system private void pointer instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Introduce two nicely abstracted helper, which can be moved to the iomap
code later. Also use list_first_entry_or_null to simplify the code a
bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In preparation for moving the XFS writeback code to fs/iomap.c, switch
it to use struct iomap instead of the XFS-specific struct xfs_bmbt_irec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Don't set IOMAP_F_NEW if we COW over an existing allocated range, as
these aren't strictly new allocations. This is required to be able to
use IOMAP_F_NEW to zero newly allocated blocks, which is required for
the iomap code to fully support file systems that don't do delayed
allocations or use unwritten extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Currently we don't overwrite the flags field in the iomap in
xfs_bmbt_to_iomap. This works fine with 0-initialized iomaps on stack,
but is harmful once we want to be able to reuse an iomap in the
writeback code. Replace the shared parameter with a set of initial
flags an thus ensures the flags field is always reinitialized.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.
However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion
when O_DSYNC writes are being used and the hardware supports FUA.
Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync()
to flush the inode size update to stable storage correctly.
Fixes: 3460cac1ca ("iomap: Use FUA for pure data O_DSYNC DIO writes")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: removed the ext4 part; they'll handle it separately]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use iomap_dio_rw() to wait for unaligned direct IO instead of opencoding
the wait.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Filesystems do not support doing IO as asynchronous in some cases. For
example in case of unaligned writes or in case file size needs to be
extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
in such cases, add argument to iomap_dio_rw() which makes the function
wait for IO completion. This also results in executing
iomap_dio_complete() inline in iomap_dio_rw() providing its return value
to the caller as for ordinary sync IO.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
- Removed locked down from tracefs itself and moved it to the trace
directory. Having the open functions there do the lockdown checks.
- Fixed a few races with opening an instance file and the instance being
deleted (Discovered during the locked down updates). Kept separate
from the clean up code such that they can be backported to stable
easier.
- Cleaned up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.
- Fixed a regression in the record mcount code.
- Small hw_lat detector tracer fixes.
- A trace_pipe read fix due to not initializing trace_seq.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXaNhphQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quDIAP4v08ARNdIh+r+c4AOBm3xsOuE/d9GB
I56ydnskm+x2JQD6Ap9ivXe9yDBIErFeHNtCoq7pM8YDI4YoYIB30N0GfwM=
=7oAu
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few tracing fixes:
- Remove lockdown from tracefs itself and moved it to the trace
directory. Have the open functions there do the lockdown checks.
- Fix a few races with opening an instance file and the instance
being deleted (Discovered during the lockdown updates). Kept
separate from the clean up code such that they can be backported to
stable easier.
- Clean up and consolidated the checks done when opening a trace
file, as there were multiple checks that need to be done, and it
did not make sense having them done in each open instance.
- Fix a regression in the record mcount code.
- Small hw_lat detector tracer fixes.
- A trace_pipe read fix due to not initializing trace_seq"
* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
tracing/hwlat: Report total time spent in all NMIs during the sample
recordmcount: Fix nop_mcount() function
tracing: Do not create tracefs files if tracefs lockdown is in effect
tracing: Add locked_down checks to the open calls of files created for tracefs
tracing: Add tracing_check_open_get_tr()
tracing: Have trace events system open call tracing_open_generic_tr()
tracing: Get trace_array reference for available_tracers files
ftrace: Get a reference counter for the trace_array on filter files
tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2iBvAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvokEACaUozc0Y2L99CqJcRtoH8cwIAimjKTFSmi
UKijO/w11uc9xCIOospkM8iseRbpP/oCIwfYyVROi4LniPB1TSyQJoLF/MELeqIi
lOSg8RrXzC22h5+QG3BR5VO55me3MwIXmJLMMWXbh7FAuUg+wepVwHwzZBurd1Pz
C1wG8NU758AmqJdSGclUbdLLwTBkR+fXoi2EHfdbsBMqGmdZ5VRzCLx70grDcxq2
S9gt+2+diME6o6p1wko+m9pdhUdEy7epPE50K4SA+DU06I77OZpTdSAPilPyQHei
+3HDslzjCyos5/Ay+7hWl42jvyZ+x6GI5VcDLMTcMZsmxLy/WObDB4zcAixJrDV0
c7XWnwCCs2wMXYMsu0ASbkU32mcQ/Ik8CrX0W3/tP1aZBW2LZnOHvXYbybzpbgbo
QY47t3jz21uzF5kplBGxqaY1sPmLxp5V0vQmj9eN+Y7T+hiR/UgWKO/5B4CGNJe0
BQkXNl5cZAtqxmSE8AxlpOla7z6beZUljdhFc7K8cMDdhoatRgm2MBh2G9BB1ayo
VwHc1qLcvInpQrGiC4l5H/F8jrEs/PglknFSs1wlsav1gXA69J0cHA7LY6Czweaw
mkiKr7jA6duBBn6MRUThF5sBMFWYuxp3B/BA4wfDJsSfCmyhWN+OZlsDaEVrzk7o
/XK2Ew9R2Q==
=PUeL
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20191012' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Single small fix for a regression in the sequence logic for linked
commands"
* tag 'for-linus-20191012' of git://git.kernel.dk/linux-block:
io_uring: fix sequence logic for timeout requests
If on boot up, lockdown is activated for tracefs, don't even bother creating
the files. This can also prevent instances from being created if lockdown is
in effect.
Link: http://lkml.kernel.org/r/CAHk-=whC6Ji=fWnjh2+eS4b15TnbsS4VPVtvBOwCy1jjEG_JHQ@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running the latest kernel through my "make instances" stress tests, I
triggered the following bug (with KASAN and kmemleak enabled):
mkdir invoked oom-killer:
gfp_mask=0x40cd0(GFP_KERNEL|__GFP_COMP|__GFP_RECLAIMABLE), order=0,
oom_score_adj=0
CPU: 1 PID: 2229 Comm: mkdir Not tainted 5.4.0-rc2-test #325
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
Call Trace:
dump_stack+0x64/0x8c
dump_header+0x43/0x3b7
? trace_hardirqs_on+0x48/0x4a
oom_kill_process+0x68/0x2d5
out_of_memory+0x2aa/0x2d0
__alloc_pages_nodemask+0x96d/0xb67
__alloc_pages_node+0x19/0x1e
alloc_slab_page+0x17/0x45
new_slab+0xd0/0x234
___slab_alloc.constprop.86+0x18f/0x336
? alloc_inode+0x2c/0x74
? irq_trace+0x12/0x1e
? tracer_hardirqs_off+0x1d/0xd7
? __slab_alloc.constprop.85+0x21/0x53
__slab_alloc.constprop.85+0x31/0x53
? __slab_alloc.constprop.85+0x31/0x53
? alloc_inode+0x2c/0x74
kmem_cache_alloc+0x50/0x179
? alloc_inode+0x2c/0x74
alloc_inode+0x2c/0x74
new_inode_pseudo+0xf/0x48
new_inode+0x15/0x25
tracefs_get_inode+0x23/0x7c
? lookup_one_len+0x54/0x6c
tracefs_create_file+0x53/0x11d
trace_create_file+0x15/0x33
event_create_dir+0x2a3/0x34b
__trace_add_new_event+0x1c/0x26
event_trace_add_tracer+0x56/0x86
trace_array_create+0x13e/0x1e1
instance_mkdir+0x8/0x17
tracefs_syscall_mkdir+0x39/0x50
? get_dname+0x31/0x31
vfs_mkdir+0x78/0xa3
do_mkdirat+0x71/0xb0
sys_mkdir+0x19/0x1b
do_fast_syscall_32+0xb0/0xed
I bisected this down to the addition of the proxy_ops into tracefs for
lockdown. It appears that the allocation of the proxy_ops and then freeing
it in the destroy_inode callback, is causing havoc with the memory system.
Reading the documentation about destroy_inode and talking with Linus about
this, this is buggy and wrong. When defining the destroy_inode() method, it
is expected that the destroy_inode() will also free the inode, and not just
the extra allocations done in the creation of the inode. The faulty commit
causes a memory leak of the inode data structure when they are deleted.
Instead of allocating the proxy_ops (and then having to free it) the checks
should be done by the open functions themselves, and not hack into the
tracefs directory. First revert the tracefs updates for locked_down and then
later we can add the locked_down checks in the kernel/trace files.
Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home
Fixes: ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Stable bugfixes:
- Fix O_DIRECT accounting of number of bytes read/written # v4.1+
Other fixes:
- Fix nfsi->nrequests count error on nfs_inode_remove_request()
- Remove redundant mirror tracking in O_DIRECT
- Fix leak of clp->cl_acceptor string
- Fix race to sk_err after xs_error_report
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl2g7gAACgkQ18tUv7Cl
QOtMSxAAyQj2BjzUviR+clFe4X6betnqse6wHr4gnA3EEyo5OiwYk/U/bevejOXW
mYevz+VkYVVaN4l8SLT/ZJEH7cJCbJUjy/cf6vvHdL0E+sE+41s0Qrl0wrX2NOyT
cgm90VTzwFCZe9e9i2jOehgyJ0zkc0+9H2YySYIsiw0MPHVOzR+t8lgoD9mVsYAn
d4L5VLOo0hP/dhYfS2e/SkESR75rZFR2tZbL2ClKmGTYVHoLpliAtCUepqV9kUpw
FkAA5PXosa0xewXCUg6Lvhac7Urh37OrnLtedIe4fa7qGGqB2U3CavB6W6ojQhKJ
Brgk1/wSVhag3vVCCplwscB5jpOly6azUbs2mcYhdKZ5zWzTkPL1F/KZkZSnkZU6
LpZPk2/Lltko2TUviSCwDJwVzWqqRMlvz7OyXv1tVw53yFP1Fr7tNTxEe4XOnbxG
8pbLqBjwHp8Iyerh0JSJ21quPJSVIfgTKjWbMTjrH4yh/FdzUkeAktvwZT4LDEMx
uKFH8FQPrp/oqQ4wc49gpxLGqNCYjK51Hk3ceym47d1xcDww8yFeaana5D3VERmF
CCuJfqkUxFmeFle8TGBHlmvVrb8K/W1WZPC/dcmMAEQSq07fIlZYvUHTHO8pGOkp
ZZqNtbLyH2yIRR9FuzlrpmEqJPZMGNYySthcEXrYzVKDDDbVaII=
=XDan
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
- Fix O_DIRECT accounting of number of bytes read/written # v4.1+
Other fixes:
- Fix nfsi->nrequests count error on nfs_inode_remove_request()
- Remove redundant mirror tracking in O_DIRECT
- Fix leak of clp->cl_acceptor string
- Fix race to sk_err after xs_error_report"
* tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: fix race to sk_err after xs_error_report
NFSv4: Fix leak of clp->cl_acceptor string
NFS: Remove redundant mirror tracking in O_DIRECT
NFS: Fix O_DIRECT accounting of number of bytes read/written
nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl2gzCMACgkQiiy9cAdy
T1F5aAv+M5OLmFX6OOuCQNYWqku8N9K78C3+bvdpDINYceozhPgKX1/+RO3z8rcR
v+cK0b+R6tXVmpTbyo1t7QW3UVzHiJLEaGKjafbHNBjOM87wYkW7D9Uc6u6GSYDp
7cgXqmMDM+26FftBwxaxLjbjY6YAN/3V66j5Zt6I+lTb+8YC2DO1ihprwaMmGAm2
hDGYpvrW9KPjTB23+dC5Z9/YfG6EGXOvXl4z20JajCVoKD/70KdpDwVLNo2JuPsL
hsgV38xmUd7PtFXeFtStPjN9VpNOf4sp29xmWnjJVGaVx4hnhzcyiKmiW2o5VkKM
14PEUd/N26/+E6VXuY1eP6L8O9EwkZISNZUXWBI/zyyiVedBfSJVeSXTU4NyqB0g
B7WMnWYbJ73AeOsq9iVrcribrRPDtAcyK5UKMWEkrYhB/BIylg1vWUhn6B5bPqv0
3Q9TR+pSmEifkAk3ZSu5CjKNEgpNj6lpFoIcmkdp3Gg/qzeWvcFUTJTnxZZbxSqP
qtmZ3d8z
=mXpP
-----END PGP SIGNATURE-----
Merge tag '5.4-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Eight small SMB3 fixes, four for stable, and important fix for the
recent regression introduced by filesystem timestamp range patches"
* tag '5.4-rc2-smb3' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Force reval dentry if LOOKUP_REVAL flag is set
CIFS: Force revalidate inode when dentry is stale
smb3: Fix regression in time handling
smb3: remove noisy debug message and minor cleanup
CIFS: Gracefully handle QueryInfo errors during open
cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic
fs: cifs: mute -Wunused-const-variable message
smb3: cleanup some recent endian errors spotted by updated sparse
We have two ways a request can be deferred:
1) It's a regular request that depends on another one
2) It's a timeout that tracks completions
We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a rounding error in the fallocate code
- Minor code cleanups
- Make sure to zero memory buffers before formatting metadata blocks
- Fix a few places where we forgot to log an inode metadata update
- Remove broken error handling that tried to clean up after a failure
but still got it wrong
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2eA4EACgkQ+H93GTRK
tOsosxAAgOgEvFrIZQDnzio+yQ9OysiFWCiiT3b+91f8v1v3MNBXP2EEyLA4zZEK
TlgaFvNCoTO2Q16r+PKCSDFVBwEw/8t2YEylS6X6/GKBUGc/BDBZ5ROZdzQH3Wng
a98M7GaQektTUqQuKlv3fHiL1Fucp0FonCG3wiuNw6/vzH6RxxwOTTkd2F67Zkn6
9N7bRoEPh7d6h1Ah7myf/ONA1f3mGEiFuvyb3/KCZPr1WKDFFflzhUboJnMq8Br5
N55Bq3uoc4+btS4eVxr3XjEXD1zImBiXq7gcEoCRDZNfv+/2VcaDYRkXQu40NbOI
psH+1xy9lQvLSSbCm6zI1enpfVAm3qxUVktp9G4i4dKtjJQL7HzuvW7ckdzcRaav
QKbkTmgGQFFI5yLpNHVN+zs+exdkwG6whNyY3Frt5yNh8bH8yRZxrb7YMzfIkNOl
8O1Tl0ZFGzJtjggXPhAKCoGz2yz8oO2G2JzejfWvdyrku2heyGLktnm98Uk/Sx0g
90hHTC8KYfLm8r0PuH+lv5c/AwTDr4prJO2wQkU2ZDCCLxXjNsHa3vuxMDmK6PRV
Y+ryOP6OcyQkB+BkY3/HOlxjen71Z91hdMFNVajRmVwNuFzHCK3yHhsMitBPDEnx
PnotQQihRmVKP8hw7/3zPfLoDEwO+hPa19JbqUFLXp7xl/OzF6A=
=MFMd
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"A couple of small code cleanups and bug fixes for rounding errors,
metadata logging errors, and an extra layer of safeguards against
leaking memory contents.
- Fix a rounding error in the fallocate code
- Minor code cleanups
- Make sure to zero memory buffers before formatting metadata blocks
- Fix a few places where we forgot to log an inode metadata update
- Remove broken error handling that tried to clean up after a failure
but still got it wrong"
* tag 'xfs-5.4-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: move local to extent inode logging into bmap helper
xfs: remove broken error handling on failed attr sf to leaf change
xfs: log the inode on directory sf to block format change
xfs: assure zeroed memory buffers for certain kmem allocations
xfs: removed unused error variable from xchk_refcountbt_rec
xfs: remove unused flags arg from xfs_get_aghdr_buf()
xfs: Fix tail rounding in xfs_alloc_file_space()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2fQkoACgkQxWXV+ddt
WDsXgA/+OpORcroqswa5V/AB5NgSMv08QfBtL7en7cUA2cGbrTt0lAoixIesCeGA
7y4umlx7zWDhrG0NFv8E21xxkMzK72/YQnN0C6ZwRCi+ZbPSDlpMCwSNV9b6oVt4
t9mJQ2kNZ80wj9W7jtoyiiWZ2OBccWywKBYr+BXybha5BliSd1XbpWMv90lODQHc
1oH1FOphPAf+nSEtW4g0BV8UHy1n1+7TqjEHDilj0TuHlO0MHsR2FQcsRnMBv5H/
CcEH3+870pUOjvm/l1OAdIrzrnH6UsXKHnOmJpYZIAQdG4xtHq/d6O9sfQMZK/Af
VMXuju558kGOvrYdgffYa0HuW3P6c8q5BeEtGAZwjGsQS5GxmW63et/x+HIEl4RU
4SWMfD592GK1/yQn31NWGav7Olkr4L0Eh9iqfKk2nWayTV+pqs8NgkGumkoNB66n
WJs2m2WwKvZzj1Ys9ilRzo17qt2U/m4eLQtbut3m5eYCpzvo38PDrqOVqKTZG/dc
nG1JBJNk+yjV+RkOO0gnQ8/zzooEGvMhVIx5iq52tWjvxnvOGSVv9QAAkKrbMoOX
kn/b3heL57Z+kilUU3Zd0GvVif+awIgOj+PmAevR+w3MLoJ2gLfWb7qcONdHyFPk
jK7rsuP737fyHpZ1tvJyfTt4Lk/u1UbApKrO4QIku1r9IJmmu20=
=zpDj
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stabitly fixes, one build warning fix.
- fix inode allocation under NOFS context
- fix leak in fiemap due to concurrent append writes
- fix log-root tree updates
- fix balance convert of single profile on 32bit architectures
- silence false positive warning on old GCCs (code moved in rc1)"
* tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: silence maybe-uninitialized warning in clone_range
btrfs: fix uninitialized ret in ref-verify
btrfs: allocate new inode in NOFS context
btrfs: fix balance convert to single on 32-bit host CPUs
btrfs: fix incorrect updating of log root tree
Btrfs: fix memory leak due to concurrent append writes with fiemap
Pull dcache_readdir() fixes from Al Viro:
"The couple of patches you'd been OK with; no hlist conversion yet, and
cursors are still in the list of children"
[ Al is referring to future work to avoid some nasty O(n**2) behavior
with the readdir cursors when you have lots of concurrent readdirs.
This is just a fix for a race with a trivial cleanup - Linus ]
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
libfs: take cursors out of list when moving past the end of directory
Fix the locking in dcache_readdir() and friends
Pull mount fixes from Al Viro:
"A couple of regressions from the mount series"
* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: add missing blkdev_put() in get_tree_bdev()
shmem: fix LSM options parsing