Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
Also don't bother to set up a .get_acl method for symlinks as we do not
support access control (ACLs or even mode bits) for symlinks in Linux,
and create inodes with the proper mode instead of fixing it up later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some time ago, mkfs.xfs started picking the storage physical
sector size as the default filesystem "sector size" in order
to avoid RMW costs incurred by doing IOs at logical sector
size alignments.
However, this means that for a filesystem made with i.e.
a 4k sector size on an "advanced format" 4k/512 disk,
512-byte direct IOs are no longer allowed. This means
that XFS has essentially turned this AF drive into a hard
4K device, from the filesystem on up.
XFS's mkfs-specified "sector size" is really just controlling
the minimum size & alignment of filesystem metadata.
There is no real need to tightly couple XFS's minimal
metadata size to the minimum allowed direct IO size;
XFS can continue doing metadata in optimal sizes, but
still allow smaller DIOs for apps which issue them,
for whatever reason.
This patch adds a new field to the xfs_buftarg, so that
we now track 2 sizes:
1) The metadata sector size, which is the minimum unit and
alignment of IO which will be performed by metadata operations.
2) The device logical sector size
The first is used internally by the file system for metadata
alignment and IOs.
The second is used for the minimum allowed direct IO alignment.
This has passed xfstests on filesystems made with 4k sectors,
including when run under the patch I sent to ignore
XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly
tested end of block behavior on preallocated, sparse, and
existing files when we do a 512 IO into a 4k file on a
4k-sector filesystem, to be sure there were no unexpected
behaviors.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In preparation for adding new members to the structure,
give these old ones more descriptive names:
bt_ssize -> bt_meta_sectorsize
bt_smask -> bt_meta_sectormask
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Clean up the xfs_buftarg structure a bit:
- remove bt_bsize which is never used
- replace bt_sshift with bt_ssize; we only ever shift it back
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 6f96b3063c)
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 85dd0707f0)
A set of fixes which makes sure we are taking the ilock whenever accessing the
extent list. This was associated with "Access to block zero" messages which
may result in extent list corruption.
In case CONFIG_DEBUG_OBJECTS_WORK is defined, it is needed to
call destroy_work_on_stack() which frees the debug object to pair
with INIT_WORK_ONSTACK().
Signed-off-by: Liu, Chuansheng <chuansheng.liu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The VFS allows an anonymous temporary file to be named at a later
time via a linkat() syscall. The inodes for O_TMPFILE files are
are marked with a special flag I_LINKABLE and have a zero link count.
To support this in XFS, xfs_link() detects if this flag I_LINKABLE
is set and behaves appropriately when detected. So in this case,
its transaciton reservation takes into account the additional
overhead of removing the inode from the unlinked list. Then the
inode is removed from the unlinked list and the directory entry
is added. Finally its link count is bumped accordingly.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add two functions xfs_create_tmpfile() and xfs_vn_tmpfile()
to support O_TMPFILE file creation.
In contrast to xfs_create(), xfs_create_tmpfile() has a different
log reservation to the regular file creation because there is no
directory modification, and doesn't check if an entry can be added
to the directory, but the reservation quotas is required appropriately,
and finally its inode is added to the unlinked list.
xfs_vn_tmpfile() add one O_TMPFILE method to VFS interface and directly
invoke xfs_create_tmpfile().
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
It will be reused by the O_TMPFILE creation function.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
With CRC check is enabled, if trying to set an attributes value just
equal to the maximum size of XATTR_SIZE_MAX would cause the v3 remote
attr write verification procedure failure, which would yield the back
trace like below:
<snip>
XFS (sda7): Internal error xfs_attr3_rmt_write_verify at line 191 of file fs/xfs/xfs_attr_remote.c
<snip>
Call Trace:
[<ffffffff816f0042>] dump_stack+0x45/0x56
[<ffffffffa0d99c8b>] xfs_error_report+0x3b/0x40 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d99ce5>] xfs_corruption_error+0x55/0x80 [xfs]
[<ffffffffa0dbef6b>] xfs_attr3_rmt_write_verify+0x14b/0x1a0 [xfs]
[<ffffffffa0d96edd>] ? _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d96edd>] _xfs_buf_ioapply+0x6d/0x390 [xfs]
[<ffffffff81184cda>] ? vm_map_ram+0x31a/0x460
[<ffffffff81097230>] ? wake_up_state+0x20/0x20
[<ffffffffa0d97315>] ? xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d9726b>] xfs_buf_iorequest+0x6b/0xc0 [xfs]
[<ffffffffa0d97315>] xfs_bdstrat_cb+0x55/0xb0 [xfs]
[<ffffffffa0d97906>] xfs_bwrite+0x46/0x80 [xfs]
[<ffffffffa0dbfa94>] xfs_attr_rmtval_set+0x334/0x490 [xfs]
[<ffffffffa0db84aa>] xfs_attr_leaf_addname+0x24a/0x410 [xfs]
[<ffffffffa0db8893>] xfs_attr_set_int+0x223/0x470 [xfs]
[<ffffffffa0db8b76>] xfs_attr_set+0x96/0xb0 [xfs]
[<ffffffffa0db13b2>] xfs_xattr_set+0x42/0x70 [xfs]
[<ffffffff811df9b2>] generic_setxattr+0x62/0x80
[<ffffffff811e0213>] __vfs_setxattr_noperm+0x63/0x1b0
[<ffffffff81307afe>] ? evm_inode_setxattr+0xe/0x10
[<ffffffff811e0415>] vfs_setxattr+0xb5/0xc0
[<ffffffff811e054e>] setxattr+0x12e/0x1c0
[<ffffffff811c6e82>] ? final_putname+0x22/0x50
[<ffffffff811c708b>] ? putname+0x2b/0x40
[<ffffffff811cc4bf>] ? user_path_at_empty+0x5f/0x90
[<ffffffff811bdfd9>] ? __sb_start_write+0x49/0xe0
[<ffffffff81168589>] ? vm_mmap_pgoff+0x99/0xc0
[<ffffffff811e07df>] SyS_setxattr+0x8f/0xe0
[<ffffffff81700c2d>] system_call_fastpath+0x1a/0x1f
Tests:
setfattr -n user.longxattr -v `perl -e 'print "A"x65536'` testfile
This patch fix it to check the remote EA size is greater than the
XATTR_SIZE_MAX rather than more than or equal to it, because it's
valid if the specified EA value size is equal to the limitation as
per VFS setxattr interface.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Make sure that xfs_bmapi_read has the ilock held in some way, and that
xfs_bmapi_write, xfs_bmapi_delay, xfs_bunmapi and xfs_iread_extents are
called with the ilock held exclusively.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We might not have read in the extent list at this point, so make sure we
take the ilock exclusively if we have to do so.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Although it was removed in commit 051e7cd44a, ilock needs to be taken in
xfs_readdir because we might have to read the extent list in from disk. This
keeps other threads from reading from or writing to the extent list while it is
being read in and is still in a transitional state.
This has been associated with "Access to block zero" messages on directories
with large numbers of extents resulting from excessive filesytem fragmentation,
as well as extent list corruption. Unfortunately no test case at this point.
Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Equivalent to xfs_ilock_data_map_shared, except for the attribute fork.
Make xfs_getbmap use it if called for the attribute fork instead of
xfs_ilock_data_map_shared.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Make it clear that we're only locking against the extent map on the data
fork. Also clean the function up a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We can just use xfs_iunlock without any loss of clarity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Both the inode number and the generation do not change on a live inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
If we are doing aysnc writeback of metadata, we can get write errors
but have nobody to report them to. At the moment, we simply attempt
to reissue the write from io completion in the hope that it's a
transient error.
When it's not a transient error, the buffer is stuck forever in
this loop, and we cannot break out of it. Eventually, unmount will
hang because the AIL cannot be emptied and everything goes downhill
from them.
To solve this problem, only retry the write IO once before aborting
it. We don't throw the buffer away because some transient errors can
last minutes (e.g. FC path failover) or even hours (thin
provisioned devices that have run out of backing space) before they
go away. Hence we really want to keep trying until we can't try any
more.
Because the buffer was not cleaned, however, it does not get removed
from the AIL and hence the next pass across the AIL will start IO on
it again. As such, we still get the "retry forever" semantics that
we currently have, but we allow other access to the buffer in the
mean time. Meanwhile the filesystem can continue to modify the
buffer and relog it, so the IO errors won't hang the log or the
filesystem.
Now when we are pushing the AIL, we can see all these "permanent IO
error" buffers and we can issue a warning about failures before we
retry the IO. We can also catch these buffers when unmounting an
issue a corruption warning, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
When swalloc is specified as a mount option, allocations are
supposed to be aligned to the stripe width rather than the stripe
unit of the underlying filesystem. However, it does not do this.
What the implementation does is round up the allocation size to a
stripe width, hence ensuring that all allocations span a full stripe
width. It does not, however, ensure that that allocation is aligned
to a stripe width, and hence the allocations can span multiple
underlying stripes and so still see RMW cycles for things like
direct IO on MD RAID.
So, if the swalloc mount option is set, change the allocation
alignment in xfs_bmap_btalloc() to use the stripe width rather than
the stripe unit.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
handles the case of a shut down filesystem. Most of the users have private,
uncached buffers that can just be freed in this case, but the complex error
handling in xfs_bioerror_relse messes up the case when it's called without
a locked buffer.
Remove xfsbdstrat and opencode the error handling in the callers. All but
one can simply return an error and don't need to deal with buffer state,
and the one caller that cares about the buffer state could do with a major
cleanup as well, but we'll defer that to later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.
Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.
Fix it by considering allocation into empty files as requiring
aligned allocation again.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9b395a8ef)
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
# mount -ouquota,gquota /dev/sda7 /xfs
# mkdir /xfs/test
# xfs_quota -xc 'off -g' /xfs <-- hangs up
# echo w > /proc/sysrq-trigger
# dmesg
SysRq : Show Blocked State
task PC stack pid father
xfs_quota D 0000000000000000 0 27574 2551 0x00000000
[snip]
Call Trace:
[<ffffffff81aaa21d>] schedule+0xad/0xc0
[<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
[<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
[<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
[<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
[<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
[<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
[<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
[<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
[<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
[<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
[<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
[<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
[<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
[<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
[<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
[<ffffffff812fba4b>] ? iput+0x5b/0x90
[<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
[<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b
It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s). Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.
This problem has been around since Linux 3.4, it was introduced by:
[ b84a3a9675 xfs: remove the per-filesystem list of dquots ]
Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.
In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.
Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit df8052e7da)
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,
[ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[ 305.383939] Call Trace:
[ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350
[ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110
[ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]
This fix adjust the assertion to check if the super block support both
quota inodes or not.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 5a01dd54f4)
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.
Backtrace in this case:
[ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[ 50.867924] ------------[ cut here ]------------
... <snip>
[ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[ 50.867999] invalid opcode: 0000 [#1] SMP
[ 50.869407] Call Trace:
[ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[ 50.870050] RSP <ffff88002941fd60>
[ 50.879251] ---[ end trace c93a2b342341c65b ]---
We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO. While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee
that the desired quota is still active.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 37eb9706eb)
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit ef701600fd)
Use xfs_icluster_size_fsb() in xfs_imap().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use xfs_icluster_size_fsb() in xfs_ifree_cluster(), rename variable
ninodes to inodes_per_cluster, the latter is more meaningful.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use xfs_icluster_size_fsb() in xfs_ialloc_inode_init(), rename variable
ninodes to inodes_per_cluster, the latter is more meaningful.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use xfs_icluster_size_fsb() in xfs_bulkstat(), make the related
variables more meaningful and remove an unused variable nimask
from it.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Introduce a common routine xfs_icluster_size_fsb() to calculate
and return the number of file system blocks per inode cluster.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Get rid of XFS_IALLOC_BLOCKS() marcos, use mp->m_ialloc_blks directly.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Get rid of XFS_INODE_CLUSTER_SIZE() macros, use mp->m_inode_cluster_size
directly.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Get rid of XFS_IALLOC_INODES() marcos, use mp->m_ialloc_inos directly.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This one doesn't save a whole lot of memory, but still makes the
code simpler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
No need to keep the dquot log format around all the time, we can
easily generate it at iop_format time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
No need to keep the inode log format around all the time, we can
easily generate it at iop_format time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
With the new iop_format scheme there is no need to have a temporary buffer
to format logged extents into, we can do so directly into the CIL. This
also allows to remove the shortcut for big endian systems that probably
hasn't gotten a lot of test coverage for a long time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Instead of setting up pointers to memory locations in iop_format which then
get copied into the CIL linear buffer after return move the copy into
the individual inode items. This avoids the need to always have a memory
block in the exact same layout that gets written into the log around, and
allow the log items to be much more flexible in their in-memory layouts.
The only caveat is that we need to properly align the data for each
iovec so that don't have structures misaligned in subsequent iovecs.
Note that all log item format routines now need to be careful to modify
the copy of the item that was placed into the CIL after calls to
xlog_copy_iovec instead of the in-memory copy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add a helper to abstract out filling the log iovecs in the log item
format handlers. This will allow us to change the way we do the log
item formatting more easily.
The copy in the name is a bit confusing for now as it just assigns a
pointer and lets the CIL code perform the copy, but that will change
soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Split out a function to handle the data and attr fork, as well as a
helper for the really old v1 inodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Split out two helpers to size the data and attribute to make the
function more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Add two helpers to make the code more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Share code that was previously duplicated in two branches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The function xfs_bmap_isaeof() is used to indicate that an
allocation is occurring at or past the end of file, and as such
should be aligned to the underlying storage geometry if possible.
Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
behaviour of this function for empty files - it turned off
allocation alignment for this case accidentally. Hence large initial
allocations from direct IO are not getting correctly aligned to the
underlying geometry, and that is cause write performance to drop in
alignment sensitive configurations.
Fix it by considering allocation into empty files as requiring
aligned allocation again.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
rec.ir_startino is an agino rather than an ino. Use the correct macro
when dealing with it in xfs_difree.
Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we are using a large directory block size, and memory becomes
fragmented, we can get memory allocation failures trying to
kmem_alloc(64k) for a temporary buffer. However, there is not need
for a directory buffer sized allocation, as the end result ends up
in the inode literal area. This is, at most, slightly less than 2k
of space, and hence we don't need an allocation larger than that
fora temporary buffer.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:
for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.
This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.
Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit b7d961b35b)
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed
This issue can be triggered via xfstests/generic/288.
Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit f9fd013561)
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.
This can only be triggered with CAP_SYS_ADMIN.
Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit 071c529eb6)
xfs_quota(8) will hang up if trying to turn group/project quota off
before the user quota is off, this could be 100% reproduced by:
# mount -ouquota,gquota /dev/sda7 /xfs
# mkdir /xfs/test
# xfs_quota -xc 'off -g' /xfs <-- hangs up
# echo w > /proc/sysrq-trigger
# dmesg
SysRq : Show Blocked State
task PC stack pid father
xfs_quota D 0000000000000000 0 27574 2551 0x00000000
[snip]
Call Trace:
[<ffffffff81aaa21d>] schedule+0xad/0xc0
[<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0
[<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0
[<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0
[<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs]
[<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30
[<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs]
[<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs]
[<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs]
[<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs]
[<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs]
[<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs]
[<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs]
[<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs]
[<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs]
[<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0
[<ffffffff812fba4b>] ? iput+0x5b/0x90
[<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0
[<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b
It's fine if we turn user quota off at first, then turn off other
kind of quotas if they are enabled since the group/project dquot
refcount is decreased to zero once the user quota if off. Otherwise,
those dquots refcount is non-zero due to the user dquot might refer
to them as hint(s). Hence, above operation cause an infinite loop
at xfs_qm_dquot_walk() while trying to purge dquot cache.
This problem has been around since Linux 3.4, it was introduced by:
[ b84a3a9675 xfs: remove the per-filesystem list of dquots ]
Originally we will release the group dquot pointers because the user
dquots maybe carrying around as a hint via xfs_qm_detach_gdquots().
However, with above change, there is no such work to be done before
purging group/project dquot cache.
In order to solve this problem, this patch introduces a special routine
xfs_qm_dqpurge_hints(), and it would release the group/project dquot
pointers the user dquots maybe carrying around as a hint, and then it
will proceed to purge the user dquot cache if requested.
Cc: stable@vger.kernel.org
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For CRC enabled v5 super block, change a file's ownership can simply
trigger an ASSERT failure at xfs_setattr_nonsize() if both group and
project quota are enabled, i.e,
[ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621
[ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable]
[ 305.383939] Call Trace:
[ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs]
[ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs]
[ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350
[ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110
[ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110
[ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30
[ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs]
This fix adjust the assertion to check if the super block support both
quota inodes or not.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Split out a xfs_setattr_time helper to share code between truncate and
regular setattr similar to xfs_setattr_mode. I might also have another
caller growing for this in the near future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Remove the pointless tp argument, and properly align the local variable
declarations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
After the previous fix, there still has another ASSERT failure if turning
off any type of quota while fsstress is running at the same time.
Backtrace in this case:
[ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118
[ 50.867924] ------------[ cut here ]------------
... <snip>
[ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable]
[ 50.867999] invalid opcode: 0000 [#1] SMP
[ 50.869407] Call Trace:
[ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs]
[ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs]
[ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs]
[ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs]
[ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130
[ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0
[ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20
[ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f
[ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip>
[ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs]
[ 50.870050] RSP <ffff88002941fd60>
[ 50.879251] ---[ end trace c93a2b342341c65b ]---
We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(),
however the assertion itself is not right IMHO. While performing quota off, we
firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking
any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee
that the desired quota is still active.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
The xfs_quota_priv header file is only included by xfs_qm header and
there is no much users for its contents, hence we can move those stuff
to xfs_qm header file and kill it.
This patch also remove an unused macro DQFLAGTO_TYPESTR.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
In xfs_qm_scall_trunc_qfiles(), we ignore the error if failed to remove
the users quota metadata and proceed to remove groups and projects if
they are being there. However, in user space, the remove operation will
break and return if failed to remove any kind of quota.
Also for v5 super block, we can enabled both group and project quota at
the same time, in this case the current error handling will cover the
group error with projects but they might failed due to different reasons.
It seems we'd better the error handling consistent to the user space and
don't trying to remove another kind of quota metadata if the previous
operation is failed.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Fix the leak of kernel memory in xfs_dir2_node_removename()
when xfs_dir2_leafn_remove() returns an error code.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Recovery builds a list of items on the transaction's
r_itemq head. Normally these items are committed and freed.
But in the event of a recovery error, these allocations
are leaked.
If the error occurs during item reordering, then reconstruct
the r_itemq list before deleting the list to avoid leaking
the entries that were on one of the temporary lists.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:
for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.
This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.
Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed
This issue can be triggered via xfstests/generic/288.
Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.
This can only be triggered with CAP_SYS_ADMIN.
Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The "verbose" argument to xfs_setsize_buftarg_flags() has been
unused since:
ffe37436 xfs: stop using the page cache to back the buffer cache
Remove it, and fold the function into xfs_setsize_buftarg()
now that there's no need for different types of callers.
Fix inconsistent comment spacing while we're at it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Here we have a performance fix for inode iversion, increased inode cluster size
for v5 superblock filesystems, a fix for error handling in
xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJSjjbHAAoJENaLyazVq6ZO/l4QAIajqdOy56MDDCaE1ocNRMej
oPXqxMZpi+YAFRXIWQK/evjXjYE4hcCjiLI0tAKzQM0FGRHd1GgQQJvWKjvHBrLG
rlrHWDcKq+3bsJ6KIw+JvLnhsOxxKXbRovIG1PI4frOeFtS3DCtzMZ4FGBErh+BV
PkFjf5Lxe7VnegDL4eNpkAfSM/2/pEtn2gIMMj8eeKemTPAbDplMAk4UUxp18R7F
FCr6mOKETWthHdBgkLB1xA6OVGUuYcleWvluFc5PONJN1VPSutEHJaRw01lY1VLY
4COUt7MjLAqlAu24LTD1aNszhUlajYq0AL+nmd4gULZI2fpu+meUIGCHQG4BpVZl
ds9isn80SmKiLT4ZQCbJAv4XMEXn6p41+uxdKP6ZkYXso4zJmVw0TyGLg/ZOJpw0
9mNcaJG1s57ronN07dMCSMqsyNFLuLtX7+mVa5liO3sEkXv7hEK7EB9qojQ9Qn/p
xC2r4jrE0//xgcbOE+uKOyIad3L0IBM6TXuy58xVPi5l0dpM69z4LCyUGZ+L1G8x
0QhkA7NL3xI2tNgehzJBsBuxjtEePtm/jCIS1HfDSIRQZR3y+PxVEjgO4naS4vm2
xKwo5dBodsxgXeSMUY3miMuEX7ZQse+xVDK1q0Zk4VXZyC2+erNS5hxs313yzeLD
7X2ZESfIJaMrkLSKOVUe
=w0M3
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs
Pull second xfs update from Ben Myers:
"There are a couple of patches that I wasn't quite sure about in time
for our initial 3.13 pull request, a bugfix, and an update to add Dave
to MAINTAINERS:
Here we have a performance fix for inode iversion, increased inode
cluster size for v5 superblock filesystems, a fix for error handling
in xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave"
* tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs:
xfs: open code inc_inode_iversion when logging an inode
xfs: increase inode cluster size for v5 filesystems
xfs: fix unlock in xfs_bmap_add_attrfork
xfs: update maintainers
Michael L Semon reported that generic/069 runtime increased on v5
superblocks by 100% compared to v4 superblocks. his perf-based
analysis pointed directly at the timestamp updates being done by the
write path in this workload. The append writers are doing 4-byte
writes, so there are lots of timestamp updates occurring.
The thing is, they aren't being triggered by timestamp changes -
they are being triggered by the inode change counter needing to be
updated. That is, every write(2) system call needs to bump the inode
version count, and it does that through the timestamp update
mechanism. Hence for v5 filesystems, test generic/069 is running 3
orders of magnitude more timestmap update transactions on v5
filesystems due to the fact it does a huge number of *4 byte*
write(2) calls.
This isn't a real world scenario we really need to address - anyone
doing such sequential IO should be using fwrite(3), not write(2).
i.e. fwrite(3) buffers the writes in userspace to minimise the
number of write(2) syscalls, and the problem goes away.
However, there is a small change we can make to improve the
situation - removing the expensive lock operation on the change
counter update. All inode version counter changes in XFS occur
under the ip->i_ilock during a transaction, and therefore we
don't actually need the spin lock that provides exclusive access to
it through inc_inode_iversion().
Hence avoid the lock and just open code the increment ourselves when
logging the inode.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
v5 filesystems use 512 byte inodes as a minimum, so read inodes in
clusters that are effectively half the size of a v4 filesystem with
256 byte inodes. For v5 fielsystems, scale the inode cluster size
with the size of the inode so that we keep a constant 32 inodes per
cluster ratio for all inode IO.
This only works if mkfs.xfs sets the inode alignment appropriately
for larger inode clusters, so this functionality is made conditional
on mkfs doing the right thing. xfs_repair needs to know about
the inode alignment changes, too.
Wall time:
create bulkstat find+stat ls -R unlink
v4 237s 161s 173s 201s 299s
v5 235s 163s 205s 31s 356s
patched 234s 160s 182s 29s 317s
System time:
create bulkstat find+stat ls -R unlink
v4 2601s 2490s 1653s 1656s 2960s
v5 2637s 2497s 1681s 20s 3216s
patched 2613s 2451s 1658s 20s 3007s
So, wall time same or down across the board, system time same or
down across the board, and cache hit rates all improve except for
the ls -R case which is a pure cold cache directory read workload
on v5 filesystems...
So, this patch removes most of the performance and CPU usage
differential between v4 and v5 filesystems on traversal related
workloads.
Note: while this patch is currently for v5 filesystems only, there
is no reason it can't be ported back to v4 filesystems. This hasn't
been done here because bringing the code back to v4 requires
forwards and backwards kernel compatibility testing. i.e. to
deterine if older kernels(*) do the right thing with larger inode
alignments but still only using 8k inode cluster sizes. None of this
testing and validation on v4 filesystems has been done, so for the
moment larger inode clusters is limited to v5 superblocks.
(*) a current default config v4 filesystem should mount just fine on
2.6.23 (when lazy-count support was introduced), and so if we change
the alignment emitted by mkfs without a feature bit then we have to
make sure it works properly on all kernels since 2.6.23. And if we
allow it to be changed when the lazy-count bit is not set, then it's
all kernels since v2 logs were introduced that need to be tested for
compatibility...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_trans_ijoin() activates the inode in a transaction and
also can specify which lock to free when the transaction is
committed or canceled.
xfs_bmap_add_attrfork call locks and adds the lock to the
transaction but also manually removes the lock. Change the
routine to not add the lock to the transaction and manually
remove lock on completion.
While here, clean up the xfs_trans_cancel flags and goto names.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
refactoring. Bugfixes that stand out are the fix for the AGF/AGI
deadlock, incore extent list fixes, verifier fixes for v4 superblocks
and growfs, and memory leaks. There are some asserts, warnings, and
strings that were cleaned up. There was further rearrangement of code
to make libxfs and the kernel sync up more easily, differences between
v2 and v3 directory code were abstracted using an ops vector,
xfs_inactive was reworked, and the preallocation/hole punching code was
refactored.
- simplify kmem_zone_zalloc
- add traces for AGF/AGI read ops
- add additional AIL traces
- fix xfs_remove AGF vs AGI deadlock
- fix the extent count of new incore extent page in the indirection array
- don't fail bad secondary superblocks verification on v4 filesystems
due to unzeroed bits after v4 fields
- fix possible NULL dereference in xlog_verify_iclog
- remove redundant assert in xfs_dir2_leafn_split
- prevent stack overflows from page cache allocation
- fix some sparse warnings
- fix directory block format verifier to check the leaf entry count
- abstract the differences in dir2/dir3 via an ops vector
- continue process of reorganization to make libxfs/kernel code merges easier
- refactor the preallocation and hole punching code
- fix for growfs and verifiers
- remove unnecessary scary corruption error when probing non-xfs filesystems
- remove extra newlines from strings passed to printk
- prevent deadlock trying to cover an active log
- rework xfs_inactive()
- add the inode directory type support to XFS_IOC_FSGEOM
- cleanup (remove) usage of is_bad_inode
- fix miscalculation in xfs_iext_realloc_direct which results in oversized
direct extent list
- remove unnecessary count arg to xfs_iomap_write_allocate
- fix memory leak in xlog_recover_add_to_trans
- check superblock instead of block magic to determine if dtype field
is present
- fix lockdep annotation due to project quotas
- fix regression in xfs_node_toosmall which can lead to incorrect directory
btree node collapse
- make log recovery verify filesystem uuid of recovering blocks
- fix XFS_IOC_FREE_EOFBLOCKS definition
- remove invalid assert in xfs_inode_free
- fix for AIL lock regression
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJShBL7AAoJENaLyazVq6ZOaRwP/1B3QkWxFArRSD4wl15oBxpN
Zv6D7woTmAvON87OIG4m67gyTr5/yNrPy8bg6Qw4YoL6lHTgle+RDUaKhgnsODoX
Gd/oOiKCBqGfe93zs2fzIQzZ+yn+xdXr+q8uyEwEe8QHK6/wg6lEHNNae8VXEBlO
20ec4b0U9dxoOJyG8nJNdytI++jp3TWzmZGpmLwisRogt4b86JM+QRhKFOe18AeI
c9ky0uQmOQ6gX6h1VKN1L1u66GpTtFgj8XqPp/V6D8xHb1XGNiutDAKD7Mt/Rcgf
njQsky2lXSQIuOnhyS1+lPvR8x19srs6UdnxcWdJvvwsICb14ZHEsZQA7M1bkLnw
zNYtwn5RSneVSdjUZ+55dU1oDfTw2fxRHmKcm3bKrJCG7aOcH5vhEKs6HS0eVZAW
4AcjThA43UpcEv47sghd7WJ+hFc4tKDVh9BOLUNi9zlkltVdP6WmWduMco0mRNeJ
gd++CFRv9R3cQ0UUNsNMGQ9a8k/TW5uHYRsfX2IRBcgXQD2Ip1HBGLGSft2/JA4G
U53mM08RntInGKctp1PjJea74QPJrYT7wBMlBl917tmnZ59i20nDs/OfeD2Dsnod
9ekK5J7cMGHdWnQ3+o2b9Awuypcl+d9vdNKgNmNVTPlptfkI5OjJ5+BhqScyDw7m
LJ1JmPIPIJF7vdIqBJWL
=XMd/
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"For 3.13-rc1 we have an eclectic assortment of bugfixes, cleanups, and
refactoring. Bugfixes that stand out are the fix for the AGF/AGI
deadlock, incore extent list fixes, verifier fixes for v4 superblocks
and growfs, and memory leaks. There are some asserts, warnings, and
strings that were cleaned up. There was further rearrangement of code
to make libxfs and the kernel sync up more easily, differences between
v2 and v3 directory code were abstracted using an ops vector,
xfs_inactive was reworked, and the preallocation/hole punching code
was refactored.
- simplify kmem_zone_zalloc
- add traces for AGF/AGI read ops
- add additional AIL traces
- fix xfs_remove AGF vs AGI deadlock
- fix the extent count of new incore extent page in the indirection
array
- don't fail bad secondary superblocks verification on v4 filesystems
due to unzeroed bits after v4 fields
- fix possible NULL dereference in xlog_verify_iclog
- remove redundant assert in xfs_dir2_leafn_split
- prevent stack overflows from page cache allocation
- fix some sparse warnings
- fix directory block format verifier to check the leaf entry count
- abstract the differences in dir2/dir3 via an ops vector
- continue process of reorganization to make libxfs/kernel code
merges easier
- refactor the preallocation and hole punching code
- fix for growfs and verifiers
- remove unnecessary scary corruption error when probing non-xfs
filesystems
- remove extra newlines from strings passed to printk
- prevent deadlock trying to cover an active log
- rework xfs_inactive()
- add the inode directory type support to XFS_IOC_FSGEOM
- cleanup (remove) usage of is_bad_inode
- fix miscalculation in xfs_iext_realloc_direct which results in
oversized direct extent list
- remove unnecessary count arg to xfs_iomap_write_allocate
- fix memory leak in xlog_recover_add_to_trans
- check superblock instead of block magic to determine if dtype field
is present
- fix lockdep annotation due to project quotas
- fix regression in xfs_node_toosmall which can lead to incorrect
directory btree node collapse
- make log recovery verify filesystem uuid of recovering blocks
- fix XFS_IOC_FREE_EOFBLOCKS definition
- remove invalid assert in xfs_inode_free
- fix for AIL lock regression"
* tag 'xfs-for-linus-v3.13-rc1' of git://oss.sgi.com/xfs/xfs: (49 commits)
xfs: simplify kmem_{zone_}zalloc
xfs: add tracepoints to AGF/AGI read operations
xfs: trace AIL manipulations
xfs: xfs_remove deadlocks due to inverted AGF vs AGI lock ordering
xfs: fix the extent count when allocating an new indirection array entry
xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields
xfs: fix possible NULL dereference in xlog_verify_iclog
xfs:xfs_dir2_node.c: pointer use before check for null
xfs: prevent stack overflows from page cache allocation
xfs: fix static and extern sparse warnings
xfs: validity check the directory block leaf entry count
xfs: make dir2 ftype offset pointers explicit
xfs: convert directory vector functions to constants
xfs: convert directory vector functions to constants
xfs: vectorise encoding/decoding directory headers
xfs: vectorise DA btree operations
xfs: vectorise directory leaf operations
xfs: vectorise directory data operations part 2
xfs: vectorise directory data operations
xfs: vectorise remaining shortform dir2 ops
...
When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
(e.g. causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.
The following script reproduces the problem:
function run_writers
{
for (( i = 0; i < 10; i++ )); do
mkdir $1/dir$i
for (( j = 0; j < 40000; j++ )); do
dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
done &
done
}
for dir in "$@"; do
run_writers $dir
done
sleep 40
time sync
Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.
To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426. With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce flag KM_ZERO which is used to alloc zeroed entry, and convert
kmem_{zone_}zalloc to call kmem_{zone_}alloc() with KM_ZERO directly,
in order to avoid the setting to zero step.
And following Dave's suggestion, make kmem_{zone_}zalloc static inline
into kmem.h as they're now just a simple wrapper.
V2:
Make kmem_{zone_}zalloc static inline into kmem.h as Dave suggested.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To help track down AGI/AGF lock ordering issues, I added these
tracepoints to tell us when an AGI or AGF is read and locked. With
these we can now determine if the lock ordering goes wrong from
tracing captures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
I debugging a log tail issue on a RHEL6 kernel, I added these trace
points to trace log items being added, moved and removed in the AIL
and how that affected the log tail LSN that was written to the log.
They were very helpful in that they immediately identified the cause
of the problem being seen. Hence I'd like to always have them
available for use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Removing an inode from the namespace involves removing the directory
entry and dropping the link count on the inode. Removing the
directory entry can result in locking an AGF (directory blocks were
freed) and removing a link count can result in placing the inode on
an unlinked list which results in locking an AGI.
The big problem here is that we have an ordering constraint on AGF
and AGI locking - inode allocation locks the AGI, then can allocate
a new extent for new inodes, locking the AGF after the AGI.
Similarly, freeing the inode removes the inode from the unlinked
list, requiring that we lock the AGI first, and then freeing the
inode can result in an inode chunk being freed and hence freeing
disk space requiring that we lock an AGF.
Hence the ordering that is imposed by other parts of the code is AGI
before AGF. This means we cannot remove the directory entry before
we drop the inode reference count and put it on the unlinked list as
this results in a lock order of AGF then AGI, and this can deadlock
against inode allocation and freeing. Therefore we must drop the
link counts before we remove the directory entry.
This is still safe from a transactional point of view - it is not
until we get to xfs_bmap_finish() that we have the possibility of
multiple transactions in this operation. Hence as long as we remove
the directory entry and drop the link count in the first transaction
of the remove operation, there are no transactional constraints on
the ordering here.
Change the ordering of the operations in the xfs_remove() function
to align the ordering of AGI and AGF locking to match that of the
rest of the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
At xfs_iext_add(), if extent(s) are being appended to the last page in
the indirection array and the new extent(s) don't fit in the page, the
number of extents(erp->er_extcount) in a new allocated entry should be
the minimum value between count and XFS_LINEAR_EXTS, instead of count.
For now, there is no existing test case can demonstrates a problem with
the er_extcount being set incorrectly here, but it obviously like a bug.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Today, if xfs_sb_read_verify encounters a v4 superblock
with junk past v4 fields which includes data in sb_crc,
it will be treated as a failing checksum and a significant
corruption.
There are known prior bugs which leave junk at the end
of the V4 superblock; we don't need to actually fail the
verification in this case if other checks pan out ok.
So if this is a secondary superblock, and the primary
superblock doesn't indicate that this is a V5 filesystem,
don't treat this as an actual checksum failure.
We should probably check the garbage condition as
we do in xfs_repair, and possibly warn about it
or self-heal, but that's a different scope of work.
Stable folks: This can go back to v3.10, which is what
introduced the sb CRC checking that is tripped up by old,
stale, incorrect V4 superblocks w/ unzeroed bits.
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In xlog_verify_iclog a debug check of the incore log buffers prints an
error if icptr is null and then goes on to dereference the pointer
regardless. Convert this to an assert so that the intention is clear.
This was reported by Coverty.
Signed-off-by: Ben Myers <bpm@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
ASSERT on args takes place after args dereference.
This assertion is redundant since we are going to panic anyway.
Found by Linux Driver Verification project (linuxtesting.org) -
PVS-Studio analyzer.
Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Page cache allocation doesn't always go through ->begin_write and
hence we don't always get the opportunity to set the allocation
context to GFP_NOFS. Failing to do this means we open up the direct
relcaim stack to recurse into the filesystem and consume a
significant amount of stack.
On RHEL6.4 kernels we are seeing ra_submit() and
generic_file_splice_read() from an nfsd context recursing into the
filesystem via the inode cache shrinker and evicting inodes. This is
causing truncation to be run (e.g EOF block freeing) and causing
bmap btree block merges and free space btree block splits to occur.
These btree manipulations are occurring with the call chain already
30 functions deep and hence there is not enough stack space to
complete such operations.
To avoid these specific overruns, we need to prevent the page cache
allocation from recursing via direct reclaim. We can do that because
the allocation functions take the allocation context from that which
is stored in the mapping for the inode. We don't set that right now,
so the default is GFP_HIGHUSER_MOVABLE, which is effectively a
GFP_KERNEL context. We need it to be the equivalent of GFP_NOFS, so
when we initialise an inode, set the mapping gfp mask appropriately.
This makes the use of AOP_FLAG_NOFS redundant from other parts of
the XFS IO path, so get rid of it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
The kbuild test robot indicated that there were some new sparse
warnings in fs/xfs/xfs_dquot_buf.c. Actually, there were a lot more
that is wasn't warning about, so fix them all up.
Reported-by: kbuild test robot
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The directory block format verifier fails to check that the leaf
entry count is in a valid range, and so if it is corrupted then it
can lead to derefencing a pointer outside the block buffer. While we
can't exactly validate the count without first walking the directory
block, we can ensure the count lands in the valid area within the
directory block and hence avoid out-of-block references.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Rather than hiding the ftype field size accounting inside the dirent
padding for the ".." and first entry offset functions for v2
directory formats, add explicit functions that calculate it
correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Many of the vectorised function calls now take no parameters and
return a constant value. There is no reason for these to be vectored
functions, so convert them to constants
Binary sizes:
text data bss dec hex filename
794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig
792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1
792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2
789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3
789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4
789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5
789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6
791421 96802 1096 889319 d91e7 fs/xfs/xfs.o.p7
791701 96802 1096 889599 d92ff fs/xfs/xfs.o.p8
791205 96802 1096 889103 d91cf fs/xfs/xfs.o.p9
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Next step in the vectorisation process is the directory free block
encode/decode operations. There are relatively few of these, though
there are quite a number of calls to them.
Binary sizes:
text data bss dec hex filename
794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig
792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1
792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2
789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3
789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4
789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5
789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6
791421 96802 1096 889319 d91e7 fs/xfs/xfs.o.p7
791701 96802 1096 889599 d92ff fs/xfs/xfs.o.p8
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>