Split out the entire lookup loop from lookup_metapath and
fillup_metapath. Make both functions return the actual height in
mp->mp_aheight, and return 0 on success. Handle lookup errors properly
in trunc_dealloc.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
First, this function truncates the file in chunks. When the original
file size isn't block aligned, each chunk that is truncated will remain
be misaligned. This is inefficient.
Second, this function doesn't recognize where holes are, so it loops
through them. For each chunk of a hole, it creates a new transaction.
At least avoid creating another transactions whe the current one is
still empty. (An better fix would be to skip large holes, of course.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The current transaction is being dereferenced before asserting that is
not NULL; that isn't going to help.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Document when to use gfs2_blk2rgrpd for "inexact" resource group
matching. Based on that, fix an incorrect use of gfs2_blk2rgrpd in
sweep_bh_for_rgrps.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
We iterate through the entire ordered writes list in
gfs2_ordered_write() to write out inodes. It's a good
place to try and shrink the list by throwing out inodes
that don't have any pages.
Signed-off-by: Abhi Das <adas@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, there was a lot of code redundancy between functions
log_write_header (which uses bio) and clean_journal (which uses
buffer_head). This patch reduces the redundancy to simplify the code
and make log header writing more consistent. We want more consistency
and reduced redundancy because we plan to add a bunch of new fields
to improve performance (by eliminating the local statfs and quota files)
improve metadata integrity (by adding new crcs and such) and for better
debugging (by adding new fields to track when and where metadata was
pushed through the journals.) We don't want to duplicate setting these
new fields, nor allow for human error in the process.
This reduction in code redundancy is accomplished by introducing a new
helper function, gfs2_write_log_header which uses bio rather than bh.
That simplifies recovery function clean_journal() to use the new helper
function and iomap rather than redundancy and block_map (and eventually
we can maybe remove block_map). It also reduces our dependency on
buffer_heads.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add the rg_crc field to store a crc32 of the gfs2_rgrp structure. This
allows us to check resource group headers' integrity and removes the
requirement to check them against the rindex entries in fsck. If this
field is found to be zero, it should be ignored (or updated with an
accurate value).
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add rg_data0, rg_data and rg_bitbytes to struct gfs2_rgrp. The fields
are identical to their counterparts in struct gfs2_rindex and are
intended to reduce the use of the rindex. For now the fields are only
written back as the in-memory equivalents in struct gfs2_rgrpd are set
using values from the rindex. However, they are needed at this point so
that userspace can make use of them, allowing a migration away from the
rindex over time.
The new fields take up previously reserved space which was explicitly
zeroed on write so, in clusters with mixed kernels, these fields could
get zeroed after being set and this should not be treated as an error.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a new rg_skip field to struct gfs2_rgrp, replacing __pad. The
rg_skip field has the following meaning:
- If rg_skip is zero, it is considered unset and not useful.
- If rg_skip is non-zero, its value will be the number of blocks between
this rgrp's address and the next rgrp's address. This can be used as a
hint by fsck.gfs2 when rebuilding a bad rindex, for example.
This will provide less dependency on the rindex in future, and allow
tools such as fsck.gfs2 to iterate the resource groups without keeping
the rindex around.
The field is updated in gfs2_rgrp_out() so that existing file systems
will have it set. This means that any resource groups that aren't ever
written will not be updated. The final rgrp is a special case as there
is no next rgrp, so it will always have a rg_skip of 0 (unless the fs is
extended).
Before this patch, gfs2_rgrp_out() zeroes the __pad field explicitly, so
the rg_skip field can get set back to 0 in cases where nodes with and
without this patch are mixed in a cluster. In some cases, the field may
bounce between being set by one node and then zeroed by another which
may harm performance slightly, e.g. when two nodes create many small
files. In testing this situation is rare but it becomes more likely as
the filesystem fills up and there are fewer resource groups to choose
from. The problem goes away when all nodes are running with this patch.
Dipping into the space currently occupied by the rg_reserved field would
have resulted in the same problem as it is also explicitly zeroed, so
unfortunately there is no other way around it.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:
ret = rhashtable_walk_start(rhiter);
if (ret && ret != -EAGAIN)
goto out;
Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.
This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a follow-up to commit d2bc5b3c67, remove the end parameter which is
now unused.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
init_gfs2_fs() is calling e.g. calling unregister_shrinker() without
register_shrinker() when an error occurred during initialization.
Rename goto labels and call appropriate undo function.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, function gfs2_free_di was 4 lines of code, and
one of those lines was to call gfs2_free_uninit_di. Although
unlikely, if function gfs2_free_uninit_di encountered an error
finding the block to be freed, the error was silently ignored by the
caller, which went ahead and improperly did a quota-change operation
and meta_wipe despite the error. This patch combines the two
functions into one to make the code more readable and fixes the bug
by returning from the combined function before it takes those next
incorrect steps.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in gfs2_write_cache_jdata(). Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
patches are basically in three categories: (1) patches related to
broken xfstest cases, (2) patches related to improving iomap and
start using it in GFS2, and (3) general typos and clarifications.
Please note that one of the iomap patches extends beyond GFS2 and
affects other file systems, but it was publically reviewed by a
variety of file system people in the community.
1. Andreas has a patch that simply renames variable 'bsize' to 'factor'
to clarify the logic related to gfs2_block_map.
2. He also has a patch to correctly set ctime in the setflags ioctl,
which fixes broken xfstests test 277.
3. He also fixed broken xfstest 258, due to an atime initialization
problem.
4. He also fixed broken xfstest 307, in which GFS2 was not setting
ctime when setting acls.
5. He has a patch to switch general iomap code from blkno to disk
offset for a variety of file systems.
6. He has a patch to add a new IOMAP_F_DATA_INLINE flag for iomap
to indicate blocks that have data mixed with metadata.
7. I contributed a patch to make inode height info part of the
'metapath' data structure to facilitate using iomap in GFS2.
8. I have a patch to start using iomap inside GFS2 and switch GFS2's
block_map functions to use iomap under the covers.
9. I have a patch to switch GFS2's fiemap implementation from using
block_map to using iomap under the covers.
10. Andreas has a patch to implement SEEK_HOLE and SEEK_DATA via
iomap in GFS2.
11. I have a patch related to journaled data pages not being properly
synced to media when writing inodes. This was caught with xfstests.
12. I have a patch to fix another failing xfstest case in which
switching a file from ordered_write to journaled data via set_flags
caused a deadlock.
13. Andreas has a patch to fix failing xfstest case 066, which was
due to not properly syncing dirty inodes when changing extended
attributes.
14. Andreas fixed a minor typo in a comment.
15. Andreas contributed a patch to partially fix xfstest 424, which
involved GET_FLAGS and SET_FLAGS ioctl. This is also a cleanup
and simplification of the translation of flags from fs flags to
gfs2 flags.
16. He also added support for STATX_ATTR_ in statx, which fixed broken
xfstest 424.
17. He also contributed a fix for failing xfstest 093 which fixes a
recursive glock problem with gfs2_xattr_get and _set.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaCx5uAAoJENeLYdPf93o7Nb8H/RLJ2CsEbTSJQ82RH4eptoxe
XbQ4HVig9Hm8k5teSTH9DdVypkxjPtbJZY9k1Y4mEtddtCZ/yS407aTdr/pP0C5r
3W8Ouu2JXmqPKWg0sp3wC/Pji2ThCYssQXNyBSDPADsF2C8XEuT7aL/YPzMitIdm
Lxa9JHo1tKgdFnkloNyaTt4MdBGNF5M5UBr6KgRfwhgooHWbxM0rNyZIXJtySb0I
vsaNNOA7a4VQp1Fo1DkHQomNbOG5hpVKfswUOOZvk2RdAewTPN+jXiOAmIhNjQ3Y
/PkJLjRCf8Ob/VIYmt2BTs16+07mODGv1d6DuhgXzH/dfiVihVGvVo71DxXx5uw=
=i8b2
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've got a total of 17 GFS2 patches for this merge window. The
patches are basically in three categories: (1) patches related to
broken xfstest cases, (2) patches related to improving iomap and start
using it in GFS2, and (3) general typos and clarifications.
Please note that one of the iomap patches extends beyond GFS2 and
affects other file systems, but it was publically reviewed by a
variety of file system people in the community.
From Andreas Gruenbacher:
- rename variable 'bsize' to 'factor' to clarify the logic related to
gfs2_block_map.
- correctly set ctime in the setflags ioctl, which fixes broken
xfstests test 277.
- fix broken xfstest 258, due to an atime initialization problem.
- fix broken xfstest 307, in which GFS2 was not setting ctime when
setting acls.
- switch general iomap code from blkno to disk offset for a variety
of file systems.
- add a new IOMAP_F_DATA_INLINE flag for iomap to indicate blocks
that have data mixed with metadata.
- implement SEEK_HOLE and SEEK_DATA via iomap in GFS2.
- fix failing xfstest case 066, which was due to not properly syncing
dirty inodes when changing extended attributes.
- fix a minor typo in a comment.
- partially fix xfstest 424, which involved GET_FLAGS and SET_FLAGS
ioctl. This is also a cleanup and simplification of the translation
of flags from fs flags to gfs2 flags.
- add support for STATX_ATTR_ in statx, which fixed broken xfstest
424.
- fix for failing xfstest 093 which fixes a recursive glock problem
with gfs2_xattr_get and _set
From me:
- make inode height info part of the 'metapath' data structure to
facilitate using iomap in GFS2.
- start using iomap inside GFS2 and switch GFS2's block_map functions
to use iomap under the covers.
- switch GFS2's fiemap implementation from using block_map to using
iomap under the covers.
- fix journaled data pages not being properly synced to media when
writing inodes. This was caught with xfstests.
- fix another failing xfstest case in which switching a file from
ordered_write to journaled data via set_flags caused a deadlock"
* tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Allow gfs2_xattr_set to be called with the glock held
gfs2: Add support for statx inode flags
gfs2: Fix and clean up {GET,SET}FLAGS ioctl
gfs2: Fix a harmless typo
gfs2: Fix xattr fsync
GFS2: Take inode off order_write list when setting jdata flag
GFS2: flush the log and all pages for jdata as we do for WB_SYNC_ALL
gfs2: Implement SEEK_HOLE / SEEK_DATA via iomap
GFS2: Switch fiemap implementation to use iomap
GFS2: Implement iomap for block_map
GFS2: Make height info part of metapath
gfs2: Always update inode ctime in set_acl
gfs2: Support negative atimes
gfs2: Update ctime in setflags ioctl
gfs2: Clarify gfs2_block_map
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On the following call path:
gfs2_setattr -> setattr_prepare -> ... ->
cap_inode_killpriv -> ... ->
gfs2_xattr_set
the glock is locked in gfs2_setattr, so check for recursive locking in
gfs2_xattr_set as gfs2_xattr_get already does. While at it, get rid of
need_unlock in gfs2_xattr_get.
Fixes xfstest generic/093.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add support for the STATX_ATTR_ flags in statx. (Compression,
encryption, and the nodump flag are not supported by gfs2.)
Partially fixes xfstest generic/424.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Switch to a simple array for mapping between the FS_*_FL and GFS_DIF_*
flags. Clarify how the mapping between FS_JOURNAL_DATA_FL and the
filesystem flags works. The GFS2_DIF_SYSTEM flag cannot be set from
user space, so remove it from GFS2_FLAGS_USER_SET. Fail with -EINVAL
when trying to set flags that are not supported instead of silently
ignoring those flags.
Partially fixes xfstest generic/424.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Make sure that changing xattrs marks the corresponding inode dirty so
that a subsequent fsync will sync those changes to disk. We set
I_DIRTY_SYNC as well as I_DIRTY_DATASYNC so that both fsync and
fdatasync will sync xattr changes: xattrs can contain information
critical to how the data can be accessed, so we don't want fdatasync
to skip them.
Fixes xfstest generic/066.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch fixes a deadlock caused when the jdata flag is set for
inodes that are already on the ordered write list. Since it is
on the ordered write list, log_flush calls gfs2_ordered_write which
calls filemap_fdatawrite. But since the inode had the jdata flag
set, that calls gfs2_jdata_writepages, which tries to start a new
transaction. A new transaction cannot be started because it tries
to acquire the log_flush rwsem which is already locked by the log
flush operation.
The bottom line is: We cannot switch an inode from ordered to jdata
until we eliminate any ordered data pages (via log flush) or any
log_flush operation afterward will create the circular dependency
above. So we need to flush the log before setting the diskflags to
switch the file mode, then we need to remove the inode from the
ordered writes list.
Before this patch, the log flush was done for jdata->ordered, but
that's wrong. If we're going from jdata to ordered, we don't need
to call gfs2_log_flush because the call to filemap_fdatawrite will
do it for us:
filemap_fdatawrite() -> __filemap_fdatawrite_range()
__filemap_fdatawrite_range() -> do_writepages()
do_writepages() -> gfs2_jdata_writepages()
gfs2_jdata_writepages() -> gfs2_log_flush()
This patch modifies function do_gfs2_set_flags so that if a file
has its jdata flag set, and it's already on the ordered write list,
the log will be flushed and it will be removed from the list
before setting the flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
In function gfs2_write_inode, starting with patch a9185b41a4, we
only flush the log and call filemap_fdatawait if we're passed in a
wbc sync_mode of WB_SYNC_ALL. We also need to do these things if
we're evicting a jdata inode, because we might have jdata pages
still attached to bufdata descriptors that need to be revoked, but
by the time it gets to evict() it's too late to start a new
transaction. This patch changes it to treat jdata inodes as if
WB_SYNC_ALL had been specified.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Abhijith Das <adas@redhat.com>
This patch switches GFS2's implementation of fiemap from the old
block_map code to the new iomap interface.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch implements iomap for block mapping, and switches the
block_map function to use it under the covers.
The additional IOMAP_F_BOUNDARY iomap flag indicates when iomap has
reached a "metadata boundary" and fetching the next mapping is likely to
incur an additional I/O. This flag is used for setting the bh buffer
boundary flag.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch eliminates height parameters from function gfs2_bmap_alloc.
Function find_metapath determines the metapath's "find height", also
known as the desired height. Function lookup_metapath determines the
metapath's "actual height", previously known as starting height or
sheight. Function gfs2_bmap_alloc now gets both height values from
the metapath. This simplification was done as a step toward switching
the block_map functions to using iomap. The bh_map responsibilities
are also removed from function gfs2_bmap_alloc for the same reason.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This tag is meant for pulling a patch called "gfs2: Fix
debugfs glocks dump" which fixes a regression introduced
by commit 88ffbf3e03. The regression caused the glock
dump in debugfs to not report all the glocks, which makes
debugging extremely difficult.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZyUI8AAoJENeLYdPf93o7iq4IAKhb9wJ8kmpu7LZ5k6Fl8BCy
GFztPe2bKsFG8cul1o1gZx8c/GWORaCHe3ZDI6pxl16/E+AvWoA1pKbBLYB1GSvD
90a7/m6+hx02ZXR/MHxBUQLWYXBtBrVMVcZDCmFMHWYCRUIiX2etPZL8wOXeJLTl
lNCSGdd1+3y6IJbthaIKTt1ctzsR8ZqV4QN786d2C3L9dxZ63FnAV43p3rUBzBLX
B5uT5LTmdWSLRqe0A9rnrPga/BfEnA8GDtIYUMic9Yz0Hq2a3vEnCC3P3Myp0DJZ
PGposwqL/emRhXkC4+ICrGsTOIy1BzwMXLF47GQaB/k+2Rd3/l9r/hU5ESjQOgA=
=taQL
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Bob Peterson:
"GFS2: Fix an old regression in GFS2's debugfs interface
This fixes a regression introduced by commit 88ffbf3e03 ("GFS2: Use
resizable hash table for glocks"). The regression caused the glock dump
in debugfs to not report all the glocks, which makes debugging
extremely difficult"
* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix debugfs glocks dump
Three-entry POSIX ACLs can be stored in the file mode permission bits,
with no need to store them in extended attributes. When a process sets
such a minimal ACL, the kernel updates the file mode like chmod does,
and removes any existing extended attributes for that ACL. Make sure
the ctime is always updated in that case.
Fixes xfstest generic/307.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When inodes are read from disk, GFS2 will only update in-memory atimes
older than the on-disk atimes; this prevents atimes from going
backwards. The atimes of newly allocated inodes are initialized to 0.
This means that when an atime is explicitly set to a negative value,
this value will not persist.
Fix by setting the atime of newly allocated inodes to the lowest
possible value instead of 0.
Fixes xfstest generic/258.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The FS_IOC_SETFLAGS ioctl is supposed to update the inode ctime.
Fixes xfstests generic/277.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a comment about the logical block size for directories. Rename
"bsize" in gfs2_block_map to "factor". Fix a typo in the description of
metaptr1.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The switch to rhashtables (commit 88ffbf3e03) broke the debugfs glock
dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
single buffer: the right function for restarting an rhashtable iteration
from the beginning of the hash table is rhashtable_walk_enter;
rhashtable_walk_stop + rhashtable_walk_start will just resume from the
current position.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.3+
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZrTy3AAoJEAAOaEEZVoIVaucP/ApBAj2S5wzvlV1u6l8E6ae7
ZeEEZfcWwzRYlKjZAkTWqj9XvGpDGO5gLq4wsZK2edFAq++/MJF8ZVtN4tdZ1kUZ
DUvRodtVOrT08Kp9wZXGT7JOFrf6U/6gMcR6p0MuWnHndeKYvlpcFi9NPT4EC9/z
Zm9V7gtlPdSOha7eaSjUS0+vLERkxqXLBW3Av9QUOBP/lbI3lqIroGKeHDYnVdya
2P/k5EcRRJMyJP6TqyYxmmJl+UWjJFMLvnlUDBslHnD/u3mIUhw3JLHYBjn5dZRE
Xjq56IDPoXDUvzlBhtn/Uqyx+/wtwsNsylpmKv6K5G1JfdeuSsPVsCey+A1cqV64
LpE5896wf9TmnmI9LNyh6vDn925xPSGBiF45UEp5f9aO7jXeY0MaEZ8g+ENqFIDK
v4gtZdS9FhYHV+/l4qEwYMKrqSbwKEs1r1FT+f4wnABby1ojfdA57ZPlp5PV2Vjp
szTp88Zkb7cMvZwEnWwxWofcJNmgS7uNahvnQF3IJ4ITsioEkuyYR3K4ZQMaaaV9
wCp6G0FhXZaK3OI7o9WiDwaO2elp9Hxc8bnqKpiBbHZkY0NLh7/++5VxpeNbTHFy
AGijQiiKGNNyYqNj93wq9jpVdMNjB0pXrHRxfav8v7MtQ+WfbEoAENF4T7hN7iXn
UuF6eSWEC5O1UCRUk1A+
=LLY3
-----END PGP SIGNATURE-----
Merge tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull writeback error handling updates from Jeff Layton:
"This pile continues the work from last cycle on better tracking
writeback errors. In v4.13 we added some basic errseq_t infrastructure
and converted a few filesystems to use it.
This set continues refining that infrastructure, adds documentation,
and converts most of the other filesystems to use it. The main
exception at this point is the NFS client"
* tag 'wberr-v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
ecryptfs: convert to file_write_and_wait in ->fsync
mm: remove optimizations based on i_size in mapping writeback waits
fs: convert a pile of fsync routines to errseq_t based reporting
gfs2: convert to errseq_t based writeback error reporting for fsync
fs: convert sync_file_range to use errseq_t based error-tracking
mm: add file_fdatawait_range and file_write_and_wait
fuse: convert to errseq_t based error tracking for fsync
mm: consolidate dax / non-dax checks for writeback
Documentation: add some docs for errseq_t
errseq: rename __errseq_set to errseq_set
because we held some back from the previous merge window until we
could get them perfected and well tested. We have a couple patch
sets, including my patch set for protecting glock gl_object and
Andreas Gruenbacher's patch set to fix the long-standing shrink-
slab hang, plus a bunch of assorted bugs and cleanups:
1. I fixed a bug whereby an IO error would lead to a double-brelse.
2. Andreas Gruenbacher made a minor cleanup to call his relatively
new function, gfs2_holder_initialized, rather than doing it
manually. This was just missed by a previous patch set.
3. Jan Kara fixed a bug whereby the SGID was being cleared when
inheriting ACLs.
4. Andreas found a bug and fixed it in his previous patch,
"Get rid of flush_delayed_work in gfs2_evict_inode". A call to
flush_delayed_work was deleted from *gfs2_inode_lookup and added
to gfs2_create_inode.
5. Wang Xibo found and fixed a list_add call in inode_go_lock
that specified the parameters in the wrong order.
6. Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
metadata reads that were accidentally missing them.
7 - 10. I submitted a 4-patch set to protect the glock gl_object
field. GFS2 was setting and checking gl_object with no locking
mechanism, so the value was occasionally stomped on, which caused
file system corruption.
11. I submitted a small cleanup to function gfs2_clear_rgrpd.
It was needlessly adding rgrp glocks to the lru list, then pulling
them back off immediately. The rgrp glocks don't use the lru list
anyway, so doing so was just a waste of time.
12. I submitted a patch that checks the GLOF_LRU flag on a glock
before trying to remove it from the lru_list. This avoids a lot
of unnecessary spin_lock contention.
13. I submitted a patch to delete GFS2's debugfs files only after
we evict all the glocks. Before this patch, GFS2 would delete the
debugfs files, and if unmount hung waiting for a glock, there was
no way to debug the problem. Now, if a hang occurs during umount,
we can examine the debugfs files to figure out why it's hung.
14. Andreas Gruenbacher submitted a patch to fix some trivial typos.
15 - 19. Andreas also submitted a five-part patch set to fix the
longstanding hang involving the slab shrinker: dlm requires
memory, calls the inode shrinker, which calls gfs2's evict, which
calls back into DLM before it can evict an inode.
20. Abhi Das submitted a patch to forcibly flush the active items
list to relieve memory pressure. This fixes a long-standing bug
whereby GFS2 was getting hung permanently in balance_dirty_pages.
21. Thomas Tai submitted a patch to fix a slab corruption problem
due to a residual pointer left in the lock_dlm lockstruct.
22. I submitted a patch to withdraw the file system if IO errors
are encountered while writing to the journals or statfs system
file which were previously not being sent back up. Before, some
IO errors were sometimes not be detected for several hours, and
at recovery time, the journal errors made journal replay
impossible.
23. Andreas has a patch to fix an annoying format-truncation compiler
warning so GFS2 compiles cleanly.
24. I have a patch that fixes a handful of sparse compiler warnings.
25. Andreas fixed up an useless gl_object warning caused by an
earlier patch.
26. Arvind Yadav added a patch to properly constify our rhashtable
params declare.
27. I added a patch to fix a regression caused by the non-recursive
delete and truncate patch that caused file system blocks to not
be properly freed.
28. Ernesto A. Fernández added a patch to fix a place where GFS2
would send back the wrong return code setting extended attributes.
29. Ernesto also added a patch to fix a case in which GFS2 was
improperly setting an inode's i_mode, potentially granting access
to the wrong users.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZrMC2AAoJENeLYdPf93o7PIQIAKY4hdC2pMM5tiiIHx5fPAAr
tjpVuFkDQzyEaTb9sArVLxEdva3ShKERQKoYq/VVxqbAEwPgXbzJFNNil1WTJi1t
J2gE4wE4G5x1+A7XDzCdPI8KAcF+yX63AaFYlVKyuZSq5w7njIRc1Vk+TFiIexxC
xb0nP0g9L6Zt114rE8kfi0/GLjTO9vOKM3XsJgG612I3/cs3RUx4gJ+nSUG0bYLA
qoBIXEJ3SFHw2Zr/LgHZ9QDHnlPVl3bjg03sRQaWZms7XbLegDBYsDSvS1HLZ300
gjTc0Dgz/6KwzDVJ7cZ/fPNYtIFY58tKs6aqqDTrCncsX9nPjcTAxYkBNWsFyZM=
=tXJ8
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got a whopping 29 GFS2 patches for this merge window, mainly
because we held some back from the previous merge window until we
could get them perfected and well tested. We have a couple patch sets,
including my patch set for protecting glock gl_object and Andreas
Gruenbacher's patch set to fix the long-standing shrink- slab hang,
plus a bunch of assorted bugs and cleanups.
Summary:
- I fixed a bug whereby an IO error would lead to a double-brelse.
- Andreas Gruenbacher made a minor cleanup to call his relatively new
function, gfs2_holder_initialized, rather than doing it manually.
This was just missed by a previous patch set.
- Jan Kara fixed a bug whereby the SGID was being cleared when
inheriting ACLs.
- Andreas found a bug and fixed it in his previous patch, "Get rid of
flush_delayed_work in gfs2_evict_inode". A call to
flush_delayed_work was deleted from *gfs2_inode_lookup and added to
gfs2_create_inode.
- Wang Xibo found and fixed a list_add call in inode_go_lock that
specified the parameters in the wrong order.
- Coly Li submitted a patch to add the REQ_PRIO to some of GFS2's
metadata reads that were accidentally missing them.
- I submitted a 4-patch set to protect the glock gl_object field.
GFS2 was setting and checking gl_object with no locking mechanism,
so the value was occasionally stomped on, which caused file system
corruption.
- I submitted a small cleanup to function gfs2_clear_rgrpd. It was
needlessly adding rgrp glocks to the lru list, then pulling them
back off immediately. The rgrp glocks don't use the lru list
anyway, so doing so was just a waste of time.
- I submitted a patch that checks the GLOF_LRU flag on a glock before
trying to remove it from the lru_list. This avoids a lot of
unnecessary spin_lock contention.
- I submitted a patch to delete GFS2's debugfs files only after we
evict all the glocks. Before this patch, GFS2 would delete the
debugfs files, and if unmount hung waiting for a glock, there was
no way to debug the problem. Now, if a hang occurs during umount,
we can examine the debugfs files to figure out why it's hung.
- Andreas Gruenbacher submitted a patch to fix some trivial typos.
- Andreas also submitted a five-part patch set to fix the
longstanding hang involving the slab shrinker: dlm requires memory,
calls the inode shrinker, which calls gfs2's evict, which calls
back into DLM before it can evict an inode.
- Abhi Das submitted a patch to forcibly flush the active items list
to relieve memory pressure. This fixes a long-standing bug whereby
GFS2 was getting hung permanently in balance_dirty_pages.
- Thomas Tai submitted a patch to fix a slab corruption problem due
to a residual pointer left in the lock_dlm lockstruct.
- I submitted a patch to withdraw the file system if IO errors are
encountered while writing to the journals or statfs system file
which were previously not being sent back up. Before, some IO
errors were sometimes not be detected for several hours, and at
recovery time, the journal errors made journal replay impossible.
- Andreas has a patch to fix an annoying format-truncation compiler
warning so GFS2 compiles cleanly.
- I have a patch that fixes a handful of sparse compiler warnings.
- Andreas fixed up an useless gl_object warning caused by an earlier
patch.
- Arvind Yadav added a patch to properly constify our rhashtable
params declare.
- I added a patch to fix a regression caused by the non-recursive
delete and truncate patch that caused file system blocks to not be
properly freed.
- Ernesto A. Fernández added a patch to fix a place where GFS2 would
send back the wrong return code setting extended attributes.
- Ernesto also added a patch to fix a case in which GFS2 was
improperly setting an inode's i_mode, potentially granting access
to the wrong users"
* tag 'gfs2-4.14.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (29 commits)
gfs2: preserve i_mode if __gfs2_set_acl() fails
gfs2: don't return ENODATA in __gfs2_xattr_set unless replacing
GFS2: Fix non-recursive truncate bug
gfs2: constify rhashtable_params
GFS2: Fix gl_object warnings
GFS2: Fix up some sparse warnings
gfs2: Silence gcc format-truncation warning
GFS2: Withdraw for IO errors writing to the journal or statfs
gfs2: fix slab corruption during mounting and umounting gfs file system
gfs2: forcibly flush ail to relieve memory pressure
gfs2: Clean up waiting on glocks
gfs2: Defer deleting inodes under memory pressure
gfs2: gfs2_evict_inode: Put glocks asynchronously
gfs2: Get rid of gfs2_set_nlink
gfs2: gfs2_glock_get: Wait on freeing glocks
gfs2: Fix trivial typos
GFS2: Delete debugfs files only after we evict the glocks
GFS2: Don't waste time locking lru_lock for non-lru glocks
GFS2: Don't bother trying to add rgrps to the lru list
GFS2: Clear gl_object when deleting an inode in gfs2_delete_inode
...
When changing a file's acl mask, __gfs2_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.
Prevent this by only changing the inode mode after the acl has been set.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The function __gfs2_xattr_set() will return -ENODATA when called to
remove a xattr that does not exist. The result is that setfacl will
show an exit status of 1 when called to set only a file's mode bits
(on a file with no ACLs), despite succeeding. A "No data available"
error will be printed as well.
To fix this return 0 instead, except when the XATTR_REPLACE flag is
set, in which case -ENODATA is appropriate. This is consistent with
how most other xattr setting functions work, in other filesystems.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch if you truncated a file to a smaller size it
wasn't freeing all the blocks properly. There are two reasons.
First, the metapath comparison was not comparing previous heights.
I added a function, mp_eq_to_hgt, which checks the metapath at
all heights prior to the target height.
Second, in function find_nonnull_ptr, it needed to zero out all
pointers for heights following the target height. Translated into
decimal integer terms, this way a number like 299, when incremented,
becomes 300, not 399. The 2 gets incremented to 3, and the following
digits need to be reset.
These two things allow the truncate state machine to properly find
the blocks it needs to delete.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
rhashtable_params are not supposed to change at runtime. All
Functions rhashtable_* working with const rhashtable_params
provided by <linux/rhashtable.h>. So mark the non-const structs
as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The following cleanup is needed to avoid spilling the syslog with
false warnings.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch cleans up various pieces of GFS2 to avoid sparse errors.
This doesn't fix them all, but it fixes several. The first error,
in function glock_hash_walk was a genuine bug where the rhashtable
could be started and not stopped.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Enlarge sd_fsname to be big enough for the longest long lock table name
and an arbitrary journal number. This silences two -Wformat-truncation
warnings with gcc 7.1.1.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, if GFS2 encountered IO errors while writing to
the journal, it would not report the problem, so they would go
unnoticed, sometimes for many hours. Sometimes this would only be
noticed later, when recovery tried to do journal replay and failed
due to invalid metadata at the blocks that resulted in IO errors.
This patch makes GFS2's log daemon check for IO errors. If it
encounters one, it withdraws from the file system and reports
why in dmesg. A similar action is taken when IO errors occur when
writing to the system statfs file.
These errors are also reported back to any callers of fsync, since
that requires the journal to be flushed. Therefore, any IO errors
that would previously go unnoticed are now noticed and the file
system is withdrawn as early as possible, thus preventing further
file system damage.
Also note that this reintroduces superblock variable sd_log_error,
which Christoph removed with commit f729b66fca.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using cman-3.0.12.1 and gfs2-utils-3.0.12.1, mounting and
unmounting GFS2 file system would cause kernel to hang. The slab
allocator suggests that it is likely a double free memory corruption.
The issue is traced back to v3.9-rc6 where a patch is submitted to
use kzalloc() for storing a bitmap instead of using a local variable.
The intention is to allocate memory during mount and to free memory
during unmount. The original patch misses a code path which has
already freed the memory and caused memory corruption. This patch sets
the memory pointer to NULL after the memory is freed, so that double
free memory corruption will not happen.
gdlm_mount()
'-- set_recover_size() which use kzalloc()
'-- if dlm does not support ops callbacks then
'--- free_recover_size() which use kfree()
gldm_unmount()
'-- free_recover_size() which use kfree()
Previous patch which introduced the double free issue is
commit 57c7310b8e ("GFS2: use kmalloc for lvb bitmap")
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
On systems with low memory, it is possible for gfs2 to infinitely
loop in balance_dirty_pages() under heavy IO (creating sparse files).
balance_dirty_pages() attempts to write out the dirty pages via
gfs2_writepages() but none are found because these dirty pages are
being used by the journaling code in the ail. Normally, the journal
has an upper threshold which when hit triggers an automatic flush
of the ail. But this threshold can be higher than the number of
allowable dirty pages and result in the ail never being flushed.
This patch forces an ail flush when gfs2_writepages() fails to write
anything. This is a good indication that the ail might be holding
some dirty pages.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The prepare_to_wait_on_glock and finish_wait_on_glock functions introduced in
commit 56a365be "gfs2: gfs2_glock_get: Wait on freeing glocks" are
better removed, resulting in cleaner code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When under memory pressure and an inode's link count has dropped to
zero, defer deleting the inode to the delete workqueue. This avoids
calling into DLM under memory pressure, which can deadlock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
gfs2_evict_inode is called to free inodes under memory pressure. The
function calls into DLM when an inode's last cluster-wide reference goes
away (remote unlink) and to release the glock and associated DLM lock
before finally destroying the inode. However, if DLM is blocked on
memory to become available, calling into DLM again will deadlock.
Avoid that by decoupling releasing glocks from destroying inodes in that
case: with gfs2_glock_queue_put, glocks will be dequeued asynchronously
in work queue context, when the associated inodes have likely already
been destroyed.
With this change, inodes can end up being unlinked, remote-unlink can be
triggered, and then the inode can be reallocated before all
remote-unlink callbacks are processed. To detect that, revalidate the
link count in gfs2_evict_inode to make sure we're not deleting an
allocated, referenced inode.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Remove gfs2_set_nlink which prevents the link count of an inode from
becoming non-zero once it has reached zero. The next commit reduces the
amount of waiting on glocks when an inode is evicted from memory. With
that, an inode can become reallocated before all the remote-unlink
callbacks from a previous delete are processed, which causes the link
count to change from zero to non-zero.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Keep glocks in their hash table until they are freed instead of removing
them when their last reference is dropped. This allows to wait for any
previous instances of a glock to go away in gfs2_glock_get before
creating a new glocks.
Special thanks to Andy Price for finding and fixing a problem which also
required us to delete the rcu_read_unlock from the error case in function
gfs2_glock_get.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch moves the call to gfs2_delete_debugfs_file so that it
comes after the glock hash table has been cleared. This way we
can query the debugfs files if umount hangs.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, glock_dq would call gfs2_glock_remove_from_lru.
For glocks that are never put on the LRU, such as the transaction
glock, this just takes the spin_lock, determines there's nothing to
be done because the list is empty, then unlocks again. This was
causing unnecessary lock contention on the lru_lock spin_lock.
This patch adds a check for GLOF_LRU in the glops before taking
the spin_lock.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch removes a call to gfs2_glock_add_to_lru from function
gfs2_clear_rgrpd. The call is just a waste of time because as soon
as it adds it to the lru_list, the call to gfs2_glock_put takes it
back off again.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch adds some calls to clear gl_object in function
gfs2_delete_inode. Since we are deleting the inode, and the glock
typically outlives the inode in core, we must clear gl_object
so subsequent use of the glock (e.g. for a new inode in its place)
will not have the old pointer sitting there. In error cases we
need to tidy up after ourselves. In non-error cases, we need to
clear gl_object before we set the block free in the bitmap so
residules aren't left for potential inode creators.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
If function gfs2_create_inode fails after the inode has been
created (for example, if the inode_refresh fails for some reason)
the function was setting gl_object but never clearing it again.
The glocks are left pointing to a freed inode. This patch adds
the calls to clear gl_object in the appropriate error paths.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the inode glock's gl_object was set after a
reference was acquired, but before the block type was verified.
In cases where the block was unlinked, then freed and reused on
another node, a residule delete callback (delete_work) would try
to look up the inode, eventually failing the block check, but
only after it overwrites gl_object with a pointer to the wrong
inode. This patch moves the assignment of gl_object after the
block check so it won't be improperly overwritten.
Likewise, at the end of the function, gfs2_inode_lookup was
clearing gl_object after it unlocked the glock, which meant
another process might free the glock in the meantime. This
patch guards against that case.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch introduces a new helper function in glock.h that
clears gl_object, with an added integrity check. An additional
integrity check has been added to glock_set_object, plus comments.
This is step 1 in a series to ensure gl_object integrity.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of
the bio. But flag REQ_META is just a hint for block trace, not for block
layer code to handle a bio as metadata request.
For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio
would be very informative to block layer code. For example, if bcache is
used as a I/O cache for gfs2, it will be possible for bcache code to get
the hint and cache the pre-fetched metadata blocks on cache device. This
behavior may be helpful to improve metadata I/O performance if the
following requests hit the cache.
Here are the locations in gfs2 code where a REQ_PRIO flag should be added,
- All places where REQ_READAHEAD is used, gfs2 code uses this flag for
metadata read ahead.
- In gfs2_meta_rq() where the first metadata block is read in.
- In gfs2_write_buf_to_page(), read in quota metadata blocks to have them
up to date.
These metadata blocks are probably to be accessed again in future, adding
a REQ_PRIO flag may have bcache to keep such metadata in fast cache
device. For system without a cache layer, REQ_PRIO can still provide hint
to block layer to handle metadata requests more properly.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In inode_go_lock() function, the parameter order of list_add() is error.
According to the define of list_add(), the first parameter is new entry
and the second is the list head, so ip->i_trunc_list should be the
first parameter and the sdp->sd_trunc_list should be second.
Signed-off-by: Wang Xibo<wang.xibo@zte.com.cn>
Signed-off-by: Xiao Likun<xiao.likun@zte.com.cn>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When commit 4fd1a57952 moved the call to flush_delayed_work from
gfs2_evict_inode to gfs2_inode_lookup to avoid calling into DLM during
evict, a similar call should have been added to gfs2_create_inode:
that's another code path in which glocks of previous inodes may be
reused.
The flush of the iopen glock work queue added by 4fd1a57952, on the
other hand, is unnecessary and can be removed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__gfs2_set_acl() into gfs2_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.
Fixes: 073931017b
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_holder_initialized should be used in do_flock as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, problems reading in indirect buffers would send
an IO error back to the caller, and release the buffer_head with
brelse() in function gfs2_meta_indirect_buffer, however, it would
still return the address of the buffer_head it released. After the
error was discovered, function gfs2_block_map would call function
release_metapath to free all buffers. That checked:
if (mp->mp_bh[i] == NULL) but since the value was set after the
error, it was non-zero, so brelse was called a second time. This
resulted in the following error:
kernel: WARNING: at fs/buffer.c:1224 __brelse+0x3a/0x40() (Tainted: G W -- ------------ )
kernel: Hardware name: RHEV Hypervisor
kernel: VFS: brelse: Trying to free free buffer
This patch changes gfs2_meta_indirect_buffer so it only sets
the buffer_head pointer in cases where it isn't released.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull ->s_options removal from Al Viro:
"Preparations for fsmount/fsopen stuff (coming next cycle). Everything
gets moved to explicit ->show_options(), killing ->s_options off +
some cosmetic bits around fs/namespace.c and friends. Basically, the
stuff needed to work with fsmount series with minimum of conflicts
with other work.
It's not strictly required for this merge window, but it would reduce
the PITA during the coming cycle, so it would be nice to have those
bits and pieces out of the way"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
isofs: Fix isofs_show_options()
VFS: Kill off s_options and helpers
orangefs: Implement show_options
9p: Implement show_options
isofs: Implement show_options
afs: Implement show_options
affs: Implement show_options
befs: Implement show_options
spufs: Implement show_options
bpf: Implement show_options
ramfs: Implement show_options
pstore: Implement show_options
omfs: Implement show_options
hugetlbfs: Implement show_options
VFS: Don't use save/replace_mount_options if not using generic_show_options
VFS: Provide empty name qstr
VFS: Make get_filesystem() return the affected filesystem
VFS: Clean up whitespace in fs/namespace.c and fs/super.c
Provide a function to create a NUL-terminated string from unterminated data
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
yKWjj9wfYRQ0vSUqhsI5
=3Z93
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.
The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.
For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.
Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.
But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.
This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing
2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.
3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.
To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.
Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:
1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.
2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.
This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:
https://lwn.net/Articles/718734/https://lwn.net/Articles/724307/"
* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty
Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu. This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe. Bring back the original
code for freeing glocks via call_rcu.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # 4.3+
I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.
The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.
Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.
This sets the error in the mapping earlier, at the time that it's first
detected.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Provide an empty name (ie. "") qstr for general use.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1. Andreas Gruenbacher has four patches related to cleaning up the GFS2
inode evict process. This is about half of his patches designed to
fix a long-standing GFS2 hang related to the inode shrinker.
(Shrinker calls gfs2 evict, evict calls DLM, DLM requires memory
and blocks on the shrinker.) These 4 patches have been well tested.
His second set of patches are still being tested, so I plan to hold
them until the next merge window, after we have more weeks of testing.
The first patch eliminates the flush_delayed_work, which can block.
2. Andreas's second patch protects setting of gl_object for rgrps with
a spin_lock to prevent proven races.
3. His third patch introduces a centralized mechanism for queueing glock
work with better reference counting, to prevent more races.
4. His fourth patch retains a reference to inode glocks when an error
occurs while creating an inode. This keeps the subsequent evict from
needing to reacquire the glock, which might call into DLM and block
in low memory conditions.
5. Arvind Yadav has a patch to add const to attribute_group structures.
6. I have a patch to detect directory entry inconsistencies and withdraw
the file system if any are found. Better that than silent corruption.
7. I have a patch to remove a vestigial variable from glock structures,
saving some slab space.
8. I have another patch to remove a vestigial variable from the GFS2
in-core superblock structure.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZXOIfAAoJENeLYdPf93o7RVcH/jLEK3hmZOd94pDTYg3Damuo
KI3xjyutDgQT83uwg8p5UBPwRYCDnyiOLwOWGBJJvjPEI1S4syrXq/FzOmxmX6cV
nE28ARL/OXCoFEXBMUVHvHL3nK+zEUr8rO6Xz51B1ifVq7GV8iVK+ZgxzRhx0PWP
f+0SVHiQtU0HKyxR5y9p43oygtHZaGbjy4WL0YbmFZM59y5q9A8rBHFACn2JyPBm
/zXN6gF/Orao+BDXLT6OM3vNXZcOQ7FUPWwctguHsAO/bLzWiISyfJxLWJsHvSdW
tzFTN1DByjXvqAhs4HTSuh9JfBDAyxcXkmczXJyATBkCTEJv42Iev+ILmre+wwQ=
=YTwn
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got eight GFS2 patches for this merge window:
- Andreas Gruenbacher has four patches related to cleaning up the
GFS2 inode evict process. This is about half of his patches
designed to fix a long-standing GFS2 hang related to the inode
shrinker: Shrinker calls gfs2 evict, evict calls DLM, DLM requires
memory and blocks on the shrinker.
These four patches have been well tested. His second set of patches
are still being tested, so I plan to hold them until the next merge
window, after we have more weeks of testing. The first patch
eliminates the flush_delayed_work, which can block.
- Andreas's second patch protects setting of gl_object for rgrps with
a spin_lock to prevent proven races.
- His third patch introduces a centralized mechanism for queueing
glock work with better reference counting, to prevent more races.
-His fourth patch retains a reference to inode glocks when an error
occurs while creating an inode. This keeps the subsequent evict
from needing to reacquire the glock, which might call into DLM and
block in low memory conditions.
- Arvind Yadav has a patch to add const to attribute_group
structures.
- I have a patch to detect directory entry inconsistencies and
withdraw the file system if any are found. Better that than silent
corruption.
- I have a patch to remove a vestigial variable from glock
structures, saving some slab space.
- I have another patch to remove a vestigial variable from the GFS2
in-core superblock structure"
* tag 'gfs2-4.13.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: constify attribute_group structures.
gfs2: gfs2_create_inode: Keep glock across iput
gfs2: Clean up glock work enqueuing
gfs2: Protect gl->gl_object by spin lock
gfs2: Get rid of flush_delayed_work in gfs2_evict_inode
GFS2: Eliminate vestigial sd_log_flush_wrapped
GFS2: Remove gl_list from glock structure
GFS2: Withdraw when directory entry inconsistencies are detected
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
5259 1344 8 6611 19d3 fs/gfs2/sys.o
File size After adding 'const':
text data bss dec hex filename
5371 1216 8 6595 19c3 fs/gfs2/sys.o
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
On failure, keep the inode glock across the final iput of the new inode
so that gfs2_evict_inode doesn't have to re-acquire the glock. That
way, gfs2_evict_inode won't need to revalidate the block type.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch adds a standardized queueing mechanism for glock work
with spin_lock protection to prevent races.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Put all remaining accesses to gl->gl_object under the
gl->gl_lockref.lock spinlock to prevent races.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
So far, gfs2_evict_inode clears gl->gl_object and then flushes the glock
work queue to make sure that inode glops which dereference gl->gl_object
have finished running before the inode is destroyed. However, flushing
the work queue may do more work than needed, and in particular, it may
call into DLM, which we want to avoid here. Use a bit lock
(GIF_GLOP_PENDING) to synchronize between the inode glops and
gfs2_evict_inode instead to get rid of the flushing.
In addition, flush the work queues of existing glocks before reusing
them for new inodes to get those glocks into a known state: the glock
state engine currently doesn't handle glock re-appropriation correctly.
(We may be able to fix the glock state engine instead later.)
Based on a patch by Steven Whitehouse <swhiteho@redhat.com>.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch prints an inode consistency error and withdraws the file
system when directory entry counts are mismatched.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers. More to come..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
Fixes: b685d3d65a
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
CC: stable@vger.kernel.org
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
1. Andreas Gruenbacher wrote a patch to replace the deprecated
call to rhashtable_walk_init with rhashtable_walk_enter.
2. Andreas also wrote a patch to eliminate redundant code in
two of our debugfs sequence files.
3. Andreas also cleaned up the rhashtable key ugliness Linus
pointed out during this cycle, following Linus's suggestions.
4. Andreas also wrote a patch to take advantage of his new
function rhashtable_lookup_get_insert_fast. This makes glock
lookup faster and more bullet-proof.
5. Andreas also wrote a patch to revert a patch in the evict
path that caused occasional deadlocks, and is no longer
needed.
6. Andrew Price wrote a patch to re-enable fallocate for the
rindex system file to enable gfs2_grow to grow properly on
secondary file system grow operations.
7. I wrote a patch to initialize an inode number field to make
certain kernel trace points more understandable.
8. I also wrote a patch that makes GFS2 file system "withdraw"
work more like it should by ignoring operations after a
withdraw that would formerly cause a BUG() and kernel panic.
9. I also reworked the entire truncate/delete algorithm,
scrapping the old recursive algorithm in favor of a new
non-recursive algorithm. This was done for performance:
This way, GFS2 no longer needs to lock multiple resource
groups while doing truncates and deletes of files that cross
multiple resource group boundaries, allowing for better
parallelism. It also solves a problem whereby deleting large
files would request a large chunk of kernel memory, which
resulted in a get_page_from_freelist warning.
10. Due to a regression found during testing, I added a new
patch to correct "GFS2: Prevent BUG from occurring when
normal Withdraws occur".
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZDNnaAAoJENeLYdPf93o7B7kIAJzwz7vVDVg2TpWVhMmXIWhf
rZx3Gth5F0h+ZHddW7HzTLg+64XQ5//GyDD3UDtCpkhl5SJH+nt3juHyPJlRwioT
0ua4SjyKLQSoJJVAEgAwu42QjORTXab7NjYn5LEhvRc0Gg/El9WGU+ZgmP2/aAvf
KE2u/IEYNDkoJNS3Oqc7shajAyLYda6wCAASs/1ZGt9u48m/o/I23Zd7wr7EOkzw
rd3gB0x80cJqDAB5IcymGOm111Tg4g34LwsRuyMnWE3H1jOgV+J515FVHEIvZuPq
Wl9X7V8CzktI7nyLKVnZhpuv5JzyMq/vOPiD01tTFx8Oy1JCRezjmATXFjW/zIo=
=MX3c
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got ten GFS2 patches for this merge window.
- Andreas Gruenbacher wrote a patch to replace the deprecated call to
rhashtable_walk_init with rhashtable_walk_enter.
- Andreas also wrote a patch to eliminate redundant code in two of
our debugfs sequence files.
- Andreas also cleaned up the rhashtable key ugliness Linus pointed
out during this cycle, following Linus's suggestions.
- Andreas also wrote a patch to take advantage of his new function
rhashtable_lookup_get_insert_fast. This makes glock lookup faster
and more bullet-proof.
- Andreas also wrote a patch to revert a patch in the evict path that
caused occasional deadlocks, and is no longer needed.
- Andrew Price wrote a patch to re-enable fallocate for the rindex
system file to enable gfs2_grow to grow properly on secondary file
system grow operations.
- I wrote a patch to initialize an inode number field to make certain
kernel trace points more understandable.
- I also wrote a patch that makes GFS2 file system "withdraw" work
more like it should by ignoring operations after a withdraw that
would formerly cause a BUG() and kernel panic.
- I also reworked the entire truncate/delete algorithm, scrapping the
old recursive algorithm in favor of a new non-recursive algorithm.
This was done for performance: This way, GFS2 no longer needs to
lock multiple resource groups while doing truncates and deletes of
files that cross multiple resource group boundaries, allowing for
better parallelism. It also solves a problem whereby deleting large
files would request a large chunk of kernel memory, which resulted
in a get_page_from_freelist warning.
- Due to a regression found during testing, I added a new patch to
correct 'GFS2: Prevent BUG from occurring when normal Withdraws
occur'."
* tag 'gfs2-4.12.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Allow glocks to be unlocked after withdraw
GFS2: Non-recursive delete
gfs2: Re-enable fallocate for the rindex
Revert "GFS2: Wait for iopen glock dequeues"
gfs2: Switch to rhashtable_lookup_get_insert_fast
GFS2: Temporarily zero i_no_addr when creating a dinode
gfs2: Don't pack struct lm_lockname
gfs2: Deduplicate gfs2_{glocks,glstats}_open
gfs2: Replace rhashtable_walk_init with rhashtable_walk_enter
GFS2: Prevent BUG from occurring when normal Withdraws occur
This bug fixes a regression introduced by patch 0d1c7ae9d8.
The intent of the patch was to stop promoting glocks after a
file system is withdrawn due to a variety of errors, because doing
so results in a BUG(). (You should be able to unmount after a
withdraw rather than having the kernel panic.)
Unfortunately, it also stopped demotions, so glocks could not be
unlocked after withdraw, which means the unmount would hang.
This patch allows function do_xmote to demote locks to an
unlocked state after a withdraw, but not promote them.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Now that all bdi structures filesystems use are properly refcounted, we
can remove the SB_I_DYNBDI flag.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similarly to set_bdev_super() GFS2 just used block device reference to
bdi. Convert it to properly getting bdi reference. The reference will
get automatically dropped on superblock destruction.
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: Bob Peterson <rpeterso@redhat.com>
CC: cluster-devel@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>