We have observed a BUG() crash in fs/attr.c:notify_change(). The crash
occurs during an rsync into a filesystem that is exported via NFS.
1.) fs/attr.c:notify_change() modifies the caller's version of attr.
2.) 6de0ec00ba ("VFS: make notify_change pass ATTR_KILL_S*ID to
setattr operations") introduced a BUG() restriction such that "no
function will ever call notify_change() with both ATTR_MODE and
ATTR_KILL_S*ID set". Under some circumstances though, it will have
assisted in setting the caller's version of attr to this very
combination.
3.) 27ac0ffeac ("locks: break delegations on any attribute
modification") introduced code to handle breaking
delegations. This can result in notify_change() being re-called. attr
_must_ be explicitly reset to avoid triggering the BUG() established
in #2.
4.) The path that that triggers this is via fs/open.c:chmod_common().
The combination of attr flags set here and in the first call to
notify_change() along with a later failed break_deleg_wait()
results in notify_change() being called again via retry_deleg
without resetting attr.
Solution is to move retry_deleg in chmod_common() a bit further up to
ensure attr is completely reset.
There are other places where this seemingly could occur, such as
fs/utimes.c:utimes_common(), but the attr flags are not initially
set in such a way to trigger this.
Fixes: 27ac0ffeac ("locks: break delegations on any attribute modification")
Reported-by: Eric Meddaugh <etmsys@rit.edu>
Tested-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On a distributed filesystem it's possible for lookup to discover that a
directory it just found is already cached elsewhere in the directory
heirarchy. The dcache won't let us keep the directory in both places,
so we have to move the dentry to the new location from the place we
previously had it cached.
If the parent has changed, then this requires all the same locks as we'd
need to do a cross-directory rename. But we're already in lookup
holding one parent's i_mutex, so it's too late to acquire those locks in
the right order.
The (unreliable) solution in __d_unalias is to trylock() the required
locks and return -EBUSY if it fails.
I see no particular reason for returning -EBUSY, and -ESTALE is already
the result of some other lookup races on NFS. I think -ESTALE is the
more helpful error return. It also allows us to take advantage of the
logic Jeff Layton added in c6a9428401 "vfs: fix renameat to retry on
ESTALE errors" and ancestors, which hopefully resolves some of these
errors before they're returned to userspace.
I can reproduce these cases using NFS with:
ssh root@$client '
mount -olookupcache=pos '$server':'$export' /mnt/
mkdir /mnt/TO
mkdir /mnt/DIR
touch /mnt/DIR/test.txt
while true; do
strace -e open cat /mnt/DIR/test.txt 2>&1 | grep EBUSY
done
'
ssh root@$server '
while true; do
mv $export/DIR $export/TO/DIR
mv $export/TO/DIR $export/DIR
done
'
It also helps to add some other concurrent use of the directory on the
client (e.g., "ls /mnt/TO"). And you can replace the server-side mv's
by client-side mv's that are repeatedly killed. (If the client is
interrupted while waiting for the RENAME response then it's left with a
dentry that has to go under one parent or the other, but it doesn't yet
know which.)
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For one thing, LOOKUP_DIRECTORY will be dealt with in do_last().
For another, name can be an empty string, but not NULL - no callers
pass that and it would oops immediately if they would.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just make const char iname[] the last member and compare name->name with
name->iname instead of checking name->separate
We need to make sure that out-of-line name doesn't end up allocated adjacent
to struct filename refering to it; fortunately, it's easy to achieve - just
allocate that struct filename with one byte in ->iname[], so that ->iname[0]
will be inside the same object and thus have an address different from that
of out-of-line name [spotted by Boqun Feng <boqun.feng@gmail.com>]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
"ocfs2 syncs the wrong range" had been broken; prior to it the
code was doing the wrong thing in case of O_APPEND, all right,
but _after_ it we were syncing the wrong range in 100% cases.
*ppos, aka iocb->ki_pos is incremented prior to that point,
so we are always doing sync on the area _after_ the one we'd
written to.
Spotted by Joseph Qi <joseph.qi@huawei.com> back in January;
unfortunately, I'd missed his mail back then ;-/
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
generic_file_direct_write() already does that. Broken by
"ocfs2: do not fallback to buffer I/O write if appending"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
quotad periodically syncs in-memory quotas to the ondisk quota file
and sets the QDF_REFRESH flag so that a subsequent read of a synced
quota is re-read from disk.
gfs2_quota_lock() checks for this flag and sets a 'force' bit to
force re-read from disk if requested. However, there is a race
condition here. It is possible for gfs2_quota_lock() to find the
QDF_REFRESH flag unset (i.e force=0) and quotad comes in immediately
after and syncs the relevant quota and sets the QDF_REFRESH flag.
gfs2_quota_lock() resumes with force=0 and uses the stale in-memory
quota usage values that result in miscalculations.
This patch fixes this race by moving the check for the QDF_REFRESH
flag check further out into the gfs2_quota_lock() process, i.e, in
do_glock(), under the protection of the quota glock.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-04-04
Here's what's probably the last bluetooth-next pull request for 4.1:
- Fixes for LE advertising data & advertising parameters
- Fix for race condition with HCI_RESET flag
- New BNEPGETSUPPFEAT ioctl, needed for certification
- New HCI request callback type to get the resulting skb
- Cleanups to use BIT() macro wherever possible
- Consolidate Broadcom device entries in the btusb HCI driver
- Check for valid flags in CMTP, HIDP & BNEP
- Disallow local privacy & OOB data combo to prevent a potential race
- Expose SMP & ECDH selftest results through debugfs
- Expose current Device ID info through debugfs
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If we fail past the aio_setup_ring(), we need to destroy the
mapping. We don't need to care about anybody having found ctx,
or added requests to it, since the last failure exit is exactly
the failure to make ctx visible to lookups.
Reproducer (based on one by Joe Mario <jmario@redhat.com>):
void count(char *p)
{
char s[80];
printf("%s: ", p);
fflush(stdout);
sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
system(s);
}
int main()
{
io_context_t *ctx;
int created, limit, i, destroyed;
FILE *f;
count("before");
if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
perror("opening aio-max-nr");
else if (fscanf(f, "%d", &limit) != 1)
fprintf(stderr, "can't parse aio-max-nr\n");
else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
perror("allocating aio_context_t array");
else {
for (i = 0, created = 0; i < limit; i++) {
if (io_setup(1000, ctx + created) == 0)
created++;
}
for (i = 0, destroyed = 0; i < created; i++)
if (io_destroy(ctx[i]) == 0)
destroyed++;
printf("created %d, failed %d, destroyed %d\n",
created, limit - created, destroyed);
count("after");
}
}
Found-by: Joe Mario <jmario@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed
Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is needed if user space wants to know supported bnep features
by kernel, e.g. if kernel supports sending response to bnep setup
control message. By now there is no possibility to know supported
features by kernel in case of bnep. Ioctls allows only to add connection,
delete connection, get connection list, get connection info. Adding
connection if it's possible (establishing network device connection) is
equivalent to starting bnep session. Bnep session handles data queue of
transmit, receive messages over bnep channel. It means that if we add
connection the received/transmitted data will be parsed immediately. In
case of get bnep features we want to know before session start, if we
should leave setup data on socket queue and let kernel to handle with it,
or in case of no setup handling support, if we should pull this message
and handle setup response within user space.
Signed-off-by: Grzegorz Kolodziejczyk <grzegorz.kolodziejczyk@tieto.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Pull CIFS fixes from Steve French:
"A set of small cifs fixes fixing a memory leak, kernel oops, and
infinite loop (and some spotted by Coverity)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
Fix warning
Fix another dereference before null check warning
CIFS: session servername can't be null
Fix warning on impossible comparison
Fix coverity warning
Fix dereference before null check warning
Don't ignore errors on encrypting password in SMBTcon
Fix warning on uninitialized buftype
cifs: potential memory leaks when parsing mnt opts
cifs: fix use-after-free bug in find_writable_file
cifs: smb2_clone_range() - exit on unhandled error
If something went wrong with creating a debugfs file/symlink/directory,
that value could be passed down into debugfs again as a parent dentry.
To make caller code simpler, just error out if this happens, and don't
crash the kernel.
Reported-by: Alex Elder <elder@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Alex Elder <elder@linaro.org>
This should cover the set emitted by viced and the volume server.
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVGx6pAAoJEPL5WVaVDYGjh5cIAKQAyGST92IbTkxRZsxMgqnH
7LQI+fbNn6oHGEjSSnsWLxl6CpwT4WrCmj8WhVmpAoTLU958nBbF7iZAaaeQCGeS
3EqaNOlKvuOK9M5PKK7a5AWO04uJuj+t6s536OqHyB1zRb1yYMsywllPzu63eigA
jxu2yZxkFIKjo2ohSaTDRONVCsQGlqgZ2Aq/Ho5vy5QffVJKTN1G/3Kf33xukUyr
SAnndaax23jMqcFJE3gePYXc3W8EuGoloehKyo04qFeNNVMmSoytXAwMzcTmHn+H
biOTN5ezSKbYzv1aevRg7UuSPv17/yIo3aEberfLBgsn5O4wJGDdS+LajaI5/x8=
=0k0d
-----END PGP SIGNATURE-----
Merge tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull lazytime fixes from Ted Ts'o:
"This fixes a problem in the lazy time patches, which can cause
frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours"
* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs: add dirtytime_expire_seconds sysctl
fs: make sure the timestamps for lazytime inodes eventually get written
Pull nfsd fixes from Bruce Fields:
"Two main issues:
- We found that turning on pNFS by default (when it's configured at
build time) was too aggressive, so we want to switch the default
before the 4.0 release.
- Recent client changes to increase open parallelism uncovered a
serious bug lurking in the server's open code.
Also fix a krb5/selinux regression.
The rest is mainly smaller pNFS fixes"
* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
sunrpc: make debugfs file creation failure non-fatal
nfsd: require an explicit option to enable pNFS
NFSD: Fix bad update of layout in nfsd4_return_file_layout
NFSD: Take care the return value from nfsd4_encode_stateid
NFSD: Printk blocklayout length and offset as format 0x%llx
nfsd: return correct lockowner when there is a race on hash insert
nfsd: return correct openowner when there is a race to put one in the hash
NFSD: Put exports after nfsd4_layout_verify fail
NFSD: Error out when register_shrinker() fail
NFSD: Take care the return value from nfsd4_decode_stateid
NFSD: Check layout type when returning client layouts
NFSD: restore trace event lost in mismerge
afs_send_empty_reply() doesn't require an iovec array with which to initialise
the msghdr, but can pass NULL instead.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Coverity reports a warning due to unitialized attr structure in one
code path.
Reported by Coverity (CID 728535)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
null tcon is not possible in these paths so
remove confusing null check
Reported by Coverity (CID 728519)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
remove impossible check
Pointed out by Coverity (CID 115422)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
workstation_RFC1001_name is part of the struct and can't be null,
remove impossible comparison (array vs. null)
Pointed out by Coverity (CID 140095)
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Coverity reports a warning for referencing the beginning of the
SMB2/SMB3 frame using the ProtocolId field as an array. Although
it works the same either way, this patch should quiet the warning
and might be a little clearer.
Reported by Coverity (CID 741269)
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
null tcon is not likely in these paths in current
code, but obviously it does clarify the code to
check for null (if at all) before derefrencing
rather than after.
Reported by Coverity (CID 1042666)
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Although unlikely to fail (and tree connect does not commonly send
a password since SECMODE_USER is the default for most servers)
do not ignore errors on SMBNTEncrypt in SMB Tree Connect.
Reported by Coverity (CID 1226853)
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Pointed out by coverity analyzer. resp_buftype is
not initialized in one path which can rarely log
a spurious warning (buf is null so there will
not be a problem with freeing data, but if buf_type
were randomly set to wrong value could log a warning)
Reported by Coverity (CID 1269144)
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Turns out sending out layouts to any client is a bad idea if they
can't get at the storage device, so require explicit admin action
to enable pNFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs
is not configured, so the return value should be checked against ERROR_VALUE
as well, otherwise the later dereference of the dentry pointer would crash
the kernel.
Signed-off-by: Chengyu Song <csong84@gatech.edu>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
locks_delete_lock_ctx() is called inside the loop, so we
should use list_for_each_entry_safe.
Fixes: 8634b51f6c (locks: convert lease handling to file_lock_context)
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
With return layout as, (seg is return layout, lo is record layout)
seg->offset <= lo->offset and layout_end(seg) < layout_end(lo),
nfsd should update lo's offset to seg's end,
and,
seg->offset > lo->offset and layout_end(seg) >= layout_end(lo),
nfsd should update lo's end to seg's offset.
Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When testing pnfs with nfsd_debug on, nfsd print a negative number
of layout length and foff in nfsd4_block_proc_layoutget as,
"GET: -xxxx:-xxx 2"
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
alloc_init_lock_stateowner can return an already freed entry if there is
a race to put openowners in the hashtable.
Noticed by inspection after Jeff Layton fixed the same bug for open
owners. Depending on client behavior, this one may be trickier to
trigger in practice.
Fixes: c58c6610ec "nfsd: Protect adding/removing lock owners using client_lock"
Cc: <stable@vger.kernel.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
alloc_init_open_stateowner can return an already freed entry if there is
a race to put openowners in the hashtable.
In commit 7ffb588086, we changed it so that we allocate and initialize
an openowner, and then check to see if a matching one got stuffed into
the hashtable in the meantime. If it did, then we free the one we just
allocated and take a reference on the one already there. There is a bug
here though. The code will then return the pointer to the one that was
allocated (and has now been freed).
This wasn't evident before as this race almost never occurred. The Linux
kernel client used to serialize requests for a single openowner. That
has changed now with v4.0 kernels, and this race can now easily occur.
Fixes: 7ffb588086
Cc: <stable@vger.kernel.org> # v3.17+
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert(). In this case a hfs_brec_update_parent() is
called to update the parent index node (if exists) and it is passed
hfs_find_data with a search_key containing a newly inserted key instead
of the key to be updated. This results in an inconsistent index node.
The bug reproduces on my machine after an extents overflow record for
the catalog file (CNID=4) is inserted into the extents overflow B-tree.
Because of a low (reserved) value of CNID=4, it has to become the first
record in the first leaf node.
The resulting first leaf node is correct:
----------------------------------------------------
| key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... |
----------------------------------------------------
But the parent index key0 still contains the previous key CNID=123:
-----------------------
| key0.CNID=123 | ... |
-----------------------
A change in hfs_brec_insert() makes hfs_brec_update_parent() work
correctly by preventing it from getting fd->record=-1 value from
__hfs_brec_find().
Along the way, I removed duplicate code with unification of the if
condition. The resulting code is equivalent to the original code
because node is never 0.
Also hfs_brec_update_parent() will now return an error after getting a
negative fd->record value. However, the return value of
hfs_brec_update_parent() is not checked anywhere in the file and I'm
leaving it unchanged by this patch. brec.c lacks error checking after
some other calls too, but this issue is of less importance than the one
being fixed by this patch.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When affs_bread_ino() fails, correctly unlock the page and release the
page cache with proper error value. All write_end() should
unlock/release the page that was locked by write_beg().
Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For sysfs file attributes, only read and write permissions make sense.
Mask provided attribute permissions accordingly and send a warning
to the console if invalid permission bits are set.
This patch is originally from Guenter [1] and includes the fixup
explained in the thread, that is printing permissions in octal format
and limiting the scope of attributes to SYSFS_PREALLOC | 0664.
[1] https://lkml.org/lkml/2015/1/19/599
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Up to now, is_visible can only be used to either remove visibility
of a file entirely or to add permissions, but not to reduce permissions.
This makes it impossible, for example, to use DEVICE_ATTR_RW to define
file attributes and reduce permissions to read-only.
This behavior is undesirable and unnecessarily complicates code which
needs to reduce permissions; instead of just returning the desired
permissions, it has to ensure that the permissions in the attribute
variable declaration only reflect the minimal permissions ever needed.
Change semantics of is_visible to only use the permissions returned
from it instead of oring the returned value with the hard-coded
permissions.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Here are two bugfixes for things reported. One regression in kernfs,
and another issue fixed in the LZ4 code that was fixed in the "upstream"
codebase that solves a reported kernel crash.
Both have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlUOl6YACgkQMUfUDdst+ylRjQCfT46/we0U/S38/sTNYtElQXUI
ZEsAnAwozBj6nIYOydxaKnA0BcujsSGz
=unYd
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two bugfixes for things reported. One regression in kernfs,
and another issue fixed in the LZ4 code that was fixed in the
"upstream" codebase that solves a reported kernel crash
Both have been in linux-next for a while"
* tag 'driver-core-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
LZ4 : fix the data abort issue
kernfs: handle poll correctly on 'direct_read' files.
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
Pull nfsd bufix from Bruce Fields:
"This is a fix for a crash easily triggered by 4.1 activity to a server
built with CONFIG_NFSD_PNFS.
There are some more bugfixes queued up that I intend to pass along
next week, but this is the most critical"
* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
For example, when mount opt is redundently specified
(e.g., "user=A,user=B,user=C"), kernel kept allocating new key/val
with kstrdup() and overwrite previous ptr (to be freed).
Althouhg mount.cifs in userspace performs a bit of sanitization
(e.g., forcing one user option), current implementation is not
robust. Other options such as iocharset and domainanme are similarly
vulnerable.
Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Under intermittent network outages, find_writable_file() is susceptible
to the following race condition, which results in a user-after-free in
the cifs_writepages code-path:
Thread 1 Thread 2
======== ========
inv_file = NULL
refind = 0
spin_lock(&cifs_file_list_lock)
// invalidHandle found on openFileList
inv_file = open_file
// inv_file->count currently 1
cifsFileInfo_get(inv_file)
// inv_file->count = 2
spin_unlock(&cifs_file_list_lock);
cifs_reopen_file() cifs_close()
// fails (rc != 0) ->cifsFileInfo_put()
spin_lock(&cifs_file_list_lock)
// inv_file->count = 1
spin_unlock(&cifs_file_list_lock)
spin_lock(&cifs_file_list_lock);
list_move_tail(&inv_file->flist,
&cifs_inode->openFileList);
spin_unlock(&cifs_file_list_lock);
cifsFileInfo_put(inv_file);
->spin_lock(&cifs_file_list_lock)
// inv_file->count = 0
list_del(&cifs_file->flist);
// cleanup!!
kfree(cifs_file);
spin_unlock(&cifs_file_list_lock);
spin_lock(&cifs_file_list_lock);
++refind;
// refind = 1
goto refind_writable;
At this point we loop back through with an invalid inv_file pointer
and a refind value of 1. On second pass, inv_file is not overwritten on
openFileList traversal, and is subsequently dereferenced.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@samba.org>
CC: <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
While attempting to clone a file on a samba server, we receive a
STATUS_INVALID_DEVICE_REQUEST. This is mapped to -EOPNOTSUPP which
isn't handled in smb2_clone_range(). We end up looping in the while loop
making same call to the samba server over and over again.
The proposed fix is to exit and return the error value when encountered
with an unhandled error.
Cc: <stable@vger.kernel.org>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Fix commit 9cf514ccfa (nfsd: implement pNFS operations).
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Return status after nfsd4_decode_stateid failed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
According to RFC5661:
" When lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used
to identify the file system and all layouts matching the client ID,
the fsid of the file system, lora_layout_type, and lora_iomode are
returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts
matching the client ID, lora_layout_type, and lora_iomode are
returned and the current filehandle is not used. "
When returning client layouts, always check layout type.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
31ef83dc05 "nfsd: add trace events" had a typo that dropped a trace
event and replaced it by an incorrect recursive call to
nfsd4_cb_layout_fail. 133d558216 "Subject: nfsd: don't recursively
call nfsd4_cb_layout_fail" fixed the crash, this restores the
tracepoint.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull overlayfs fixes from Miklos Szeredi:
"This fixes minor issues with the multi-layer update in v4.0"
* 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: upper fs should not be R/O
ovl: check lowerdir amount for non-upper mount
ovl: print error message for invalid mount options
The misc subsystem (which is used for /dev/fuse) initializes private_data to
point to the misc device when a driver has registered a custom open file
operation, and initializes it to NULL when a custom open file operation has
*not* been provided.
This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc device
structure. And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.
So to simplify things in the misc subsystem, a patch [1] has been proposed to
*always* set the private_data to point to the misc device, instead of only
doing this when a custom open file operation has been registered.
But before this patch can be applied we need to modify drivers that make the
assumption that a misc device file's private_data is initialized to NULL
because they didn't register a custom open file operation, so they don't rely
on this assumption anymore. FUSE uses private_data to store the fuse_conn and
errors out if this is not initialized to NULL at mount time.
Hence, we now set a file's private_data to NULL explicitly, to be independent
of whatever value the misc subsystem initializes it to by default.
[1] https://lkml.org/lkml/2014/12/4/939
Reported-by: Giedrius Statkevicius <giedriuswork@gmail.com>
Reported-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We can quickly get an estimate of how many blocks are available
for allocation restricted by quota and fs size respectively, using
the ap->allowed field in the gfs2_alloc_parms structure.
gfs2_quota_check() and gfs2_inplace_reserve() provide these values.
Once we have the total number of blocks available to us, we can
compute how many bytes of data can be written using those blocks
instead of guessing inefficiently.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
struct gfs2_alloc_parms is passed to gfs2_quota_check() and
gfs2_inplace_reserve() with ap->target containing the number of
blocks being requested for allocation in the current operation.
We add a new field to struct gfs2_alloc_parms called 'allowed'.
gfs2_quota_check() and gfs2_inplace_reserve() return the max
blocks allowed by quota and the max blocks allowed by the chosen
rgrp respectively in 'allowed'.
A new field 'min_target', when non-zero, tells gfs2_quota_check()
and gfs2_inplace_reserve() to not return -EDQUOT/-ENOSPC when
there are atleast 'min_target' blocks allowable/available. The
assumption is that the caller is ok with just 'min_target' blocks
and will likely proceed with allocating them.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Use struct gfs2_alloc_parms as an argument to gfs2_quota_check()
and gfs2_quota_lock_check() to check for quota violations while
accounting for the new blocks requested by the current operation
in ap->target.
Previously, the number of new blocks requested during an operation
were not accounted for during quota_check and would allow these
operations to exceed quota. This was not very apparent since most
operations allocated only 1 block at a time and quotas would get
violated in the next operation. i.e. quota excess would only be by
1 block or so. With fallocate, (where we allocate a bunch of blocks
at once) the quota excess is non-trivial and is addressed by this
patch.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
This patch moves function gfs2_file_splice_write so it's not
conditionally compiled.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds a GFS2-specific function for splice_write which
first calls function gfs2_rs_alloc to make sure a reservation
structure has been allocated before attempting to reserve blocks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
When removing a default acl or setting an access acl that is entirely
represented in the file mode, we end up with acl == NULL in gfs2_set_acl(). In
that case, bring gfs2 in line with other file systems and cache the NULL acl
with set_cached_acl() instead of invalidating the cache with
forget_cached_acl().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
After importing multi-lower layer support, users could mount a r/o
partition as the left most lowerdir instead of using it as upperdir.
And a r/o upperdir may cause an error like
overlayfs: failed to create directory ./workdir/work
during mount.
This patch check the *s_flags* of upper fs and return an error if
it is a r/o partition. The checking of *upper_mnt->mnt_sb->s_flags*
can be removed now.
This patch also remove
/* FIXME: workdir is not needed for a R/O mount */
from ovl_fill_super() because:
1) for upper fs r/o case
Setting a r/o partition as upper is prevented, no need to care about
workdir in this case.
2) for "mount overlay -o ro" with a r/w upper fs case
Users could remount overlayfs to r/w in this case, so workdir should
not be omitted.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Recently multi-lower layer mount support allow upperdir and workdir
to be omitted, then cause overlayfs can be mount with only one
lowerdir directory. This action make no sense and have potential risk.
This patch check the total number of lower directories to prevent
mounting overlayfs with only one directory.
Also, an error message is added to indicate lower directories exceed
OVL_MAX_STACK limit.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Overlayfs should print an error message if an incorrect mount option
is caught like other filesystems.
After this patch, improper option input could be clearly known.
Reported-by: Fabian Sturm <fabian.sturm@aduu.de>
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested. This isn't quite right since we will
be limited to our max extent size. So instead lets do something horrible! Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent. If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly. Thanks,
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
I introduced a regression wrt outstanding_extents accounting. These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time. So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
My fix
Btrfs: fix merge delalloc logic
only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent. We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent. If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation. This shows up with a case like this
[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation. But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations. The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.
This disallows anybody without CAP_SYS_ADMIN to read the pagemap.
[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
[ Eventually we might want to do anything more finegrained, but for now
this is the simple model. - Linus ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Seaborn <mseaborn@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a tuning knob so we can adjust the dirtytime expiration timeout,
which is very useful for testing lazytime.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode->dirtied_when is constantly getting
updated. We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.
In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out. We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff. To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Kernfs supports two styles of read: direct_read and seqfile_read.
The latter supports 'poll' correctly thanks to the update of
'->event' in kernfs_seq_show.
The former does not as '->event' is never updated on a read.
So add an appropriate update in kernfs_file_direct_read().
This was noticed because some 'md' sysfs attributes were
recently changed to use direct reads.
Reported-by: Prakash Punnoor <prakash@punnoor.de>
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Fixes: 750f199ee8
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the function ramoops_probe, the console_size, pmsg_size,
ftrace_size may be update because the value is not the power
of two. We should update the module parameter variables
as well so they are visible through /sys/module/ramoops/parameters
correctly.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Acked-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
It's possible that "fl" won't point at a valid lock at this point, so
use "victim" instead which is either a valid lock or NULL.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Dave could hit this assert consistently running btrfs/078. This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root. For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case. This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty. This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas. We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic. This fixes the regression that was reported by a user on
list. Thanks,
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.
Fsync() can report success when it actually doesn't. When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.
It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return. Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.
This add the necessary check and it now passes the test.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch fixes mips compilation warning:
fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <clm@fb.com>
Most callers in the kernel want to perform synchronous file I/O, but
still have to bloat the stack with a full struct kiocb. Split out
the parts needed in filesystem code from those in the aio code, and
only allocate those needed to pass down argument on the stack. The
aio code embedds the generic iocb in the one it allocates and can
easily get back to it by using container_of.
Also add a ->ki_complete method to struct kiocb, this is used to call
into the aio code and thus removes the dependency on aio for filesystems
impementing asynchronous operations. It will also allow other callers
to substitute their own completion callback.
We also add a new ->ki_flags field to work around the nasty layering
violation recently introduced in commit 5e33f6 ("usb: gadget: ffs: add
eventfd notification about ffs events").
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The AIO interface is fairly complex because it tries to allow
filesystems to always work async and then wakeup a synchronous
caller through aio_complete. It turns out that basically no one
was doing this to avoid the complexity and context switches,
and we've already fixed up the remaining users and can now
get rid of this case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Based on a patch from Maxim Patlasov <MPatlasov@parallels.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With FAN_ONDIR set, the user can end up getting events, which it hasn't
marked. This was revealed with fanotify04 testcase failure on
Linux-4.0-rc1, and is a regression from 3.19, revealed with 66ba93c0d7
("fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask").
# /opt/ltp/testcases/bin/fanotify04
[ ... ]
fanotify04 7 TPASS : event generated properly for type 100000
fanotify04 8 TFAIL : fanotify04.c:147: got unexpected event 30
fanotify04 9 TPASS : No event as expected
The testcase sets the adds the following marks : FAN_OPEN | FAN_ONDIR for
a fanotify on a dir. Then does an open(), followed by close() of the
directory and expects to see an event FAN_OPEN(0x20). However, the
fanotify returns (FAN_OPEN|FAN_CLOSE_NOWRITE(0x10)). This happens due to
the flaw in the check for event_mask in fanotify_should_send_event() which
does:
if (event_mask & marks_mask & ~marks_ignored_mask)
return true;
where, event_mask == (FAN_ONDIR | FAN_CLOSE_NOWRITE),
marks_mask == (FAN_ONDIR | FAN_OPEN),
marks_ignored_mask == 0
Fix this by masking the outgoing events to the user, as we already take
care of FAN_ONDIR and FAN_EVENT_ON_CHILD.
Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com>
Tested-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
during recovery at mount time. The code path that caused the deadlock was
as follows:
nilfs_fill_super()
load_nilfs()
nilfs_salvage_orphan_logs()
* Do roll-forwarding, attach segment constructor for recovery,
and kick it.
nilfs_segctor_thread()
nilfs_segctor_thread_construct()
* A lock is held with nilfs_transaction_lock()
nilfs_segctor_do_construct()
nilfs_segctor_drop_written_files()
iput()
iput_final()
write_inode_now()
writeback_single_inode()
__writeback_single_inode()
do_writepages()
nilfs_writepage()
nilfs_construct_dsync_segment()
nilfs_transaction_lock() --> deadlock
This can happen if commit 7ef3ff2fea ("nilfs2: fix deadlock of segment
constructor over I_SYNC flag") is applied and roll-forward recovery was
performed at mount time. The roll-forward recovery can happen if datasync
write is done and the file system crashes immediately after that. For
instance, we can reproduce the issue with the following steps:
< nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
# dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
# dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
count=1 && reboot -nfh
< the system will immediately reboot >
# mount -t nilfs2 /dev/sdb1 /nilfs
The deadlock occurs because iput() can run segment constructor through
writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags. The
above commit changed segment constructor so that it calls iput()
asynchronously for inodes with i_nlink == 0, but that change was
imperfect.
This fixes the another deadlock by deferring iput() in segment constructor
even for the case that mount is not finished, that is, for the case that
MS_ACTIVE flag is not set.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turns out that making this feature ro_compat isn't quite enough to
prevent accidental corruption on mount from older kernels. Ocfs2 (like
other file systems) will process orphaned inodes even when the user mounts
in 'ro' mode. So for the case of a filesystem not knowing the append_dio
feature, mounting the filesystem could result in orphaned-for-dio files
being deleted, which we clearly don't want.
So instead, turn this into an incompat flag.
Btw, this is kind of my fault - initially I asked that we add a flag to
cover the feature and even suggested that we use an ro flag. It wasn't
until I was looking through our commits for v4.0-rc1 that I realized we
actually want this to be incompat.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use %ps for actual addresses, otherwise you'll get bad output
on arches like ppc64 where %pf expects a function descriptor.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: jfs-discussion@lists.sourceforge.net