Commit 09e05d48 introduced a wait for transaction commit into
journal_unmap_buffer() in the case we are truncating a buffer undergoing commit
in the page stradding i_size on a filesystem with blocksize < pagesize. Sadly
we forgot to drop buffer lock before waiting for transaction commit and thus
deadlock is possible when kjournald wants to lock the buffer.
Fix the problem by dropping the buffer lock before waiting for transaction
commit. Since we are still holding page lock (and that is OK), buffer cannot
disappear under us.
CC: stable@vger.kernel.org # Wherever commit 09e05d48 was taken
Signed-off-by: Jan Kara <jack@suse.cz>
Pull reiserfs and ext3 fixes from Jan Kara:
"Fixes of reiserfs deadlocks when quotas are enabled (locking there was
completely busted by BKL conversion) and also one small ext3 fix in
the trim interface."
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext3: Avoid underflow of in ext3_trim_fs()
reiserfs: Move quota calls out of write lock
reiserfs: Protect reiserfs_quota_write() with write lock
reiserfs: Protect reiserfs_quota_on() with write lock
reiserfs: Fix lock ordering during remount
Currently if len argument in ext3_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.
Also remove useless unlikely().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Calls into highlevel quota code cannot happen under the write lock. These
calls take dqio_mutex which ranks above write lock. So drop write lock
before calling back into quota code.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
Calls into reiserfs journalling code and reiserfs_get_block() need to
be protected with write lock. We remove write lock around calls to high
level quota code in the next patch so these paths would suddently become
unprotected.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
In reiserfs_quota_on() we do quite some work - for example unpacking
tail of a quota file. Thus we have to hold write lock until a moment
we call back into the quota code.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
When remounting reiserfs dquot_suspend() or dquot_resume() can be called.
These functions take dqonoff_mutex which ranks above write lock so we have
to drop it before calling into quota code.
CC: stable@vger.kernel.org # >= 3.0
Signed-off-by: Jan Kara <jack@suse.cz>
If the FAN_Q_OVERFLOW bit set in event->mask, the fanotify event
metadata will not contain a valid file descriptor, but
copy_event_to_user() didn't check for that, and unconditionally does a
fd_install() on the file descriptor.
Which in turn will cause a BUG_ON() in __fd_install().
Introduced by commit 352e3b2492 ("fanotify: sanitize failure exits in
copy_event_to_user()")
Mea culpa - missed that path ;-/
Reported-by: Alex Shi <lkml.alex@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc VFS fixes from Al Viro:
"Remove a bogus BUG_ON() that can trigger spuriously + alpha bits of
do_mount() constification I'd missed during the merge window."
This pull request came in a week ago, I missed it for some reason.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill bogus BUG_ON() in do_close_on_exec()
missing const in alpha callers of do_mount()
- fix attr tree double split corruption
- fix broken error handling in xfs_vm_writepage
- drop buffer io reference when a bad bio is built
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQp7sfAAoJENaLyazVq6ZOWHwP/2WTlenvM74i8HDa/nYW8KTC
EubCZ6X1C7LPTV9tm9YUpKZ1VtI1O+OmuGcSmWdBKSMMoBVNyKvWXvrJeVKBVtXV
sQ/jh1zCiPYzt9DfxGuarkw8Uy5qKNOYrbEAK1WwPMeOsDODYncfmTm+A/VYMeTt
bWOjaxFd5QQOMuf0x9NO/keZc84R5l51ezYxA7HyYa5XvV/MDmLLVL0IhuSTFKyw
oOiQMp0hby4zsJg6nqu/eINmdlgBIw+32m8aMSB2jreUQm4yvt0CY7M3Zq6sPmsM
2tC6cFonPw31FBBu9jvv9h5wNz7McyzxtZBS0+zDV+7K0UrIyxWm1BhzZIXoXzLz
vHwc4gnZV8nOP/g34aftHLYYRD3ZJhG8mX5AdBRzlWWqDSFvYVEq+1evHrv8kk4l
coTapzimNnR3aJ16qdP1M0gExKO9nrGVqrRi8ndLNbxLpxC9mFG7CfJBQPMumukX
G8pTV1wQvqONHDNlN4mxqMBHN0d9dGp5xjYQ0Q92/siIA1C5szjCwTHekKNrP6Ol
7xd+nO7Xcgj7Uwaakv31paqOSAGhla6H5jvxPF2A54hZWQqlp88QpChLt3LFPxwh
tEYTEf1zRoaoCS4TD3zMYTLY+9cXvUybSIf3hbgns+JMYHJtuZdzbvcaXE6Wl4Jr
6esA5fsBFP1J2/EzpLof
=depY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
- fix attr tree double split corruption
- fix broken error handling in xfs_vm_writepage
- drop buffer io reference when a bad bio is built
* tag 'for-linus-v3.7-rc7' of git://oss.sgi.com/xfs/xfs:
xfs: drop buffer io reference when a bad bio is built
xfs: fix broken error handling in xfs_vm_writepage
xfs: fix attr tree double split corruption
Error handling in xfs_buf_ioapply_map() does not handle IO reference
counts correctly. We increment the b_io_remaining count before
building the bio, but then fail to decrement it in the failure case.
This leads to the buffer never running IO completion and releasing
the reference that the IO holds, so at unmount we can leak the
buffer. This leak is captured by this assert failure during unmount:
XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273
This is not a new bug - the b_io_remaining accounting has had this
problem for a long, long time - it's just very hard to get a
zero length bio being built by this code...
Further, the buffer IO error can be overwritten on a multi-segment
buffer by subsequent bio completions for partial sections of the
buffer. Hence we should only set the buffer error status if the
buffer is not already carrying an error status. This ensures that a
partial IO error on a multi-segment buffer will not be lost. This
part of the problem is a regression, however.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we shut down the filesystem, it might first be detected in
writeback when we are allocating a inode size transaction. This
happens after we have moved all the pages into the writeback state
and unlocked them. Unfortunately, if we fail to set up the
transaction we then abort writeback and try to invalidate the
current page. This then triggers are BUG() in block_invalidatepage()
because we are trying to invalidate an unlocked page.
Fixing this is a bit of a chicken and egg problem - we can't
allocate the transaction until we've clustered all the pages into
the IO and we know the size of it (i.e. whether the last block of
the IO is beyond the current EOF or not). However, we don't want to
hold pages locked for long periods of time, especially while we lock
other pages to cluster them into the write.
To fix this, we need to make a clear delineation in writeback where
errors can only be handled by IO completion processing. That is,
once we have marked a page for writeback and unlocked it, we have to
report errors via IO completion because we've already started the
IO. We may not have submitted any IO, but we've changed the page
state to indicate that it is under IO so we must now use the IO
completion path to report errors.
To do this, add an error field to xfs_submit_ioend() to pass it the
error that occurred during the building on the ioend chain. When
this is non-zero, mark each ioend with the error and call
xfs_finish_ioend() directly rather than building bios. This will
immediately push the ioends through completion processing with the
error that has occurred.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In certain circumstances, a double split of an attribute tree is
needed to insert or replace an attribute. In rare situations, this
can go wrong, leaving the attribute tree corrupted. In this case,
the attr being replaced is the last attr in a leaf node, and the
replacement is larger so doesn't fit in the same leaf node.
When we have the initial condition of a node format attribute
btree with two leaves at index 1 and 2. Call them L1 and L2. The
leaf L1 is completely full, there is not a single byte of free space
in it. L2 is mostly empty. The attribute being replaced - call it X
- is the last attribute in L1.
The way an attribute replace is executed is that the replacement
attribute - call it Y - is first inserted into the tree, but has an
INCOMPLETE flag set on it so that list traversals ignore it. Once
this transaction is committed, a second transaction it run to
atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
traversal will now find Y and skip X. Once that transaction is
committed, attribute X is then removed.
So, the initial condition is:
+--------+ +--------+
| L1 | | L2 |
| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |
| fsp: 0 | | fsp: N |
|--------| |--------|
| attr A | | attr 1 |
|--------| |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr X | | attr n |
+--------+ +--------+
So now we go to replace X, and see that L1:fsp = 0 - it is full so
we can't insert Y in the same leaf. So we record the the location of
attribute X so we can track it for later use, then we split L1 into
L1 and L3 and reblance across the two leafs. We end with:
+--------+ +--------+ +--------+
| L1 | | L3 | | L2 |
| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: N |
|--------| |--------| |--------|
| attr A | | attr X | | attr 1 |
|--------| +--------+ |--------|
| attr B | | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And we track that the original attribute is now at L3:0.
We then try to insert Y into L1 again, and find that there isn't
enough room because the new attribute is larger than the old one.
Hence we have to split again to make room for Y. We end up with
this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| + INCOMP + +--------+ |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
And now we have the new (incomplete) attribute @ L4:0, and the
original attribute at L3:0. At this point, the first transaction is
committed, and we move to the flipping of the flags.
This is where we are supposed to end up with this:
+--------+ +--------+ +--------+ +--------+
| L1 | | L4 | | L3 | | L2 |
| fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
| bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
| fsp: M | | fsp: J | | fsp: J | | fsp: N |
|--------| |--------| |--------| |--------|
| attr A | | attr Y | | attr X | | attr 1 |
|--------| +--------+ + INCOMP + |--------|
| attr B | +--------+ | attr 2 |
|--------| |--------|
.......... ..........
|--------| |--------|
| attr W | | attr n |
+--------+ +--------+
But that doesn't happen properly - the attribute tracking indexes
are not pointing to the right locations. What we end up with is both
the old attribute to be removed pointing at L4:0 and the new
attribute at L4:1. On a debug kernel, this assert fails like so:
XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725
because the new attribute location does not exist. On a production
kernel, this goes unnoticed and the code proceeds ahead merrily and
removes L4 because it thinks that is the block that is no longer
needed. This leaves the hash index node pointing to entries
L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
space, and so everything is busted. This corruption is caused by the
removal of the old attribute triggering a join - it joins everything
correctly but then frees the wrong block.
xfs_repair will report something like:
bad sibling back pointer for block 4 in attribute fork for inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 8 for inode 131, would reset to 3
bad anextents 4 for inode 131, would reset to 0
The problem lies in the assignment of the old/new blocks for
tracking purposes when the double leaf split occurs. The first split
tries to place the new attribute inside the current leaf (i.e.
"inleaf == true") and moves the old attribute (X) to the new block.
This sets up the old block/index to L1:X, and newly allocated
block to L3:0. It then moves attr X to the new block and tries to
insert attr Y at the old index. That fails, so it splits again.
With the second split, the rebalance ends up placing the new attr in
the second new block - L4:0 - and this is where the code goes wrong.
What is does is it sets both the new and old block index to the
second new block. Hence it inserts attr Y at the right place (L4:0)
but overwrites the current location of the attr to replace that is
held in the new block index (currently L3:0). It over writes it with
L4:1 - the index we later assert fail on.
Hopefully this table will show this in a foramt that is a bit easier
to understand:
Split old attr index new attr index
vanilla patched vanilla patched
before 1st L1:26 L1:26 N/A N/A
after 1st L3:0 L3:0 L1:26 L1:26
after 2nd L4:0 L3:0 L4:1 L4:0
^^^^ ^^^^
wrong wrong
The fix is surprisingly simple, for all this analysis - just stop
the rebalance on the out-of leaf case from overwriting the new attr
index - it's already correct for the double split case.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This is mostly a revert of 01dc52ebdf ("oom: remove deprecated oom_adj")
from Davidlohr Bueso.
It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier
kernels. It simply scales the value linearly when /proc/pid/oom_score_adj
is written.
The major difference is that its scheduled removal is no longer included
in Documentation/feature-removal-schedule.txt. We do warn users with a
single printk, though, to suggest the more powerful and supported
/proc/pid/oom_score_adj interface.
Reported-by: Artem S. Tashkinov <t.artem@lycos.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
but only fixed now because no one gave enough material for debugging.
Anyway, these fix the problem that sometimes after a power cut the
file-system is not mountable with the following symptom:
grab_empty_leb: could not find an empty LEB
The fixes make the file-system mountable again.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQpQcdAAoJECmIfjd9wqK0iOAP/RbkGdtNznPpm4henealAx+d
9R5kDsmVT3iSmVC7ryAwCapDoe0JUhzN9OiaAMzH66Hmw0isFoOFY5VGd+jPmbMd
biIUTvQ9AmmrTuLW0zkU8SmUwLDxmZD48IhJnzkmEaUAzF1cN1eLox5FDTzs9a/p
RW+m1RXJWX2tPpGs9WRtltVfaLaieQcfFbuSmuY48vyR9TM9svoMOqW2hEABlMFA
xvJFwir/H1w1mE+QVcB7FhvvUTB3fMcCNnxIEiLec2QLlQwJXfIO03ZpMJAN7Z/s
s7lgRl29pI6vCwbyV1u9P3hfPg02a0hB71U4nk3HXE5rGBJMTf2qgXJnBE8sDxVu
kO8yIcGv9f/rD6JFmFbQ1VmMH7XCBb6Glyyqbz8hkRR6hnkyeDiGz3UnO48cqM2d
gApoRlRGPwG3vqjIxKdKY2qzkPgtw86ktKgS2K7X8Owh+yYgcGYKstK5OtRn83aW
2s5Z6n10i/ucI0bVNYoBr7I1NnSg7b2AETbeNdnQiAmOS+6022HaOaSVxJQTlY7n
p/WhsRMHp3vUeok7M42+ObXOMe8tJF00NnlneNx9bpPmlTOvnKxB7/WcPNEqGSwo
vBkrFTp+HCo+mLCT/4ug2XSxMQD42oGUv5wYSjNwpsT9gCsrpC2ZUD++yk6ifliq
dXSypO7Gd4jLSF7sHtq0
=1iao
-----END PGP SIGNATURE-----
Merge tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs
Pull UBIFS fixes from Artem Bityutskiy:
"Two patches which fix a problem reported by several people in the
past, but only fixed now because no one gave enough material for
debugging.
Anyway, these fix the problem that sometimes after a power cut the
file-system is not mountable with the following symptom:
grab_empty_leb: could not find an empty LEB
The fixes make the file-system mountable again."
* tag 'upstream-3.7-rc6' of git://git.infradead.org/linux-ubifs:
UBIFS: fix mounting problems after power cuts
UBIFS: introduce categorized lprops counter
Passing a NULL id causes a NULL pointer deference in writers such as
erst_writer and efi_pstore_write because they expect to update this id.
Pass a dummy id instead.
This avoids a cascade of oopses caused when the initial
pstore_console_write passes a null which in turn causes writes to the
console causing further oopses in subsequent pstore_console_write calls.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
It can be legitimately triggered via procfs access. Now, at least
2 of 3 of get_files_struct() callers in procfs are useless, but
when and if we get rid of those we can always add WARN_ON() here.
BUG_ON() at that spot is simply wrong.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cifs fixes from Jeff Layton.
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not lookup hashed negative dentry in cifs_atomic_open
cifs: fix potential buffer overrun in cifs.idmap handling code
jffs2_write_begin() first acquires the page lock, then f->sem. This
causes an AB-BA deadlock with jffs2_garbage_collect_live(), which first
acquires f->sem, then the page lock:
jffs2_garbage_collect_live
mutex_lock(&f->sem) (A)
jffs2_garbage_collect_dnode
jffs2_gc_fetch_page
read_cache_page_async
do_read_cache_page
lock_page(page) (B)
jffs2_write_begin
grab_cache_page_write_begin
find_lock_page
lock_page(page) (B)
mutex_lock(&f->sem) (A)
We fix this by restructuring jffs2_write_begin() to take f->sem before
the page lock. However, we make sure that f->sem is not held when
calling jffs2_reserve_space(), as this is not permitted by the locking
rules.
The deadlock above was observed multiple times on an SoC with a dual
ARMv7 (Cortex-A9), running the long-term 3.4.11 kernel; it occurred
when using scp to copy files from a host system to the ARM target
system. The fix was heavily tested on the same target system.
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com>
Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Merge misc fixes from Andrew Morton:
"Five fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (5 patches)
h8300: add missing L1_CACHE_SHIFT
mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()
fanotify: fix missing break
revert "epoll: support for disabling items, and a self-test app"
checkpatch: improve network block comment style checking
- fix for large transactions spanning multiple iclog buffers
- zero the allocation_args structure on the stack before using it
to determine whether to use a worker for allocation
- move allocation stack switch to xfs_bmapi_allocate in order
to prevent deadlock on AGF buffers
- growfs no longer reads in garbage for new secondary superblocks
- silence a build warning
- ensure that invalid buffers never get written to disk while on
free list
- don't vmap inode cluster buffers during free
- fix buffer shutdown reference count mismatch
- fix reading of wrapped log data
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQm/NqAAoJENaLyazVq6ZOhkgP/2wcLIfJBfdfFV/MBEYYNqdB
ufgsl351L6d9xX9WdQNIB3ely7sbZhsh+9uZM1i7LuySXuYC1JC7OvOBByV0UV9W
/WWIzonFC+8W0X9EsetUb2Ru+4C40RAodZMRwsdBQQXhSS3A4GoUbv4t/Ty3bEk/
1v27diGjCKQYBongHhFVjaNjtuXIebPG+yIegkZJseKnd7zRZPo2AgULIxtWPcIT
xkH3WP3fVU/zF6ESjF4cGqGMtKaw+a2nk8mfMqHNRIvehBQHZ7CRnBwgVMtQk6fr
0J62GvAFDA10bwz/toUd/8TJNewJKjBhEwfkITsO3EHw8SBp0KvrvrGvfDVew4Wr
oOb0+YkI+jtFZkz3WpZqd+Kgryeyxh6j1OBQ09FVLe3SQ2ZKcDVKFOkaPjEqqNuF
TUIwenWK4I5zmWeWGkLUJTzzY/ModVKUGTft57HUTyD318H7xFu2fBYj10jVE8dl
fl4u8A3ifCXcancROEeCX7AXkY3SFrkvYqCWDWqNDzhyr1WsjEJmkf6F45makaD0
cPBsKYUi3P3pVvbcdHdGEgCWx5X1TzOyhX8pggWHbBRfc8frg9IIfMX4fl5f66vP
fmxRHwP+4co1PQdu++LqJ92clz2ja2DYRQzKVJPNQkhRfHOFDEQ8DlxPFrCb8iK0
HHHlp4v+Gyme6YhgD317
=+24f
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs
Pull xfs bugfixes from Ben Myers:
- fix for large transactions spanning multiple iclog buffers
- zero the allocation_args structure on the stack before using it to
determine whether to use a worker for allocation
- move allocation stack switch to xfs_bmapi_allocate in order to
prevent deadlock on AGF buffers
- growfs no longer reads in garbage for new secondary superblocks
- silence a build warning
- ensure that invalid buffers never get written to disk while on free
list
- don't vmap inode cluster buffers during free
- fix buffer shutdown reference count mismatch
- fix reading of wrapped log data
* tag 'for-linus-v3.7-rc5' of git://oss.sgi.com/xfs/xfs:
xfs: fix reading of wrapped log data
xfs: fix buffer shudown reference count mismatch
xfs: don't vmap inode cluster buffers during free
xfs: invalidate allocbt blocks moved to the free list
xfs: silence uninitialised f.file warning.
xfs: growfs: don't read garbage for new secondary superblocks
xfs: move allocation stack switch up to xfs_bmapi_allocate
xfs: introduce XFS_BMAPI_STACK_SWITCH
xfs: zero allocation_args on the kernel stack
xfs: only update the last_sync_lsn when a transaction completes
Anders Blomdell noted in 2010 that Fanotify lost events and provided a
test case. Eric Paris confirmed it was a bug and posted a fix to the
list
https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/RrJfTfyW2BE
but never applied it. Repeated attempts over time to actually get him
to apply it have never had a reply from anyone who has raised it
So apply it anyway
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Cc: Eric Paris <eparis@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 03a7beb55b ("epoll: support for disabling items, and a
self-test app") pending resolution of the issues identified by Michael
Kerrisk, copied below.
We'll revisit this for 3.8.
: I've taken a look at this patch as it currently stands in 3.7-rc1, and
: done a bit of testing. (By the way, the test program
: tools/testing/selftests/epoll/test_epoll.c does not compile...)
:
: There are one or two places where the behavior seems a little strange,
: so I have a question or two at the end of this mail. But other than
: that, I want to check my understanding so that the interface can be
: correctly documented.
:
: Just to go though my understanding, the problem is the following
: scenario in a multithreaded application:
:
: 1. Multiple threads are performing epoll_wait() operations,
: and maintaining a user-space cache that contains information
: corresponding to each file descriptor being monitored by
: epoll_wait().
:
: 2. At some point, a thread wants to delete (EPOLL_CTL_DEL)
: a file descriptor from the epoll interest list, and
: delete the corresponding record from the user-space cache.
:
: 3. The problem with (2) is that some other thread may have
: previously done an epoll_wait() that retrieved information
: about the fd in question, and may be in the middle of using
: information in the cache that relates to that fd. Thus,
: there is a potential race.
:
: 4. The race can't solved purely in user space, because doing
: so would require applying a mutex across the epoll_wait()
: call, which would of course blow thread concurrency.
:
: Right?
:
: Your solution is the EPOLL_CTL_DISABLE operation. I want to
: confirm my understanding about how to use this flag, since
: the description that has accompanied the patches so far
: has been a bit sparse
:
: 0. In the scenario you're concerned about, deleting a file
: descriptor means (safely) doing the following:
: (a) Deleting the file descriptor from the epoll interest list
: using EPOLL_CTL_DEL
: (b) Deleting the corresponding record in the user-space cache
:
: 1. It's only meaningful to use this EPOLL_CTL_DISABLE in
: conjunction with EPOLLONESHOT.
:
: 2. Using EPOLL_CTL_DISABLE without using EPOLLONESHOT in
: conjunction is a logical error.
:
: 3. The correct way to code multithreaded applications using
: EPOLL_CTL_DISABLE and EPOLLONESHOT is as follows:
:
: a. All EPOLL_CTL_ADD and EPOLL_CTL_MOD operations should
: should EPOLLONESHOT.
:
: b. When a thread wants to delete a file descriptor, it
: should do the following:
:
: [1] Call epoll_ctl(EPOLL_CTL_DISABLE)
: [2] If the return status from epoll_ctl(EPOLL_CTL_DISABLE)
: was zero, then the file descriptor can be safely
: deleted by the thread that made this call.
: [3] If the epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY,
: then the descriptor is in use. In this case, the calling
: thread should set a flag in the user-space cache to
: indicate that the thread that is using the descriptor
: should perform the deletion operation.
:
: Is all of the above correct?
:
: The implementation depends on checking on whether
: (events & ~EP_PRIVATE_BITS) == 0
: This replies on the fact that EPOLL_CTL_AD and EPOLL_CTL_MOD always
: set EPOLLHUP and EPOLLERR in the 'events' mask, and EPOLLONESHOT
: causes those flags (as well as all others in ~EP_PRIVATE_BITS) to be
: cleared.
:
: A corollary to the previous paragraph is that using EPOLL_CTL_DISABLE
: is only useful in conjunction with EPOLLONESHOT. However, as things
: stand, one can use EPOLL_CTL_DISABLE on a file descriptor that does
: not have EPOLLONESHOT set in 'events' This results in the following
: (slightly surprising) behavior:
:
: (a) The first call to epoll_ctl(EPOLL_CTL_DISABLE) returns 0
: (the indicator that the file descriptor can be safely deleted).
: (b) The next call to epoll_ctl(EPOLL_CTL_DISABLE) fails with EBUSY.
:
: This doesn't seem particularly useful, and in fact is probably an
: indication that the user made a logic error: they should only be using
: epoll_ctl(EPOLL_CTL_DISABLE) on a file descriptor for which
: EPOLLONESHOT was set in 'events'. If that is correct, then would it
: not make sense to return an error to user space for this case?
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paton J. Lewis" <palewis@adobe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4439647 ("xfs: reset buffer pointers before freeing them") in
3.0-rc1 introduced a regression when recovering log buffers that
wrapped around the end of log. The second part of the log buffer at
the start of the physical log was being read into the header buffer
rather than the data buffer, and hence recovery was seeing garbage
in the data buffer when it got to the region of the log buffer that
was incorrectly read.
Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we shut down the filesystem, we have to unpin and free all the
buffers currently active in the CIL. To do this we unpin and remove
them in one operation as a result of a failed iclogbuf write. For
buffers, we do this removal via a simultated IO completion of after
marking the buffer stale.
At the time we do this, we have two references to the buffer - the
active LRU reference and the buf log item. The LRU reference is
removed by marking the buffer stale, and the active CIL reference is
by the xfs_buf_iodone() callback that is run by
xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
callback).
However, ioend processing requires one more reference - that of the
IO that it is completing. We don't have this reference, so we free
the buffer prematurely and use it after it is freed. For buffers
marked with XBF_ASYNC, this leads to assert failures in
xfs_buf_rele() on debug kernels because the b_hold count is zero.
Fix this by making sure we take the necessary IO reference before
starting IO completion processing on the stale buffer, and set the
XBF_ASYNC flag to ensure that IO completion processing removes all
the active references from the buffer to ensure it is fully torn
down.
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Inode buffers do not need to be mapped as inodes are read or written
directly from/to the pages underlying the buffer. This fixes a
regression introduced by commit 611c994 ("xfs: make XBF_MAPPED the
default behaviour").
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When we free a block from the alloc btree tree, we move it to the
freelist held in the AGFL and mark it busy in the busy extent tree.
This typically happens when we merge btree blocks.
Once the transaction is committed and checkpointed, the block can
remain on the free list for an indefinite amount of time. Now, this
isn't the end of the world at this point - if the free list is
shortened, the buffer is invalidated in the transaction that moves
it back to free space. If the buffer is allocated as metadata from
the free list, then all the modifications getted logged, and we have
no issues, either. And if it gets allocated as userdata direct from
the freelist, it gets invalidated and so will never get written.
However, during the time it sits on the free list, pressure on the
log can cause the AIL to be pushed and the buffer that covers the
block gets pushed for write. IOWs, we end up writing a freed
metadata block to disk. Again, this isn't the end of the world
because we know from the above we are only writing to free space.
The problem, however, is for validation callbacks. If the block was
on old btree root block, then the level of the block is going to be
higher than the current tree root, and so will fail validation.
There may be other inconsistencies in the block as well, and
currently we don't care because the block is in free space. Shutting
down the filesystem because a freed block doesn't pass write
validation, OTOH, is rather unfriendly.
So, make sure we always invalidate buffers as they move from the
free space trees to the free list so that we guarantee they never
get written to disk while on the free list.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phil White <pwhite@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Uninitialised variable build warning introduced by 2903ff0 ("switch
simple cases of fget_light to fdget"), gcc is not smart enough to
work out that the variable is not used uninitialised, and the commit
removed the initialisation at declaration that the old variable had.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When updating new secondary superblocks in a growfs operation, the
superblock buffer is read from the newly grown region of the
underlying device. This is not guaranteed to be zero, so violates
the underlying assumption that the unused parts of superblocks are
zero filled. Get a new buffer for these secondary superblocks to
ensure that the unused regions are zero filled correctly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Switching stacks are xfs_alloc_vextent can cause deadlocks when we
run out of worker threads on the allocation workqueue. This can
occur because xfs_bmap_btalloc can make multiple calls to
xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
return with the AGF locked in the current allocation transaction.
If we then need to make another allocation, and all the allocation
worker contexts are exhausted because the are blocked waiting for
the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
work completed to release the AGF. Hence allocation effectively
deadlocks.
To avoid this, move the stack switch one layer up to
xfs_bmapi_allocate() so that all of the allocation attempts in a
single switched stack transaction occur in a single worker context.
This avoids the problem of an allocation being blocked waiting for
a worker thread whilst holding the AGF.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Certain allocation paths through xfs_bmapi_write() are in situations
where we have limited stack available. These are almost always in
the buffered IO writeback path when convertion delayed allocation
extents to real extents.
The current stack switch occurs for userdata allocations, which
means we also do stack switches for preallocation, direct IO and
unwritten extent conversion, even those these call chains have never
been implicated in a stack overrun.
Hence, let's target just the single stack overun offended for stack
switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
the caller can pass xfs_bmapi_write() to indicate it should switch
stacks if it needs to do allocation.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Zero the kernel stack space that makes up the xfs_alloc_arg structures.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The log write code stamps each iclog with the current tail LSN in
the iclog header so that recovery knows where to find the tail of
thelog once it has found the head. Normally this is taken from the
first item on the AIL - the log item that corresponds to the oldest
active item in the log.
The problem is that when the AIL is empty, the tail lsn is dervied
from the the l_last_sync_lsn, which is the LSN of the last iclog to
be written to the log. In most cases this doesn't happen, because
the AIL is rarely empty on an active filesystem. However, when it
does, it opens up an interesting case when the transaction being
committed to the iclog spans multiple iclogs.
That is, the first iclog is stamped with the l_last_sync_lsn, and IO
is issued. Then the next iclog is setup, the changes copied into the
iclog (takes some time), and then the l_last_sync_lsn is stamped
into the header and IO is issued. This is still the same
transaction, so the tail lsn of both iclogs must be the same for log
recovery to find the entire transaction to be able to replay it.
The problem arises in that the iclog buffer IO completion updates
the l_last_sync_lsn with it's own LSN. Therefore, If the first iclog
completes it's IO before the second iclog is filled and has the tail
lsn stamped in it, it will stamp the LSN of the first iclog into
it's tail lsn field. If the system fails at this point, log recovery
will not see a complete transaction, so the transaction will no be
replayed.
The fix is simple - the l_last_sync_lsn is updated when a iclog
buffer IO completes, and this is incorrect. The l_last_sync_lsn
shoul dbe updated when a transaction is completed by a iclog buffer
IO. That is, only iclog buffers that have transaction commit
callbacks attached to them should update the l_last_sync_lsn. This
means that the last_sync_lsn will only move forward when a commit
record it written, not in the middle of a large transaction that is
rolling through multiple iclog buffers.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Pull gfs2 fixes from Steven Whitehouse:
"Here are a number of GFS2 bug fixes. There are three from Andy Price
which fix various issues spotted by automated code analysis. There
are two from Lukas Czerner fixing my mistaken assumptions as to how
FITRIM should work. Finally Ben Marzinski has fixed a bug relating to
mmap and atime and also a bug relating to a locking issue in the
transaction code."
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Test bufdata with buffer locked and gfs2_log_lock held
GFS2: Don't call file_accessed() with a shared glock
GFS2: Fix FITRIM argument handling
GFS2: Require user to provide argument for FITRIM
GFS2: Clean up some unused assignments
GFS2: Fix possible null pointer deref in gfs2_rs_alloc
GFS2: Fix an unchecked error from gfs2_rs_alloc
In gfs2_trans_add_bh(), gfs2 was testing if a there was a bd attached to the
buffer without having the gfs2_log_lock held. It was then assuming it would
stay attached for the rest of the function. However, without either the log
lock being held of the buffer locked, __gfs2_ail_flush() could detach bd at any
time. This patch moves the locking before the test. If there isn't a bd
already attached, gfs2 can safely allocate one and attach it before locking.
There is no way that the newly allocated bd could be on the ail list,
and thus no way for __gfs2_ail_flush() to detach it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
file_accessed() was being called by gfs2_mmap() with a shared glock. If it
needed to update the atime, it was crashing because it dirtied the inode in
gfs2_dirty_inode() without holding an exclusive lock. gfs2_dirty_inode()
checked if the caller was already holding a glock, but it didn't make sure that
the glock was in the exclusive state. Now, instead of calling file_accessed()
while holding the shared lock in gfs2_mmap(), file_accessed() is called after
grabbing and releasing the glock to update the inode. If file_accessed() needs
to update the atime, it will grab an exclusive lock in gfs2_dirty_inode().
gfs2_dirty_inode() now also checks to make sure that if the calling process has
already locked the glock, it has an exclusive lock.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently implementation in gfs2 uses FITRIM arguments as it were in
file system blocks units which is wrong. The FITRIM arguments
(fstrim_range.start, fstrim_range.len and fstrim_range.minlen) are
actually in bytes.
Moreover, check for start argument beyond the end of file system, len
argument being smaller than file system block and minlen argument being
bigger than biggest resource group were missing.
This commit converts the code to convert FITRIM argument to file system
blocks and also adds appropriate checks mentioned above.
All the problems were recognised by xfstests 251 and 260.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the fstrim_range argument is not provided by user in FITRIM ioctl
we should just return EFAULT and not promoting bad behaviour by filling
the structure in kernel. Let the user deal with it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cleans up two cases where variables were assigned values but then never
used again.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Despite the return value from kmem_cache_zalloc() being checked, the
error wasn't being returned until after a possible null pointer
dereference. This patch returns the error immediately, allowing the
removal of the error variable.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Check the return value of gfs2_rs_alloc(ip) and avoid a possible null
pointer dereference.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We do not need to lookup a hashed negative directory since we have
already revalidated it before and have found it to be fine.
This also prevents a crash in cifs_lookup() when it attempts to rehash
the already hashed negative lookup dentry.
The patch has been tested using the reproducer at
https://bugzilla.redhat.com/show_bug.cgi?id=867344#c28
Cc: <stable@kernel.org> # 3.6.x
Reported-by: Vit Zahradka <vit.zahradka@tiscali.cz>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
The userspace cifs.idmap program generally works with the wbclient libs
to generate binary SIDs in userspace. That program defines the struct
that holds these values as having a max of 15 subauthorities. The kernel
idmapping code however limits that value to 5.
When the kernel copies those values around though, it doesn't sanity
check the num_subauths value handed back from userspace or from the
server. It's possible therefore for userspace to hand us back a bogus
num_subauths value (or one that's valid, but greater than 5) that could
cause the kernel to walk off the end of the cifs_sid->sub_auths array.
Fix this by defining a new routine for copying sids and using that in
all of the places that copy it. If we end up with a sid that's longer
than expected then this approach will just lop off the "extra" subauths,
but that's basically what the code does today already. Better approaches
might be to fix this code to reject SIDs with >5 subauths, or fix it
to handle the subauths array dynamically.
At the same time, change the kernel to check the length of the data
returned by userspace. If it's shorter than struct cifs_sid, reject it
and return -EIO. If that happens we'll end up with fields that are
basically uninitialized.
Long term, it might make sense to redefine cifs_sid using a flexarray at
the end, to allow for variable-length subauth lists, and teach the code
to handle the case where the subauths array being passed in from
userspace is shorter than 5 elements.
Note too, that I don't consider this a security issue since you'd need
a compromised cifs.idmap program. If you have that, you can do all sorts
of nefarious stuff. Still, this is probably reasonable for stable.
Cc: stable@kernel.org
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Return errno - not an NFS4ERR_. This worked because NFS4ERR_ACCESS == EACCES.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_sb_deactive_async instead of nfs_sb_deactive when in a workqueue
context. This avoids a deadlock where rpc_shutdown_client loops forever
in a workqueue kworker context, trying to kill all RPC tasks associated with
the client, while one or more of these tasks have already been assigned to the
same kworker (and will never run rpc_exit_task).
This approach is needed because RPC tasks that have already been assigned
to a kworker by queue_work cannot be canceled, as explained in the comment
for workqueue.c:insert_wq_barrier.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
[Trond: add module_get/put.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since commit c7f404b ('vfs: new superblock methods to override
/proc/*/mount{s,info}'), nfs_path() is used to generate the mounted
device name reported back to userland.
nfs_path() always generates a trailing slash when the given dentry is
the root of an NFS mount, but userland may expect the original device
name to be returned verbatim (as it used to be). Make this
canonicalisation optional and change the callers accordingly.
[jrnieder@gmail.com: use flag instead of bool argument]
Reported-and-tested-by: Chris Hiestand <chiestand@salk.edu>
Reference: http://bugs.debian.org/669314
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: <stable@vger.kernel.org> # v2.6.39+
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In very busy v3 environment, rpc.mountd can respond to the NULL
procedure but not the MNT procedure in a timely manner causing
the MNT procedure to time out. The problem is the mount system
call returns EIO which causes the mount to fail, instead of
ETIMEDOUT, which would cause the mount to be retried.
This patch sets the RPC_TASK_SOFT|RPC_TASK_TIMEOUT flags to
the rpc_call_sync() call in nfs_mount() which causes
ETIMEDOUT to be returned on timed out connections.
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
The new layout pointer in pnfs_find_alloc_layout() may be NULL because of
out of memory. we must do some check work, otherwise pnfs_free_layout_hdr()
will go wrong because it can not deal with a NULL pointer.
Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The DNS resolver's use of the sunrpc cache involves a 'ttl' number
(relative) rather that a timeout (absolute). This confused me when
I wrote
commit c5b29f885a
"sunrpc: use seconds since boot in expiry cache"
and I managed to break it. The effect is that any TTL is interpreted
as 0, and nothing useful gets into the cache.
This patch removes the use of get_expiry() - which really expects an
expiry time - and uses get_uint() instead, treating the int correctly
as a ttl.
This fixes a regression that has been present since 2.6.37, causing
certain NFS accesses in certain environments to incorrectly fail.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the state recovery machinery is triggered by the call to
nfs4_async_handle_error() then we can deadlock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
If we do not release the sequence id in cases where we fail to get a
session slot, then we can deadlock if we hit a recovery scenario.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Currently, we will schedule session recovery and then return to the
caller of nfs4_handle_exception. This works for most cases, but causes
a hang on the following test case:
Client Server
------ ------
Open file over NFS v4.1
Write to file
Expire client
Try to lock file
The server will return NFS4ERR_BADSESSION, prompting the client to
schedule recovery. However, the client will continue placing lock
attempts and the open recovery never seems to be scheduled. The
simplest solution is to wait for session recovery to run before retrying
the lock.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Jack Lin reports that the error return from dup3() for the RLIMIT_NOFILE
case changed incorrectly after 3.6.
The culprit is commit f33ff9927f ("take rlimit check to callers of
expand_files()") which when it moved the "return -EMFILE" out to the
caller, didn't notice that the dup3() had special code to turn the
EMFILE return into EBADF.
The replace_fd() helper that got added later then inherited the bug too.
Reported-by: Jack Lin <linliangjie@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ Noted more bugs, wrote proper changelog, fixed up typos - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a ruckus on LWN, Phoronix, and Slashdot. This bug only showed up when
non-standard mount options (journal_async_commit and/or
journal_checksum) were enabled, and when the file system was not
cleanly unmounted, but the root cause was the inode bitmap
modifications was not being properly journaled. This could
potentially lead to minor file system corruptions (pass 5 complaints
with the inode allocation bitmap) after an unclean shutdown under the
wrong/unlucky workloads, but it turned into major failure if the
journal_checksum and/or jouaral_async_commit was enabled.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQj1JqAAoJENNvdpvBGATwRUoQALlw+S3LPEnaF3kTsmfjUg2o
52jDvh83vHYsiI/lWSpqyXNLpCzAWWnM9IDxULODf1nX/OZIuqR0odfREIK36mUG
qb3kDjspXGLONqxQPEorlMzglG3eTBB0xs0p+LHrVsmMCxv4L+56YNUZjIVWwc6e
VVgHXT8Yl87rZFE4exycCLAFUtTx7aMR7eBw5sJ31TN/zozV2E2A4gkd4NjbEchL
89KEzfmp1MeHCJz01MgFQPqcKJW/Biyxac1ViOJxLObG19mZ8xf6KY2W0Ie3vEoA
50tnzuYBO/anpRcmY/TpK+k42PMZHcafPwBxYAOfnA4+FD0Syfof/xoIVkxlRU8i
X5Mk3jA+PQMNjg9G2hZi17sN0MpFyltrmd5W4008ebyVK3n2jispW1ZUxqQ6YY3I
0YBI4th23EuVIllbPfhiIFc5503GUilT6t6A+9A0H3y1oaWktO60oo1U7ew5Fg/d
G8Tan20KWnnwqfZFmH0FO/WiKmqJQFiNV1M5T6CYHU6y4E3fPdIYqUzEoX1QNXr6
7sIP69Rqar5ngP+dV/bHL6CdydCbZH2XWp6JYpDyeebxCxtMt50UZLJ0LUTi4W1F
6ZBezP8g6nlOPi+wG2s8xMIO7Ovmf94ti/Lzr4iYG8Uo9r9HnlQclbeacjJtknlL
RDBO2tp4kxqamjX3wGBQ
=6coE
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfix from Ted Ts'o:
"This fixes the root cause of the ext4 data corruption bug which raised
a ruckus on LWN, Phoronix, and Slashdot.
This bug only showed up when non-standard mount options
(journal_async_commit and/or journal_checksum) were enabled, and when
the file system was not cleanly unmounted, but the root cause was the
inode bitmap modifications was not being properly journaled.
This could potentially lead to minor file system corruptions (pass 5
complaints with the inode allocation bitmap) after an unclean shutdown
under the wrong/unlucky workloads, but it turned into major failure if
the journal_checksum and/or jouaral_async_commit was enabled."
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix unjournaled inode bitmap modification
Pull block driver update from Jens Axboe:
"Distilled down variant, the rest will pass over to 3.8. I pulled it
into the for-linus branch I had waiting for a pull request as well, in
case you are wondering why there are new entries in here too. This
also got rid of two reverts and the ones of the mtip32xx patches that
went in later in the 3.6 cycle, so the series looks a bit cleaner."
* 'for-linus' of git://git.kernel.dk/linux-block:
loop: Make explicit loop device destruction lazy
mtip32xx:Added appropriate timeout value for secure erase
xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
cciss: select CONFIG_CHECK_SIGNATURE
cciss: remove unneeded memset()
xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
pktcdvd: update MAINTAINERS
floppy: remove dr, reuse drive on do_floppy_init
floppy: use common function to check if floppies can be registered
floppy: properly handle failure on add_disk loop
floppy: do put_disk on current dr if blk_init_queue fails
floppy: don't call alloc_ordered_workqueue inside the alloc_disk loop
xen/blkback: Fix compile warning
block: Add blk_rq_pos(rq) to sort rq when plushing
drivers/block: remove CONFIG_EXPERIMENTAL
block: remove CONFIG_EXPERIMENTAL
vfs: fix: don't increase bio_slab_max if krealloc() fails
blkcg: stop iteration early if root_rl is the only request list
blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg
Pull Ceph fixes form Sage Weil:
"There are two fixes in the messenger code, one that can trigger a NULL
dereference, and one that error in refcounting (extra put). There is
also a trivial fix that in the fs client code that is triggered by NFS
reexport."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix dentry reference leak in encode_fh()
libceph: avoid NULL kref_put when osd reset races with alloc_msg
rbd: reset BACKOFF if unable to re-queue
Call to d_find_alias() needs a corresponding dput()
This fixes http://tracker.newdream.net/issues/3271
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
commit 119c0d4460 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.
Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming. The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"
[ Changed by tytso to only call ext4_journal_get_write_access() only
when we're fairly certain that we're going to allocate the inode. ]
I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box. Without the patch I hit a journal
checksum error every time. With this fix it survives
many iterations.
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Functions generic_file_splice_read and generic_file_splice_write access
the pagecache directly. For block devices these functions must be locked
so that block size is not changed while they are in progress.
This patch is an additional fix for commit b87570f5d3 ("Fix a crash
when block device is read and block size is changed at the same time")
that locked aio_read, aio_write and mmap against block size change.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix for a memory leak in acpi_bind_one() from Jesper Juhl.
* Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
from Jonghwan Choi.
* Fix for smp_processor_id() usage in preemptible code in powernow-k8 from
Andreas Herrmann.
* Fix for a suspend-related memory leak in cpufreq stats from Xiaobing Tu.
* Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD
in flush_old_exec() from Oleg Nesterov.
* acpi_processor_notify() fix from Alan Cox.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJQivlUAAoJEKhOf7ml8uNsPHMP/jGv0umbDl0CrJBjd9eF+Tdt
52DpJ/c2HjghsmSG26MCsFVO026DcPoO/t7faUVpWiUy/I38TwyjTFMKrxouY5AC
8p929Gt5yjf/pB7w/P5C3exhv9zSWdVzCZ4rmlt7knBn7vN7jfI5Lv4kaEwAcu4o
mnGbUVzaFLaiHsKFa8iBOpkdr01Fn9FRINddMQ/+PdiFR+wkqOKMZBExjRoQgS31
aH90LL5Nfv5pSH126TH5S6GDdAXw0g4eHEfxGjNodEXdmAS+GXrD6QoGabab99ZD
SaZA41kTv3+ls4Z9uhhpNBgqEDQEWiNVBVfTs0PWTUpemYKlx85Ihdl8PbH1H0TM
QeHsM3dfHJsfhK/EjUFwx31oWrvfM0Qqw8CxOc/ASm2rpPOZVBgqRzKsqSMQE805
y3lteaoT9nObnKdy871QmIhAk3fN25u1txCtmNFc0S5VZyiAnD2RVS/a8y93gMse
5lxSMjOfUqvq3APlz4HIn3YovswjiAOOw0PlD3nK2qLj7tyEVOl5CMyL/05v58wJ
SeOic10v1oOEDYT3uM+aVERmK9APAsMbcecj7Xd5yqPu8NPx7zrj+z30wr0znS5x
KWnyZR83F4ZCb00geSZW0LliXuazjj+lWX/TFri9XY6ZdMvmTJOUVKBD8JwhHN0j
WH9iMOOBTMnJdbnf88sM
=8Tcc
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael J Wysocki:
- Fix for a memory leak in acpi_bind_one() from Jesper Juhl.
- Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
from Jonghwan Choi.
- Fix for smp_processor_id() usage in preemptible code in powernow-k8
from Andreas Herrmann.
- Fix for a suspend-related memory leak in cpufreq stats from Xiaobing
Tu.
- Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD in
flush_old_exec() from Oleg Nesterov.
- acpi_processor_notify() fix from Alan Cox.
* tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: missing break
freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
Fix memory leak in cpufreq stats.
cpufreq / powernow-k8: Remove usage of smp_processor_id() in preemptible code
PM / Domains: Fix memory leak on error path in pm_genpd_attach_cpuidle
ACPI: Fix memory leak in acpi_bind_one()
Here are a number of firmware core fixes for 3.7, and some other minor fixes.
And some documentation updates thrown in for good measure.
All have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlCKwkIACgkQMUfUDdst+ynzUgCfQDwxUw1PVqQyWy7SakpsjFJJ
8kwAoITyjppn39v1WuZbg0+FZ6JpocyY
=2mDG
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg Kroah-Hartman:
"Here are a number of firmware core fixes for 3.7, and some other minor
fixes. And some documentation updates thrown in for good measure.
All have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Documentation:Chinese translation of Documentation/arm64/memory.txt
Documentation:Chinese translation of Documentation/arm64/booting.txt
Documentation:Chinese translation of Documentation/IRQ.txt
firmware loader: document kernel direct loading
sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
dynamic_debug: Remove unnecessary __used
firmware loader: sync firmware cache by async_synchronize_full_domain
firmware loader: let direct loading back on 'firmware_buf'
firmware loader: fix one reqeust_firmware race
firmware loader: cancel uncache work before caching firmware
In commit 800179c9b8 ("This adds symlink and hardlink restrictions to
the Linux VFS"), the new link protections were enabled by default, in
the hope that no actual application would care, despite it being
technically against legacy UNIX (and documented POSIX) behavior.
However, it does turn out to break some applications. It's rare, and
it's unfortunate, but it's unacceptable to break existing systems, so
we'll have to default to legacy behavior.
In particular, it has broken the way AFD distributes files, see
http://www.dwd.de/AFD/
along with some legacy scripts.
Distributions can end up setting this at initrd time or in system
scripts: if you have security problems due to link attacks during your
early boot sequence, you have bigger problems than some kernel sysctl
setting. Do:
echo 1 > /proc/sys/fs/protected_symlinks
echo 1 > /proc/sys/fs/protected_hardlinks
to re-enable the link protections.
Alternatively, we may at some point introduce a kernel config option
that sets these kinds of "more secure but not traditional" behavioural
options automatically.
Reported-by: Nick Bowler <nbowler@elliptictech.com>
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # v3.6
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"This has our series of fixes for the next rc. The biggest batch is
from Jan Schmidt, fixing up some problems in our subvolume quota code
and fixing btrfs send/receive to work with the new extended inode
refs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: do not bug when we fail to commit the transaction
Btrfs: fix memory leak when cloning root's node
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
Btrfs: fix deadlock caused by the nested chunk allocation
btrfs: Return EINVAL when length to trim is less than FSB
Btrfs: fix memory leak in btrfs_quota_enable()
Btrfs: send correct rdev and mode in btrfs-send
Btrfs: extended inode refs support for send mechanism
Btrfs: Fix wrong error handling code
Fix a sign bug causing invalid memory access in the ino_paths ioctl.
Btrfs: comment for loop in tree_mod_log_insert_move
Btrfs: fix extent buffer reference for tree mod log roots
Btrfs: determine level of old roots
Btrfs: tree mod log's old roots could still be part of the tree
Btrfs: fix a tree mod logging issue for root replacement operations
Btrfs: don't put removals from push_node_left into tree mod log twice
This is a bugfix for a problem with the following symptoms:
1. A power cut happens
2. After reboot, we try to mount UBIFS
3. Mount fails with "No space left on device" error message
UBIFS complains like this:
UBIFS error (pid 28225): grab_empty_leb: could not find an empty LEB
The root cause of this problem is that when we mount, not all LEBs are
categorized. Only those which were read are. However, the
'ubifs_find_free_leb_for_idx()' function assumes that all LEBs were
categorized and 'c->freeable_cnt' is valid, which is a false assumption.
This patch fixes the problem by teaching 'ubifs_find_free_leb_for_idx()'
to always fall back to LPT scanning if no freeable LEBs were found.
This problem was reported by few people in the past, but Brent Taylor
was able to reproduce it and send me a flash image which cannot be mounted,
which made it easy to hunt the bug. Kudos to Brent.
Reported-by: Brent Taylor <motobud@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org
This commit is a preparation for a subsequent bugfix. We introduce a
counter for categorized lprops.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org
- Fix the NFSv2/v3 kernel statd protocol, which broke due to net namespace
related changes.
- Fix a number of races in the SUNRPC TCP disconnect/reconnect code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQieKUAAoJEGcL54qWCgDylBsP/R8UkhkNuL92x+KMSlNG28rr
EYLIl5XNpBy/xIPhQmxuzV3KjytNzwe/3caNZ/Cl6thS4l/CRiqPKsq62xIOEMKn
NBsY0ZULzhi2xwcR3BZxgn055PSTMxbQ9a3oeNbTvUbsrk5AnX7br6ol/oW1s7Fi
ytCLrW/A4zB4DCJ/HMUnnQTCKCBGXBs2VD0ib6lKzr/6OqMO26cbQfHddTZW9BC3
9qDKDVUQbLr9lNouYZbqWhZ6g/ecIkFyDyxW6ElVELaTNZFo0MTkQMYBgTQPDFd8
PAxnq/Dcik2/Ls+qQgtqK2gs67RH7oAWvPR34ozA2pVJJj6wojXHbtIpH6IDuwIM
XvNCOkQNebzB2glf0WFLqUjmoQVvONiDp0zA9CYz5xBgjgA6AlmGsnW5dHKFiAkS
RxqZKPPinMOBzULOQYcC5Bu/hkXDMtvSi0UwMnuM5IZh1GAadtw+Do+0k6ociRuK
ALD4tzfV+o9VM+/714eyYyJ7DEL4HgeEEpoQb6tRyNj8YIlibxM3XKGBBzwssI2d
L2YuWmBE6N+zYUrwOqY1v9hYwh7qayqQiAgIac8e4/oC+UnQM8tfNNABcouA8a3d
fSxOK+S0DFYTRN1EMDvhasnxUqmilj0Wuw+g6ytL95k8T2Q5b8PdYt6QCBc9h6SJ
Qb+E/wkMEGAavm6w2T42
=8Dt4
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS bugfixes from Trond Myklebust:
- Fix the NFSv2/v3 kernel statd protocol, which broke due to net
namespace related changes.
- Fix a number of races in the SUNRPC TCP disconnect/reconnect code.
* tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
LOCKD: fix races in nsm_client_get
SUNRPC: Get rid of the xs_error_report socket callback
SUNRPC: Prevent races in xs_abort_connection()
Revert "SUNRPC: Ensure we close the socket on EPIPE errors too..."
SUNRPC: Clear the connect flag when socket state is TCP_CLOSE_WAIT
The compat ioctl for VIDEO_SET_SPU_PALETTE was missing an error check
while converting ioctl arguments. This could lead to leaking kernel
stack contents into userspace.
Patch extracted from existing fix in grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious. Remove the BUG_ON(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot. Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.
Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
if (S_ISREG(sctx->cur_inode_mode)) {
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:
ls -i
btrfs inspect inode-resolve -v $ino /mnt/btrfs
I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.
The warning check for duplicate sysfs entries can cause a buffer overflow
when printing the warning, as strcat() doesn't check buffer sizes.
Use strlcat() instead.
Since strlcat() doesn't return a pointer to the passed buffer, unlike
strcat(), I had to convert the nested concatenation in sysfs_add_one() to
an admittedly more obscure comma operator construct, to avoid emitting code
for the concatenation if CONFIG_BUG is disabled.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current code is clearing it in all cases _except_ when zero.
Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Commit e9406db20f (lockd: per-net
NSM client creation and destruction helpers introduced) contains
a nasty race on initialisation of the per-net NSM client because
it doesn't check whether or not the client is set after grabbing
the nsm_create_mutex.
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull core kernel fixes from Ingo Molnar:
"Two small fixes"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: Reflect the new location of the NMI watchdog info
nohz: Fix idle ticks in cpu summary line of /proc/stat
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in
two places. Where needed, we call tree_mod_log_free_eb explicitly now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did
log the removal of the old data entries from the source buffer. Therefore,
we must not call tree_mod_log_eb_move if the check evaluates to true, as
that would log the removal twice, finally resulting in (rewinded) buffers
with wrong values for header_nritems.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
bug (CVE-2012-4508) which leads to stale data exposure when we have
fallocate racing against writes to files undergoing delayed
allocation. We also have two fixes for the metadata checksum feature,
the most serious of which can cause the superblock to have a invalid
checksum after a power failure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQhcJVAAoJENNvdpvBGATwlc0QAJ5eRVSXoQ9DL/rpycZtWsiR
1HofZCBbeVJq7JkazypYZPV+ncm2Nxljx61EBMpReDgx+hgJS8VD7BcjxblXT1gK
cvIk7tYXS1E5++TWZzQd0v3GMDoMJsfzb0Ao6vefaVgqh07MKE9Zvx0L8JR4tsH1
YlRs2/ZALFqqMficemXpDuWRRoBTEcYkvaW9PtUIpeuk9i71iSCDiHvi0mRy4dYe
nLftjBOjcsIuK0I7DfUYrbZNQuYacFcFTM5foE6lhdT+tlL1/od2M00IpopSSjF8
7RoqV351FqL74Stu71wDp+q9n8t8bR9gnvEuDisHXXH6PKIYo83vawvuDKtP05lt
lF0l2nKy/QorQtUNRnrWiRshPNEplmKM1yfRXwzfq5CX4Mjox1PM9g1AfMT/Pzbq
wNPMqtiaNnVzfcSP94MTExKMR5axFgeFsIwuCtPVNDAUEbEFuwKARIeFjCGxYYsr
81rIKD4lgvJjaHChtE/NzslQysMmr6qiZa17s+NteCwNRJX7U4xN99SO2BXSW7lW
xGb1ZjdESiBZGzsmuOXqAQw7KWIRS7bQ+s4dewEbqQJomPD3NQUKsRVo1wdWeUqI
6SI1YsBYEDPiViiohsFyn1Zl3BHgIvypWMW/ChhKtYsmEJapTnJ3SR3a+eafFZ99
HgCMCF1i0KN1tvMDrLRn
=qL29
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Various bug fixes for ext4. The most serious of them fixes a security
bug (CVE-2012-4508) which leads to stale data exposure when we have
fallocate racing against writes to files undergoing delayed
allocation. We also have two fixes for the metadata checksum feature,
the most serious of which can cause the superblock to have a invalid
checksum after a power failure."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Avoid underflow in ext4_trim_fs()
ext4: Checksum the block bitmap properly with bigalloc enabled
ext4: fix undefined bit shift result in ext4_fill_flex_info
ext4: fix metadata checksum calculation for the superblock
ext4: race-condition protection for ext4_convert_unwritten_extents_endio
ext4: serialize fallocate with ext4_convert_unwritten_extents
- Do not call pnfs_return_layout() from an rpciod context
- nfs4_ds_disconnect can cause Oopses. Kill it...
- Fix the return value for nfs_callback_start_svc
- Fix a number of compile warnings
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQhYQSAAoJEGcL54qWCgDyX7UP/0u2lexD08QtE+rMO4+9PHeI
6vpTqd6LURhbjZ5JGkThBHU3H8LwGb588SBMzeU6ULk5WituoBki6WCBTEw9TxCN
4nBr9/7FEyzM/6JKSICeTs/x4ZGS+lHE2M6Q77auUj8nclykNs+oVmYCAyc1k/2J
bL7Hz+MDLvpPLqcVthMpaSEfuPHOaawkHOD3Iqh7ZqtqcePm6K+GadD9s/pSY8RA
8sslffyw4oEfSThaHyCuX3kcq3IL1rCez4QL986f/c1Owi8zud1c/63sSBFHpX3h
cmQSR3R1tWNllIrwU8XD2ypVHM8RK3DFgO+KWyv9c55CCIAMeaviou9YNeWZ40d3
iptigAjQNNzx9pMCEKtC0G/B4XLkByF9iL0/xEQzCFeX4z34mLYGpw9CpEv6bpKl
RABPRzvJGyuodZEMxfzFKEr13M4LBV7XG6HjRt+r0ytlHOlkMil6q4sY5OhX62lG
EwpiWx/52Lf2P7YPgkjEZeaPRq92uIX9oEGUYL1cEmXb7pVNRuPfXYtxyKy1FHXr
xyBtbCOg/HwhYZdsHeT0C56Id0rFRP5Z+fQKdKNxyLQ9CPhpSZwAgdk/QLVeWedt
IE5IDQpehq2v8rKyhLu08ny4YHSrUwopw6s9mZLDDFxOW9R1p23u28t50PjRSMao
tWq7F03fspqbKeNXeB9I
=bDBm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Do not call pnfs_return_layout() from an rpciod context
- nfs4_ds_disconnect can cause Oopses. Kill it...
- Fix the return value for nfs_callback_start_svc
- Fix a number of compile warnings
* tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix the return value for nfs_callback_start_svc
NFSv4.1: Declare osd_pri_2_pnfs_err(), objio_init_read/write to be static
NFSv4: fs/nfs/nfs4getroot.c needs to include "internal.h"
NFSv4.1: Use kcalloc() to allocate zeroed arrays instead of kzalloc()
NFSv4.1: Do not call pnfs_return_layout() from an rpciod context
NFSv4.1: Kill nfs4_ds_disconnect()
Currently if len argument in ext4_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.
Also remove useless unlikely().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Without the patch, bio_slab_max, representing bio_slabs capacity, is increased before krealloc() of bio_slabs. If krealloc() fails, bio_slab_max is too high. Fix that by only updating bio_slab_max if krealloc() is successful.
Signed-off-by: Anna Leuschner <anna.m.leuschner@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In certain cases (for example when a cdev structure is embedded into
another object whose lifetime is controlled by a separate kobject) it is
beneficial to tie lifetime of another object to the lifetime of
character device so that related object is not freed until after
char_dev object is freed.
To achieve this let's pin kobject's parent when doing cdev_add() and
unpin when last reference to cdev structure is being released.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.
Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
/proc/<pid>/numa_maps scans vma and show mempolicy under
mmap_sem. It sometimes accesses task->mempolicy which can
be freed without mmap_sem and numa_maps can show some
garbage while scanning.
This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.
V2->v3
- updated comments to be more verbose.
- removed task_lock() in numa_maps code.
V1->V2
- access task->mempolicy only once and remember it. Becase kernel/exit.c
can overwrite it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull nfsd bugfixes from J Bruce Fields.
* 'for-3.7' of git://linux-nfs.org/~bfields/linux:
SUNRPC: Prevent kernel stack corruption on long values of flush
NLM: nlm_lookup_file() may return NLMv4-specific error codes