There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121. There is a script included to
reproduce the problem.
During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem. I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode. This points to memory scribble, or
synchronisation problme.
i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW. Subsequently modifying i_state causes corruption. In
my case it would look like this:
CPU0 CPU1
unlock_new_inode() __sync_single_inode()
reg <- inode->i_state
reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state
reg -> inode->i_state reg -> reg | I_SYNC
reg -> inode->i_state
Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.
After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running. Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.
I'm also testing on ext3 now, and so far no problems there either. I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.
Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In sget(), destroy_super(s) is called with s->s_umount held, which makes
lockdep unhappy.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the second fasync_helper() fails, pipe_rdwr_fasync() returns the error
but leaves the file on ->fasync_readers.
This was always wrong, but since 233e70f422
"saner FASYNC handling on file close" we have the new problem. Because in
this case setfl() doesn't set FASYNC bit, __fput() will not do
->fasync(0), and we leak fasync_struct with ->fa_file pointing to the
freed file.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The corrupted filesystem patch added a check against zlib trying to
output too much data in the presence of data corruption. This check
triggered if zlib_inflate asked to be called again (Z_OK) with
avail_out == 0 and no more output buffers available. This check proves
to be rather dumb, as it incorrectly catches the case where zlib has
generated all the output, but there are still input bytes to be processed.
This patch does a number of things. It removes the original check and
replaces it with code to not move to the next output buffer if there
are no more output buffers available, relying on zlib to error if it
wants an extra output buffer in the case of data corruption. It
also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH
flag, and makes the error messages more understandable to
non-technical users.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reported-by: Stefan Lippers-Hollmann <s.L-H@gmx.de>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: only issues a cache flush on unmount if barriers are enabled
xfs: prevent lockdep false positive in xfs_iget_cache_miss
xfs: prevent kernel crash due to corrupted inode log format
On swapon() path, it has already i_mutex. So, this uses i_alloc_sem
instead of it.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Laurent GUERBY <laurent@guerby.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kpf_copy_bit(src,dst) to be kpf_copy_bit(dst,src) to match the
actual call patterns, e.g. kpf_copy_bit(kflags, KPF_LOCKED, PG_locked).
This misplacement of src/dst only affected reporting of PG_writeback,
PG_reclaim and PG_buddy. For others kflags==uflags so not affected.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Very annoying when working with containters.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places. But, it wasn't correct for UP compiled
kernels.
This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we unconditionally issue a flush from xfs_free_buftarg, but
since 2.6.29-rc1 this gives a warning in the style of
end_request: I/O error, dev vdb, sector 0
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The inode can't be locked by anyone else as we just created it a few
lines above and it's not been added to any lookup data structure yet.
So use a trylock that must succeed to get around the lockdep warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Andras Korn reported an oops on log replay causes by a corrupted
xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles
to small or too large numbers of log regions gracefully by rejecting the
log replay with a useful error message.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
This fixes a code regression caused by the recent mainlining changes.
The recent code changes call zlib_inflate repeatedly, decompressing into
separate 4K buffers, this code didn't check for the possibility that
zlib_inflate might ask for too many buffers when decompressing corrupted
data.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
I was seeing fsck errors on inode bitmaps after a 4 thread
dbench run on a 4 cpu machine:
Inode bitmap differences: -50736 -(50752--50753) etc...
I believe that this is because ext4_free_inode() uses atomic
bitops, and although ext4_new_inode() *used* to also use atomic
bitops for synchronization, commit
393418676a changed this to use
the sb_bgl_lock, so that we could also synchronize against
read_inode_bitmap and initialization of uninit inode tables.
However, that change left ext4_free_inode using atomic bitops,
which I think leaves no synchronization between setting &
unsetting bits in the inode table.
The below patch fixes it for me, although I wonder if we're
getting at all heavy-handed with this spinlock...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: don't call jbd2_journal_force_commit_nested without journal
ext4: Reorder fs/Makefile so that ext2 root fs's are mounted using ext2
ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()
Commit 8e961870bb removed the FREEZE/THAW
handling in xfs_compat_ioctl but never added any compat handler back, so
now any freeze/thaw request from a 32-bit binary ond 64-bit userspace
will fail.
As these ioctls are 32/64-bit compatible two simple COMPATIBLE_IOCTL
entries in fs/compat_ioctl.c will do the job.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4ea3ada295 declares d_obtain_alias()
as EXPORT_SYMBOL_GPL where it's supposed to replace d_alloc_anon which was
previously declared as EXPORT_SYMBOL and thus available to any loadable
module.
This patch reverts that.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch set a gap (4 bytes) between xattr entry and
name/value when xattr in bucket. This gap use to seperate
entry and name/value when a bucket is full. It had already
been set when xattr in inode/block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ctime updating of xattr, it use the wrong type of access for
inode, so use ocfs2_journal_access_di instead.
Reported-and-Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In dlm_assert_master_handler(), if we get an incorrect assert master from a node
that, we reply with EINVAL asking the asserter to die. The problem is that an
assert is sent after so many hoops, it is invariably the node that thinks the
asserter is wrong, is actually wrong. So instead of killing the asserter, this
patch kills the assertee.
This patch papers over a race that is still being addressed.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The code was using dlm->spinlock instead of dlm->ast_lock to protect the
ast_list. This patch fixes the issue.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The dentry lock has a different format than other locks. This patch fixes
ocfs2_log_dlm_error() macro to make it print the dentry lock correctly.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mainline commit d4f7e650e5 attempts to delay
the dlm_thread from sending the drop ref message if the lockres is being
migrated. The problem is that we make the dlm_thread wait for the migration
to complete. This causes a deadlock as dlm_thread also participates in the
lockres migration process.
A better fix for the original oss bugzilla#1012 is in testing.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In __ocfs2_mark_extent_written, when we meet with the situation
of c_split_covers_rec, the old solution just replace the extent
record and forget to access and dirty the buffer_head. This will
cause a problem when the unwritten extent is in an extent block.
So access and dirty it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Newer gcc throw this warning:
fs/bio.c: In function ?bio_alloc_bioset?:
fs/bio.c:305: warning: ?p? may be used uninitialized in this function
since it cannot figure out that 'p' is only ever used if 'bs' is non-NULL.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Running without a journal, I oopsed when I ran out of space,
because we called jbd2_journal_force_commit_nested() from
ext4_should_retry_alloc() without a journal.
This should take care of it, I think.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In fs/Makefile, ext3 was placed before ext2 so that a root filesystem
that possessed a journal, it would be mounted as ext3 instead of ext2.
This was necessary because a cleanly unmounted ext3 filesystem was
fully backwards compatible with ext2, and could be mounted by ext2 ---
but it was desirable that it be mounted with ext3 so that the
journaling would be enabled.
The ext4 filesystem supports new incompatible features, so there is no
danger of an ext4 filesystem being mistaken for an ext2 filesystem.
At that point, the relative ordering of ext4 with respect to ext2
didn't matter until ext4 gained the ability to mount filesystems
without a journal starting in 2.6.29-rc1. Now that this is the case,
given that ext4 is before ext2, it means that root filesystems that
were using the plain-jane ext2 format are getting mounted using the
ext4 filesystem driver, which is a change in behavior which could be
surprising to users.
It's doubtful that there are that many ext2-only root filesystem users
that would also have ext4 compiled into the kernel, but to adhere to
the principle of least surprise, the correct ordering in fs/Makefile
is ext3, followed by ext2, and finally ext4.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Commit c4be0c1d added error checking to ext4_freeze() when calling
ext4_commit_super(). Unfortunately the patch failed to remove the
original call to ext4_commit_super(), with the net result that when
freezing the filesystem, the superblock gets written twice, the first
time without error checking.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix deadlock in ext4_write_begin() and ext4_da_write_begin()
ext4: Add fallback for find_group_flex
de_get is called before every proc_get_inode, but corresponding de_put is
called only when dropping last reference to an inode. This might cause
something like
remove_proc_entry: /proc/stats busy, count=14496
to be printed to the syslog.
The fix is to call de_put in case of an already initialized inode in
proc_get_inode.
Signed-off-by: Krzysztof Sachanowicz <analyzer1@gmail.com>
Tested-by: Marcin Pilipczuk <marcin.pilipczuk@gmail.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Functions ext4_write_begin() and ext4_da_write_begin() call
grab_cache_page_write_begin() without AOP_FLAG_NOFS. Thus it
can happen that page reclaim is triggered in that function
and it recurses back into the filesystem (or some other filesystem).
But this can lead to various problems as a transaction is already
started at that point. Add the necessary flag.
http://bugzilla.kernel.org/show_bug.cgi?id=11688
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This is a workaround for find_group_flex() which badly needs to be
replaced. One of its problems (besides ignoring the Orlov algorithm)
is that it is a bit hyperactive about returning failure under
suspicious circumstances. This can lead to spurious ENOSPC failures
even when there are inodes still available.
Work around this for now by retrying the search using
find_group_other() if find_group_flex() returns -1. If
find_group_other() succeeds when find_group_flex() has failed, log a
warning message.
A better block/inode allocator that will fix this problem for real has
been queued up for the next merge window.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix multiuser mounts so server does not invalidate earlier security contexts
[CIFS] improve posix semantics of file create
[CIFS] Fix oops in cifs_strfromUCS_le mounting to servers which do not specify their OS
cifs: posix fill in inode needed by posix open
cifs: properly handle case where CIFSGetSrvInodeNumber fails
cifs: refactor new_inode() calls and inode initialization
[CIFS] Prevent OOPs when mounting with remote prefixpath.
[CIFS] ipv6_addr_equal for address comparison
At scan time we observed following scenario:
node A inserted
node B inserted
node C inserted -> sets overlapped flag on node B
node A is removed due to CRC failure -> overlapped flag on node B remains
while (tn->overlapped)
tn = tn_prev(tn);
==> crash, when tn_prev(B) is referenced.
When the ultimate node is removed at scan time and the overlapped flag
is set on the penultimate node, then nothing updates the overlapped
flag of that node. The overlapped iterators blindly expect that the
ultimate node does not have the overlapped flag set, which causes the
scan code to crash.
It would be a huge overhead to go through the node chain on node
removal and fix up the overlapped flags, so detecting such a case on
the fly in the overlapped iterators is a simpler and reliable
solution.
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
When two different users mount the same Windows 2003 Server share using CIFS,
the first session mounted can be invalidated. Some servers invalidate the first
smb session when a second similar user (e.g. two users who get mapped by server to "guest")
authenticates an smb session from the same client.
By making sure that we set the 2nd and subsequent vc numbers to nonzero values,
this ensures that we will not have this problem.
Fixes Samba bug 6004, problem description follows:
How to reproduce:
- configure an "open share" (full permissions to Guest user) on Windows 2003
Server (I couldn't reproduce the problem with Samba server or Windows older
than 2003)
- mount the share twice with different users who will be authenticated as guest.
noacl,noperm,user=john,dir_mode=0700,domain=DOMAIN,rw
noacl,noperm,user=jeff,dir_mode=0700,domain=DOMAIN,rw
Result:
- just the mount point mounted last is accessible:
Signed-off-by: Steve French <sfrench@us.ibm.com>
Samba server added support for a new posix open/create/mkdir operation
a year or so ago, and we added support to cifs for mkdir to use it,
but had not added the corresponding code to file create.
The following patch helps improve the performance of the cifs create
path (to Samba and servers which support the cifs posix protocol
extensions). Using Connectathon basic test1, with 2000 files, the
performance improved about 15%, and also helped reduce network traffic
(17% fewer SMBs sent over the wire) due to saving a network round trip
for the SetPathInfo on every file create.
It should also help the semantics (and probably the performance) of
write (e.g. when posix byte range locks are on the file) on file
handles opened with posix create, and adds support for a few flags
which would have to be ignored otherwise.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fixes kernel bug #10451http://bugzilla.kernel.org/show_bug.cgi?id=10451
Certain NAS appliances do not set the operating system or network operating system
fields in the session setup response on the wire. cifs was oopsing on the unexpected
zero length response fields (when trying to null terminate a zero length field).
This fixes the oops.
Acked-by: Jeff Layton <jlayton@redhat.com>
CC: stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...if it does then we pass a pointer to an unintialized variable for
the inode number to cifs_new_inode. Have it pass a NULL pointer instead.
Also tweak the function prototypes to reduce the amount of casting.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Move new inode creation into a separate routine and refactor the
callers to take advantage of it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>