Current hugetlb strict accounting for shared mapping always assume mapping
starts at zero file offset and reserves pages between zero and size of the
file. This assumption often reserves (or lock down) a lot more pages then
necessary if application maps at none zero file offset. libhugetlbfs is
one example that requires proper reservation on shared mapping starts at
none zero offset.
This patch extends the reservation and hugetlb strict accounting to support
any arbitrary pair of (offset, len), resulting a much more robust and
accurate scheme. More importantly, it won't lock down any hugetlb pages
outside file mapping.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
in xfs.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Enable XFS to limit the statfs() results to the project quota covering the
dentry used as a base for call.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.
This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.
linux/mount.h has been added where necessary to make allyesconfig build
successfully.
Interest has also been expressed for use with the FUSE and XFS filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Add description of d_lock handling to comments over prune_one_dentry().
- It has three callsites - uninline it, saving 200 bytes of text.
Cc: Jan Blunck <jblunck@suse.de>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Olaf Hering <olh@suse.de>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The race is that the shrink_dcache_memory shrinker could get called while a
filesystem is being unmounted, and could try to prune a dentry belonging to
that filesystem.
If it does, then it will call in to iput on the inode while the dentry is
no longer able to be found by the umounting process. If iput takes a
while, generic_shutdown_super could get all the way though
shrink_dcache_parent and shrink_dcache_anon and invalidate_inodes without
ever waiting on this particular inode.
Eventually the superblock gets freed anyway and if the iput tried to touch
it (which some filesystems certainly do), it will lose. The promised
"Self-destruct in 5 seconds" doesn't lead to a nice day.
The race is closed by holding s_umount while calling prune_one_dentry on
someone else's dentry. As a down_read_trylock is used,
shrink_dcache_memory will no longer try to prune the dentry of a filesystem
that is being unmounted, and unmount will not be able to start until any
such active prune_one_dentry completes.
This requires that prune_dcache *knows* which filesystem (if any) it is
doing the prune on behalf of so that it can be careful of other
filesystems. shrink_dcache_memory isn't called it on behalf of any
filesystem, and so is careful of everything.
shrink_dcache_anon is now passed a super_block rather than the s_anon list
out of the superblock, so it can get the s_anon list itself, and can pass
the superblock down to prune_dcache.
If prune_dcache finds a dentry that it cannot free, it leaves it where it
is (at the tail of the list) and exits, on the assumption that some other
thread will be removing that dentry soon. To try to make sure that some
work gets done, a limited number of dnetries which are untouchable are
skipped over while choosing the dentry to work on.
I believe this race was first found by Kirill Korotaev.
Cc: Jan Blunck <jblunck@suse.de>
Acked-by: Kirill Korotaev <dev@openvz.org>
Cc: Olaf Hering <olh@suse.de>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the steal_locks() function.
steal_locks() doesn't work correctly with any filesystem that does it's own
lock management, including NFS, CIFS, etc.
In addition it has weird semantics on local filesystems in case tasks
sharing file-descriptor tables are doing POSIX locking operations in
parallel to execve().
The steal_locks() function has an effect on applications doing:
clone(CLONE_FILES)
/* in child */
lock
execve
lock
POSIX locks acquired before execve (by "child", "parent" or any further
task sharing files_struct) will after the execve be owned exclusively by
"child".
According to Chris Wright some LSB/LTP kind of suite triggers without the
stealing behavior, but there's no known real-world application that would
also fail.
Apps using NPTL are not affected, since all other threads are killed before
execve.
Apps using LinuxThreads are only affected if they
- have multiple threads during exec (LinuxThreads doesn't kill other
threads, the app may do it with pthread_kill_other_threads_np())
- rely on POSIX locks being inherited across exec
Both conditions are documented, but not their interaction.
Apps using clone() natively are affected if they
- use clone(CLONE_FILES)
- rely on POSIX locks being inherited across exec
The above scenarios are unlikely, but possible.
If the patch is vetoed, there's a plan B, that involves mostly keeping the
weird stealing semantics, but changing the way lock ownership is handled so
that network and local filesystems work consistently.
That would add more complexity though, so this solution seems to be
preferred by most people.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This race became a cause of oops, and can reproduce by the following.
while true; do
dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
done
This race condition was between __sync_single_inode() and iput().
cpu0 (fs's inode) cpu1 (bdev's inode)
----------------- -------------------
close("/dev/hda2")
[...]
__sync_single_inode()
/* copy the bdev's ->i_mapping */
mapping = inode->i_mapping;
generic_forget_inode()
bdev_clear_inode()
/* restre the fs's ->i_mapping */
inode->i_mapping = &inode->i_data;
/* bdev's inode was freed */
destroy_inode(inode);
if (wait) {
/* dereference a freed bdev's mapping->host */
filemap_fdatawait(mapping); /* Oops */
Since __sync_single_inode() is only taking a ref-count of fs's inode, the
another process can be close() and freeing the bdev's inode while writing
fs's inode. So, __sync_signle_inode() accesses the freed ->i_mapping,
oops.
This patch takes a ref-count on the bdev's inode for the fs's inode before
setting a ->i_mapping, and the clear_inode() of the fs's inode does iput() on
the bdev's inode. So if the fs's inode is still living, bdev's inode
shouldn't be freed.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many thanks to Pauline Ng for the detailed bug report and analysis!
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://oss.sgi.com:8090/xfs-2.6: (43 commits)
[XFS] Remove files from the build that are now unused.
[XFS] Fix a Makefile issue related to exports.o handling.
[XFS] Remove version 1 directory code. Never functioned on Linux, just
[XFS] Map EFSCORRUPTED to an actual error code, not just a made up one
[XFS] Kill direct access to ->count in valusema(); all we ever use it for
[XFS] Remove unneeded conditional code on NFS export interface related
[XFS] Remove an incorrect use of unlikely() on a relatively likely code
[XFS] Push some common code out of write path into core XFS code for
[XFS] Remove unnecessary local from open_exec dmapi path.
[XFS] Minor XFS documentation updates.
[XFS] Fix broken const use inside local suffix_strtoul routine.
[XFS] Fix nused counter. It's currently getting set to -1 rather than
[XFS] Fix mismerge of the fs_writable cleanup patch causing a freeze/thaw
[XFS] Fix up debug code so that bulkstat wont generate thousands of
[XFS] Remove unused parameter from di2xflags routine.
[XFS] Cleanup a missed porting conversion, and freezing.
[XFS] Resolve a namespace collision on remaining vtypes for FreeBSD
[XFS] Resolve a namespace collision on vnode/vnodeops for FreeBSD porters.
[XFS] Resolve a namespace collision on vfs/vfsops for FreeBSD porters.
[XFS] statvfs component of directory/project quota support, code
...
Like the SUBSYTEM= key we find in the environment of the uevent, this
creates a generic "subsystem" link in sysfs for every device. Userspace
usually doesn't care at all if its a "class" or a "bus" device. This
provides an unified way to determine the subsytem of a device, regardless
of the way the driver core has created it.
Signed-off-by: Kay Sievers <kay.sievers@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* git://git.infradead.org/~dwmw2/rbtree-2.6:
[RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
Update UML kernel/physmem.c to use rb_parent() accessor macro
[RBTREE] Update hrtimers to use rb_parent() accessor macro.
[RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
[RBTREE] Merge colour and parent fields of struct rb_node.
[RBTREE] Remove dead code in rb_erase()
[RBTREE] Update JFFS2 to use rb_parent() accessor macro.
[RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
[RBTREE] Update key.c to use rb_parent() accessor macro.
[RBTREE] Update ext3 to use rb_parent() accessor macro.
[RBTREE] Change rbtree off-tree marking in I/O schedulers.
[RBTREE] Add accessor macros for colour and parent fields of rb_node
* git://git.infradead.org/mtd-2.6: (199 commits)
[MTD] NAND: Fix breakage all over the place
[PATCH] NAND: fix remaining OOB length calculation
[MTD] NAND Fixup NDFC merge brokeness
[MTD NAND] S3C2410 driver cleanup
[MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation.
[JFFS2] Check CRC32 on dirent and data nodes each time they're read
[JFFS2] When retiring nextblock, allocate a node_ref for the wasted space
[JFFS2] Mark XATTR support as experimental, for now
[JFFS2] Don't trust node headers before the CRC is checked.
[MTD] Restore MTD_ROM and MTD_RAM types
[MTD] assume mtd->writesize is 1 for NOR flashes
[MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles
[MTD] Prepare physmap for 64-bit-resources
[JFFS2] Fix more breakage caused by janitorial meddling.
[JFFS2] Remove stray __exit from jffs2_compressors_exit()
[MTD] Allow alternate JFFS2 mount variant for root filesystem.
[MTD] Disconnect struct mtd_info from ABI
[MTD] replace MTD_RAM with MTD_GENERIC_TYPE
[MTD] replace MTD_ROM with MTD_GENERIC_TYPE
[MTD] remove a forgotten MTD_XIP
...
Allow callers to remove watches from their event handler via
inotify_remove_watch_locked(). This functionality can be used to
achieve IN_ONESHOT-like functionality for a subset of events in the
mask.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add inotify_init_watch() so caller can use inotify_watch refcounts
before calling inotify_add_watch().
Add inotify_find_watch() to find an existing watch for an (ih,inode)
pair. This is similar to inotify_find_update_watch(), but does not
update the watch's mask if one is found.
Add inotify_rm_watch() to remove a watch via the watch pointer instead
of the watch descriptor.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When an inotify event includes a dentry name, also include the inode
associated with that name.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following series of patches introduces a kernel API for inotify,
making it possible for kernel modules to benefit from inotify's
mechanism for watching inodes. With these patches, inotify will
maintain for each caller a list of watches (via an embedded struct
inotify_watch), where each inotify_watch is associated with a
corresponding struct inode. The caller registers an event handler and
specifies for which filesystem events their event handler should be
called per inotify_watch.
Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(990). Turns out some ye-olde unices used EUCLEAN as
Filesystem-needs-cleaning, so now we use that too.
SGI-PV: 953954
SGI-Modid: xfs-linux-melb:xfs-kern:26286a
Signed-off-by: Nathan Scott <nathans@sgi.com>
is check if semaphore is actually locked, which can be trivially done in
portable way. Code gets more reabable, while we are at it...
SGI-PV: 953915
SGI-Modid: xfs-linux-melb:xfs-kern:26274a
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Also, make sure dirents are marked REF_UNCHECKED when we 'discover' them
through eraseblock summary.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Failing to do so makes the calculated length of the last node incorrect,
when we're not using eraseblock summaries.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Especially when summary code is used, we can have in-memory data
structures referencing certain nodes without them actually being readable
on the flash. Discard the nodes gracefully in that case, rather than
triggering a BUG().
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
If get_user_pages() returns less pages than what we asked for, we jump
to out_unmap which will return ERR_PTR(ret). But ret can contain a
positive number just smaller than local_nr_pages, so be sure to set it
to -EFAULT always.
Problem found and diagnosed by Damien Le Moal <damien@sdl.hitachi.co.jp>
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If flock_lock_file() failed to allocate flock with locks_alloc_lock()
then "error = 0" is returned. Need to return some non-zero.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
jffs2_zlib_exit() and free_workspaces() shouldn't be marked __exit because
they get called in the error case from the init functions.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
getting decremented by 1. Since nused never reaches 0, the "if
(!free->hdr.nused)" check in xfs_dir2_leafn_remove() fails every time and
xfs_dir2_shrink_inode() doesn't get called when it should. This causes
extra blocks to be left on an empty directory and the directory in unable
to be converted back to inline extent mode.
SGI-PV: 951958
SGI-Modid: xfs-linux-melb:xfs-kern:211382a
Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
truncate down followed by delayed allocation (buffered writes) - worst
case scenario for the notorious NULL files problem. This reduces the
window where we are exposed to that problem significantly.
SGI-PV: 917976
SGI-Modid: xfs-linux-melb:xfs-kern:26100a
Signed-off-by: Nathan Scott <nathans@sgi.com>
init_rwsem() has no return value. This is not a problem if init_rwsem()
is a function, but it's a problem if it's a do { ... } while (0) macro.
(which lockdep introduces)
SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:26082a
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nathan Scott <nathans@sgi.com>
logged version of di_next_unlinked which is actually always stored in the
correct ondisk format. This was pointed out to us by Shailendra Tripathi.
And is evident in the xfs qa test of 121.
SGI-PV: 953263
SGI-Modid: xfs-linux-melb:xfs-kern:26044a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
transaction completion from marking the inode dirty while it is being
cleaned up on it's way out of the system.
SGI-PV: 952967
SGI-Modid: xfs-linux-melb:xfs-kern:26040a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
64bit kernels allow recovery to handle both versions and do the necessary
decoding
SGI-PV: 952214
SGI-Modid: xfs-linux-melb:xfs-kern:26011a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
transaction within each such operation may involve multiple locking of AGF
buffer. While the freeing extent function has sorted the extents based on
AGF number before entering into transaction, however, when the file system
space is very limited, the allocation of space would try every AGF to get
space allocated, this could potentially cause out-of-order locking, thus
deadlock could happen. This fix mitigates the scarce space for allocation
by setting aside a few blocks without reservation, and avoid deadlock by
maintaining ascending order of AGF locking.
SGI-PV: 947395
SGI-Modid: xfs-linux-melb:xfs-kern:210801a
Signed-off-by: Yingping Lu <yingping@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
ATTR_NOLOCK flag, but this was split off some time ago, as ATTR_DMI needed
to be used separately. Two asserts were added to guard correctness of the
code during the transition. These are no longer required.
SGI-PV: 952145
SGI-Modid: xfs-linux-melb:xfs-kern:209633a
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
the range spanned by modifications to the in-core extent map. Add
XFS_BUNMAPI() and XFS_SWAP_EXTENTS() macros that call xfs_bunmapi() and
xfs_swap_extents() via the ioops vector. Change all calls that may modify
the in-core extent map for the data fork to go through the ioops vector.
This allows a cache of extent map data to be kept in sync.
SGI-PV: 947615
SGI-Modid: xfs-linux-melb:xfs-kern:209226a
Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Looking at the reiser4 crash, I found a leak in debugfs. In
debugfs_mknod(), we create the inode before checking if the dentry
already has one attached. We don't free it if that is the case.
These bugs happen quite often, I'm starting to think we should disallow
such coding in CodingStyle.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Trond Myklebust <Trond.Myklebust@netapp.com>
We're presently running lock_kernel() under fs_lock via nfs's ->permission
handler. That's a ranking bug and sometimes a sleep-in-spinlock bug. This
problem was introduced in the openat() patchset.
We should not need to hold the current->fs->lock for a codepath that doesn't
use current->fs.
[vsu@altlinux.ru: fix error path]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's used from the initfunc in case of failure too. We could actually do
with an '__initexit' for this kind of thing -- when built in to the
kernel, it could do with being dropped with the init text. We _could_
actually just use __init for it, but that would break if/when we start
dropping init text from modules. So let's just leave it as it was for now,
and mutter a little more about random 'janitorial' fixes from people who
aren't paying attention to what they're doing.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
From: Andrew Morton <akpm@osdl.org>
Spotted by Jan Capek <jca@sysgo.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Jan Capek <jca@sysgo.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Wasn't able to reproduce a hard hang, but was able to get an oops if
suspended the machine during a copy to the cifs mount. This led to some
things hanging, including a "sync". Also got I/O errors when trying to
access the mount afterwards (even when didn't see the oops), and had
to unmount and remount in order to access the filesystem.
This patch fixed the oops.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Return -EUCLEAN on read when a bitflip was detected and corrected, so the
clients can react and eventually copy the affected block to a spare one.
Make all in kernel users aware of the change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hopefully the last iteration on this!
The handling of out of band data on NAND was accompanied by tons of fruitless
discussions and halfarsed patches to make it work for a particular
problem. Sufficiently annoyed by I all those "I know it better" mails and the
resonable amount of discarded "it solves my problem" patches, I finally decided
to go for the big rework. After removing the _ecc variants of mtd read/write
functions the solution to satisfy the various requirements was to refactor the
read/write _oob functions in mtd.
The major change is that read/write_oob now takes a pointer to an operation
descriptor structure "struct mtd_oob_ops".instead of having a function with at
least seven arguments.
read/write_oob which should probably renamed to a more descriptive name, can do
the following tasks:
- read/write out of band data
- read/write data content and out of band data
- read/write raw data content and out of band data (ecc disabled)
struct mtd_oob_ops has a mode field, which determines the oob handling mode.
Aside of the MTD_OOB_RAW mode, which is intended to be especially for
diagnostic purposes and some internal functions e.g. bad block table creation,
the other two modes are for mtd clients:
MTD_OOB_PLACE puts/gets the given oob data exactly to/from the place which is
described by the ooboffs and ooblen fields of the mtd_oob_ops strcuture. It's
up to the caller to make sure that the byte positions are not used by the ECC
placement algorithms.
MTD_OOB_AUTO puts/gets the given oob data automaticaly to/from the places in
the out of band area which are described by the oobfree tuples in the ecclayout
data structre which is associated to the devicee.
The decision whether data plus oob or oob only handling is done depends on the
setting of the datbuf member of the data structure. When datbuf == NULL then
the internal read/write_oob functions are selected, otherwise the read/write
data routines are invoked.
Tested on a few platforms with all variants. Please be aware of possible
regressions for your particular device / application scenario
Disclaimer: Any whining will be ignored from those who just contributed "hot
air blurb" and never sat down to tackle the underlying problem of the mess in
the NAND driver grown over time and the big chunk of work to fix up the
existing users. The problem was not the holiness of the existing MTD
interfaces. The problems was the lack of time to go for the big overhaul. It's
easy to add more mess to the existing one, but it takes alot of effort to go
for a real solution.
Improvements and bugfixes are welcome!
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Most of those macros are unused and the used ones just obfuscate
the code. Remove them and fixup all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The nand_oobinfo structure is not fitting the newer error correction
demands anymore. Replace it by struct nand_ecclayout and fixup the users
all over the place. Keep the nand_oobinfo based ioctl for user space
compability reasons.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The info structure for out of band data was copied into
the mtd structure. Make it a pointer and remove the ability
to set it from userspace. The position of ecc bytes is
defined by the hardware and should not be changed by software.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This allows us to drop another pointer from the struct jffs2_raw_node_ref,
shrinking it to 8 bytes on 32-bit machines (if the TEST_TOTLEN) paranoia
check is turned off, which will be committed soon).
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Print wasted_size in scanned eraseblocks, print range correctly for
summary dirent and inode entries.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Preallocation of refs is shortly going to be a per-eraseblock thing,
rather than per-filesystem. Add the required argument to the function.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>