linux_old1

Commit Graph

Author	SHA1	Message	Date
Christoph Hellwig	0634857488	Btrfs: enable discard support The discard support code in btrfs currently is guarded by ifdefs for BIO_RW_DISCARD, which is never defines as it's the name of an enum memeber. Just remove the useless ifdefs to actually enable the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-14 10:32:49 -04:00
Christoph Hellwig	e244a0aeb6	Btrfs: add -o discard option Enable discard by default is not a good idea given the the trim speed of SSD prototypes we've seen, and the carecteristics for many high-end arrays. Turn of discards by default and require the -o discard option to enable them on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-14 10:32:49 -04:00
Yan, Zheng	86df7eb921	Btrfs: properly wait log writers during log sync A recently fsync optimization make btrfs_sync_log skip calling wait_for_writer in the single log writer case. This is incorrect since the writer count can also be increased by btrfs_pin_log. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-14 10:32:48 -04:00
Josef Bacik	5d5e103a70	Btrfs: fix possible ENOSPC problems with truncate There's a problem where we don't do any space reservation for truncates, which can cause you to OOPs because you will be allowed to go off in the weeds a bit since we don't account for the delalloc bytes that are created as a result of the truncate. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-14 10:32:47 -04:00
Alex Elder	ba313e68fa	Merge branch 'master' of ssh://oss.sgi.com/oss/git/xfs/xfs into for-linus	2009-10-13 15:47:22 -05:00
Christoph Hellwig	05277c75f6	xfs: fix double IRELE in xfs_dqrele_inode xfs_dqrele_inode calls xfs_iput to release the ilock and a reference and then also calls IRELE which does a second decrement of the reference count. This leads to a premature freeing of inodes when quotas were turned off while the filesystem was mounted. Thanks to Utako Kusaka for reporting the bug and provinding a good testcase. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Utako Kusaka <u-kusaka@wm.jp.nec.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-13 13:16:36 -05:00
Chris Mason	0eda294dfc	Btrfs: fix btrfs acl #ifdef checks The btrfs acl code was #ifdefing for a define that didn't exist. This correctly matches it to the values used by the Kconfig file. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-13 13:51:39 -04:00
Chris Mason	690587d109	Btrfs: streamline tree-log btree block writeout Syncing the tree log is a 3 phase operation. 1) write and wait for all the tree log blocks for a given root. 2) write and wait for all the tree log blocks for the tree of tree log roots. 3) write and wait for the super blocks (barriers here) This isn't as efficient as it could be because there is no requirement to wait for the blocks from step one to hit the disk before we start writing the blocks from step two. This commit changes the sequence so that we don't start waiting until all the tree blocks from both steps one and two have been sent to disk. We do this by breaking up btrfs_write_wait_marked_extents into two functions, which is trivial because it was already broken up into two parts. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-13 13:35:12 -04:00
Chris Mason	257c62e1bc	Btrfs: avoid tree log commit when there are no changes rpm has a habit of running fdatasync when the file hasn't changed. We already detect if a file hasn't been changed in the current transaction but it might have been sent to the tree-log in this transaction and not changed since the last call to fsync. In this case, we want to avoid a tree log sync, which includes a number of synchronous writes and barriers. This commit extends the existing tracking of the last transaction to change a file to also track the last sub-transaction. The end result is that rpm -ivh and -Uvh are roughly twice as fast, and on par with ext3. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-13 13:35:12 -04:00
Chris Mason	4722607db6	Btrfs: only write one super copy during fsync During a tree-log commit for fsync, we've been writing at least two copies of the super block and forcing them to disk. The other filesystems write only one, and this change brings us on par with them. A full transaction commit will write all the super copies, so we still have redundant info written on a regular basis. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-13 13:35:11 -04:00
Linus Torvalds	80f506918f	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cciss: Add cciss_allow_hpsa module parameter cciss: Fix multiple calls to pci_release_regions blk-settings: fix function parameter kernel-doc notation writeback: kill space in debugfs item name writeback: account IO throttling wait as iowait elv_iosched_store(): fix strstrip() misuse cfq-iosched: avoid probable slice overrun when idling cfq-iosched: apply bool value where we return 0/1 cfq-iosched: fix think time allowed for seekers cfq-iosched: fix the slice residual sign cfq-iosched: abstract out the 'may this cfqq dispatch' logic block: use proper BLK_RW_ASYNC in blk_queue_start_tag() block: Seperate read and write statistics of in_flight requests v2 block: get rid of kblock_schedule_delayed_work() cfq-iosched: fix possible problem with jiffies wraparound cfq-iosched: fix issue with rq-rq merging and fifo list ordering	2009-10-13 10:21:33 -07:00
Theodore Ts'o	96ec2e0a71	ext3: Don't update superblock write time when filesystem is read-only This avoids updating the superblock write time when we are mounting the root file system read/only but we need to replay the journal; at that point, for people who are east of GMT and who make their clock tick in localtime for Windows bug-for-bug compatibility, and this will cause e2fsck to complain and force a full file system check. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>	2009-10-13 00:06:43 +02:00
Stefan Richter	a1be9eee29	NFS: suppress a build warning struct sockaddr_storage * can safely be used as struct sockaddr *. Suppress an "incompatible pointer type" warning. Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-12 10:25:12 -07:00
Bernd Schmidt	ef1f7a7e87	ROMFS: fix length used with romfs_dev_strnlen() function An interestingly corrupted romfs file system exposed a problem with the romfs_dev_strnlen function: it's passing the wrong value to its helpers. Rather than limit the string to the length passed in by the callers, it uses the size of the device as the limit. Signed-off-by: Bernd Schmidt <bernds_cb1@t-online.de> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-11 11:33:56 -07:00
Linus Torvalds	474a503d4b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix file clone ioctl for bookend extents Btrfs: fix uninit compiler warning in cow_file_range_nocow Btrfs: constify dentry_operations Btrfs: optimize back reference update during btrfs_drop_snapshot Btrfs: remove negative dentry when deleting subvolumne Btrfs: optimize fsync for the single writer case Btrfs: async delalloc flushing under space pressure Btrfs: release delalloc reservations on extent item insertion Btrfs: delay clearing EXTENT_DELALLOC for compressed extents Btrfs: cleanup extent_clear_unlock_delalloc flags Btrfs: fix possible softlockup in the allocator Btrfs: fix deadlock on async thread startup	2009-10-11 11:23:13 -07:00
Alexey Dobriyan	d43c36dc6b	headers: remove sched.h from interrupt.h After m68k's task_thread_info() doesn't refer to current, it's possible to remove sched.h from interrupt.h and not break m68k! Many thanks to Heiko Carstens for allowing this. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2009-10-11 11:20:58 -07:00
Linus Torvalds	4047df09a1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: ima: ecryptfs fix imbalance message eCryptfs: Remove Kconfig NET dependency and select MD5 ecryptfs: depends on CRYPTO	2009-10-09 13:30:14 -07:00
Linus Torvalds	a372bf8b6a	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop calling filemap_fdatawait inside ->fsync fix readahead calculations in xfs_dir2_leaf_getdents() xfs: make sure xfs_sync_fsdata covers the log xfs: mark inodes dirty before issuing I/O xfs: cleanup ->sync_fs xfs: fix xfs_quiesce_data xfs: implement ->dirty_inode to fix timestamp handling	2009-10-09 13:29:42 -07:00
Chris Mason	ac6889cbb2	Btrfs: fix file clone ioctl for bookend extents The file clone ioctl was incorrectly taking the offset into the extent on disk into account when calculating the length of the cloned extent. The length never changes based on the offset into the physical extent. Test case: fallocate -l 1g image mke2fs image bcp image image2 e2fsck -f image2 (errors on image2) The math bug ends up wrapping the length of the extent, and things go wrong from there. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-09 11:29:53 -04:00
Chris Mason	e9061e2148	Btrfs: fix uninit compiler warning in cow_file_range_nocow The extent_type variable was exposed uninit via a goto. It should be impossible to trigger because it is protected by a check on another variable, but this makes sure. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-09 09:57:45 -04:00
Alexey Dobriyan	82d339d9b3	Btrfs: constify dentry_operations Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-09 09:54:36 -04:00
Yan, Zheng	94fcca9f89	Btrfs: optimize back reference update during btrfs_drop_snapshot This patch reading level 0 tree blocks that already use full backrefs. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-09 09:25:16 -04:00
Yan, Zheng	efefb1438b	Btrfs: remove negative dentry when deleting subvolumne The use of btrfs_dentry_delete is removing dentries from the dcache when deleting subvolumne. btrfs_dentry_delete ignores negative dentries. This is incorrect since if we don't remove the negative dentry, its parent dentry can't be removed. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-09 09:25:16 -04:00
Linus Torvalds	32b7a567c8	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFSv4: Kill nfs4_renewd_prepare_shutdown() NFSv4: Fix the referral mount code nfs: Avoid overrun when copying client IP address string NFS: Fix port initialisation in nfs_remount() NFS: Fix port and mountport display in /proc/self/mountinfo NFS: Fix a default mount regression...	2009-10-08 14:15:19 -07:00
Josef Bacik	ff782e0a13	Btrfs: optimize fsync for the single writer case This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie for new people to join the logging transaction if there is only ever 1 writer. This helps a little bit with latency where we have something like RPM where it will fdatasync every file it writes, and so waiting the 1 jiffie for every fdatasync really starts to add up. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-08 15:30:04 -04:00
Josef Bacik	e3ccfa9897	Btrfs: async delalloc flushing under space pressure This patch moves the delalloc flushing that occurs when we are under space pressure off to a async thread pool. This helps since we only free up metadata space when we actually insert the extent item, which means it takes quite a while for space to be free'ed up if we wait on all ordered extents. However, if space is freed up due to inline extents being inserted, we can wake people who are waiting up early, and they can finish their work. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-08 15:21:23 -04:00
Josef Bacik	32c00aff71	Btrfs: release delalloc reservations on extent item insertion This patch fixes an issue with the delalloc metadata space reservation code. The problem is we used to free the reservation as soon as we allocated the delalloc region. The problem with this is if we are not inserting an inline extent, we don't actually insert the extent item until after the ordered extent is written out. This patch does 3 things, 1) It moves the reservation clearing stuff into the ordered code, so when we remove the ordered extent we remove the reservation. 2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear delalloc bits in the cases where we want to clear the metadata reservation when we clear the delalloc extent, in the case that we do an inline extent or we invalidate the page. 3) It adds another waitqueue to the space info so that when we start a fs wide delalloc flush, anybody else who also hits that area will simply wait for the flush to finish and then try to make their allocation. This has been tested thoroughly to make sure we did not regress on performance. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-08 15:21:10 -04:00
Chris Mason	a3429ab70b	Btrfs: delay clearing EXTENT_DELALLOC for compressed extents When compression is on, the cow_file_range code is farmed off to worker threads. This allows us to do significant CPU work in parallel on SMP machines. But it is a delicate balance around when we clear flags and how. In the past we cleared the delalloc flag immediately, which was safe because the pages stayed locked. But this is causing problems with the newest ENOSPC code, and with the recent extent state cleanups we can now clear the delalloc bit at the same time the uncompressed code does. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-08 15:11:50 -04:00
Chris Mason	a791e35e12	Btrfs: cleanup extent_clear_unlock_delalloc flags extent_clear_unlock_delalloc has a growing set of ugly parameters that is very difficult to read and maintain. This switches to a flag field and well named flag defines. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-08 15:11:49 -04:00
Alex Elder	e09d39968b	Merge branch 'master' into for-linus	2009-10-08 13:53:44 -05:00
Christoph Hellwig	d0800703fe	xfs: stop calling filemap_fdatawait inside ->fsync Now that the VFS actually waits for the data I/O to complete before calling into ->fsync we can stop doing it ourselves. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:02:48 -05:00
Eric Sandeen	8e69ce1471	fix readahead calculations in xfs_dir2_leaf_getdents() This is for bug #850, http://oss.sgi.com/bugzilla/show_bug.cgi?id=850 XFS file system segfaults , repeatedly and 100% reproducable in 2.6.30 , 2.6.31 The above only showed up on a CONFIG_XFS_DEBUG=y kernel, because xfs_bmapi() ASSERTs that it has been asked for at least one map, and it was getting 0. The root cause is that our guesstimated "bufsize" from xfs_file_readdir was fairly small, and the bufsize -= length; in the loop was going negative - except bufsize is a size_t, so it was wrapping to a very large number. Then when we did ra_want = howmany(bufsize + mp->m_dirblksize, mp->m_sb.sb_blocksize) - 1; with that very large number, the (int) ra_want was coming out negative, and a subsequent compare: if (1 + ra_want > map_blocks ... was coming out -true- (negative int compare w/ uint) and we went back to xfs_bmapi() for more, even though we did not need more, and asked for 0 maps, and hit the ASSERT. We have kind of a type mess here, but just keeping bufsize from going negative is probably sufficient to avoid the problem. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:02:12 -05:00
Dave Chinner	dce5065a57	xfs: make sure xfs_sync_fsdata covers the log We want to always cover the log after writing out the superblock, and in case of a synchronous writeout make sure we actually wait for the log to be covered. That way a filesystem that has been sync()ed can be considered clean by log recovery. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:49 -05:00
Dave Chinner	932640e8ad	xfs: mark inodes dirty before issuing I/O To make sure they get properly waited on in sync when I/O is in flight and we latter need to update the inode size. Requires a new helper to check if an ioend structure is beyond the current EOF. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:26 -05:00
Christoph Hellwig	69961a26b8	xfs: cleanup ->sync_fs Sort out ->sync_fs to not perform a superblock writeback for the wait = 0 case as that is just an optional first pass and the superblock will be written back properly in the next call with wait = 1. Instead perform an opportunistic quota writeback to have less work later. Also remove the freeze special case as we do a proper wait = 1 call in the freeze code anyway. Also rename the function to xfs_fs_sync_fs to match the normal naming convention, update comments and avoid calling into the laptop_mode logic on an error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:01:03 -05:00
Dave Chinner	c90b07e8dd	xfs: fix xfs_quiesce_data We need to do a synchronous xfs_sync_fsdata to make sure the superblock actually is on disk when we return. Also remove SYNC_BDFLUSH flag to xfs_sync_inodes because that particular flag is never checked. Move xfs_filestream_flush call later to only release inodes after they have been written out. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:00:36 -05:00
Christoph Hellwig	f9581b1443	xfs: implement ->dirty_inode to fix timestamp handling This is picking up on Felix's repost of Dave's patch to implement a .dirty_inode method. We really need this notification because the VFS keeps writing directly into the inode structure instead of going through methods to update this state. In addition to the long-known atime issue we now also have a caller in VM code that updates c/mtime that way for shared writeable mmaps. And I found another one that no one has noticed in practice in the FIFO code. So implement ->dirty_inode to set i_update_core whenever the inode gets externally dirtied, and switch the c/mtime handling to the same scheme we already use for atime (always picking up the value from the Linux inode). Note that this patch also removes the xfs_synchronize_atime call in xfs_reclaim it was superflous as we already synchronize the time when writing the inode via the log (xfs_inode_item_format) or the normal buffers (xfs_iflush_int). In addition also remove the I_CLEAR check before copying the Linux timestamps - now that we always have the Linux inode available we can always use the timestamps in it. Also switch to just using file_update_time for regular reads/writes - that will get us all optimization done to it for free and make sure we notice early when it breaks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2009-10-08 12:00:03 -05:00
Mimi Zohar	36520be8e3	ima: ecryptfs fix imbalance message The unencrypted files are being measured. Update the counters to get rid of the ecryptfs imbalance message. (http://bugzilla.redhat.com/519737) Reported-by: Sachin Garg Cc: Eric Paris <eparis@redhat.com> Cc: Dustin Kirkland <kirkland@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: David Safford <safford@watson.ibm.com> Cc: stable@kernel.org Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-10-08 11:31:38 -05:00
Tyler Hicks	ed1f21857e	eCryptfs: Remove Kconfig NET dependency and select MD5 eCryptfs no longer uses a netlink interface to communicate with ecryptfsd, so NET is not a valid dependency anymore. MD5 is required and must be built for eCryptfs to be of any use. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-10-08 11:31:36 -05:00
Randy Dunlap	664fc5a4e7	ecryptfs: depends on CRYPTO ecryptfs uses crypto APIs so it should depend on CRYPTO. Otherwise many build errors occur. [63 lines not pasted] Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: ecryptfs-devel@lists.launchpad.net Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>	2009-10-08 11:21:12 -05:00
Trond Myklebust	3050141bae	NFSv4: Kill nfs4_renewd_prepare_shutdown() The NFSv4 renew daemon is shared between all active super blocks that refer to a particular NFS server, so it is wrong to be shutting it down in nfs4_kill_super every time a super block is destroyed. This patch therefore kills nfs4_renewd_prepare_shutdown altogether, and leaves it up to nfs4_shutdown_client() to also shut down the renew daemon by means of the existing call to nfs4_kill_renewd(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-08 11:50:55 -04:00
Wu Fengguang	253fb02d62	pagemap: export KPF_HWPOISON This flag indicates a hardware detected memory corruption on the page. Any future access of the page data may bring down the machine. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-08 07:36:39 -07:00
Jaswinder Singh Rajput	4055e97318	fs: includecheck fix: proc, kcore.c fix the following 'make includecheck' warning: fs/proc/kcore.c: linux/mm.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-08 07:36:38 -07:00
Trond Myklebust	517be09def	NFSv4: Fix the referral mount code Fix a typo which causes try_location() to use the wrong length argument when calling nfs_parse_server_name(). This again, causes the initialisation of the mount's sockaddr structure to fail. Also ensure that if nfs4_pathname_string() returns an error, then we pass that error back up the stack instead of ENOENT. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-06 15:42:20 -04:00
Ben Hutchings	f4373bf9e6	nfs: Avoid overrun when copying client IP address string As seen in <http://bugs.debian.org/549002>, nfs4_init_client() can overrun the source string when copying the client IP address from nfs_parsed_mount_data::client_address to nfs_client::cl_ipaddr. Since these are both treated as null-terminated strings elsewhere, the copy should be done with strlcpy() not memcpy(). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-06 15:42:18 -04:00
Trond Myklebust	bcd2ea17da	NFS: Fix port initialisation in nfs_remount() The recent changeset `53a0b9c4c9` (NFS: Replace nfs_parse_ip_address() with rpc_pton()) broke nfs_remount, since the call to rpc_pton() will zero out the port number in data->nfs_server.address. This is actually due to a bug in nfs_remount: it should be looking at the port number in nfs_server.port instead... This fixes bug http://bugzilla.kernel.org/show_bug.cgi?id=14276 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-06 15:41:22 -04:00
Trond Myklebust	f5855fecda	NFS: Fix port and mountport display in /proc/self/mountinfo Currently, the port and mount port will both display as 65535 if you do not specify a port number. That would be wrong... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-06 15:40:37 -04:00
Trond Myklebust	c5811dbdd2	NFS: Fix a default mount regression... With the recent spate of changes, the nfs protocol version will now default to 2 instead of 3, while the mount protocol version defaults to 3. The following patch should ensure the defaults are consistent with the previous defaults of vers=3,proto=tcp,mountvers=3,mountproto=tcp. This fixes the bug http://bugzilla.kernel.org/show_bug.cgi?id=14259 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-10-06 15:40:15 -04:00
Steve French	8347a5cdd1	[CIFS] Fixing to avoid invalid kfree() in cifs_get_tcp_session() trivial bug in fs/cifs/connect.c . The bug is caused by fail of extract_hostname() when mounting cifs file system. This is the situation when I noticed this bug. % sudo mount -t cifs //192.168.10.208 mountpoint -o options... Then my kernel says, [ 1461.807776] ------------[ cut here ]------------ [ 1461.807781] kernel BUG at mm/slab.c:521! [ 1461.807784] invalid opcode: 0000 [#2] PREEMPT SMP [ 1461.807790] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:09:02.0/resource [ 1461.807793] CPU 0 [ 1461.807796] Modules linked in: nls_iso8859_1 usbhid sbp2 uhci_hcd ehci_hcd i2c_i801 ohci1394 ieee1394 psmouse serio_raw pcspkr sky2 usbcore evdev [ 1461.807816] Pid: 3446, comm: mount Tainted: G D 2.6.32-rc2-vanilla [ 1461.807820] RIP: 0010:[<ffffffff810b888e>] [<ffffffff810b888e>] kfree+0x63/0x156 [ 1461.807829] RSP: 0018:ffff8800b4f7fbb8 EFLAGS: 00010046 [ 1461.807832] RAX: ffffea00033fff98 RBX: ffff8800afbae7e2 RCX: 0000000000000000 [ 1461.807836] RDX: ffffea0000000000 RSI: 000000000000005c RDI: ffffffffffffffea [ 1461.807839] RBP: ffff8800b4f7fbf8 R08: 0000000000000001 R09: 0000000000000000 [ 1461.807842] R10: 0000000000000000 R11: ffff8800b4f7fbf8 R12: 00000000ffffffea [ 1461.807845] R13: ffff8800afb23000 R14: ffff8800b4f87bc0 R15: ffffffffffffffea [ 1461.807849] FS: 00007f52b6f187c0(0000) GS:ffff880007600000(0000) knlGS:0000000000000000 [ 1461.807852] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1461.807855] CR2: 0000000000613000 CR3: 00000000af8f9000 CR4: 00000000000006f0 [ 1461.807858] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1461.807861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1461.807865] Process mount (pid: 3446, threadinfo ffff8800b4f7e000, task ffff8800950e4380) [ 1461.807867] Stack: [ 1461.807869] 0000000000000202 0000000000000282 ffff8800b4f7fbf8 ffff8800afbae7e2 [ 1461.807876] <0> 00000000ffffffea ffff8800afb23000 ffff8800b4f87bc0 ffff8800b4f7fc28 [ 1461.807884] <0> ffff8800b4f7fcd8 ffffffff81159f6d ffffffff81147bc2 ffffffff816bfb48 [ 1461.807892] Call Trace: [ 1461.807899] [<ffffffff81159f6d>] cifs_get_tcp_session+0x440/0x44b [ 1461.807904] [<ffffffff81147bc2>] ? find_nls+0x1c/0xe9 [ 1461.807909] [<ffffffff8115b889>] cifs_mount+0x16bc/0x2167 [ 1461.807917] [<ffffffff814455bd>] ? _spin_unlock+0x30/0x4b [ 1461.807923] [<ffffffff81150da9>] cifs_get_sb+0xa5/0x1a8 [ 1461.807928] [<ffffffff810c1b94>] vfs_kern_mount+0x56/0xc9 [ 1461.807933] [<ffffffff810c1c64>] do_kern_mount+0x47/0xe7 [ 1461.807938] [<ffffffff810d8632>] do_mount+0x712/0x775 [ 1461.807943] [<ffffffff810d671f>] ? copy_mount_options+0xcf/0x132 [ 1461.807948] [<ffffffff810d8714>] sys_mount+0x7f/0xbf [ 1461.807953] [<ffffffff8144509a>] ? lockdep_sys_exit_thunk+0x35/0x67 [ 1461.807960] [<ffffffff81011cc2>] system_call_fastpath+0x16/0x1b [ 1461.807963] Code: 00 00 00 00 ea ff ff 48 c1 e8 0c 48 6b c0 68 48 01 d0 66 83 38 00 79 04 48 8b 40 10 66 83 38 00 79 04 48 8b 40 10 80 38 00 78 04 <0f> 0b eb fe 4c 8b 70 58 4c 89 ff 41 8b 76 4c e8 b8 49 fb ff e8 [ 1461.808022] RIP [<ffffffff810b888e>] kfree+0x63/0x156 [ 1461.808027] RSP <ffff8800b4f7fbb8> [ 1461.808031] ---[ end trace ffe26fcdc72c0ce4 ]--- The reason of this bug is that the error handling code of cifs_get_tcp_session() calls kfree() when corresponding kmalloc() failed. (The kmalloc() is called by extract_hostname().) Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> CC: Stable <stable@kernel.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-10-06 18:31:29 +00:00
Nikanth Karthikesan	316d315bff	block: Seperate read and write statistics of in_flight requests v2 Commit `a9327cac44` added seperate read and write statistics of in_flight requests. And exported the number of read and write requests in progress seperately through sysfs. But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange output from "iostat -kx 2". Global values for service time and utilization were garbage. For interval values, utilization was always 100%, and service time is higher than normal. So this was reverted by commit `0f78ab9899` The problem was in part_round_stats_single(), I missed the following: if (now == part->stamp) return; - if (part->in_flight) { + if (part_in_flight(part)) { __part_stat_add(cpu, part, time_in_queue, part_in_flight(part) * (now - part->stamp)); __part_stat_add(cpu, part, io_ticks, (now - part->stamp)); With this chunk included, the reported regression gets fixed. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> -- Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 20:16:55 +02:00
Josef Bacik	1cdda9b81a	Btrfs: fix possible softlockup in the allocator Like the cluster allocating stuff, we can lockup the box with the normal allocation path. This happens when we 1) Start to cache a block group that is severely fragmented, but has a decent amount of free space. 2) Start to commit a transaction 3) Have the commit try and empty out some of the delalloc inodes with extents that are relatively large. The inodes will not be able to make the allocations because they will ask for allocations larger than a contiguous area in the free space cache. So we will wait for more progress to be made on the block group, but since we're in a commit the caching kthread won't make any more progress and it already has enough free space that wait_block_group_cache_progress will just return. So, if we wait and fail to make the allocation the next time around, just loop and go to the next block group. This keeps us from getting stuck in a softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-06 10:04:28 -04:00
Chris Mason	61d92c328c	Btrfs: fix deadlock on async thread startup The btrfs async worker threads are used for a wide variety of things, including processing bio end_io functions. This means that when the endio threads aren't running, the rest of the FS isn't able to do the final processing required to clear PageWriteback. The endio threads also try to exit as they become idle and start more as the work piles up. The problem is that starting more threads means kthreadd may need to allocate ram, and that allocation may wait until the global number of writeback pages on the system is below a certain limit. The result of that throttling is that end IO threads wait on kthreadd, who is waiting on IO to end, which will never happen. This commit fixes the deadlock by handing off thread startup to a dedicated thread. It also fixes a bug where the on-demand thread creation was creating far too many threads because it didn't take into account threads being started by other procs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-05 09:44:45 -04:00
Alexey Dobriyan	a99bbaf5ee	headers: remove sched.h from poll.h Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-04 15:05:10 -07:00
Linus Torvalds	58e57fbd1c	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits) Revert "Seperate read and write statistics of in_flight requests" cfq-iosched: don't delay async queue if it hasn't dispatched at all block: Topology ioctls cfq-iosched: use assigned slice sync value, not default cfq-iosched: rename 'desktop' sysfs entry to 'low_latency' cfq-iosched: implement slower async initiate and queue ramp up cfq-iosched: delay async IO dispatch, if sync IO was just done cfq-iosched: add a knob for desktop interactiveness Add a tracepoint for block request remapping block: allow large discard requests block: use normal I/O path for discard requests swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL fs/bio.c: move EXPORT* macros to line after function Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs cciss: fix build when !PROC_FS block: Do not clamp max_hw_sectors for stacking devices block: Set max_sectors correctly for stacking devices cciss: cciss_host_attr_groups should be const cciss: Dynamically allocate the drive_info_struct for each logical drive. cciss: Add usage_count attribute to each logical drive in /sys ...	2009-10-04 12:39:14 -07:00
Jens Axboe	0f78ab9899	Revert "Seperate read and write statistics of in_flight requests" This reverts commit `a9327cac44`. Corrado Zoccolo <czoccolo@gmail.com> reports: "with 2.6.32-rc1 I started getting the following strange output from "iostat -kx 2": Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10,70 0,00 3,16 15,75 0,00 70,38 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 18,22 0,00 0,67 0,01 14,77 0,02 43,94 0,01 10,53 39043915,03 2629219,87 sdb 60,89 9,68 50,79 3,04 1724,43 50,52 65,95 0,70 13,06 488437,47 2629219,87 avg-cpu: %user %nice %system %iowait %steal %idle 2,72 0,00 0,74 0,00 0,00 96,53 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 6,68 0,00 0,99 0,00 0,00 92,33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 4,40 0,00 0,73 1,47 0,00 93,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 4,00 0,00 3,00 0,00 28,00 18,67 0,06 19,50 333,33 100,00 Global values for service time and utilization are garbage. For interval values, utilization is always 100%, and service time is higher than normal. I bisected it down to: [`a9327cac44`] Seperate read and write statistics of in_flight requests and verified that reverting just that commit indeed solves the issue on 2.6.32-rc1." So until this is debugged, revert the bad commit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-04 21:04:38 +02:00
Linus Torvalds	9117703fab	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: [PATCH] ext4: retry failed direct IO allocations ext4: Fix build warning in ext4_dirty_inode() ext4: drop ext4dev compat ext4: fix a BUG_ON crash by checking that page has buffers attached to it	2009-10-03 11:24:19 -07:00
Eric Sandeen	fbbf694566	[PATCH] ext4: retry failed direct IO allocations On a 256M filesystem, doing this in a loop: xfs_io -F -f -d -c 'pwrite 0 64m' test rm -f test eventually leads to ENOSPC. (the xfs_io command does a 64m direct IO write to the file "test") As with other block allocation callers, it looks like we need to potentially retry the allocations on the initial ENOSPC. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-10-02 21:20:55 -04:00
Curt Wohlgemuth	74072d0a63	ext4: Fix build warning in ext4_dirty_inode() This fixes the following warning: fs/ext4/inode.c: In function 'ext4_dirty_inode': fs/ext4/inode.c:5615: warning: unused variable 'current_handle' We remove the jbd_debug() statement which does use current_handle, as it's not terribly important in the grand scheme of things. Thanks to Stephen Rothwell for pointing this out. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-10-02 21:08:32 -04:00
Linus Torvalds	0efe5e32c8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix data space leak fix Btrfs: remove duplicates of filemap_ helpers Btrfs: take i_mutex before generic_write_checks Btrfs: fix arguments to btrfs_wait_on_page_writeback_range Btrfs: fix deadlock with free space handling and user transactions Btrfs: fix error cases for ioctl transactions Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code Btrfs: introduce missing kfree Btrfs: Fix setting umask when POSIX ACLs are not enabled Btrfs: proper -ENOSPC handling	2009-10-01 20:23:15 -07:00
Christoph Hellwig	80e50be422	afs: remove cache.h It's just a wrapper for <linux/fscache.h>, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:16 -07:00
Alexey Dobriyan	828c09509b	const: constify remaining file_operations [akpm@linux-foundation.org: fix KVM] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-10-01 16:11:11 -07:00
Chris Mason	9c2693c924	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus	2009-10-01 17:24:44 -04:00
Josef Bacik	fbf1908744	Btrfs: fix data space leak fix There is a problem where page_mkwrite can be called on a dirtied page that already has a delalloc range associated with it. The fix is to clear any delalloc bits for the range we are dirtying so the space accounting gets handled properly. This is the same thing we do in the normal write case, so we are consistent across the board. With this patch we no longer leak reserved space. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 17:10:23 -04:00
H Hartley Sweeten	a112a71d45	fs/bio.c: move EXPORT* macros to line after function As mentioned in Documentation/CodingStyle, move EXPORT* macro's to the line immediately after the closing function brace line. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:46 +02:00
Christoph Hellwig	8aa38c31b7	Btrfs: remove duplicates of filemap_ helpers Use filemap_fdatawrite_range and filemap_fdatawait_range instead of local copies of the functions. For filemap_fdatawait_range that also means replacing the awkward old wait_on_page_writeback_range calling convention with the regular filemap byte offsets. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 12:58:30 -04:00
Chris Mason	25472b880c	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus	2009-10-01 12:58:13 -04:00
Chris Mason	ab93dbecfb	Btrfs: take i_mutex before generic_write_checks btrfs_file_write was incorrectly calling generic_write_checks without taking i_mutex. This lead to problems with racing around i_size when doing O_APPEND writes. The fix here is to move i_mutex higher. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 12:29:10 -04:00
Christoph Hellwig	35d62a942d	Btrfs: fix arguments to btrfs_wait_on_page_writeback_range wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes a pagecache offset, not a byte offset into the file. Shift the arguments around to wait for the correct range Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-10-01 10:27:01 -04:00
Eric Sandeen	f0e2dfa7f3	ext4: drop ext4dev compat Kconfig & super.c promised it'd be gone by 2.6.31, so it's about time to drop it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-10-01 02:21:07 -04:00
Theodore Ts'o	1f94533d9c	ext4: fix a BUG_ON crash by checking that page has buffers attached to it In ext4_num_dirty_pages() we were calling page_buffers() before checking to see if the page actually had pages attached to it; this would cause a BUG check crash in the inline function page_buffers(). Thanks to Markus Trippelsdorf for reporting this bug. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 22:57:41 -04:00
David Teigland	6861f35078	dlm: fix socket fd translation The code to set up sctp sockets was not using the sockfd_lookup() and sockfd_put() routines to translate an fd to a socket. The direct fget and fput calls were resulting in error messages from alloc_fd(). Also clean up two log messages and remove a third, related to setting up sctp associations. Signed-off-by: David Teigland <teigland@redhat.com>	2009-09-30 12:19:44 -05:00
David Teigland	04bedd79a7	dlm: fix lowcomms_connect_node for sctp The recently added dlm_lowcomms_connect_node() from `391fbdc5d5` does not work when using SCTP instead of TCP. The sctp connection code has nothing to do without data to send. Check for no data in the sctp connection code and do nothing instead of triggering a BUG. Also have connect_node() do nothing when the protocol is sctp. Signed-off-by: David Teigland <teigland@redhat.com>	2009-09-30 12:19:44 -05:00
Linus Torvalds	9abf47f11b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix missing initialization of i_dir_start_lookup member nilfs2: fix missing zero-fill initialization of btree node cache	2009-09-30 09:42:24 -07:00
Linus Torvalds	9f44fdc518	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix time encoding with extra epoch bits ext4: Add a stub for mpage_da_data in the trace header jbd2: Use tracepoints for history file ext4: Use tracepoints for mb_history trace file ext4, jbd2: Drop unneeded printks at mount and unmount time ext4: Handle nested ext4_journal_start/stop calls without a journal ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode ext4: Avoid updating the inode table bh twice in no journal mode ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first ext4: async direct IO for holes and fallocate support ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O ext4: Split uninitialized extents for direct I/O ext4: release reserved quota when block reservation for delalloc retry ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks ext4: Fix hueristic which avoids group preallocation for closed files ext4: Use ext4_msg() for ext4_da_writepage() errors ext4: Update documentation about quota mount options	2009-09-30 09:32:30 -07:00
Linus Torvalds	4c8f1cb266	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: Check s_dirt in fat_sync_fs() vfat: change the default from shortname=lower to shortname=mixed fat/nls: Fix handling of utf8 invalid char	2009-09-30 09:31:14 -07:00
Theodore Ts'o	c1fccc0696	ext4: Fix time encoding with extra epoch bits "Looking at ext4.h, I think the setting of extra time fields forgets to mask the epoch bits so the epoch part overwrites nsec part. The second change is only for coherency (2 -> EXT4_EPOCH_BITS)." Thanks to Damien Guibouret for pointing out this problem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 01:13:55 -04:00
Theodore Ts'o	bf6993276f	jbd2: Use tracepoints for history file The /proc/fs/jbd2/<dev>/history was maintained manually; by using tracepoints, we can get all of the existing functionality of the /proc file plus extra capabilities thanks to the ftrace infrastructure. We save memory as a bonus. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 00:32:06 -04:00
Theodore Ts'o	296c355cd6	ext4: Use tracepoints for mb_history trace file The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a number of problems: it required a largish amount of memory to be allocated for each ext4 filesystem, and the s_mb_history_lock introduced a CPU contention problem. By ripping out the mb_history code and replacing it with ftrace tracepoints, and we get more functionality: timestamps, event filtering, the ability to correlate mballoc history with other ext4 tracepoints, etc. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-30 00:32:42 -04:00
Sage Weil	dd7e0b7b02	Btrfs: fix deadlock with free space handling and user transactions If an ioctl-initiated transaction is open, we can't force a commit during the free space checks in order to free up pinned extents or else we deadlock. Just ENOSPC instead. A more satisfying solution that reserves space for the entire user transaction up front is forthcoming... Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 19:50:07 -04:00
Sage Weil	1ab86aedbc	Btrfs: fix error cases for ioctl transactions Fix leak of vfsmount write reference and open_ioctl_trans reference on ENOMEM. Clean up the error paths while we're at it. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 18:38:44 -04:00
Theodore Ts'o	90576c0b9a	ext4, jbd2: Drop unneeded printks at mount and unmount time There are a number of kernel printk's which are printed when an ext4 filesystem is mounted and unmounted. Disable them to economize space in the system logs. In addition, disabling the mballoc stats by default saves a number of unneeded atomic operations for every block allocation or deallocation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 15:51:30 -04:00
Chris Ball	3baf0bed0a	Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're currently not using it and are testing CONFIG_FS_POSIX_ACL instead. CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs". Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:05 -04:00
Julia Lawall	fd2696f399	Btrfs: introduce missing kfree Error handling code following a kzalloc should free the allocated data. The semantic match that finds the problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; statement S; expression E; identifier f,f1,l; position p1,p2; expression ptr != NULL; @@ x@p1 = \(kmalloc\\|kzalloc\\|kcalloc\)(...); ... if (x == NULL) S <... when != x when != if (...) { <+...x...+> } ( x->f1 = E \| (x->f1 == NULL \|\| ...) \| f(...,x->f1,...) ) ...> ( return \(0\\|<+...x...+>\\|ptr\); \| return@p2 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; @@ print " file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:04 -04:00
Chris Ball	49cf6f4529	Btrfs: Fix setting umask when POSIX ACLs are not enabled We currently set sb->s_flags \|= MS_POSIXACL unconditionally, which is incorrect -- it tells the VFS that it shouldn't set umask because we will, yet we don't set it ourselves if we aren't using POSIX ACLs, so the umask ends up ignored. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-29 13:51:04 -04:00
Curt Wohlgemuth	d3d1faf6a7	ext4: Handle nested ext4_journal_start/stop calls without a journal This patch fixes a problem with handling nested calls to ext4_journal_start/ext4_journal_stop, when there is no journal present. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 11:01:03 -04:00
Curt Wohlgemuth	f3dc272fd5	ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode This patch a problem that ext4_dirty_inode() was not calling ext4_mark_inode_dirty() if the current_handle is not valid, which it is the case in no journal mode. It also removes a test for non-matching transaction which can never happen. Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 16:06:01 -04:00
Frank Mayhar	830156c79b	ext4: Avoid updating the inode table bh twice in no journal mode This is a cleanup of commit `91ac6f4`. Since ext4_mark_inode_dirty() has already called ext4_mark_iloc_dirty(), which in turn calls ext4_do_update_inode(), it's not necessary to have ext4_write_inode() call ext4_do_update_inode() in no journal mode. Indeed, it would be duplicated work. Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 10:07:47 -04:00
Ryusuke Konishi	3cc811bffd	nilfs2: fix missing initialization of i_dir_start_lookup member The i_dir_start_lookup field in nilfs_inode_info objects should be cleared when the objects are allocated, but the the initialization was missing in case of reading from disk. This adds the initialization. Since the variable just gives a start page on directory lookups, the bug was nonfatal until now. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2009-09-29 20:32:13 +09:00
Ryusuke Konishi	1f28fcd925	nilfs2: fix missing zero-fill initialization of btree node cache This will fix file system corruption which infrequently happens after mount. The problem was reported from users with the title "[NILFS users] Fail to mount NILFS." (Message-ID: <200908211918.34720.yuri@itinteg.net>), and so forth. I've also experienced the corruption multiple times on kernel 2.6.30 and 2.6.31. The problem turned out to be caused due to discordance between mapping->nrpages of a btree node cache and the actual number of pages hung on the cache; if the mapping->nrpages becomes zero even as it has pages, truncate_inode_pages() returns without doing anything. Usually this is harmless except it may cause page leak, but garbage collection fairly infrequently sees a stale page remained in the btree node cache of DAT (i.e. disk address translation file of nilfs), and induces the corruption. I identified a missing initialization in btree node caches was the root cause. This corrects the bug. I've tested this for kernel 2.6.30 and 2.6.31. Reported-by: Yuri Chislov <yuri@itinteg.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: stable <stable@kernel.org>	2009-09-29 20:12:56 +09:00
Josef Bacik	9ed74f2dba	Btrfs: proper -ENOSPC handling At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-28 16:29:42 -04:00
Theodore Ts'o	f3ce8064b3	ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first Move the check to make sure the original and donor inodes are different earlier, to avoid a potential deadlock by trying to lock the same inode twice. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:58:29 -04:00
Mingming Cao	8d5d02e6b1	ext4: async direct IO for holes and fallocate support For async direct IO that covers holes or fallocate, the end_io callback function now queued the convertion work on workqueue but don't flush the work rightaway as it might take too long to afford. But when fsync is called after all the data is completed, user expects the metadata also being updated before fsync returns. Thus we need to flush the conversion work when fsync() is called. This patch keep track of a listed of completed async direct io that has a work queued on workqueue. When fsync() is called, it will go through the list and do the conversion. Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2009-09-28 15:48:29 -04:00
Mingming Cao	4c0425ff68	ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O Currently the DIO VFS code passes create = 0 when writing to the middle of file. It does this to avoid block allocation for holes, so as not to expose stale data out when there is a parallel buffered read (which does not hold the i_mutex lock). Direct I/O writes into holes falls back to buffered IO for this reason. Since preallocated extents are treated as holes when doing a get_block() look up (buffer is not mapped), direct IO over fallocate also falls back to buffered IO. Thus ext4 actually silently falls back to buffered IO in above two cases, which is undesirable. To fix this, this patch creates unitialized extents when a direct I/O write into holes in sparse files, and registering an end_io callback which converts the uninitialized extent to an initialized extent after the I/O is completed. Singed-Off-By: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:48:41 -04:00
Mingming Cao	0031462b5b	ext4: Split uninitialized extents for direct I/O When writing into an unitialized extent via direct I/O, and the direct I/O doesn't exactly cover the unitialized extent, split the extent into uninitialized and initialized extents before submitting the I/O. This avoids needing to deal with an ENOSPC error in the end_io callback that gets used for direct I/O. When the IO is complete, the written extent will be marked as initialized. Singed-Off-By: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:49:08 -04:00
Mingming Cao	9f0ccfd8e0	ext4: release reserved quota when block reservation for delalloc retry ext4_da_reserve_space() can reserve quota blocks multiple times if ext4_claim_free_blocks() fail and we retry the allocation. We should release the quota reservation before restarting. Bug found by Jan Kara. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 15:49:52 -04:00
Theodore Ts'o	55138e0bc2	ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks Work around problems in the writeback code to force out writebacks in larger chunks than just 4mb, which is just too small. This also works around limitations in the ext4 block allocator, which can't allocate more than 2048 blocks at a time. So we need to defeat the round-robin characteristics of the writeback code and try to write out as many blocks in one inode before allowing the writeback code to move on to another inode. We add a a new per-filesystem tunable, max_writeback_mb_bump, which caps this to a default of 128mb per inode. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-29 13:31:31 -04:00
Theodore Ts'o	7178057730	ext4: Fix hueristic which avoids group preallocation for closed files The hueristic was designed to avoid using locality group preallocation when writing the last segment of a closed file. Fix it by move setting size to the maximum of size and isize until after we check whether size == isize. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-28 00:06:20 -04:00
Alexey Dobriyan	f0f37e2f77	const: mark struct vm_struct_operations * mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-27 11:39:25 -07:00
Theodore Ts'o	1693918e0b	ext4: Use ext4_msg() for ext4_da_writepage() errors This allows the user to see what filesystem was involved with a particular ext4_da_writepage() error. Also, use KERN_CRIT which is more appropriate than KERN_EMERG. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-09-26 17:43:59 -04:00
Linus Torvalds	bfebb14063	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: pass in super_block to bdi_start_writeback()	2009-09-26 10:11:13 -07:00
Linus Torvalds	07e2e6ba27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix locking and list handling code in cifs_open and its helper [CIFS] Remove build warning cifs: fix problems with last two commits [CIFS] Fix build break when keys support turned off cifs: eliminate cifs_init_private cifs: convert oplock breaks to use slow_work facility (try #4) cifs: have cifsFileInfo hold an extra inode reference cifs: take read lock on GlobalSMBSes_lock in is_valid_oplock_break cifs: remove cifsInodeInfo.oplockPending flag cifs: fix oplock request handling in posix codepath [CIFS] Re-enable Lanman security	2009-09-26 10:10:35 -07:00
Jens Axboe	a72bfd4dea	writeback: pass in super_block to bdi_start_writeback() Sometimes we only want to write pages from a specific super_block, so allow that to be passed in. This fixes a problem with commit `56a131dcf7` causing writeback on all super_blocks on a bdi, where we only really want to sync a specific sb from writeback_inodes_sb(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-26 00:10:40 +02:00
Jeff Layton	3321b791b2	cifs: fix locking and list handling code in cifs_open and its helper The patch to remove cifs_init_private introduced a locking imbalance. It didn't remove the leftover list addition code and the unlocking in that function. cifs_new_fileinfo does the list addition now, so there should be no need to do it outside of that function. pCifsInode will never be NULL, so we don't need to check for that. This patch also gets rid of the ugly locking and unlocking across function calls. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 17:59:31 +00:00
Linus Torvalds	6d7f18f6ea	Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block * 'writeback' of git://git.kernel.dk/linux-2.6-block: writeback: writeback_inodes_sb() should use bdi_start_writeback() writeback: don't delay inodes redirtied by a fast dirtier writeback: make the super_block pinning more efficient writeback: don't resort for a single super_block in move_expired_inodes() writeback: move inodes from one super_block together writeback: get rid to incorrect references to pdflush in comments writeback: improve readability of the wb_writeback() continue/break logic writeback: cleanup writeback_single_inode() writeback: kupdate writeback shall not stop when more io is possible writeback: stop background writeback when below background threshold writeback: balance_dirty_pages() shall write more than dirtied pages fs: Fix busyloop in wb_writeback()	2009-09-25 09:27:30 -07:00
Jens Axboe	56a131dcf7	writeback: writeback_inodes_sb() should use bdi_start_writeback() Pointless to iterate other devices looking for a super, when we have a bdi mapping. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Wu Fengguang	b3af9468ae	writeback: don't delay inodes redirtied by a fast dirtier Debug traces show that in per-bdi writeback, the inode under writeback almost always get redirtied by a busy dirtier. We used to call redirty_tail() in this case, which could delay inode for up to 30s. This is unacceptable because it now happens so frequently for plain cp/dd, that the accumulated delays could make writeback of big files very slow. So let's distinguish between data redirty and metadata only redirty. The first one is caused by a busy dirtier, while the latter one could happen in XFS, NFS, etc. when they are doing delalloc or updating isize. The inode being busy dirtied will now be requeued for next io, while the inode being redirtied by fs will continue to be delayed to avoid repeated IO. CC: Jan Kara <jack@suse.cz> CC: Theodore Ts'o <tytso@mit.edu> CC: Dave Chinner <david@fromorbit.com> CC: Chris Mason <chris.mason@oracle.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Jens Axboe	9ecc2738ac	writeback: make the super_block pinning more efficient Currently we pin the inode->i_sb for every single inode. This increases cache traffic on sb->s_umount sem. Lets instead cache the inode sb pin state and keep the super_block pinned for as long as keep writing out inodes from the same super_block. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Jens Axboe	cf137307cd	writeback: don't resort for a single super_block in move_expired_inodes() If we only moved inodes from a single super_block to the temporary list, there's no point in doing a resort for multiple super_blocks. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:26 +02:00
Shaohua Li	5c03449d34	writeback: move inodes from one super_block together __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has several partitions, writeback might keep spindle moving between partitions. To reduce the move, better write big chunk of one partition and then move to another. Inodes from one fs usually are in one partion, so idealy move indoes from one fs together should reduce spindle move. This patch tries to address this. Before per-bdi writeback is added, the behavior is write indoes from one fs first and then another, so the patch restores previous behavior. The loop in the patch is a bit ugly, should we add a dirty list for each superblock in bdi_writeback? Test in a two partition disk with attached fio script shows about 3% ~ 6% improvement. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Jens Axboe	5b0830cb90	writeback: get rid to incorrect references to pdflush in comments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Jens Axboe	71fd05a887	writeback: improve readability of the wb_writeback() continue/break logic And throw some comments in there, too. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	ae1b7f7d4b	writeback: cleanup writeback_single_inode() Make the if-else straight in writeback_single_inode(). No behavior change. Cc: Jan Kara <jack@suse.cz> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	7fbdea3232	writeback: kupdate writeback shall not stop when more io is possible Fix the kupdate case, which disregards wbc.more_io and stop writeback prematurely even when there are more inodes to be synced. wbc.more_io should always be respected. Also remove the pages_skipped check. It will set when some page(s) of some inode(s) cannot be written for now. Such inodes will be delayed for a while. This variable has nothing to do with whether there are other writeable inodes. CC: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:25 +02:00
Wu Fengguang	d3ddec7635	writeback: stop background writeback when below background threshold Treat bdi_start_writeback(0) as a special request to do background write, and stop such work when we are below the background dirty threshold. Also simplify the (nr_pages <= 0) checks. Since we already pass in nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't need to worry about it being decreased to zero. Reported-by: Richard Kennedy <richard@rsk.demon.co.uk> CC: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:24 +02:00
Jan Kara	a5989bdc98	fs: Fix busyloop in wb_writeback() If all inodes are under writeback (e.g. in case when there's only one inode with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by waiting on I_SYNC flags of an inode on b_more_io list in case we failed to write anything. Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-25 18:08:24 +02:00
Steve French	15dd478107	[CIFS] Remove build warning Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 02:24:45 +00:00
Jeff Layton	5d2c0e2259	cifs: fix problems with last two commits Fix problems with commits: `086f68bd97` `3bc303c254` Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 02:12:33 +00:00
Steve French	0f59e61c1f	[CIFS] Fix build break when keys support turned off Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-25 00:33:37 +00:00
Andrew Morton	c44972f178	procfs: disable per-task stack usage on NOMMU It needs walk_page_range(). Reported-by: Michal Simek <monstr@monstr.eu> Tested-by: Michal Simek <monstr@monstr.eu> Cc: Stefani Seibold <stefani@seibold.net> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 17:11:24 -07:00
Linus Torvalds	b9b9df62e7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6: eCryptfs: Prevent lower dentry from going negative during unlink eCryptfs: Propagate vfs_read and vfs_write return codes eCryptfs: Validate global auth tok keys eCryptfs: Filename encryption only supports password auth tokens eCryptfs: Check for O_RDONLY lower inodes when opening lower files eCryptfs: Handle unrecognized tag 3 cipher codes ecryptfs: improved dependency checking and reporting eCryptfs: Fix lockdep-reported AB-BA mutex issue ecryptfs: Remove unneeded locking that triggers lockdep false positives	2009-09-24 17:10:17 -07:00
Jeff Layton	086f68bd97	cifs: eliminate cifs_init_private ...it does the same thing as cifs_fill_fileinfo, but doesn't handle the flist ordering correctly. Also rename cifs_fill_fileinfo to a more descriptive name and have it take an open flags arg instead of just a write_only flag. That makes the logic in the callers a little simpler. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-24 19:35:18 +00:00
Al Viro	36dd2fdb37	nfs[23] tcp breakage in mount with binary options We forget to set nfs_server.protocol in tcp case when old-style binary options are passed to mount. The thing remains zero and never validated afterwards. As the result, we hit BUG in fs/nfs/client.c:588. Breakage has been introduced in NFS: Add nfs_alloc_parsed_mount_data merged yesterday... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-09-24 14:58:42 -04:00
Jeff Layton	3bc303c254	cifs: convert oplock breaks to use slow_work facility (try #4 ) This is the fourth respin of the patch to convert oplock breaks to use the slow_work facility. A customer of ours was testing a backport of one of the earlier patchsets, and hit a "Busy inodes after umount..." problem. An oplock break job had raced with a umount, and the superblock got torn down and its memory reused. When the oplock break job tried to dereference the inode->i_sb, the kernel oopsed. This patchset has the oplock break job hold an inode and vfsmount reference until the oplock break completes. With this, there should be no need to take a tcon reference (the vfsmount implicitly holds one already). Currently, when an oplock break comes in there's a chance that the oplock break job won't occur if the allocation of the oplock_q_entry fails. There are also some rather nasty races in the allocation and handling these structs. Rather than allocating oplock queue entries when an oplock break comes in, add a few extra fields to the cifsFileInfo struct. Get rid of the dedicated cifs_oplock_thread as well and queue the oplock break job to the slow_work thread pool. This approach also has the advantage that the oplock break jobs can potentially run in parallel rather than be serialized like they are today. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-09-24 18:33:18 +00:00
Linus Torvalds	7ca263cdf8	Merge branch 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6 * 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6: [PATCH] Fix idle time field in /proc/uptime	2009-09-24 09:04:24 -07:00
Linus Torvalds	dc2af6a6bc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (42 commits) Btrfs: hash the btree inode during fill_super Btrfs: relocate file extents in clusters Btrfs: don't rename file into dummy directory Btrfs: check size of inode backref before adding hardlink Btrfs: fix releasepage to avoid unlocking extents we haven't locked Btrfs: Fix test_range_bit for whole file extents Btrfs: fix errors handling cached state in set/clear_extent_bit Btrfs: fix early enospc during balancing Btrfs: deal with NULL space info Btrfs: account for space used by the super mirrors Btrfs: fix extent entry threshold calculation Btrfs: remove dead code Btrfs: fix bitmap size tracking Btrfs: don't keep retrying a block group if we fail to allocate a cluster Btrfs: make balance code choose more wisely when relocating Btrfs: fix arithmetic error in clone ioctl Btrfs: add snapshot/subvolume destroy ioctl Btrfs: change how subvolumes are organized Btrfs: do not reuse objectid of deleted snapshot/subvol Btrfs: speed up snapshot dropping ...	2009-09-24 08:57:29 -07:00
Linus Torvalds	6c5daf012c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: truncate: use new helpers truncate: new helpers fs: fix overflow in sys_mount() for in-kernel calls fs: Make unload_nls() NULL pointer safe freeze_bdev: grab active reference to frozen superblocks freeze_bdev: kill bd_mount_sem exofs: remove BKL from super operations fs/romfs: correct error-handling code vfs: seq_file: add helpers for data filling vfs: remove redundant position check in do_sendfile vfs: change sb->s_maxbytes to a loff_t vfs: explicitly cast s_maxbytes in fiemap_check_ranges libfs: return error code on failed attr set seq_file: return a negative error code when seq_path_root() fails. vfs: optimize touch_time() too vfs: optimization for touch_atime() vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it fs/inode.c: add dev-id and inode number for debugging in init_special_inode() libfs: make simple_read_from_buffer conventional	2009-09-24 08:32:11 -07:00
Linus Torvalds	db16826367	Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits) HWPOISON: Enable error_remove_page on btrfs HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs HWPOISON: Add madvise() based injector for hardware poisoned pages v4 HWPOISON: Enable error_remove_page for NFS HWPOISON: Enable .remove_error_page for migration aware file systems HWPOISON: The high level memory error handler in the VM v7 HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process HWPOISON: shmem: call set_page_dirty() with locked page HWPOISON: Define a new error_remove_page address space op for async truncation HWPOISON: Add invalidate_inode_page HWPOISON: Refactor truncate to allow direct truncating of page v2 HWPOISON: check and isolate corrupted free pages v2 HWPOISON: Handle hardware poisoned pages in try_to_unmap HWPOISON: Use bitmask/action code for try_to_unmap behaviour HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 HWPOISON: Add poison check to page fault handling HWPOISON: Add basic support for poisoned pages in fault handler v3 HWPOISON: Add new SIGBUS error codes for hardware poison signals HWPOISON: Add support for poison swap entries v2 HWPOISON: Export some rmap vma locking to outside world ...	2009-09-24 07:53:22 -07:00
Hiroshi Shimamoto	801460d0cf	task_struct cleanup: move binfmt field to mm_struct Because the binfmt is not different between threads in the same process, it can be moved from task_struct to mm_struct. And binfmt moudle is handled per mm_struct instead of task_struct. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Julia Lawall	a21f3c2a04	fs/romfs: correct error-handling code romfs_iget returns an ERR_PTR value in an error case instead of NULL. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @match exists@ expression x, E; statement S1, S2; @@ x = romfs_iget(...) ... when != x = E ( * if (x == NULL \|\| ...) S1 else S2 \| * if (x == NULL && ...) S1 else S2 ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Roel Kluin	3886de938c	adfs: remove redundant test on unsigned unsigned block cannot be less than 0. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:05 -07:00
Alexey Dobriyan	8d65af789f	sysctl: remove "struct file *" argument of ->proc_handler It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:04 -07:00
Renzo Davoli	dd5d81f326	fs/char_dev.c: remove useless loop There are two useless lines in fs/char_dev.c. In register_chrdev there is a loop to change all '/' into '!' in the kernel object name. This code is useless as the same substitution is in kobject_set_name_vargs in lib/kobject.c: 228 /* ewww... some of these buggers have '/' in the name ... / 229 while ((s = strchr(kobj->name, '/'))) 230 s[0] = '!'; kobject_set_name_vargs is called by kobject_set_name. kobject_set_name is called just above the useless loop. [hidave.darkstar@gmail.com: fix warning, remove the unused char s] Signed-off-by: Renzo Davoli <renzo@cs.unibo.it> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:03 -07:00
Mike Frysinger	0b8c78f2bf	flat: use IS_ERR_VALUE() helper macro There is a common macro now for testing mixed pointer/errno values, so use that rather than handling the casts ourself. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David McCullough <david_mccullough@securecomputing.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:03 -07:00
David Howells	8e8b63a68c	fdpic: ignore the loader's PT_GNU_STACK when calculating the stack size Ignore the loader's PT_GNU_STACK when calculating the stack size, and only consider the executable's PT_GNU_STACK, assuming the executable has one. Currently the behaviour is to take the largest stack size and use that, but that means you can't reduce the stack size in the executable. The loader's stack size should probably only be used when executing the loader directly. WARNING: This patch is slightly dangerous - it may render a system inoperable if the loader's stack size is larger than that of important executables, and the system relies unknowingly on this increasing the size of the stack. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:02 -07:00
Amerigo Wang	0cf062d0ff	elf: clean up fill_note_info() Introduce a helper function elf_note_info_init() to help fill_note_info() to do initializations, also fix the potential memory leaks. [akpm@linux-foundation.org: remove NUM_NOTES] Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Peter Zijlstra	ba0a6c9f6f	fcntl: add F_[SG]ETOWN_EX In order to direct the SIGIO signal to a particular thread of a multi-threaded application we cannot, like suggested by the manpage, put a TID into the regular fcntl(F_SETOWN) call. It will still be send to the whole process of which that thread is part. Since people do want to properly direct SIGIO we introduce F_SETOWN_EX. The need to direct SIGIO comes from self-monitoring profiling such as with perf-counters. Perf-counters uses SIGIO to notify that new sample data is available. If the signal is delivered to the same task that generated the new sample it can augment that data by inspecting the task's user-space state right after it returns from the kernel. This is esp. convenient for interpreted or virtual machine driven environments. Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex as argument: struct f_owner_ex { int type; pid_t pid; }; Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Tested-by: stephane eranian <eranian@googlemail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Oleg Nesterov	06f1631a16	signals: send_sigio: use do_send_sig_info() to avoid check_kill_permission() group_send_sig_info()->check_kill_permission() assumes that current is the sender and uses current_cred(). This is not true in send_sigio_to_task() case. From the security pov the sender is not current, but the task which did fcntl(F_SETOWN), that is why we have sigio_perm() which uses the right creds to check. Fortunately, send_sigio() always sends either SEND_SIG_PRIV or SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But still it would be tidier to avoid this bogus security check and save a couple of cycles. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Oleg Nesterov	964ee7df90	exec: fix set_binfmt() vs sys_delete_module() race sys_delete_module() can set MODULE_STATE_GOING after search_binary_handler() does try_module_get(). In this case set_binfmt()->try_module_get() fails but since none of the callers check the returned error, the task will run with the wrong old ->binfmt. The proper fix should change all ->load_binary() methods, but we can rely on fact that the caller must hold a reference to binfmt->module and use __module_get() which never fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:01 -07:00
Neil Horman	61be228a06	exec: allow do_coredump() to wait for user space pipe readers to complete Allow core_pattern pipes to wait for user space to complete One of the things that user space processes like to do is look at metadata for a crashing process in their /proc/<pid> directory. this is racy however, since do_coredump in the kernel doesn't wait for the user space process to complete before it reaps the crashing process. This patch corrects that. Allowing the kernel to wait for the user space process to complete before cleaning up the crashing process. This is a bit tricky to do for a few reasons: 1) The user space process isn't our child, so we can't sys_wait4 on it 2) We need to close the pipe before waiting for the user process to complete, since the user process may rely on an EOF condition I've discussed several solutions with Oleg Nesterov off-list about this, and this is the one we've come up with. We add ourselves as a pipe reader (to prevent premature cleanup of the pipe_inode_info), and remove ourselves as a writer (to provide an EOF condition to the writer in user space), then we iterate until the user space process exits (which we detect by pipe->readers == 1, hence the > 1 check in the loop). When we exit the loop, we restore the proper reader/writer values, then we return and let filp_close in do_coredump clean up the pipe data properly. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Neil Horman	a293980c2e	exec: let do_coredump() limit the number of concurrent dumps to pipes Introduce core pipe limiting sysctl. Since we can dump cores to pipe, rather than directly to the filesystem, we create a condition in which a user can create a very high load on the system simply by running bad applications. If the pipe reader specified in core_pattern is poorly written, we can have lots of ourstandig resources and processes in the system. This sysctl introduces an ability to limit that resource consumption. core_pipe_limit defines how many in-flight dumps may be run in parallel, dumps beyond this value are skipped and a note is made in the kernel log. A special value of 0 in core_pipe_limit denotes unlimited core dumps may be handled (this is the default value). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
Neil Horman	725eae32df	exec: make do_coredump() more resilient to recursive crashes Change how we detect recursive dumps. Currently we have a mechanism by which we try to compare pathnames of the crashing process to the core_pattern path. This is broken for a dozen reasons, and just doesn't work in any sort of robust way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps set RLIMIT_CORE to zero, we don't write out core files for any process with that particular limit set. It the core_pattern is a pipe, any non-zero limit is translated to RLIM_INFINITY. This allows complete dumps to be captured, but prevents infinite recursion in the event that the core_pattern process itself crashes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:21:00 -07:00
From: Mel Gorman	ef1ff6b8c0	hugetlbfs: do not call user_shm_lock() for MAP_HUGETLB fix Commit `6bfde05bf5` ("hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount") altered can_do_hugetlb_shm() to check if a file is being created for shared memory or mmap(). If this returns false, we then unconditionally call user_shm_lock() triggering a warning. This block should never be entered for MAP_HUGETLB. This patch partially reverts the problem and fixes the check. Signed-off-by: Eric B Munson <ebmunson@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-09-24 07:20:56 -07:00
Chris Mason	54bcf382da	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable into for-linus Conflicts: fs/btrfs/super.c	2009-09-24 10:00:58 -04:00
Yan Zheng	c65ddb52dc	Btrfs: hash the btree inode during fill_super The snapshot deletion patches dropped this line, but the inode needs to be hashed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:24:43 -04:00
Yan, Zheng	0257bb82d2	Btrfs: relocate file extents in clusters The extent relocation code copy file extents one by one when relocating data block group. This is inefficient if file extents are small. This patch makes the relocation code copy file extents in clusters. So we can can make better use of read-ahead. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
Yan, Zheng	f679a84034	Btrfs: don't rename file into dummy directory A recent change enforces only one access point to each subvolume. The first directory entry (the one added when the subvolume/snapshot was created) is treated as valid access point, all other subvolume links are linked to dummy empty directories. The dummy directories are temporary inodes that only in memory, so we can not rename file into them. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
Yan, Zheng	a571952143	Btrfs: check size of inode backref before adding hardlink For every hardlink in btrfs, there is a corresponding inode back reference. All inode back references for hardlinks in a given directory are stored in single b-tree item. The size of b-tree item is limited by the size of b-tree leaf, so we can only create limited number of hardlinks to a given file in a directory. The original code lacks of the check, it oops if the number of hardlinks goes over the limit. This patch fixes the issue by adding check to btrfs_link and btrfs_rename. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-09-24 09:17:31 -04:00
npiggin@suse.de	c08d3b0e33	truncate: use new helpers Update some fs code to make use of new helper functions introduced in the previous patch. Should be no significant change in behaviour (except CIFS now calls send_sig under i_lock, via inode_newsize_ok). Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Cc: linux-nfs@vger.kernel.org Cc: Trond.Myklebust@netapp.com Cc: linux-cifs-client@lists.samba.org Cc: sfrench@samba.org Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 08:41:47 -04:00
npiggin@suse.de	25d9e2d152	truncate: new helpers Introduce new truncate helpers truncate_pagecache and inode_newsize_ok. vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and into mm/truncate.c. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-09-24 08:41:47 -04:00
Vegard Nossum	eca6f534e6	fs: fix overflow in sys_mount() for in-kernel calls sys_mount() reads/copies a whole page for its "type" parameter. When do_mount_root() passes a kernel address that points to an object which is smaller than a whole page, copy_mount_options() will happily go past this memory object, possibly dereferencing "wild" pointers that could be in any state (hence the kmemcheck warning, which shows that parts of the next page are not even allocated). (The likelihood of something going wrong here is pretty low -- first of all this only applies to kernel calls to sys_mount(), which are mostly found in the boot code. Secondly, I guess if the page was not mapped, exact_copy_from_user() _would_ in fact handle it correctly because of its access_ok(), etc. checks.) But it is much nicer to avoid the dubious reads altogether, by stopping as soon as we find a NUL byte. Is there a good reason why we can't do something like this, using the already existing strndup_from_user()? [akpm@linux-foundation.org: make copy_mount_string() static] [AV: fix compat mount breakage, which involves undoing akpm's change above] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: al <al@dizzy.pdmi.ras.ru>	2009-09-24 08:40:15 -04:00

1 2 3 4 5 ...

15753 Commits