Commit Graph

23079 Commits

Author SHA1 Message Date
Sage Weil 48293699a0 vfs: dentry_unhash immediately prior to rmdir
This presumes that there is no reason to unhash a dentry if we fail because
it is a mountpoint or the LSM check fails, and that the LSM checks do not
depend on the dentry being unhashed.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:46 -04:00
Jan Kara ea13a86463 vfs: Block mmapped writes while the fs is frozen
We should not allow file modification via mmap while the filesystem is
frozen. So block in block_page_mkwrite() while the filesystem is frozen.
We cannot do the blocking wait in __block_page_mkwrite() since e.g. ext4
will want to call that function with transaction started in some cases
and that would deadlock. But we can at least do the non-blocking reliable
check in __block_page_mkwrite() which is the hardest part anyway.

We have to check for frozen filesystem with the page marked dirty and under
page lock with which we then return from ->page_mkwrite(). Only that way we
cannot race with writeback done by freezing code - either we mark the page
dirty after the writeback has started, see freezing in progress and block, or
writeback will wait for our page lock which is released only when the fault is
done and then writeback will writeout and writeprotect the page again.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:45 -04:00
Jan Kara 24da4fab5a vfs: Create __block_page_mkwrite() helper passing error values back
Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:44 -04:00
Roman Borisov 7c6e984dfc fs/namespace.c: bound mount propagation fix
This issue was discovered by users of busybox.  And the bug is actual for
busybox users, I don't know how it affects others.  Apparently, mount is
called with and without MS_SILENT, and this affects mount() behaviour.
But MS_SILENT is only supposed to affect kernel logging verbosity.

The following script was run in an empty test directory:

mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind         mount.shared1 mount.shared1
mount -vv --make-rshared mount.shared1
mount -vv --bind         mount.shared2 mount.shared2
mount -vv --make-rshared mount.shared2
mount -vv --bind mount.shared2 mount.shared1
mount -vv --bind mount.dir     mount.shared2
ls -R mount.dir mount.shared1 mount.shared2
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2

mount -vv was used to show the mount() call arguments and result.
Output shows that flag argument has 0x00008000 = MS_SILENT bit:

mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b

mount.shared1:

mount.shared2:
a
b

After adding --loud option to remove MS_SILENT bit from just one mount cmd:

mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind         mount.shared1 mount.shared1 2>&1
mount -vv --make-rshared mount.shared1               2>&1
mount -vv --bind         mount.shared2 mount.shared2 2>&1
mount -vv --loud --make-rshared mount.shared2               2>&1  # <-HERE
mount -vv --bind mount.shared2 mount.shared1         2>&1
mount -vv --bind mount.dir     mount.shared2         2>&1
ls -R mount.dir mount.shared1 mount.shared2      2>&1
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2

The result is different now - look closely at mount.shared1 directory listing.
Now it does show files 'a' and 'b':

mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x00104000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0

mount.dir:
a
b

mount.shared1:
a
b

mount.shared2:
a
b

The analysis shows that MS_SILENT flag which is ON by default in any
busybox-> mount operations cames to flags_to_propagation_type function and
causes the error return while is_power_of_2 checking because the function
expects only one bit set.  This doesn't allow to do busybox->mount with
any --make-[r]shared, --make-[r]private etc options.

Moreover, the recently added flags_to_propagation_type() function doesn't
allow us to do such operations as --make-[r]private --make-[r]shared etc.
when MS_SILENT is on.  The idea or clearing the MS_SILENT flag came from
to Denys Vlasenko.

Signed-off-by: Roman Borisov <ext-roman.borisov@nokia.com>
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:44 -04:00
Jonas Gorski 79fead47c5 exportfs: reallow building as a module
Commit 990d6c2d7a ("vfs: Add name to file
handle conversion support") changed EXPORTFS to be a bool.
This was needed for earlier revisions of the original patch, but the actual
commit put the code needing it into its own file that only gets compiled
when FHANDLE is selected which in turn selects EXPORTFS.
So EXPORTFS can be safely compiled as a module when not selecting FHANDLE.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:43 -04:00
Al Viro 9f1fafee9e merge handle_reval_dot and nameidata_drop_rcu_last
new helper: complete_walk().  Done on successful completion
of walk, drops out of RCU mode, does d_revalidate of final
result if that hadn't been done already.

handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
that one; callers converted to use of complete_walk().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:32 -04:00
Al Viro 19660af736 consolidate nameidata_..._drop_rcu()
Merge these into a single function (unlazy_walk(nd, dentry)),
kill ..._maybe variants

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26 07:26:02 -04:00
Phillip Lougher d7f2ff6718 Squashfs: update email address
My existing email address may stop working in a month or two, so update
email to one that will continue working.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-26 10:49:11 +01:00
Michal Marek 8d2c50e3b6 gfs2: Drop __TIME__ usage
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.

Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
2011-05-26 10:54:37 +02:00
Michal Marek 75ce481e15 dlm: Drop __TIME__ usage
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.

Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
2011-05-26 09:46:17 +02:00
Joel Becker ece928df16 Merge branch 'move_extents' of git://oss.oracle.com/git/tye/linux-2.6 into ocfs2-merge-window
Conflicts:
	fs/ocfs2/ioctl.c
2011-05-25 21:51:55 -07:00
Tristan Ye 3d1c1829eb Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
Oops, local-mounted of 'ocfs2_fops_no_plocks' is just missing the support
of unwritten_extents/punching-hole due to no func pointer was given correctly
to '.follocate' field.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 21:06:28 -07:00
Sunil Mushran 66effd3c68 ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
During dlm domain shutdown, o2dlm has to free all the lock resources. Ones that
have no locks and references are freed. Ones that have locks and/or references
are migrated to another node.

The first task in migration is finding a target. Currently we scan the lock
resource and find one node that either has a lock or a reference. This is not
very efficient in a parallel umount case as we might end up migrating the
lock resource to a node which itself may have to migrate it to a third node.

The patch scans the dlm->exit_domain_map to ensure the target node is not
leaving the domain. If no valid target node is found, o2dlm does not migrate
the resource but instead waits for the unlock and deref messages that will
allow it to free the resource.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-25 21:05:22 -07:00
Sunil Mushran bddefdeec5 ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
This patch adds a new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG and ups the dlm
protocol to 1.2.

o2dlm sends this new message in dlm_unregister_domain() to mark the beginning
of the exit domain. This message is sent to all nodes in the domain.

Currently o2dlm has no way of informing other nodes of its impending exit.
This information is useful as the other nodes could disregard the exiting
node in certain operations. For example, in resource migration. If two or
more nodes were umounting in parallel, it would be more efficient if o2dlm
were to choose a non-exiting node to be the new master node rather than an
exiting one.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-25 21:05:15 -07:00
Eric Paris 4db70f73e5 tmpfs: fix XATTR N overriding POSIX_ACL Y
Choosing TMPFS_XATTR default N was switching off TMPFS_POSIX_ACL,
even if it had been Y in oldconfig; and Linus reports that PulseAudio
goes subtly wrong unless it can use ACLs on /dev/shm.

Make TMPFS_POSIX_ACL select TMPFS_XATTR (and depend upon TMPFS),
and move the TMPFS_POSIX_ACL entry before the TMPFS_XATTR entry,
to avoid asking unnecessary questions then ignoring their answers.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 19:53:02 -07:00
Linus Torvalds 14d74e0cab Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
  net: fix get_net_ns_by_fd for !CONFIG_NET_NS
  ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
  ns: Declare sys_setns in syscalls.h
  net: Allow setting the network namespace by fd
  ns proc: Add support for the ipc namespace
  ns proc: Add support for the uts namespace
  ns proc: Add support for the network namespace.
  ns: Introduce the setns syscall
  ns: proc files for namespace naming policy.
2011-05-25 18:10:16 -07:00
Linus Torvalds 3f5785ec31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
  bonding: documentation and code cleanup for resend_igmp
  bonding: prevent deadlock on slave store with alb mode (v3)
  net: hold rtnl again in dump callbacks
  Add Fujitsu 1000base-SX PCI ID to tg3
  bnx2x: protect sequence increment with mutex
  sch_sfq: fix peek() implementation
  isdn: netjet - blacklist Digium TDM400P
  via-velocity: don't annotate MAC registers as packed
  xen: netfront: hold RTNL when updating features.
  sctp: fix memory leak of the ASCONF queue when free asoc
  net: make dev_disable_lro use physical device if passed a vlan dev (v2)
  net: move is_vlan_dev into public header file (v2)
  bug.h: Fix build with CONFIG_PRINTK disabled.
  wireless: fix fatal kernel-doc error + warning in mac80211.h
  wireless: fix cfg80211.h new kernel-doc warnings
  iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
  dst: catch uninitialized metrics
  be2net: hash key for rss-config cmd not set
  bridge: initialize fake_rtable metrics
  net: fix __dst_destroy_metrics_generic()
  ...

Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c
2011-05-25 17:00:17 -07:00
Ding Dinghua 3991b4008c jbd2: fix a potential leak of a journal_head on an error path
drop jh->b_jcount in error path

Signed-off-by: Ding Dinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 17:43:48 -04:00
Yongqiang Yang 1b16da77f9 ext4: teach ext4_ext_split to calculate extents efficiently
Make ext4_ext_split() get extents to be moved by calculating in a statement
instead of counting in a loop.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 17:41:48 -04:00
Jan Kara ae24f28d39 ext4: Convert ext4 to new truncate calling convention
Trivial conversion.  Fixup one error handling case calling vmtruncate()
and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and
IS_APPEND files could not be truncated during failed writes. In fact, the
test can be completely removed as upper layers do necessary permission
checks for truncate in do_sys_[f]truncate() and may_open() anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 17:39:48 -04:00
Jeff Layton c28c89fc43 cifs: add cifs_async_writev
Add the ability for CIFS to do an asynchronous write. The kernel will
set the frame up as it would for a "normal" SMBWrite2 request, and use
cifs_call_async to send it. The mid callback will then be configured to
handle the result.

Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25 20:38:33 +00:00
Jeff Layton f7910cbd9f cifs: clean up wsize negotiation and allow for larger wsize
Now that we can handle larger wsizes in writepages, fix up the
negotiation of the wsize to allow for that. find_get_pages only seems to
give out a max of 256 pages at a time, so that gives us a reasonable
default of 1M for the wsize.

If the server however does not support large writes via POSIX
extensions, then we cap the wsize to (128k - PAGE_CACHE_SIZE). That
gives us a size that goes up to the max frame size specified in RFC1001.

Finally, if CAP_LARGE_WRITE_AND_X isn't set, then further cap it to the
largest size allowed by the protocol (USHRT_MAX).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25 20:12:16 +00:00
Jeff Layton c3d17b63e5 cifs: convert cifs_writepages to use async writes
Have cifs_writepages issue asynchronous writes instead of waiting on
each write call to complete before issuing another. This also allows us
to return more quickly from writepages. It can just send out all of the
I/Os and not wait around for the replies.

In the WB_SYNC_ALL case, if the write completes with a retryable error,
then the completion workqueue job will resend the write.

This also changes the page locking semantics a little bit. Instead of
holding the page lock until the response is received, release it after
doing the send. This will reduce contention for the page lock and should
prevent processes that have the file mmap'ed from being blocked
unnecessarily.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25 20:05:03 +00:00
Pavel Shilovsky b2e5cd33b5 CIFS: Fix undefined behavior when mount fails
Fix double kfree() calls on the same pointers and cleanup mount code.

Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-25 20:02:31 +00:00
Linus Torvalds 57bb559574 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: fix cap flush race reentrancy
  libceph: subscribe to osdmap when cluster is full
  libceph: handle new osdmap down/state change encoding
  rbd: handle online resize of underlying rbd image
  ceph: avoid inode lookup on nfs fh reconnect
  ceph: use LOOKUPINO to make unconnected nfs fh more reliable
  rbd: use snprintf for disk->disk_name
  rbd: cleanup: make kfree match kmalloc
  rbd: warn on update_snaps failure on notify
  ceph: check return value for start_request in writepages
  ceph: remove useless check
  libceph: add missing breaks in addr_set_port
  libceph: fix TAG_WAIT case
  ceph: fix broken comparison in readdir loop
  libceph: fix osdmap timestamp assignment
  ceph: fix rare potential cap leak
  libceph: use snprintf for unknown addrs
  libceph: use snprintf for formatting object name
  ceph: use snprintf for dirstat content
  libceph: fix uninitialized value when no get_authorizer method is set
  ...
2011-05-25 11:46:31 -07:00
David S. Miller 22e95ac87d Merge branch 'for-davem' of ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 2011-05-25 13:28:55 -04:00
Phillip Lougher 1094a4a611 Squashfs: add extra sanity checks at mount time
Add some extra sanity checks of the inode and directory structures.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:33 +01:00
Phillip Lougher 1cac63cc9b Squashfs: add sanity checks to fragment reading at mount time
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc.  One of these is due to a corrupted superblock fragments field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.

Also add a couple of other sanity checks of the mount-time fragment table
structures.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:33 +01:00
Phillip Lougher ac51a0a713 Squashfs: add sanity checks to lookup table reading at mount time
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc.  One of these is due to a corrupted superblock inodes field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.

Also add a couple of other sanity checks of the mount-time lookup table
structures.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:32 +01:00
Phillip Lougher 37986f63c8 Squashfs: add sanity checks to id reading at mount time
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc.  One of these is due to a corrupted superblock no_ids field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.

Also add a couple of other sanity checks of the mount-time id table
structures.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:32 +01:00
Phillip Lougher 6f04864515 Squashfs: add sanity checks to xattr reading at mount time
These checks add sanity checking of the mount-time xattr structures.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:31 +01:00
Phillip Lougher 76e002f755 Squashfs: reverse order of filesystem table reading
Reverse order of table reading from mostly first to last in placement
order, to last to first.  This is to enable extra superblock sanity
checks to be added in later patches.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:31 +01:00
Phillip Lougher 82de647e1f Squashfs: move table allocation into squashfs_read_table()
This eliminates a lot of duplicate code.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2011-05-25 18:21:31 +01:00
Linus Torvalds 2a651c7f8d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: update Documentation pointers
  net/9p: enable 9p to work in non-default network namespace
  net/9p: p9_idpool_get return -1 on error
  fs/9p: Don't clunk dentry fid when we fail to get a writeback inode
  9p: Small cleanup in <net/9p/9p.h>
  9p: remove experimental tag from tested configurations
  9p: typo fixes and minor cleanups
  net/9p: Change linuxdoc names to match functions.
2011-05-25 09:21:56 -07:00
Linus Torvalds c88bc60a3b Merge branch 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-block:
  splice: add wakeup_pipe_readers()
2011-05-25 09:20:20 -07:00
Linus Torvalds 798ce8f1cc Merge branch 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits)
  cfq-iosched: free cic_index if cfqd allocation fails
  cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
  cfq-iosched: reduce bit operations in cfq_choose_req()
  cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
  blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
  block: move bd_set_size() above rescan_partitions() in __blkdev_get()
  block: call elv_bio_merged() when merged
  cfq-iosched: Make IO merge related stats per cpu
  cfq-iosched: Fix a memory leak of per cpu stats for root group
  backing-dev: Kill set but not used var in  bdi_debug_stats_show()
  block: get rid of on-stack plugging debug checks
  blk-throttle: Make no throttling rule group processing lockless
  blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
  blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
  blk-throttle: Make dispatch stats per cpu
  blk-throttle: Free up a group only after one rcu grace period
  blk-throttle: Use helper function to add root throtl group to lists
  blk-throttle: Introduce a helper function to fill in device details
  blk-throttle: Dynamically allocate root group
  blk-cgroup: Allow sleeping while dynamically allocating a group
  ...
2011-05-25 09:14:07 -07:00
Christoph Hellwig 233eebb9a9 xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent
The code in xfs_bmap_del_extent does not correctly decrement the
extent buffer index when deleting a whole extent.  Most of the time
this gets caught by checks in xfs_bmapi that work around it and
decrement it manually and thus wasn't noticed so far.

Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:38 -05:00
Christoph Hellwig 87bef1812d xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec
Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig ab1908a5bb xfs: fix up asserts in xfs_iflush_fork
Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext
with a potentially invalid extent buffer index.

Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig f1c63b73cf xfs: do not do pointer arithmetic on extent records
We need to call xfs_iext_get_ext for the previous extent to get a
valid pointer, and can't just do pointer arithmetics as they might
be in different pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig 00239acf36 xfs: do not use unchecked extent indices in xfs_bunmapi
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bunmapi.  Also
remove the old workaround for too large indices that has been
superceeded by the proper fix in xfs_bmap_del_extent.

Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig 5690f92199 xfs: do not use unchecked extent indices in xfs_bmapi
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bmapi.

Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig 2f2b3220b0 xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*
Make sure to only call xfs_iext_get_ext after we've validate the
extent index in the various xfs_bmap_add_extent_* helpers.

Based on an earlier patch from Lachlan McIlroy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig ec90c55634 xfs: remove if_lastex
The if_lastex field in struct xfs_ifork is only used as a temporary
index during xfs_bmapi and xfs_bunmapi.  Instead of using the inode
fork to store it keep it local in the callchain.  Fortunately this
is very easy as we already pass a stack copy of it down the whole
chain which can simplify be changed to be passed by reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:37 -05:00
Christoph Hellwig 548932739b xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag
The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has
always been.  Remove it to simplify the bmapi implementation and
conserve stack space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-25 10:48:36 -05:00
Andrew Morton 2a5cac17c0 fs/ncpfs/inode.c: suppress used-uninitialised warning
We get this spurious warning:

  fs/ncpfs/inode.c: In function 'ncp_fill_super':
  fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function
  fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function
  fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function
  ...

It's notabug, but we can easily fix it with a memset().

Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:56 -07:00
Amerigo Wang e50c1f609c fscache: remove dead code under CONFIG_WORKQUEUE_DEBUGFS
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:44 -07:00
Stephen Wilson 5b52fc890b proc: allocate storage for numa_maps statistics once
In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not a
candidate for stack allocation.

Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.

Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:

  ... imagine what happens if the system is out of memory, and the mm
  we're looking at is selected for killing by the OOM killer: while
  we wait in __get_free_page for more memory, no memory is freed
  from the selected mm because it cannot reach exit_mmap while we hold
  that reference.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:35 -07:00
Stephen Wilson f2beb79836 proc: make struct proc_maps_private truly private
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world.  Move it
to fs/proc/internal.h instead.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:35 -07:00
Stephen Wilson f69ff943df mm: proc: move show_numa_map() to fs/proc/task_mmu.c
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.

  - Having the show() operation "miles away" from the corresponding
    seq_file iteration operations is a maintenance burden.

  - The need to export ad hoc info like struct proc_maps_private is
    eliminated.

  - The implementation of show_numa_map() can be improved in a simple
    manner by cooperating with the other seq_file operations (start,
    stop, etc) -- something that would be messy to do without this
    change.

Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:34 -07:00
Eric Paris b09e0fa4b4 tmpfs: implement generic xattr support
Implement generic xattrs for tmpfs filesystems.  The Feodra project, while
trying to replace suid apps with file capabilities, realized that tmpfs,
which is used on the build systems, does not support file capabilities and
thus cannot be used to build packages which use file capabilities.  Xattrs
are also needed for overlayfs.

The xattr interface is a bit odd.  If a filesystem does not implement any
{get,set,list}xattr functions the VFS will call into some random LSM hooks
and the running LSM can then implement some method for handling xattrs.
SELinux for example provides a method to support security.selinux but no
other security.* xattrs.

As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
xattr handler routines specifically to handle acls.  Because of this tmpfs
would loose the VFS/LSM helpers to support the running LSM.  To make up
for that tmpfs had stub functions that did nothing but call into the LSM
hooks which implement the helpers.

This new patch does not use the LSM fallback functions and instead just
implements a native get/set/list xattr feature for the full security.* and
trusted.* namespace like a normal filesystem.  This means that tmpfs can
now support both security.selinux and security.capability, which was not
previously possible.

The basic implementation is that I attach a:

struct shmem_xattr {
	struct list_head list; /* anchored by shmem_inode_info->xattr_list */
	char *name;
	size_t size;
	char value[0];
};

Into the struct shmem_inode_info for each xattr that is set.  This
implementation could easily support the user.* namespace as well, except
some care needs to be taken to prevent large amounts of unswappable memory
being allocated for unprivileged users.

[mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Jordi Pujol <jordipujolp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:31 -07:00
Ying Han 1495f230fa vmscan: change shrinker API by passing shrink_control struct
Change each shrinker's API by consolidating the existing parameters into
shrink_control struct.  This will simplify any further features added w/o
touching each file of shrinker.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:26 -07:00
Ying Han a09ed5e000 vmscan: change shrink_slab() interfaces by passing shrink_control
Consolidate the existing parameters to shrink_slab() into a new
shrink_control struct.  This is needed later to pass the same struct to
shrinkers.

Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:25 -07:00
Peter Zijlstra 3d48ae45e7 mm: Convert i_mmap_lock to a mutex
Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:18 -07:00
Peter Zijlstra 97a894136f mm: Remove i_mmap_lock lockbreak
Hugh says:
 "The only significant loser, I think, would be page reclaim (when
  concurrent with truncation): could spin for a long time waiting for
  the i_mmap_mutex it expects would soon be dropped? "

Counter points:
 - cpu contention makes the spin stop (need_resched())
 - zap pages should be freeing pages at a higher rate than reclaim
   ever can

I think the simplification of the truncate code is definitely worth it.

Effectively reverts: 2aa15890f3 ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:17 -07:00
Peter Zijlstra d16dfc550f mm: mmu_gather rework
Rework the existing mmu_gather infrastructure.

The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.

The first 9 patches rework the mmu_gather infrastructure.  For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.

The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this.  I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.

Then there is one patch that extends the generic mmu_gather batching.

After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible.  It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.

Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.

Furthermore, it removes the new and universially detested unmap_mutex.

This patch:

Remove the first obstacle towards a fully preemptible mmu_gather.

The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches.  Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.

Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex.  Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.

Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.

[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:12 -07:00
Michal Hocko d05f3169c0 mm: make expand_downwards() symmetrical with expand_upwards()
Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.

check_stack_guard_page is a nice example of the asymmetry.  It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.

Let's clean this up by exporting both functions and make those names
consistent.  Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion).  expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:12 -07:00
Aneesh Kumar K.V 398c4f0efb fs/9p: Don't clunk dentry fid when we fail to get a writeback inode
The dentry fid get clunked via the dput.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-25 08:46:38 -05:00
Eric Van Hensbergen 87211cd8db 9p: remove experimental tag from tested configurations
The 9p client is currently undergoing regular regresssion and
stress testing as a by-product of the virtfs work.  I think its
finally time to take off the experimental tags from the well-tested
code paths.

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2011-05-25 08:46:38 -05:00
Vivek Haldar 556b27abf7 ext4: do not normalize block requests from fallocate()
Currently, an fallocate request of size slightly larger than a power of
2 is turned into two block requests, each a power of 2, with the extra
blocks pre-allocated for future use. When an application calls
fallocate, it already has an idea about how large the file may grow so
there is usually little benefit to reserve extra blocks on the
preallocation list. This reduces disk fragmentation.

Tested: fsstress. Also verified manually that fallocat'ed files are
contiguously laid out with this change (whereas without it they begin at
power-of-2 boundaries, leaving blocks in between). CPU usage of
fallocate is not appreciably higher.  In a tight fallocate loop, CPU
usage hovers between 5%-8% with this change, and 5%-7% without it.

Using a simulated file system aging program which the file system to
70%, the percentage of free extents larger than 8MB (as measured by
e2freefrag) increased from 38.8% without this change, to 69.4% with
this change.

Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 07:41:54 -04:00
Allison Henderson a4bb6b64e3 ext4: enable "punch hole" functionality
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
and "ext4_ext_check_cache"

fallocate has been modified to call ext4_punch_hole when the punch hole
flag is passed.  At the moment, we only support punching holes in
extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole
routine.

The ext4_ext_punch_hole routine first completes all outstanding writes
with the associated pages, and then releases them.  The unblock
aligned data is zeroed, and all blocks in between are punched out.

The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
except it accepts a ext4_ext_cache parameter instead of a ext4_extent
parameter.  This routine is used by ext4_ext_punch_hole to check and
see if a block in a hole that has been cached.  The ext4_ext_cache
parameter is necessary because the members ext4_extent structure are
not large enough to hold a 32 bit value.  The existing
ext4_ext_in_cache routine has become a wrapper to this new function.

[ext4 punch hole patch series 5/5 v7] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:50 -04:00
Allison Henderson e861304b8e ext4: add "punch hole" flag to ext4_map_blocks()
This patch adds a new flag to ext4_map_blocks() that specifies the
given range of blocks should be punched out.  Extents are first
converted to uninitialized extents before they are punched
out. Because punching a hole may require that the extent be split, it
is possible that the splitting may need more blocks than are
available.  To deal with this, use of reserved blocks are enabled to
allow the split to proceed.

The routine then returns the number of blocks successfully
punched out.

[ext4 punch hole patch series 4/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:46 -04:00
Allison Henderson d583fb87a3 ext4: punch out extents
This patch modifies the truncate routines to support hole punching
Below is a brief summary of the patches changes:

- Added end param to ext_ext4_rm_leaf
        This function has been modified to accept an end parameter
        which enables it to punch holes in leafs instead of just
        truncating them.

- Implemented the "remove head" case in the ext_remove_blocks routine
        This routine is used by ext_ext4_rm_leaf to remove the tail
        of an extent during a truncate.  The new ext_ext4_rm_leaf
        routine will now also use it to remove the head of an extent in the
        case that the hole covers a region of blocks at the beginning
        of an extent.

- Added "end" param to ext4_ext_remove_space routine
        This function has been modified to accept a stop parameter, which
        is passed through to ext4_ext_rm_leaf.

[ext4 punch hole patch series 3/5 v6] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-25 07:41:43 -04:00
Allison Henderson 308488518d ext4: add new function ext4_block_zero_page_range()
This patch modifies the existing ext4_block_truncate_page() function
which was used by the truncate code path, and which zeroes out block
unaligned data, by adding a new length parameter, and renames it to
ext4_block_zero_page_rage().  This function can now be used to zero out the
head of a block, the tail of a block, or the middle
of a block.

The ext4_block_truncate_page() function is now a wrapper to
ext4_block_zero_page_range().

[ext4 punch hole patch series 2/5 v7] 

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:32 -04:00
Allison Henderson 55f020db66 ext4: add flag to ext4_has_free_blocks
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks.  This will allow a
punch hole to proceed even if the disk is full.  Punching a hole may
require additional blocks to first split the extents.

Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:

ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks

[ext4 punch hole patch series 1/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:26 -04:00
Tristan Ye dda54e76d7 Ocfs2/move_extents: Set several trivial constraints for threshold.
The threshold should be greater than clustersize and less than i_size.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:13 +08:00
Tristan Ye 4dfa66bd59 Ocfs2/move_extents: Let defrag handle partial extent moving.
We're going to support partial extent moving, which may split entire extent
movement into pieces to compromise the insuffice allocations, it eases the
'ENSPC' pain and makes the whole moving much less likely to fail, the downside
is it may make the fs even more fragmented before moving, just let the userspace
make a trade-off here.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye 53069d4e76 Ocfs2/move_extents: move/defrag extents within a certain range.
the basic logic of moving extents for a file is pretty like punching-hole
sequence, walk the extents within the range as user specified, calculating
an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to
do the actual moving.

This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation
gets done successfully.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye ee16cc037e Ocfs2/move_extents: helper to calculate the defraging length in one run.
The helper is to calculate the defrag length in one run according to a threshold,
it will proceed doing defragmentation until the threshold was meet, and skip a
LARGE extent if any.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:12 +08:00
Tristan Ye e08477176d Ocfs2/move_extents: move entire/partial extent.
ocfs2_move_extent() logic will validate the goal_offset_in_block,
where extents to be moved, what's more, it also compromises a bit
to probe the appropriate region around given goal_offset when the
original goal is not able to fit the movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye 8473aa8a2b Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
These helpers were actually borrowed from alloc.c, which may be publicized
later.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye e6b5859ccc Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
Before doing the movement of extents, we'd better probe the alloc group from
'goal_blk' for searching a contiguous region to fit the wanted movement, we
even will have a best-effort try by compromising to a threshold around the
given goal.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:11 +08:00
Tristan Ye 99e4c75041 Ocfs2/move_extents: helper to validate and adjust moving goal.
First best-effort attempt to validate and adjust the goal (physical address in
block), while it can't guarantee later operation can succeed all the time since
global_bitmap may change a bit over time.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye 1c06b91261 Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
This function tries locate the right alloc group, where a given physical block
resides, it returns the caller a buffer_head of victim group descriptor, and also
the offset of block in this group, by passing the block number.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:10 +08:00
Tristan Ye 202ee5facb Ocfs2/move_extents: defrag a range of extent.
It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye 8f603e567a Ocfs2/move_extents: move a range of extent.
The moving range of __ocfs2_move_extent() was within one extent always, it
consists following parts:

1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.

2. Split the original extent with new extent, coalecse the nearby extents if possible.

3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye de474ee8bb Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:09 +08:00
Tristan Ye 028ba5df63 Ocfs2/move_extents: Add basic framework and source files for extent moving.
Adding new files move_extents.[c|h] and fill it with nothing but
only a context structure.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye 220ebc4334 Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Patch also manages to add a manipulative struture for this ioctl.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye 3e19a25e05 Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future,  based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 15:17:08 +08:00
Tristan Ye d24a10b9f8 Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
This new code is a bit more complicated than former ones, the goal is to
show user all statistics required to take a deep insight into filesystem
on how the disk is being fragmentaed.

The goal is achieved by scaning global bitmap from (cluster)group to group
to figure out following factors in the filesystem:

        - How many free chunks in a fixed size as user requested.
        - How many real free chunks in all size.
        - Min/Max/Avg size(in) clusters of free chunks.
        - How do free chunks distribute(in size) in terms of a histogram,
          just like following:
          ---------------------------------------------------------
          Extent Size Range :  Free extents  Free Clusters  Percent
             32K...   64K-  :             1             1    0.00%
              1M...    2M-  :             9           288    0.03%
              8M...   16M-  :             2           831    0.09%
             32M...   64M-  :             1          2047    0.23%
            128M...  256M-  :             1          8191    0.92%
            256M...  512M-  :             2         21706    2.43%
            512M... 1024M-  :            27        858623   96.29%
          ---------------------------------------------------------

Userspace ioctl() call eventually gets the above info returned by passing
a 'struct ocfs2_info_freefrag' with the chunk_size being specified first.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:07 +08:00
Tristan Ye 3e5db17d4d Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
The new code is dedicated to calculate free inodes number of all inode_allocs,
then return the info to userpace in terms of an array.

Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent'
from userspace, is now going to be involved. setting the flag on means no cluster
coherency considered, usually, userspace tools choose none-coherency strategy by
default for the sake of performace.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:18:02 +08:00
Tristan Ye 8aa1fa360d Ocfs2: Using inline funcs to set/clear *FILLED* flags in info handler.
It just removes some macros for the sake of typechecking gains.

Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
2011-05-25 12:17:18 +08:00
Grant Erickson 1ddd0d9a31 JFFS2: retry large buffer allocations
Replace direct call to kmalloc for a potentially large, contiguous
buffer allocation with one to mtd_kmalloc_up_to which helps ensure the
operation can succeed under low-memory, highly- fragmented situations
albeit somewhat more slowly.

Signed-off-by: Grant Erickson <marathon96@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-05-25 02:00:50 +01:00
Sergey Senozhatsky 8cb2a180ab jffs2: remove unused variables
Remove unused 'jffs2_sb_info *c' variable from 'jffs2_lookup()'
and 'jffs2_readdir()'.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-05-25 01:47:39 +01:00
Aditya Kali ae81230686 ext4: reserve inodes and feature code for 'quota' feature
I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 19:00:39 -04:00
Johann Lombardi c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Eric W. Biederman 62ca24baf1 ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
Spotted-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2011-05-24 15:30:33 -07:00
Kazuya Mio d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Lukas Czerner 28739eea9c ext4: protect bb_first_free in ext4_trim_all_free() with group lock
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:28:07 -04:00
Lukas Czerner 7894408666 ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:16:27 -04:00
Linus Torvalds dc522adbee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: Fix comment to match the code in journal_start()
  jbd/jbd2: remove obsolete summarise_journal_usage.
  jbd: Fix forever sleeping process in do_get_write_access()
  ext2: fix error msg when mounting fs with too-large blocksize
  jbd: fix fsync() tid wraparound bug
  ext3: Fix fs corruption when make_indexed_dir() fails
  ext3: Fix lock inversion in ext3_symlink()
2011-05-24 15:11:46 -07:00
Linus Torvalds df3256f9ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: make plock operation killable
  dlm: remove shared message stub for recovery
  dlm: delayed reply message warning
  dlm: Remove superfluous call to recalc_sigpending()
2011-05-24 15:04:00 -07:00
Eryu Guan c867516de5 jbd2: Fix comment to match the code in jbd2__journal_start()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 17:09:58 -04:00
John W. Linville 31ec97d9ce Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem 2011-05-24 16:47:54 -04:00
Linus Torvalds b0ca118dba Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits)
  TOMOYO: Fix wrong domainname validation.
  SELINUX: add /sys/fs/selinux mount point to put selinuxfs
  CRED: Fix load_flat_shared_library() to initialise bprm correctly
  SELinux: introduce path_has_perm
  flex_array: allow 0 length elements
  flex_arrays: allow zero length flex arrays
  flex_array: flex_array_prealloc takes a number of elements, not an end
  SELinux: pass last path component in may_create
  SELinux: put name based create rules in a hashtable
  SELinux: generic hashtab entry counter
  SELinux: calculate and print hashtab stats with a generic function
  SELinux: skip filename trans rules if ttype does not match parent dir
  SELinux: rename filename_compute_type argument to *type instead of *con
  SELinux: fix comment to state filename_compute_type takes an objname not a qstr
  SMACK: smack_file_lock can use the struct path
  LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH
  LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE
  SELINUX: Make selinux cache VFS RCU walks safe
  SECURITY: Move exec_permission RCU checks into security modules
  SELinux: security_read_policy should take a size_t not ssize_t
  ...
2011-05-24 13:38:19 -07:00
Sage Weil db3540522e ceph: fix cap flush race reentrancy
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure.  It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.

Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant).  This is cleaner and more
robust.

Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:12 -07:00
Sage Weil 45e3d3eeb6 ceph: avoid inode lookup on nfs fh reconnect
If we get the inode from the MDS, we have a reference in req; don't do a
fresh lookup.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:06 -07:00
Sage Weil 3c454cf216 ceph: use LOOKUPINO to make unconnected nfs fh more reliable
If we are unable to locate an inode by ino, ask the MDS using the new
LOOKUPINO command.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:05 -07:00
Linus Torvalds eb08d8ff47 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
  UBIFS: switch to dynamic printks
  UBIFS: fix kernel-doc comments
  UBIFS: fix extremely rare mount failure
  UBIFS: simplify LEB recovery function further
  UBIFS: always cleanup the recovered LEB
  UBIFS: clean up LEB recovery function
  UBIFS: fix-up free space on mount if flag is set
  UBIFS: add the fixup function
  UBIFS: add a superblock flag for free space fix-up
  UBIFS: share the next_log_lnum helper
  UBIFS: expect corruption only in last journal head LEBs
  UBIFS: synchronize write-buffer before switching to the next bud
  UBIFS: remove BUG statement
  UBIFS: change bud replay function conventions
  UBIFS: substitute the replay tree with a replay list
  UBIFS: simplify replay
  UBIFS: store free and dirty space in the bud replay entry
  UBIFS: remove unnecessary stack variable
  UBIFS: double check that buds are replied in order
  UBIFS: make 2 functions static
  ...
2011-05-24 11:51:07 -07:00
Christoph Hellwig 55a7bc5a30 xfs: do not discard alloc btree blocks
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused.  Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.

Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks.  In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard.  We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL.  Given that this would cause
significant overhead I don't think it's worthwile for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:22 -05:00
Christoph Hellwig e84661aa84 xfs: add online discard support
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard.  Note that we don't bother
supporting discard for the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-24 11:17:13 -05:00
Jan Kara 93628ffb9b ext4: fix waiting and sending of a barrier in ext4_sync_file()
jbd2_log_start_commit() returns 1 only when we really start a
transaction.  But we also need to wait for a transaction when the
commit is already running.  Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).

Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.

Reported-by: Edward Goggin <egoggin@vmware.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 12:00:54 -04:00
Jan Kara bbd2be3691 jbd2: Add function jbd2_trans_will_send_data_barrier()
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device.  The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:59:18 -04:00
Jan Kara 81be12c817 jbd2: fix sending of data flush on journal commit
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit.  In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).

Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:52:40 -04:00
Yongqiang Yang b221349fa8 ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.

If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.

Without this patch, xfstests 225 will hung on ext4 with 1K block.

Reported-by: Amir Goldstein <amir73il@users.sourceforge.net>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 11:36:58 -04:00
James Morris 434d42cfd0 Merge branch 'next' into for-linus 2011-05-24 22:55:24 +10:00
Robin Dong 98ba073c60 ocfs2: change incorrect 'extern' keyword to 'static' in dlmfs
Change function param_set_dlmfs_capabilities from 'extern' to 'static' since
function param_get_dlmfs_capabilities is also 'static'.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:59:40 -07:00
Sunil Mushran 9f62e96084 ocfs2/dlm: dlm_is_lockres_migrateable() returns boolean
Patch cleans up the gunk added by commit 388c4bcb4e.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:39 -07:00
Tao Ma 10fca35ff1 ocfs2: Add trace event for trim.
Add the corresponding trace event for trim.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:20 -07:00
Tao Ma 55e67872b6 ocfs2: Add FITRIM ioctl.
Add the corresponding ioctl function for FITRIM.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:19 -07:00
Tao Ma e80de36d8d ocfs2: Add ocfs2_trim_fs for SSD trim support.
Add ocfs2_trim_fs to support trimming freed clusters in the
volume. A range will be given and all the freed clusters greater
than minlen will be discarded to the block layer.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:18 -07:00
Amerigo Wang 69a60c4d17 ocfs2: remove the /sys/o2cb symlink
It is obsoleted since Dec 2005.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:14 -07:00
Tiger Yang e2b0c215c2 ocfs2: clean up mount option about atime in ocfs2.txt
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
2011-05-23 23:37:12 -07:00
Linus Torvalds 343800e7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: Fix statfs->f_namelen
  fat: Replace all printk with fat_msg()
  fat: Add fat_msg() function for preformated FAT messages
  fat: Convert fat_fs_error to use %pV
  fat: Fix possible null deref in fat_cache_add()
  fat: use new setup() for ->dir_ops too
2011-05-23 21:11:38 -07:00
Jeff Layton 3c1105df69 cifs: don't call mid_q_entry->callback under the Global_MidLock (try #5)
Minor revision to the last version of this patch -- the only difference
is the fix to the cFYI statement in cifs_reconnect.

Holding the spinlock while we call this function means that it can't
sleep, which really limits what it can do. Taking it out from under
the spinlock also means less contention for this global lock.

Change the semantics such that the Global_MidLock is not held when
the callback is called. To do this requires that we take extra care
not to have sync_mid_result remove the mid from the list when the
mid is in a state where that has already happened. This prevents
list corruption when the mid is sitting on a private list for
reconnect or when cifsd is coming down.

Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-24 03:11:33 +00:00
Pavel Shilovsky 724d9f1cfb CIFS: Simplify mount code for further shared sb capability
Reorganize code to get mount option at first and when get a superblock.
This lets us use shared superblock model further for equal mounts.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-24 03:07:42 +00:00
Eryu Guan c2b67735e5 jbd: Fix comment to match the code in journal_start()
journal_start returns an ERR_PTR() value rather than NULL on failure.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-05-24 00:27:53 +02:00
Linus Torvalds a77febbef1 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: obey minleft values during extent allocation correctly
  xfs: reset buffer pointers before freeing them
  xfs: avoid getting stuck during async inode flushes
  xfs: fix xfs_itruncate_start tracing
  xfs: fix duplicate workqueue initialisation
  xfs: kill off xfs_printk()
  xfs: fix race condition in AIL push trigger
  xfs: make AIL target updates and compares 32bit safe.
  xfs: always push the AIL to the target
  xfs: exit AIL push work correctly when AIL is empty
  xfs: ensure reclaim cursor is reset correctly at end of AG
  xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE
  xfs: fix compiler warning in xfs_trace.h
  xfs: cleanup duplicate initializations
  xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy
  xfs: exact busy extent tracking
  xfs: do not immediately reuse busy extent ranges
  xfs: optimize AGFL refills
2011-05-23 15:19:16 -07:00
Linus Torvalds 99dff58562 Merge branch 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
  serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
  tty/serial: Fix break handling for PORT_TEGRA
  tty/serial: Add explicit PORT_TEGRA type
  n_tracerouter and n_tracesink ldisc additions.
  Intel PTI implementaiton of MIPI 1149.7.
  Kernel documentation for the PTI feature.
  export kernel call get_task_comm().
  tty: Remove to support serial for S5P6442
  pch_phub: Support new device ML7223
  8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
  ASoC: Update cx20442 for TTY API change
  pch_uart: Support new device ML7223 IOH
  parport: Use request_muxed_region for IT87 probe and lock
  tty/serial: add support for Xilinx PS UART
  n_gsm: Use print_hex_dump_bytes
  drivers/tty/moxa.c: Put correct tty value
  TTY: tty_io, annotate locking functions
  TTY: serial_core, remove superfluous set_task_state
  TTY: serial_core, remove invalid test
  Char: moxa, fix locking in moxa_write
  ...

Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
drivers/tty/serial/Makefile.

I did the hci_ldisc thing as an evil merge, cleaning things up.
2011-05-23 12:23:20 -07:00
Theodore Ts'o 072bd7ea74 ext4: use truncate_setsize() unconditionally
In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
before calling vmtruncate().  This caused any allocated but unwritten
blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
flag to be dropped.  This was done to make to make sure that
EOFBLOCKS_FL would not be cleared while still leaving blocks past
i_size allocated.  This was not necessary, since ext4_truncate()
guarantees that blocks past i_size will be dropped, even in the case
where truncate() has increased i_size before calling ext4_truncate().

So fix this by removing the EOFBLOCKS_FL special case treatment in
ext4_setattr().  In addition, use truncate_setsize() followed by a
call to ext4_truncate() instead of using vmtruncate().  This is more
efficient since it skips the call to inode_newsize_ok(), which has
been checked already by inode_change_ok().  This is also in a win in
the case where EOFBLOCKS_FL is set since it avoids calling
ext4_truncate() twice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-23 15:13:02 -04:00
Pavel Shilovsky 37bb04e5a0 CIFS: Simplify connection structure search calls
Use separate functions for comparison between existing structure
and what we are requesting for to make server, session and tcon
search code easier to use on next superblock match call.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-23 19:05:09 +00:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Linus Torvalds 30cb6d5f2e Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimers: Reorder clock bases
  hrtimers: Avoid touching inactive timer bases
  hrtimers: Make struct hrtimer_cpu_base layout less stupid
  timerfd: Manage cancelable timers in timerfd
  clockevents: Move C3 stop test outside lock
  alarmtimer: Drop device refcount after rtc_open()
  alarmtimer: Check return value of class_find_device()
  timerfd: Allow timers to be cancelled when clock was set
  hrtimers: Prepare for cancel on clock was set timers
2011-05-23 11:30:28 -07:00
Christoph Hellwig c02324a6ae cifs: remove unused SMB2 config and mount options
There's no SMB2 support in the CIFS filesystem driver, so there's no need to
have a config and mount option for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-23 18:08:05 +00:00
Namhyung Kim 825cdcb1a5 splice: add wakeup_pipe_readers()
Add and use wakeup_pipe_readers() to consolidate duplicated codes.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 19:58:53 +02:00
Xiao Guangrong 1f78160ce1 Btrfs: using rcu lock in the reader side of devices list
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong 4622470565 Btrfs: drop unnecessary device lock
Drop device_list_mutex for the reader side  on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated

btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong 0c1daee085 Btrfs: fix the race between remove dev and alloc chunk
On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong c9513edb00 Btrfs: fix the race between reading and updating devices
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices

On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:42 -04:00
Xiao Guangrong 4f6c9328c6 Btrfs: fix bh leak on __btrfs_open_devices path
'bh' is forgot to release if no error is detected

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:42 -04:00
Xiao Guangrong c7f895a2b2 Btrfs: fix unsafe usage of merge_state
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Xiao Guangrong 8233767a22 Btrfs: allocate extent state and check the result properly
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
  allowed wait

- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
  we trigger BUG_ON() if it's fail

- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
  the return value of clear_extent_bit() is ignored by many callers

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Julia Lawall b083916638 fs/btrfs: Add missing btrfs_free_path
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@

x = btrfs_alloc_path@p1(...)
...  when != btrfs_free_path(x,...)
     when != if (...) { ... btrfs_free_path(x,...) ...}
     when != x = ra
if(...) { ... when != x = rb
     when forall
     when != btrfs_free_path(x,...)
 \(return <+...x...+>; \| return@p2...; \) }

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Tsutomu Itoh 37daa4f968 Btrfs: check return value of btrfs_inc_extent_ref()
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:40 -04:00
Tsutomu Itoh c00e9493f1 Btrfs: return error to caller if read_one_inode() fails
When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:40 -04:00
Tsutomu Itoh 1cd307990d Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Tsutomu Itoh 65a246c5ff Btrfs: return error code to caller when btrfs_del_item fails
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Tsutomu Itoh b0b802d7e3 Btrfs: return error code to caller when btrfs_previous_item fails
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Sergei Trofimovich 27160b6b5a btrfs: fix typo 'testeing' -> 'testing'
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:32 -04:00
Sergei Trofimovich 9694b3fcbb btrfs: typo: 'btrfS' -> 'btrfs'
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:25 -04:00
Sergei Trofimovich c4f675cd40 btrfs: don't spin in shrink_delalloc if there is nothing to free
Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
   $ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
   $ mkfs.btrfs --mixed 2G.img
   $ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
   $ dd if=/dev/urandom of=10M.file bs=10240 count=1024
   $ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done

Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).

No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:14 -04:00
Jamey Sharp 0f3b708c11 btrfs: Delete unused version.sh script.
In 2008, commit b4f6c45dfb dropped the use
of fs/btrfs/version.sh, but left the script behind. Kill it.

Commit by Jamey Sharp and Josh Triplett.

Signed-off-by: Jamey Sharp <jamey@minilop.net>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:05:39 -04:00
Hugo Mills e215686715 btrfs: Ensure the tree search ioctl returns the right number of records
Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.

Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:05:39 -04:00
Andi Kleen 0956c798ef BTRFS: Remove unused node_lock
240f62c875 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:05:39 -04:00
Josef Bacik d90c732122 Btrfs: leave spinning on lookup and map the leaf
On lookup we only want to read the inode item, so leave the path spinning.  Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:17 -04:00
Josef Bacik 207dde8289 Btrfs: check for duplicate entries in the free space cache
If there are duplicate entries in the free space cache, discard the entire cache
and load it the old fashioned way.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:16 -04:00
Josef Bacik cca1c81f43 Btrfs: don't try to allocate from a block group that doesn't have enough space
If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups.  So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:15 -04:00
Josef Bacik 026fd31782 Btrfs: don't always do readahead
Our readahead is sort of sloppy, and really isn't always needed.  For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat.  Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes.  This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:14 -04:00
Josef Bacik 589d8ade83 Btrfs: try not to sleep as much when doing slow caching
When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit.  So
instead just let the caching kthread keep going and only yeild if
need_resched().  This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:13 -04:00
Josef Bacik d82a6f1d7e Btrfs: kill BTRFS_I(inode)->block_group
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull.  In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:12 -04:00
Josef Bacik 7e2355ba1a Btrfs: don't look at the extent buffer level 3 times in a row
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing.  I don't think I've
ever seen this tripped, so kill it.  This saves us 2 kmap's per level in our
search.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:11 -04:00
Josef Bacik cb25c2ea6a Btrfs: map the node block when looking for readahead targets
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_.  So go ahead and map the
extent buffer while we look for readahead targets.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:10 -04:00
Josef Bacik af60bed24e Btrfs: set range_start to the right start in count_range_bits
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results.  For example, if the range

[0-8192]

is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096].  So instead set range_start = max(cur_start, state->start).
This makes everything come out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:09 -04:00
Josef Bacik fcb80c2aff Btrfs: fix how we do space reservation for truncate
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up.  This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen.  This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen.  So in order
to fix this we need to split all of the reservation stuf up.  So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:08 -04:00
Josef Bacik a4abeea41a Btrfs: kill trans_mutex
We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction.  So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction.  All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list.  This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list.  This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff.  This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join.  Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work.  So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik 2a1eb4614d Btrfs: if we've already started a trans handle, use that one
We currently track trans handles in current->journal_info, but we don't actually
use it.  This patch fixes it.  This will cover the case where we have multiple
people starting transactions down the call chain.  This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return.  I tested this with xfstests
and it worked out fine.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:57 -04:00
Josef Bacik 7a7eaa40a3 Btrfs: take away the num_items argument from btrfs_join_transaction
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :).  So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Josef Bacik 74b2107543 Btrfs: make sure to use the delalloc reserve when filling delalloc
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Linus Torvalds 57d19e80f4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  b43: fix comment typo reqest -> request
  Haavard Skinnemoen has left Atmel
  cris: typo in mach-fs Makefile
  Kconfig: fix copy/paste-ism for dell-wmi-aio driver
  doc: timers-howto: fix a typo ("unsgined")
  perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
  md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
  treewide: fix a few typos in comments
  regulator: change debug statement be consistent with the style of the rest
  Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
  audit: acquire creds selectively to reduce atomic op overhead
  rtlwifi: don't touch with treewide double semicolon removal
  treewide: cleanup continuations and remove logging message whitespace
  ath9k_hw: don't touch with treewide double semicolon removal
  include/linux/leds-regulator.h: fix syntax in example code
  tty: fix typo in descripton of tty_termios_encode_baud_rate
  xtensa: remove obsolete BKL kernel option from defconfig
  m68k: fix comment typo 'occcured'
  arch:Kconfig.locks Remove unused config option.
  treewide: remove extra semicolons
  ...
2011-05-23 09:12:26 -07:00
Tejun Heo ff2a9941ca block: move bd_set_size() above rescan_partitions() in __blkdev_get()
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check.  As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

  sda: detected capacity change from 0 to 146815737856
  sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a (block: rescan
        partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-23 08:50:48 -07:00
David Teigland 901025d2f3 dlm: make plock operation killable
Allow processes blocked on plock requests to be interrupted
when they are killed.  This leaves the problem of cleaning
up the lock state in userspace.  This has three parts:

1. Add a flag to unlock operations sent to userspace
indicating the file is being closed.  Userspace will
then look for and clear any waiting plock operations that
were abandoned by an interrupted process.

2. Queue an unlock-close operation (like in 1) to clean up
userspace from an interrupted plock request.  This is needed
because the vfs will not send a cleanup-unlock if it sees no
locks on the file, which it won't if the interrupted operation
was the only one.

3. Do not use replies from userspace for unlock-close operations
because they are unnecessary (they are just cleaning up for the
process which did not make an unlock call).  This also simplifies
the new unlock-close generated from point 2.

Signed-off-by: David Teigland <teigland@redhat.com>
2011-05-23 10:47:06 -05:00
Linus Torvalds 4d9dec4db2 Merge branch 'exec_rm_compat' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc
* 'exec_rm_compat' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc:
  exec: document acct_arg_size()
  exec: unify do_execve/compat_do_execve code
  exec: introduce struct user_arg_ptr
  exec: introduce get_user_arg_ptr() helper
2011-05-23 08:28:34 -07:00
Linus Torvalds 34b064569e Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Wait properly when flushing the ail list
  GFS2: Wipe directory hash table metadata when deallocating a directory
2011-05-23 08:24:09 -07:00
liubo 8e531cdfeb Btrfs: do not flush csum items of unchanged file data during treelog
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.

During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data.  Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.

Here I have some test results to show, I use sysbench to do "random write + fsync".

===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags=  [prepare, run]
===

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 10000
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 10000 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch:     (*SPEED* : 4.7533Mb/sec)
   1216.84 Requests/sec executed

PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 10:13:16 -04:00
Thomas Gleixner 9ec2690758 timerfd: Manage cancelable timers in timerfd
Peter is concerned about the extra scan of CLOCK_REALTIME_COS in the
timer interrupt. Yes, I did not think about it, because the solution
was so elegant. I didn't like the extra list in timerfd when it was
proposed some time ago, but with a rcu based list the list walk it's
less horrible than the original global lock, which was held over the
list iteration.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2011-05-23 13:59:53 +02:00
Tejun Heo 7e69723fef block: move bd_set_size() above rescan_partitions() in __blkdev_get()
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check.  As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

  sda: detected capacity change from 0 to 146815737856
  sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a (block: rescan
        partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org

Stable note: 2.6.39 only.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 13:26:07 +02:00
Chris Mason 712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Linus Torvalds caebc160ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: use mark_buffer_dirty to mark btnode or meta data dirty
  nilfs2: always set back pointer to host inode in mapping->host
  nilfs2: get rid of NILFS_I_NILFS
  nilfs2: use list_first_entry
  nilfs2: use empty_aops for gc-inodes
  nilfs2: implement resize ioctl
  nilfs2: add truncation routine of segment usage file
  nilfs2: add routine to move secondary super block
  nilfs2: add ioctl which limits range of segment to be allocated
  nilfs2: zero fill unused portion of super root block
  nilfs2: super root size should change depending on inode size
  nilfs2: get rid of private page allocator
  nilfs2: merge list_del()/list_add_tail() to list_move_tail()
2011-05-22 22:43:01 -07:00
Artem Bityutskiy 56e46742e8 UBIFS: switch to dynamic printks
Switch to debugging using dynamic printk (pr_debug()). There is no good reason
to carry custom debugging prints if there is so cool and powerful generic
dynamic printk infrastructure, see Documentation/dynamic-debug-howto.txt. With
dynamic printks we can switch on/of individual prints, per-file, per-function
and per format messages. This means that instead of doing old-fashioned

echo 1 > /sys/module/ubifs/parameters/debug_msgs

to enable general messages, we can do:

echo 'format "UBIFS DBG gen" +ptlf' > control

to enable general messages and additionally ask the dynamic printk
infrastructure to print process ID, line number and function name. So there is
no reason to keep UBIFS-specific crud if there is more powerful generic thing.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-23 08:22:20 +03:00
Jeff Layton 59ffd84141 cifs: add ignore_pend flag to cifs_call_async
The current code always ignores the max_pending limit. Have it instead
only optionally ignore the pending limit. For CIFSSMBEcho, we need to
ignore it to make sure they always can go out. For async reads, writes
and potentially other calls, we need to respect it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-23 02:59:16 +00:00
Jeff Layton fcc31cb6f1 cifs: make cifs_send_async take a kvec array
We'll need this for async writes, so convert the call to take a kvec
array. CIFSSMBEcho is changed to put a kvec on the stack and pass
in the SMB buffer using that.

Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-23 02:58:26 +00:00
Jeff Layton 2c8f981d93 cifs: consolidate SendReceive response checks
Further consolidate the SendReceive code by moving the checks run over
the packet into a separate function that all the SendReceive variants
can call.

We can also eliminate the check for a receive_len that's too big or too
small. cifs_demultiplex_thread already checks that and disconnects the
socket if that occurs, while setting the midStatus to MALFORMED. It'll
never call this code if that's the case.

Finally do a little cleanup. Use "goto out" on errors so that the flow
of code in the normal case is more evident. Also switch the logErr
variable in map_smb_to_linux_error to a bool.

Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-23 02:58:24 +00:00
Tao Ma 28e35e42fb jbd2: Fix the wrong calculation of t_max_wait in update_t_max_wait
t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.

This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-22 21:45:26 -04:00
Eric Gouriou f6d2f6b327 ext4: fix unbalanced up_write() in ext4_ext_truncate() error path
ext4_ext_truncate() should not invoke up_write(&EXT4_I(inode)->i_data_sem)
when ext4_orphan_add() returns an error, as it hasn't performed a
down_write() yet. This trivial patch fixes this by moving the up_write()
invocation above the out_stop label.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:33:00 -04:00
Vivek Haldar 77f4135f2a ext4: count hits/misses of extent cache and expose in sysfs
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:24:16 -04:00
Yongqiang Yang 93917411be ext4: make ext4_split_extent() handle error correctly
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-22 20:49:12 -04:00
Theodore Ts'o 373cd5c53d ext4: don't show mount options in /proc/mounts if there is no journal
After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 16:12:35 -04:00
Heiko Carstens 9ce6e0be06 fs: add missing prefetch.h include
Fixes this build error on s390 and probably other archs as well:

  fs/inode.c: In function 'new_inode':
  fs/inode.c:894:2: error: implicit declaration of function 'spin_lock_prefetch'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[ Happens on architectures that don't define their own prefetch
  functions in <asm/processor.h>, and instead rely on the default
  ones in <linux/prefetch.h>   - Linus]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-22 11:26:02 -07:00
Chris Mason aa2dfb372a Merge branch 'allocator' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:36:34 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason 0d0ca30f18 Btrfs: update the delayed inode code to use the btrfs_ino helper.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:11:22 -04:00
Chris Mason dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Steven Whitehouse 26b06a6958 GFS2: Wait properly when flushing the ail list
The ail flush code has always relied upon log flushing to prevent
it from spinning needlessly. This fixes it to wait on the last
I/O request submitted (we don't need to wait for all of it)
instead of either spinning with io_schedule or sleeping.

As a result cpu usage of gfs2_logd is much reduced with certain
workloads.

Reported-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-21 19:21:07 +01:00
Miao Xie 16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason 0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Steven Whitehouse 6d3117b412 GFS2: Wipe directory hash table metadata when deallocating a directory
The deallocation code for directories in GFS2 is largely divided into
two parts. The first part deallocates any directory leaf blocks and
marks the directory as being a regular file when that is complete. The
second stage was identical to deallocating regular files.

Regular files have their data blocks in a different
address space to directories, and thus what would have been normal data
blocks in a regular file (the hash table in a GFS2 directory) were
deallocated correctly. However, a reference to these blocks was left in the
journal (assuming of course that some previous activity had resulted in
those blocks being in the journal or ail list).

This patch uses the i_depth as a test of whether the inode is an
exhash directory (we cannot test the inode type as that has already
been changed to a regular file at this stage in deallocation)

The original issue was reported by Chris Hertel as an issue he encountered
running bonnie++

Reported-by: Christopher R. Hertel <crh@samba.org>
Cc: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2011-05-21 14:05:58 +01:00
Erez Zadok 1a4022f88d VFS: move BUG_ON test for symlink nd->depth after current->link_count test
This solves a serious VFS-level bug in nested_symlink (which was
rewritten from do_follow_link), and follows the order of depth tests
that existed before.

The bug triggers a BUG_ON in fs/namei.c:1381, when running racer with
symlink and rename ops.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-21 00:12:16 -07:00
Timo Warns cae13fe4cc Fix for buffer overflow in ldm_frag_add not sufficient
As Ben Hutchings discovered [1], the patch for CVE-2011-1017 (buffer
overflow in ldm_frag_add) is not sufficient.  The original patch in
commit c340b1d640 ("fs/partitions/ldm.c: fix oops caused by corrupted
partition table") does not consider that, for subsequent fragments,
previously allocated memory is used.

[1] http://lkml.org/lkml/2011/5/6/407

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Timo Warns <warns@pre-sense.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 16:40:36 -07:00
Linus Torvalds 8e7bfcbab3 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] define "_sdata" symbol
  pstore: Fix Kconfig dependencies for apei->pstore
  pstore: fix potential logic issue in pstore read interface
  pstore: fix pstore filesystem mount/remount issue
  pstore: fix one type of return value in pstore
  [IA64] fix build warning in arch/ia64/oprofile/backtrace.c
2011-05-20 13:39:00 -07:00
Linus Torvalds 91444f47b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
  [CIFS] Fix to problem with getattr caused by invalidate simplification patch
  [CIFS] Remove sparse warning
  [CIFS] Update cifs to version 1.72
  cifs: Change key name to cifs.idmap, misc. clean-up
  cifs: Unconditionally copy mount options to superblock info
  cifs: Use kstrndup for cifs_sb->mountdata
  cifs: Simplify handling of submount options in cifs_mount.
  cifs: cifs_parse_mount_options: do not tokenize mount options in-place
  cifs: Add support for mounting Windows 2008 DFS shares
  cifs: Extract DFS referral expansion logic to separate function
  cifs: turn BCC into a static inlined function
  cifs: keep BCC in little-endian format
  cifs: fix some unused variable warnings in id_rb_search
  CIFS: Simplify invalidate part (try #5)
  CIFS: directio read/write cleanups
  consistently use smb_buf_length as be32 for cifs (try 3)
  cifs: Invoke id mapping functions (try #17 repost)
  cifs: Add idmap key and related data structures and functions (try #17 repost)
  CIFS: Add launder_page operation (try #3)
  Introduce smb2 mounts as vers=2
  ...
2011-05-20 13:37:49 -07:00
Linus Torvalds 3ed4c0583d Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits)
  signal: trivial, fix the "timespec declared inside parameter list" warning
  job control: reorganize wait_task_stopped()
  ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping()
  signal: sys_sigprocmask() needs retarget_shared_pending()
  signal: cleanup sys_sigprocmask()
  signal: rename signandsets() to sigandnsets()
  signal: do_sigtimedwait() needs retarget_shared_pending()
  signal: introduce do_sigtimedwait() to factor out compat/native code
  signal: sys_rt_sigtimedwait: simplify the timeout logic
  signal: cleanup sys_rt_sigprocmask()
  x86: signal: sys_rt_sigreturn() should use set_current_blocked()
  x86: signal: handle_signal() should use set_current_blocked()
  signal: sigprocmask() should do retarget_shared_pending()
  signal: sigprocmask: narrow the scope of ->siglock
  signal: retarget_shared_pending: optimize while_each_thread() loop
  signal: retarget_shared_pending: consider shared/unblocked signals only
  signal: introduce retarget_shared_pending()
  ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/
  signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED
  signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending()
  ...
2011-05-20 13:33:21 -07:00
Linus Torvalds 6c1b8d94bc Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (32 commits)
  GFS2: Move all locking inside the inode creation function
  GFS2: Clean up symlink creation
  GFS2: Clean up mkdir
  GFS2: Use UUID field in generic superblock
  GFS2: Rename ops_inode.c to inode.c
  GFS2: Inode.c is empty now, remove it
  GFS2: Move final part of inode.c into super.c
  GFS2: Move most of the remaining inode.c into ops_inode.c
  GFS2: Move gfs2_refresh_inode() and friends into glops.c
  GFS2: Remove gfs2_dinode_print() function
  GFS2: When adding a new dir entry, inc link count if it is a subdir
  GFS2: Make gfs2_dir_del update link count when required
  GFS2: Don't use gfs2_change_nlink in link syscall
  GFS2: Don't use a try lock when promoting to a higher mode
  GFS2: Double check link count under glock
  GFS2: Improve bug trap code in ->releasepage()
  GFS2: Fix ail list traversal
  GFS2: make sure fallocate bytes is a multiple of blksize
  GFS2: Add an AIL writeback tracepoint
  GFS2: Make writeback more responsive to system conditions
  ...
2011-05-20 13:28:45 -07:00
Linus Torvalds 268bb0ce3e sanitize <linux/prefetch.h> usage
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

   grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
   grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-20 12:50:29 -07:00
Jens Axboe 698567f3fa Merge commit 'v2.6.39' into for-2.6.40/core
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
	drivers/block/paride/pcd.c
	drivers/cdrom/viocd.c
	drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:33:15 +02:00
Thomas Gleixner 250f972d85 Merge branch 'timers/urgent' into timers/core
Reason: Get upstream fixes and kfree_rcu which is necessary for a
follow up patch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-05-20 20:08:05 +02:00
Lukas Czerner 1bb933fb1f ext4: fix possible use-after-free in ext4_remove_li_request()
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:29 -04:00
Lukas Czerner 51ce651156 ext4: fix the mount option "init_itable=n" to work as expected for n=0
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:16 -04:00
Lukas Czerner e1290b3e62 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:51 -04:00
Lukas Czerner 4ed5c033c1 ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:04 -04:00
Tony Luck 3935bb949f Pull pstore into release branch 2011-05-20 10:34:50 -07:00
Steve French 156ecb2d8b [CIFS] Fix to problem with getattr caused by invalidate simplification patch
Fix to earlier "Simplify invalidate part (try #6)" patch
That patch caused problems with connectathon test 5.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-20 17:00:01 +00:00
Artem Bityutskiy bdc1a1b610 UBIFS: fix kernel-doc comments
This is a minor fix for UBIFS kernel-doc comments - we forgot the "@" symbol
for several 'struct ubifs_debug_info'.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2011-05-20 08:30:13 +03:00
Linus Torvalds 39ab05c8e0 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits)
  debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
  sysfs: remove "last sysfs file:" line from the oops messages
  drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION"
  memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION
  SYSFS: Fix erroneous comments for sysfs_update_group().
  driver core: remove the driver-model structures from the documentation
  driver core: Add the device driver-model structures to kerneldoc
  Translated Documentation/email-clients.txt
  RAW driver: Remove call to kobject_put().
  reboot: disable usermodehelper to prevent fs access
  efivars: prevent oops on unload when efi is not enabled
  Allow setting of number of raw devices as a module parameter
  Introduce CONFIG_GOOGLE_FIRMWARE
  driver: Google Memory Console
  driver: Google EFI SMI
  x86: Better comments for get_bios_ebda()
  x86: get_bios_ebda_length()
  misc: fix ti-st build issues
  params.c: Use new strtobool function to process boolean inputs
  debugfs: move to new strtobool
  ...

Fix up trivial conflicts in fs/debugfs/file.c due to the same patch
being applied twice, and an unrelated cleanup nearby.
2011-05-19 18:24:11 -07:00
Sage Weil 9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil 6b4a3b517a ceph: remove useless check
rc is only ever 0 or negative in this method.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil da39822c65 ceph: fix broken comparison in readdir loop
Both off and fi->offset are unsigned, so the difference is always >= 0.
Compare them directly instead of the sign of the difference.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil 3540303f87 ceph: fix rare potential cap leak
If we grab new_cap, retake the lock, and find we already have a cap now
for the given mds, release new_cap.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil ae59808301 ceph: use snprintf for dirstat content
We allocate a buffer for rstats if the dirstat option is enabled.  Use
snprintf.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil 1b36698577 libceph: remove unused variable
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:24:17 -07:00
Sage Weil 3b66378034 ceph: take reference on mds request r_unsafe_dir
We put ourselves on an inode list for the parent directory of metadata
operations so that an fsync on the directory will wait for metadata updates
to commit to disk.  We weren't holding a reference to that directory,
however, and under certain workloads (fsstress in this case) the directory
can go away.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:20:07 -07:00
Dave Chinner bf59170a66 xfs: obey minleft values during extent allocation correctly
When allocating an extent that is long enough to consume the
remaining free space in an AG, we need to ensure that the allocation
leaves enough space in the AG for any subsequent bmap btree blocks
that are needed to track the new extent. These have to be allocated
in the same AG as we only reserve enough blocks in an allocation
transaction for modification of the freespace trees in a single AG.

xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
free blocks available for extent and bmbt block allocation, which is
not correct - blocks on the AGFL are there exclusively for the use
of the free space btrees. As a result, when minleft is less than the
number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
the given extent to leave minleft blocks available for bmbt
allocation, and hence we can fail allocation during bmbt record
insertion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:48 -05:00
Dave Chinner 44396476a0 xfs: reset buffer pointers before freeing them
When we free a vmapped buffer, we need to ensure the vmap address
and length we free is the same as when it was allocated. In various
places in the log code we change the memory the buffer is pointing
to before issuing IO, but we never reset the buffer to point back to
it's original memory (or no memory, if that is the case for the
buffer).

As a result, when we free the buffer it points to memory that is
owned by something else and attempts to unmap and free it. Because
the range does not match any known mapped range, it can trigger
BUG_ON() traps in the vmap code, and potentially corrupt the vmap
area tracking.

Fix this by always resetting these buffers to their original state
before freeing them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:45 -05:00
Dave Chinner ee58abdfcc xfs: avoid getting stuck during async inode flushes
When the underlying inode buffer is locked and xfs_sync_inode_attr()
is doing a non-blocking flush, xfs_iflush() can return EAGAIN.  When
this happens, clear the error rather than returning it to
xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk
delaying for a short while and trying again. This can result in
background walks getting stuck on the one AG until inode buffer is
unlocked by some other means.

This behaviour was noticed when analysing event traces followed by
code inspection and verification of the fix via further traces.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:42 -05:00
Dave Chinner e57375153d xfs: fix xfs_itruncate_start tracing
Variables are ordered incorrectly in trace call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:36 -05:00
Dave Chinner 1beb65ad45 xfs: fix duplicate workqueue initialisation
The workqueue initialisation function is called twice when
initialising the XFS subsystem. Remove the second initialisation
call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 12:03:24 -05:00
Joe Perches e69522a8cc xfs: kill off xfs_printk()
xfs_alert_tag() can be defined using xfs_alert(), and thereby avoid
using xfs_printk() altogether.  This is the only remaining use of
xfs_printk(), so changing it this way means xfs_printk() can simply
be eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.can simply be eliminated.can simply be
eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.

Also add format checking to the non-debug inline function xfs_debug.
Miscellaneous function prototype argument alignment.

(Updated to delete the definition of xfs_printk(), which is
no longer used or needed.)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2011-05-19 11:38:09 -05:00
Steve French ceec1e0fae [CIFS] Remove sparse warning
Move extern for cifsConvertToUCS to different header to prevent following warning:

CHECK   fs/cifs/cifs_unicode.c
fs/cifs/cifs_unicode.c:267:1: warning: symbol 'cifsConvertToUCS' was not declared. Should it be static?

Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Steve French 4e64fb33de [CIFS] Update cifs to version 1.72
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Shirish Pargaonkar c4aca0c09f cifs: Change key name to cifs.idmap, misc. clean-up
Change idmap key name from cifs.cifs_idmap to cifs.idmap.
Removed unused structure wksidarr and function match_sid().
Handle errors correctly in function init_cifs().

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:56 +00:00
Sean Finney f14bcf71d1 cifs: Unconditionally copy mount options to superblock info
Previously mount options were copied and updated in the cifs_sb_info
struct only when CONFIG_CIFS_DFS_UPCALL was enabled.  Making this
information generally available allows us to remove a number of ifdefs,
extra function params, and temporary variables.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney 5167f11ec9 cifs: Use kstrndup for cifs_sb->mountdata
A relatively minor nit, but also clarified the "consensus" from the
preceding comments that it is in fact better to try for the kstrdup
early and cleanup while cleaning up is still a simple thing to do.

Reviewed-By: Steve French <smfrench@gmail.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney 046462abca cifs: Simplify handling of submount options in cifs_mount.
With CONFIG_DFS_UPCALL enabled, maintain the submount options in
cifs_sb->mountdata, simplifying the code just a bit as well as making
corner-case allocation problems less likely.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:55 +00:00
Sean Finney b946845a9d cifs: cifs_parse_mount_options: do not tokenize mount options in-place
To keep strings passed to cifs_parse_mount_options re-usable (which is
needed to clean up the DFS referral handling), tokenize a copy of the
mount options instead.  If values are needed from this tokenized string,
they too must be duplicated (previously, some options were copied and
others duplicated).

Since we are not on the critical path and any cleanup is relatively easy,
the extra memory usage shouldn't be a problem (and it is a bit simpler
than trying to implement something smarter).

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Sean Finney c1508ca236 cifs: Add support for mounting Windows 2008 DFS shares
Windows 2008 CIFS servers do not always return PATH_NOT_COVERED when
attempting to access a DFS share.  Therefore, when checking for remote
shares, unconditionally ask for a DFS referral for the UNC (w/out prepath)
before continuing with previous behavior of attempting to access the UNC +
prepath and checking for PATH_NOT_COVERED.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31092

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Sean Finney dd61394586 cifs: Extract DFS referral expansion logic to separate function
The logic behind the expansion of DFS referrals is now extracted from
cifs_mount into a new static function, expand_dfs_referral.  This will
reduce duplicate code in upcoming commits.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:54 +00:00
Jeff Layton 460458ce8e cifs: turn BCC into a static inlined function
It's a bad idea to have macro functions that reference variables more
than once, as the arguments could have side effects. Turn BCC() into
a static inlined function instead.

While we're at it, make it return a void * to discourage anyone from
dereferencing it as-is.

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:53 +00:00
Jeff Layton 820a803ffa cifs: keep BCC in little-endian format
This is the same patch as originally posted, just with some merge
conflicts fixed up...

Currently, the ByteCount is usually converted to host-endian on receive.
This is confusing however, as we need to keep two sets of routines for
accessing it, and keep track of when to use each routine. Munging
received packets like this also limits when the signature can be
calulated.

Simplify the code by keeping the received ByteCount in little-endian
format. This allows us to eliminate a set of routines for accessing it
and we can now drop the *_le suffixes from the accessor functions since
that's now implied.

While we're at it, switch all of the places that read the ByteCount
directly to use the get_bcc inline which should also clean up some
unaligned accesses.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:53 +00:00
Jeff Layton 0e6e37a7a8 cifs: fix some unused variable warnings in id_rb_search
fs/cifs/cifsacl.c: In function ‘id_rb_search’:
fs/cifs/cifsacl.c:215:19: warning: variable ‘linkto’ set but not used
[-Wunused-but-set-variable]
fs/cifs/cifsacl.c:214:18: warning: variable ‘parent’ set but not used
[-Wunused-but-set-variable]

Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:52 +00:00
Pavel Shilovsky 6feb9891da CIFS: Simplify invalidate part (try #5)
Simplify many places when we call cifs_revalidate/invalidate to make
it do what it exactly needs.

Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:52 +00:00
Pavel Shilovsky 0b81c1c405 CIFS: directio read/write cleanups
Recently introduced strictcache mode brought a new code that can be
efficiently used by directio part. That's let us add vectored operations
and break unnecessary cifs_user_read and cifs_user_write.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Steve French be8e3b0044 consistently use smb_buf_length as be32 for cifs (try 3)
There is one big endian field in the cifs protocol, the RFC1001
       length, which cifs code (unlike in the smb2 code) had been handling as
       u32 until the last possible moment, when it was converted to be32 (its
       native form) before sending on the wire.   To remove the last sparse
       endian warning, and to make this consistent with the smb2
       implementation  (which always treats the fields in their
       native size and endianness), convert all uses of smb_buf_length to
       be32.

       This version incorporates Christoph's comment about
       using be32_add_cpu, and fixes a typo in the second
       version of the patch.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Shirish Pargaonkar 9409ae58e0 cifs: Invoke id mapping functions (try #17 repost)
rb tree search and insertion routines.

A SID which needs to be mapped, is looked up in one of the rb trees
depending on whether SID is either owner or group SID.
If found in the tree, a (mapped) id from that node is assigned to
uid or gid as appropriate.  If unmapped, an upcall is attempted to
map the SID to an id.  If upcall is successful, node is marked as
mapped.  If upcall fails, node stays marked as unmapped and a mapping
is attempted again only after an arbitrary time period has passed.

To map a SID, which can be either a Owner SID or a Group SID, key
description starts with the string "os" or "gs" followed by SID converted
to a string. Without "os" or "gs", cifs.upcall does not know whether
SID needs to be mapped to either an uid or a gid.

Nodes in rb tree have fields to prevent multiple upcalls for
a SID.  Searching, adding, and removing nodes is done within global locks.
Whenever a node is either found or inserted in a tree, a reference
is taken on that node.
Shrinker routine prunes a node if it has expired but does not prune
an expired node if its refcount is not zero (i.e. sid/id of that node
is_being/will_be accessed).
Thus a node, if its SID needs to be mapped by making an upcall,
can safely stay and its fields accessed without shrinker pruning it.
A reference (refcount) is put on the node without holding the spinlock
but a reference is get on the node by holding the spinlock.

Every time an existing mapped node is accessed or mapping is attempted,
its timestamp is updated to prevent it from getting erased or a
to prevent multiple unnecessary repeat mapping retries respectively.

For now, cifs.upcall is only used to map a SID to an id (uid or gid) but
it would be used to obtain an SID for an id.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Shirish Pargaonkar 4d79dba0e0 cifs: Add idmap key and related data structures and functions (try #17 repost)
Define (global) data structures to store ids, uids and gids, to which a
SID maps.  There are two separate trees, one for SID/uid and another one
for SID/gid.

A new type of key, cifs_idmap_key_type, is used.

Keys are instantiated and searched using credential of the root by
overriding and restoring the credentials of the caller requesting the key.

Id mapping functions are invoked under config option of cifs acl.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:51 +00:00
Pavel Shilovsky 9ad1506b42 CIFS: Add launder_page operation (try #3)
Add this let us drop filemap_write_and_wait from cifs_invalidate_mapping
and simplify the code to properly process invalidate logic.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Steve French 1cb06d0b50 Introduce smb2 mounts as vers=2
As with Linux nfs client, which uses "nfsvers=" or "vers=" to
indicate which protocol to use for mount, specifying

"vers=smb2" or "vers=2"

will force an smb2 mount. When vers is not specified cifs is used

ie "vers=cifs" or "vers=1"

We can eventually autonegotiate down from smb2 to cifs
when smb2 is stable enough to make it the default, but this
is for the future.  At that time we could also implement a
"maxprotocol" mount option as smbclient and Samba have today,
but that would be premature until smb2 is stable.

Intially the smb2 Kconfig option will depend on "BROKEN"
until the merge is complete, and then be "EXPERIMENTAL"
When it is no longer experimental we can consider changing
the default protocol to attempt first.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Pavel Shilovsky 257fb1f15d CIFS: Use invalidate_inode_pages2 instead of invalidate_remote_inode (try #4)
Use invalidate_inode_pages2 that don't leave pages even if shrink_page_list()
has a temp ref on them. It prevents a data coherency problem when
cifs_invalidate_mapping didn't invalidate pages but the client thinks that a data
from the cache is uptodate according to an oplock level (exclusive or II).

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Jeff Layton fd5707e1b4 cifs: fix comment in validate_t2
The comment about checking the bcc is in the wrong place. Also make it
match kernel coding style.

Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:50 +00:00
Jeff Layton 4358b5678b VFS: trivial: fix comment on s_maxbytes value warning check
I originally intended to remove this warning in 2.6.34, but it's not in
a high performance codepath and might help us to catch bugs later. Let's
keep it, but fix the comment to allay confusion about its removal.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French b73b9a4ba7 [CIFS] Allow to set extended attribute cifs_acl (try #2)
Allow setting cifs_acl on the server.
Pass on to the server the ACL blob generated by an application.
cifs is just a pass-through, it does not monitor or inspect the contents
of the blob, server decides whether to enforce/apply the ACL blob composed
by an application.
If setting of ACL is succeessful, mark the inode for revalidation.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French 43988d7685 [CIFS] Use ecb des kernel crypto APIs instead of
local cifs functions (repost)

Using kernel crypto APIs for DES encryption during LM and NT hash generation
instead of local functions within cifs.
Source file smbdes.c is deleted sans four functions, one of which
uses ecb des functionality provided by kernel crypto APIs.

Remove function SMBOWFencrypt.

Add return codes to various functions such as calc_lanman_hash,
SMBencrypt, and SMBNTencrypt.  Includes fix noticed by Dan Carpenter.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Shirish Pargaonkar 257208736a cifs: cleanup: Rename and remove config flags
Remove config flag CIFS_EXPERIMENTAL.
Do export operations under new config flag CIFS_NFSD_EXPORT

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:49 +00:00
Steve French b34cb85cc2 Introduce SMB2 Kconfig option
SMB2 is the followon to the CIFS (and SMB) protocols
and the default for Windows since Windows Vista, and also
now implemented by various non-Windows servers.  SMB2
is more secure, has various performance advantages, including
larger i/o sizes, flow control, better caching model and more.
SMB2 also resolves some scalability limits in the cifs
protocol and adds many new features while being much
simpler (only a few dozen commands instead of hundreds)
and since the protocol is clearer it is
also more consistently implemented across servers
and thus easier to optimize.

After much discussion with Jeff Layton, Jeremy Allison
and others at Connectathon, we decided to move the smb2
code from a distinct .ko and fstype into distinct
C files that optionally build in cifs.ko.  As a result
the Kconfig gets simpler.

To avoid destabilizing cifs, the smb2 code is going
to be moved into its own experimental CONFIG_CIFS_SMB2 ifdef
as it is merged and rereviewed.  The changes to stable
cifs (builds with the smb2 ifdef off) are expected to be
fairly small.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French 34c87901e1 Shrink stack space usage in cifs_construct_tcon
We were reserving MAX_USERNAME (now 256) on stack for
something which only needs to fit about 24 bytes ie
string krb50x +  printf version of uid

Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Justin P. Mattock fd62cb7e74 fs:cifs:connect.c remove one to many l's in the word.
The patch below removes an extra "l" in the word.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French c52a95545c Don't compile in unused reparse point symlink code
Recent Windows versions now create symlinks more frequently
and they do use this "reparse point" symlink mechanism.  We can of course
do symlinks nicely to Samba and other servers which support the
CIFS Unix Extensions and we can also do SFU symlinks and "client only"
"MF" symlinks optionally, but for recent Windows we currently can not
handle the common "reparse point" symlinks fully, removing the caller
for this. We will need to extend and reenable this "reparse point" worker
code in cifs and fix cifs_symlink to call this.  In the interim this code
has been moved to its own config option so it is not compiled in by default
until cifs_symlink fixed up (and tested) to use this.

CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:48 +00:00
Steve French 0eff0e2677 Remove unused CIFSSMBNotify worker function
The CIFSSMBNotify worker is unused, pending changes to allow it to be called
via inotify, so move it into its own experimental config option so it does
not get built in, until the necessary VFS support is fixed.  It used to
be used in dnotify, but according to Jeff, inotify needs minor changes
before we can reenable this.

CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:47 +00:00
Shirish Pargaonkar 9b6763e0aa cifs: Remove unused inode number while fetching root inode
ino is unused in function cifs_root_iget().

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2011-05-19 14:10:47 +00:00
James Morris 12a5a2621b Merge branch 'master' into next
Conflicts:
	include/linux/capability.h

Manually resolve merge conflict w/ thanks to Stephen Rothwell.

Signed-off-by: James Morris <jmorris@namei.org>
2011-05-19 18:51:57 +10:00
Jonathan Cameron a037439637 debugfs: move to new strtobool
No functional changes requires that we eat errors from strtobool.
If people want to not do this, then it should be fixed at a later date.

V2: Simplification suggested by Rusty Russell removes the need for
additional variable ret.

Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-05-19 16:55:28 +09:30