This presumes that there is no reason to unhash a dentry if we fail because
it is a mountpoint or the LSM check fails, and that the LSM checks do not
depend on the dentry being unhashed.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should not allow file modification via mmap while the filesystem is
frozen. So block in block_page_mkwrite() while the filesystem is frozen.
We cannot do the blocking wait in __block_page_mkwrite() since e.g. ext4
will want to call that function with transaction started in some cases
and that would deadlock. But we can at least do the non-blocking reliable
check in __block_page_mkwrite() which is the hardest part anyway.
We have to check for frozen filesystem with the page marked dirty and under
page lock with which we then return from ->page_mkwrite(). Only that way we
cannot race with writeback done by freezing code - either we mark the page
dirty after the writeback has started, see freezing in progress and block, or
writeback will wait for our page lock which is released only when the fault is
done and then writeback will writeout and writeprotect the page again.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This issue was discovered by users of busybox. And the bug is actual for
busybox users, I don't know how it affects others. Apparently, mount is
called with and without MS_SILENT, and this affects mount() behaviour.
But MS_SILENT is only supposed to affect kernel logging verbosity.
The following script was run in an empty test directory:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1
mount -vv --make-rshared mount.shared1
mount -vv --bind mount.shared2 mount.shared2
mount -vv --make-rshared mount.shared2
mount -vv --bind mount.shared2 mount.shared1
mount -vv --bind mount.dir mount.shared2
ls -R mount.dir mount.shared1 mount.shared2
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
mount -vv was used to show the mount() call arguments and result.
Output shows that flag argument has 0x00008000 = MS_SILENT bit:
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
mount.shared2:
a
b
After adding --loud option to remove MS_SILENT bit from just one mount cmd:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1 2>&1
mount -vv --make-rshared mount.shared1 2>&1
mount -vv --bind mount.shared2 mount.shared2 2>&1
mount -vv --loud --make-rshared mount.shared2 2>&1 # <-HERE
mount -vv --bind mount.shared2 mount.shared1 2>&1
mount -vv --bind mount.dir mount.shared2 2>&1
ls -R mount.dir mount.shared1 mount.shared2 2>&1
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
The result is different now - look closely at mount.shared1 directory listing.
Now it does show files 'a' and 'b':
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x00104000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
a
b
mount.shared2:
a
b
The analysis shows that MS_SILENT flag which is ON by default in any
busybox-> mount operations cames to flags_to_propagation_type function and
causes the error return while is_power_of_2 checking because the function
expects only one bit set. This doesn't allow to do busybox->mount with
any --make-[r]shared, --make-[r]private etc options.
Moreover, the recently added flags_to_propagation_type() function doesn't
allow us to do such operations as --make-[r]private --make-[r]shared etc.
when MS_SILENT is on. The idea or clearing the MS_SILENT flag came from
to Denys Vlasenko.
Signed-off-by: Roman Borisov <ext-roman.borisov@nokia.com>
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 990d6c2d7a ("vfs: Add name to file
handle conversion support") changed EXPORTFS to be a bool.
This was needed for earlier revisions of the original patch, but the actual
commit put the code needing it into its own file that only gets compiled
when FHANDLE is selected which in turn selects EXPORTFS.
So EXPORTFS can be safely compiled as a module when not selecting FHANDLE.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: complete_walk(). Done on successful completion
of walk, drops out of RCU mode, does d_revalidate of final
result if that hadn't been done already.
handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
that one; callers converted to use of complete_walk().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My existing email address may stop working in a month or two, so update
email to one that will continue working.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
Oops, local-mounted of 'ocfs2_fops_no_plocks' is just missing the support
of unwritten_extents/punching-hole due to no func pointer was given correctly
to '.follocate' field.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
During dlm domain shutdown, o2dlm has to free all the lock resources. Ones that
have no locks and references are freed. Ones that have locks and/or references
are migrated to another node.
The first task in migration is finding a target. Currently we scan the lock
resource and find one node that either has a lock or a reference. This is not
very efficient in a parallel umount case as we might end up migrating the
lock resource to a node which itself may have to migrate it to a third node.
The patch scans the dlm->exit_domain_map to ensure the target node is not
leaving the domain. If no valid target node is found, o2dlm does not migrate
the resource but instead waits for the unlock and deref messages that will
allow it to free the resource.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
This patch adds a new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG and ups the dlm
protocol to 1.2.
o2dlm sends this new message in dlm_unregister_domain() to mark the beginning
of the exit domain. This message is sent to all nodes in the domain.
Currently o2dlm has no way of informing other nodes of its impending exit.
This information is useful as the other nodes could disregard the exiting
node in certain operations. For example, in resource migration. If two or
more nodes were umounting in parallel, it would be more efficient if o2dlm
were to choose a non-exiting node to be the new master node rather than an
exiting one.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Choosing TMPFS_XATTR default N was switching off TMPFS_POSIX_ACL,
even if it had been Y in oldconfig; and Linus reports that PulseAudio
goes subtly wrong unless it can use ACLs on /dev/shm.
Make TMPFS_POSIX_ACL select TMPFS_XATTR (and depend upon TMPFS),
and move the TMPFS_POSIX_ACL entry before the TMPFS_XATTR entry,
to avoid asking unnecessary questions then ignoring their answers.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
net: fix get_net_ns_by_fd for !CONFIG_NET_NS
ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
ns: Declare sys_setns in syscalls.h
net: Allow setting the network namespace by fd
ns proc: Add support for the ipc namespace
ns proc: Add support for the uts namespace
ns proc: Add support for the network namespace.
ns: Introduce the setns syscall
ns: proc files for namespace naming policy.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
bonding: documentation and code cleanup for resend_igmp
bonding: prevent deadlock on slave store with alb mode (v3)
net: hold rtnl again in dump callbacks
Add Fujitsu 1000base-SX PCI ID to tg3
bnx2x: protect sequence increment with mutex
sch_sfq: fix peek() implementation
isdn: netjet - blacklist Digium TDM400P
via-velocity: don't annotate MAC registers as packed
xen: netfront: hold RTNL when updating features.
sctp: fix memory leak of the ASCONF queue when free asoc
net: make dev_disable_lro use physical device if passed a vlan dev (v2)
net: move is_vlan_dev into public header file (v2)
bug.h: Fix build with CONFIG_PRINTK disabled.
wireless: fix fatal kernel-doc error + warning in mac80211.h
wireless: fix cfg80211.h new kernel-doc warnings
iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
dst: catch uninitialized metrics
be2net: hash key for rss-config cmd not set
bridge: initialize fake_rtable metrics
net: fix __dst_destroy_metrics_generic()
...
Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c
Make ext4_ext_split() get extents to be moved by calculating in a statement
instead of counting in a loop.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Trivial conversion. Fixup one error handling case calling vmtruncate()
and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and
IS_APPEND files could not be truncated during failed writes. In fact, the
test can be completely removed as upper layers do necessary permission
checks for truncate in do_sys_[f]truncate() and may_open() anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add the ability for CIFS to do an asynchronous write. The kernel will
set the frame up as it would for a "normal" SMBWrite2 request, and use
cifs_call_async to send it. The mid callback will then be configured to
handle the result.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Now that we can handle larger wsizes in writepages, fix up the
negotiation of the wsize to allow for that. find_get_pages only seems to
give out a max of 256 pages at a time, so that gives us a reasonable
default of 1M for the wsize.
If the server however does not support large writes via POSIX
extensions, then we cap the wsize to (128k - PAGE_CACHE_SIZE). That
gives us a size that goes up to the max frame size specified in RFC1001.
Finally, if CAP_LARGE_WRITE_AND_X isn't set, then further cap it to the
largest size allowed by the protocol (USHRT_MAX).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Have cifs_writepages issue asynchronous writes instead of waiting on
each write call to complete before issuing another. This also allows us
to return more quickly from writepages. It can just send out all of the
I/Os and not wait around for the replies.
In the WB_SYNC_ALL case, if the write completes with a retryable error,
then the completion workqueue job will resend the write.
This also changes the page locking semantics a little bit. Instead of
holding the page lock until the response is received, release it after
doing the send. This will reduce contention for the page lock and should
prevent processes that have the file mmap'ed from being blocked
unnecessarily.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix double kfree() calls on the same pointers and cleanup mount code.
Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: fix cap flush race reentrancy
libceph: subscribe to osdmap when cluster is full
libceph: handle new osdmap down/state change encoding
rbd: handle online resize of underlying rbd image
ceph: avoid inode lookup on nfs fh reconnect
ceph: use LOOKUPINO to make unconnected nfs fh more reliable
rbd: use snprintf for disk->disk_name
rbd: cleanup: make kfree match kmalloc
rbd: warn on update_snaps failure on notify
ceph: check return value for start_request in writepages
ceph: remove useless check
libceph: add missing breaks in addr_set_port
libceph: fix TAG_WAIT case
ceph: fix broken comparison in readdir loop
libceph: fix osdmap timestamp assignment
ceph: fix rare potential cap leak
libceph: use snprintf for unknown addrs
libceph: use snprintf for formatting object name
ceph: use snprintf for dirstat content
libceph: fix uninitialized value when no get_authorizer method is set
...
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock fragments field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time fragment table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock inodes field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time lookup table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock no_ids field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time id table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reverse order of table reading from mostly first to last in placement
order, to last to first. This is to enable extra superblock sanity
checks to be added in later patches.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: update Documentation pointers
net/9p: enable 9p to work in non-default network namespace
net/9p: p9_idpool_get return -1 on error
fs/9p: Don't clunk dentry fid when we fail to get a writeback inode
9p: Small cleanup in <net/9p/9p.h>
9p: remove experimental tag from tested configurations
9p: typo fixes and minor cleanups
net/9p: Change linuxdoc names to match functions.
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits)
cfq-iosched: free cic_index if cfqd allocation fails
cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
cfq-iosched: reduce bit operations in cfq_choose_req()
cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
block: move bd_set_size() above rescan_partitions() in __blkdev_get()
block: call elv_bio_merged() when merged
cfq-iosched: Make IO merge related stats per cpu
cfq-iosched: Fix a memory leak of per cpu stats for root group
backing-dev: Kill set but not used var in bdi_debug_stats_show()
block: get rid of on-stack plugging debug checks
blk-throttle: Make no throttling rule group processing lockless
blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
blk-throttle: Make dispatch stats per cpu
blk-throttle: Free up a group only after one rcu grace period
blk-throttle: Use helper function to add root throtl group to lists
blk-throttle: Introduce a helper function to fill in device details
blk-throttle: Dynamically allocate root group
blk-cgroup: Allow sleeping while dynamically allocating a group
...
The code in xfs_bmap_del_extent does not correctly decrement the
extent buffer index when deleting a whole extent. Most of the time
this gets caught by checks in xfs_bmapi that work around it and
decrement it manually and thus wasn't noticed so far.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext
with a potentially invalid extent buffer index.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to call xfs_iext_get_ext for the previous extent to get a
valid pointer, and can't just do pointer arithmetics as they might
be in different pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bunmapi. Also
remove the old workaround for too large indices that has been
superceeded by the proper fix in xfs_bmap_del_extent.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bmapi.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index in the various xfs_bmap_add_extent_* helpers.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The if_lastex field in struct xfs_ifork is only used as a temporary
index during xfs_bmapi and xfs_bunmapi. Instead of using the inode
fork to store it keep it local in the callchain. Fortunately this
is very easy as we already pass a stack copy of it down the whole
chain which can simplify be changed to be passed by reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has
always been. Remove it to simplify the bmapi implementation and
conserve stack space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We get this spurious warning:
fs/ncpfs/inode.c: In function 'ncp_fill_super':
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function
...
It's notabug, but we can easily fix it with a memset().
Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not a
candidate for stack allocation.
Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.
Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:
... imagine what happens if the system is out of memory, and the mm
we're looking at is selected for killing by the OOM killer: while
we wait in __get_free_page for more memory, no memory is freed
from the selected mm because it cannot reach exit_mmap while we hold
that reference.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world. Move it
to fs/proc/internal.h instead.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.
- Having the show() operation "miles away" from the corresponding
seq_file iteration operations is a maintenance burden.
- The need to export ad hoc info like struct proc_maps_private is
eliminated.
- The implementation of show_numa_map() can be improved in a simple
manner by cooperating with the other seq_file operations (start,
stop, etc) -- something that would be messy to do without this
change.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement generic xattrs for tmpfs filesystems. The Feodra project, while
trying to replace suid apps with file capabilities, realized that tmpfs,
which is used on the build systems, does not support file capabilities and
thus cannot be used to build packages which use file capabilities. Xattrs
are also needed for overlayfs.
The xattr interface is a bit odd. If a filesystem does not implement any
{get,set,list}xattr functions the VFS will call into some random LSM hooks
and the running LSM can then implement some method for handling xattrs.
SELinux for example provides a method to support security.selinux but no
other security.* xattrs.
As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
xattr handler routines specifically to handle acls. Because of this tmpfs
would loose the VFS/LSM helpers to support the running LSM. To make up
for that tmpfs had stub functions that did nothing but call into the LSM
hooks which implement the helpers.
This new patch does not use the LSM fallback functions and instead just
implements a native get/set/list xattr feature for the full security.* and
trusted.* namespace like a normal filesystem. This means that tmpfs can
now support both security.selinux and security.capability, which was not
previously possible.
The basic implementation is that I attach a:
struct shmem_xattr {
struct list_head list; /* anchored by shmem_inode_info->xattr_list */
char *name;
size_t size;
char value[0];
};
Into the struct shmem_inode_info for each xattr that is set. This
implementation could easily support the user.* namespace as well, except
some care needs to be taken to prevent large amounts of unswappable memory
being allocated for unprivileged users.
[mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Jordi Pujol <jordipujolp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change each shrinker's API by consolidating the existing parameters into
shrink_control struct. This will simplify any further features added w/o
touching each file of shrinker.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate the existing parameters to shrink_slab() into a new
shrink_control struct. This is needed later to pass the same struct to
shrinkers.
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Straightforward conversion of i_mmap_lock to a mutex.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh says:
"The only significant loser, I think, would be page reclaim (when
concurrent with truncation): could spin for a long time waiting for
the i_mmap_mutex it expects would soon be dropped? "
Counter points:
- cpu contention makes the spin stop (need_resched())
- zap pages should be freeing pages at a higher rate than reclaim
ever can
I think the simplification of the truncate code is definitely worth it.
Effectively reverts: 2aa15890f3 ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the existing mmu_gather infrastructure.
The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.
The first 9 patches rework the mmu_gather infrastructure. For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.
The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.
Then there is one patch that extends the generic mmu_gather batching.
After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.
Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.
Furthermore, it removes the new and universially detested unmap_mutex.
This patch:
Remove the first obstacle towards a fully preemptible mmu_gather.
The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.
Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex. Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.
Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.
[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.
check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.
Let's clean this up by exporting both functions and make those names
consistent. Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion). expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dentry fid get clunked via the dput.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The 9p client is currently undergoing regular regresssion and
stress testing as a by-product of the virtfs work. I think its
finally time to take off the experimental tags from the well-tested
code paths.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Currently, an fallocate request of size slightly larger than a power of
2 is turned into two block requests, each a power of 2, with the extra
blocks pre-allocated for future use. When an application calls
fallocate, it already has an idea about how large the file may grow so
there is usually little benefit to reserve extra blocks on the
preallocation list. This reduces disk fragmentation.
Tested: fsstress. Also verified manually that fallocat'ed files are
contiguously laid out with this change (whereas without it they begin at
power-of-2 boundaries, leaving blocks in between). CPU usage of
fallocate is not appreciably higher. In a tight fallocate loop, CPU
usage hovers between 5%-8% with this change, and 5%-7% without it.
Using a simulated file system aging program which the file system to
70%, the percentage of free extents larger than 8MB (as measured by
e2freefrag) increased from 38.8% without this change, to 69.4% with
this change.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
and "ext4_ext_check_cache"
fallocate has been modified to call ext4_punch_hole when the punch hole
flag is passed. At the moment, we only support punching holes in
extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole
routine.
The ext4_ext_punch_hole routine first completes all outstanding writes
with the associated pages, and then releases them. The unblock
aligned data is zeroed, and all blocks in between are punched out.
The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
except it accepts a ext4_ext_cache parameter instead of a ext4_extent
parameter. This routine is used by ext4_ext_punch_hole to check and
see if a block in a hole that has been cached. The ext4_ext_cache
parameter is necessary because the members ext4_extent structure are
not large enough to hold a 32 bit value. The existing
ext4_ext_in_cache routine has become a wrapper to this new function.
[ext4 punch hole patch series 5/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch adds a new flag to ext4_map_blocks() that specifies the
given range of blocks should be punched out. Extents are first
converted to uninitialized extents before they are punched
out. Because punching a hole may require that the extent be split, it
is possible that the splitting may need more blocks than are
available. To deal with this, use of reserved blocks are enabled to
allow the split to proceed.
The routine then returns the number of blocks successfully
punched out.
[ext4 punch hole patch series 4/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch modifies the truncate routines to support hole punching
Below is a brief summary of the patches changes:
- Added end param to ext_ext4_rm_leaf
This function has been modified to accept an end parameter
which enables it to punch holes in leafs instead of just
truncating them.
- Implemented the "remove head" case in the ext_remove_blocks routine
This routine is used by ext_ext4_rm_leaf to remove the tail
of an extent during a truncate. The new ext_ext4_rm_leaf
routine will now also use it to remove the head of an extent in the
case that the hole covers a region of blocks at the beginning
of an extent.
- Added "end" param to ext4_ext_remove_space routine
This function has been modified to accept a stop parameter, which
is passed through to ext4_ext_rm_leaf.
[ext4 punch hole patch series 3/5 v6]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch modifies the existing ext4_block_truncate_page() function
which was used by the truncate code path, and which zeroes out block
unaligned data, by adding a new length parameter, and renames it to
ext4_block_zero_page_rage(). This function can now be used to zero out the
head of a block, the tail of a block, or the middle
of a block.
The ext4_block_truncate_page() function is now a wrapper to
ext4_block_zero_page_range().
[ext4 punch hole patch series 2/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks. This will allow a
punch hole to proceed even if the disk is full. Punching a hole may
require additional blocks to first split the extents.
Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:
ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks
[ext4 punch hole patch series 1/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
We're going to support partial extent moving, which may split entire extent
movement into pieces to compromise the insuffice allocations, it eases the
'ENSPC' pain and makes the whole moving much less likely to fail, the downside
is it may make the fs even more fragmented before moving, just let the userspace
make a trade-off here.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
the basic logic of moving extents for a file is pretty like punching-hole
sequence, walk the extents within the range as user specified, calculating
an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to
do the actual moving.
This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation
gets done successfully.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The helper is to calculate the defrag length in one run according to a threshold,
it will proceed doing defragmentation until the threshold was meet, and skip a
LARGE extent if any.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
ocfs2_move_extent() logic will validate the goal_offset_in_block,
where extents to be moved, what's more, it also compromises a bit
to probe the appropriate region around given goal_offset when the
original goal is not able to fit the movement.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Before doing the movement of extents, we'd better probe the alloc group from
'goal_blk' for searching a contiguous region to fit the wanted movement, we
even will have a best-effort try by compromising to a threshold around the
given goal.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
First best-effort attempt to validate and adjust the goal (physical address in
block), while it can't guarantee later operation can succeed all the time since
global_bitmap may change a bit over time.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
This function tries locate the right alloc group, where a given physical block
resides, it returns the caller a buffer_head of victim group descriptor, and also
the offset of block in this group, by passing the block number.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The moving range of __ocfs2_move_extent() was within one extent always, it
consists following parts:
1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.
2. Split the original extent with new extent, coalecse the nearby extents if possible.
3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future, based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
This new code is a bit more complicated than former ones, the goal is to
show user all statistics required to take a deep insight into filesystem
on how the disk is being fragmentaed.
The goal is achieved by scaning global bitmap from (cluster)group to group
to figure out following factors in the filesystem:
- How many free chunks in a fixed size as user requested.
- How many real free chunks in all size.
- Min/Max/Avg size(in) clusters of free chunks.
- How do free chunks distribute(in size) in terms of a histogram,
just like following:
---------------------------------------------------------
Extent Size Range : Free extents Free Clusters Percent
32K... 64K- : 1 1 0.00%
1M... 2M- : 9 288 0.03%
8M... 16M- : 2 831 0.09%
32M... 64M- : 1 2047 0.23%
128M... 256M- : 1 8191 0.92%
256M... 512M- : 2 21706 2.43%
512M... 1024M- : 27 858623 96.29%
---------------------------------------------------------
Userspace ioctl() call eventually gets the above info returned by passing
a 'struct ocfs2_info_freefrag' with the chunk_size being specified first.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The new code is dedicated to calculate free inodes number of all inode_allocs,
then return the info to userpace in terms of an array.
Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent'
from userspace, is now going to be involved. setting the flag on means no cluster
coherency considered, usually, userspace tools choose none-coherency strategy by
default for the sake of performace.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Replace direct call to kmalloc for a potentially large, contiguous
buffer allocation with one to mtd_kmalloc_up_to which helps ensure the
operation can succeed under low-memory, highly- fragmented situations
albeit somewhat more slowly.
Signed-off-by: Grant Erickson <marathon96@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
File: "/mnt/mp1/"
ID: e175ccb83a872efe Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 258022 Free: -15 Available: -13122
Inodes: Total: 65536 Free: 63029
f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.
CASE 1:
ext4_da_writepages
mpage_da_map_and_submit
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_diskspace_used
percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
<--- interrupt statfs systemcall --->
ext4_da_update_reserve_space
percpu_counter_sub(&sbi->s_dirtyblocks_counter,
used + ei->i_allocated_meta_blocks);
CASE 2:
ext4_write_begin
__block_write_begin
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_diskspace_used
percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
<--- interrupt statfs systemcall --->
percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);
To avoid the issue, this patch ensures that f_bfree is non-negative.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.
This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().
Tested by me with xfstests 251.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: Fix comment to match the code in journal_start()
jbd/jbd2: remove obsolete summarise_journal_usage.
jbd: Fix forever sleeping process in do_get_write_access()
ext2: fix error msg when mounting fs with too-large blocksize
jbd: fix fsync() tid wraparound bug
ext3: Fix fs corruption when make_indexed_dir() fails
ext3: Fix lock inversion in ext3_symlink()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits)
TOMOYO: Fix wrong domainname validation.
SELINUX: add /sys/fs/selinux mount point to put selinuxfs
CRED: Fix load_flat_shared_library() to initialise bprm correctly
SELinux: introduce path_has_perm
flex_array: allow 0 length elements
flex_arrays: allow zero length flex arrays
flex_array: flex_array_prealloc takes a number of elements, not an end
SELinux: pass last path component in may_create
SELinux: put name based create rules in a hashtable
SELinux: generic hashtab entry counter
SELinux: calculate and print hashtab stats with a generic function
SELinux: skip filename trans rules if ttype does not match parent dir
SELinux: rename filename_compute_type argument to *type instead of *con
SELinux: fix comment to state filename_compute_type takes an objname not a qstr
SMACK: smack_file_lock can use the struct path
LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH
LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE
SELINUX: Make selinux cache VFS RCU walks safe
SECURITY: Move exec_permission RCU checks into security modules
SELinux: security_read_policy should take a size_t not ssize_t
...
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure. It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.
Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant). This is cleaner and more
robust.
Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
UBIFS: switch to dynamic printks
UBIFS: fix kernel-doc comments
UBIFS: fix extremely rare mount failure
UBIFS: simplify LEB recovery function further
UBIFS: always cleanup the recovered LEB
UBIFS: clean up LEB recovery function
UBIFS: fix-up free space on mount if flag is set
UBIFS: add the fixup function
UBIFS: add a superblock flag for free space fix-up
UBIFS: share the next_log_lnum helper
UBIFS: expect corruption only in last journal head LEBs
UBIFS: synchronize write-buffer before switching to the next bud
UBIFS: remove BUG statement
UBIFS: change bud replay function conventions
UBIFS: substitute the replay tree with a replay list
UBIFS: simplify replay
UBIFS: store free and dirty space in the bud replay entry
UBIFS: remove unnecessary stack variable
UBIFS: double check that buds are replied in order
UBIFS: make 2 functions static
...
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused. Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.
Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks. In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard. We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL. Given that this would cause
significant overhead I don't think it's worthwile for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.
The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard. Note that we don't bother
supporting discard for the non-delaylog mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
jbd2_log_start_commit() returns 1 only when we really start a
transaction. But we also need to wait for a transaction when the
commit is already running. Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).
Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.
Reported-by: Edward Goggin <egoggin@vmware.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device. The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit. In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).
Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.
If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.
Without this patch, xfstests 225 will hung on ext4 with 1K block.
Reported-by: Amir Goldstein <amir73il@users.sourceforge.net>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change function param_set_dlmfs_capabilities from 'extern' to 'static' since
function param_get_dlmfs_capabilities is also 'static'.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch cleans up the gunk added by commit 388c4bcb4e.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Add ocfs2_trim_fs to support trimming freed clusters in the
volume. A range will be given and all the freed clusters greater
than minlen will be discarded to the block layer.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
fat: Fix statfs->f_namelen
fat: Replace all printk with fat_msg()
fat: Add fat_msg() function for preformated FAT messages
fat: Convert fat_fs_error to use %pV
fat: Fix possible null deref in fat_cache_add()
fat: use new setup() for ->dir_ops too
Minor revision to the last version of this patch -- the only difference
is the fix to the cFYI statement in cifs_reconnect.
Holding the spinlock while we call this function means that it can't
sleep, which really limits what it can do. Taking it out from under
the spinlock also means less contention for this global lock.
Change the semantics such that the Global_MidLock is not held when
the callback is called. To do this requires that we take extra care
not to have sync_mid_result remove the mid from the list when the
mid is in a state where that has already happened. This prevents
list corruption when the mid is sitting on a private list for
reconnect or when cifsd is coming down.
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reorganize code to get mount option at first and when get a superblock.
This lets us use shared superblock model further for equal mounts.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
journal_start returns an ERR_PTR() value rather than NULL on failure.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: obey minleft values during extent allocation correctly
xfs: reset buffer pointers before freeing them
xfs: avoid getting stuck during async inode flushes
xfs: fix xfs_itruncate_start tracing
xfs: fix duplicate workqueue initialisation
xfs: kill off xfs_printk()
xfs: fix race condition in AIL push trigger
xfs: make AIL target updates and compares 32bit safe.
xfs: always push the AIL to the target
xfs: exit AIL push work correctly when AIL is empty
xfs: ensure reclaim cursor is reset correctly at end of AG
xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE
xfs: fix compiler warning in xfs_trace.h
xfs: cleanup duplicate initializations
xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy
xfs: exact busy extent tracking
xfs: do not immediately reuse busy extent ranges
xfs: optimize AGFL refills
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
tty/serial: Fix break handling for PORT_TEGRA
tty/serial: Add explicit PORT_TEGRA type
n_tracerouter and n_tracesink ldisc additions.
Intel PTI implementaiton of MIPI 1149.7.
Kernel documentation for the PTI feature.
export kernel call get_task_comm().
tty: Remove to support serial for S5P6442
pch_phub: Support new device ML7223
8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
ASoC: Update cx20442 for TTY API change
pch_uart: Support new device ML7223 IOH
parport: Use request_muxed_region for IT87 probe and lock
tty/serial: add support for Xilinx PS UART
n_gsm: Use print_hex_dump_bytes
drivers/tty/moxa.c: Put correct tty value
TTY: tty_io, annotate locking functions
TTY: serial_core, remove superfluous set_task_state
TTY: serial_core, remove invalid test
Char: moxa, fix locking in moxa_write
...
Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
drivers/tty/serial/Makefile.
I did the hci_ldisc thing as an evil merge, cleaning things up.
In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
before calling vmtruncate(). This caused any allocated but unwritten
blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
flag to be dropped. This was done to make to make sure that
EOFBLOCKS_FL would not be cleared while still leaving blocks past
i_size allocated. This was not necessary, since ext4_truncate()
guarantees that blocks past i_size will be dropped, even in the case
where truncate() has increased i_size before calling ext4_truncate().
So fix this by removing the EOFBLOCKS_FL special case treatment in
ext4_setattr(). In addition, use truncate_setsize() followed by a
call to ext4_truncate() instead of using vmtruncate(). This is more
efficient since it skips the call to inode_newsize_ok(), which has
been checked already by inode_change_ok(). This is also in a win in
the case where EOFBLOCKS_FL is set since it avoids calling
ext4_truncate() twice.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use separate functions for comparison between existing structure
and what we are requesting for to make server, session and tcon
search code easier to use on next superblock match call.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: Reorder clock bases
hrtimers: Avoid touching inactive timer bases
hrtimers: Make struct hrtimer_cpu_base layout less stupid
timerfd: Manage cancelable timers in timerfd
clockevents: Move C3 stop test outside lock
alarmtimer: Drop device refcount after rtc_open()
alarmtimer: Check return value of class_find_device()
timerfd: Allow timers to be cancelled when clock was set
hrtimers: Prepare for cancel on clock was set timers
There's no SMB2 support in the CIFS filesystem driver, so there's no need to
have a config and mount option for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add and use wakeup_pipe_readers() to consolidate duplicated codes.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Drop device_list_mutex for the reader side on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated
btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices
On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'bh' is forgot to release if no error is detected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
allowed wait
- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
we trigger BUG_ON() if it's fail
- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
the return value of clear_extent_bit() is ignored by many callers
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@
x = btrfs_alloc_path@p1(...)
... when != btrfs_free_path(x,...)
when != if (...) { ... btrfs_free_path(x,...) ...}
when != x = ra
if(...) { ... when != x = rb
when forall
when != btrfs_free_path(x,...)
\(return <+...x...+>; \| return@p2...; \) }
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
$ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
$ mkfs.btrfs --mixed 2G.img
$ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
$ dd if=/dev/urandom of=10M.file bs=10240 count=1024
$ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done
Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).
No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In 2008, commit b4f6c45dfb dropped the use
of fs/btrfs/version.sh, but left the script behind. Kill it.
Commit by Jamey Sharp and Josh Triplett.
Signed-off-by: Jamey Sharp <jamey@minilop.net>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
240f62c875 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On lookup we only want to read the inode item, so leave the path spinning. Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If there are duplicate entries in the free space cache, discard the entire cache
and load it the old fashioned way. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups. So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.
Signed-off-by: Josef Bacik <josef@redhat.com>
Our readahead is sort of sloppy, and really isn't always needed. For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat. Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes. This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit. So
instead just let the caching kthread keep going and only yeild if
need_resched(). This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing. I don't think I've
ever seen this tripped, so kill it. This saves us 2 kmap's per level in our
search. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the
extent buffer while we look for readahead targets. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results. For example, if the range
[0-8192]
is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096]. So instead set range_start = max(cur_start, state->start).
This makes everything come out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up. This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen. This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen. So in order
to fix this we need to split all of the reservation stuf up. So with this patch
we have
1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.
2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.
3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.
Hopefully this will make the ceph problem go away. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We use trans_mutex for lots of things, here's a basic list
1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists
Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following
1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.
I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We currently track trans handles in current->journal_info, but we don't actually
use it. This patch fixes it. This will cover the case where we have multiple
people starting transactions down the call chain. This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return. I tested this with xfstests
and it worked out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check. As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.
sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856
This patch restores the original order and remove the warning
messages.
stable: Please apply together with 02e352287a (block: rescan
partitions on invalidated devices on -ENOMEDIA too).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow processes blocked on plock requests to be interrupted
when they are killed. This leaves the problem of cleaning
up the lock state in userspace. This has three parts:
1. Add a flag to unlock operations sent to userspace
indicating the file is being closed. Userspace will
then look for and clear any waiting plock operations that
were abandoned by an interrupted process.
2. Queue an unlock-close operation (like in 1) to clean up
userspace from an interrupted plock request. This is needed
because the vfs will not send a cleanup-unlock if it sees no
locks on the file, which it won't if the interrupted operation
was the only one.
3. Do not use replies from userspace for unlock-close operations
because they are unnecessary (they are just cleaning up for the
process which did not make an unlock call). This also simplifies
the new unlock-close generated from point 2.
Signed-off-by: David Teigland <teigland@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Wait properly when flushing the ail list
GFS2: Wipe directory hash table metadata when deallocating a directory
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.
During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data. Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.
Here I have some test results to show, I use sysbench to do "random write + fsync".
===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run]
===
Sysbench args:
- Number of threads: 1
- Extra file open flags: 0
- 2 files, 4Gb each
- Block size 4Kb
- Number of random requests for random IO: 10000
- Read/Write ratio for combined random IO test: 1.50
- Periodic FSYNC enabled, calling fsync() each 100 requests.
- Calling fsync() at the end of test, Enabled.
- Using synchronous I/O mode
- Doing random write test
Sysbench results:
===
Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total
Read 0b Written 39.062Mb Total transferred 39.062Mb
===
a) without patch: (*SPEED* : 451.01Kb/sec)
112.75 Requests/sec executed
b) with patch: (*SPEED* : 4.7533Mb/sec)
1216.84 Requests/sec executed
PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Peter is concerned about the extra scan of CLOCK_REALTIME_COS in the
timer interrupt. Yes, I did not think about it, because the solution
was so elegant. I didn't like the extra list in timerfd when it was
proposed some time ago, but with a rcu based list the list walk it's
less horrible than the original global lock, which was held over the
list iteration.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
02e352287a (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check. As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.
sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856
This patch restores the original order and remove the warning
messages.
stable: Please apply together with 02e352287a (block: rescan
partitions on invalidated devices on -ENOMEDIA too).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tony Luck <tony.luck@gmail.com>
Tested-by: Tony Luck <tony.luck@gmail.com>
Cc: stable@kernel.org
Stable note: 2.6.39 only.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: use mark_buffer_dirty to mark btnode or meta data dirty
nilfs2: always set back pointer to host inode in mapping->host
nilfs2: get rid of NILFS_I_NILFS
nilfs2: use list_first_entry
nilfs2: use empty_aops for gc-inodes
nilfs2: implement resize ioctl
nilfs2: add truncation routine of segment usage file
nilfs2: add routine to move secondary super block
nilfs2: add ioctl which limits range of segment to be allocated
nilfs2: zero fill unused portion of super root block
nilfs2: super root size should change depending on inode size
nilfs2: get rid of private page allocator
nilfs2: merge list_del()/list_add_tail() to list_move_tail()
Switch to debugging using dynamic printk (pr_debug()). There is no good reason
to carry custom debugging prints if there is so cool and powerful generic
dynamic printk infrastructure, see Documentation/dynamic-debug-howto.txt. With
dynamic printks we can switch on/of individual prints, per-file, per-function
and per format messages. This means that instead of doing old-fashioned
echo 1 > /sys/module/ubifs/parameters/debug_msgs
to enable general messages, we can do:
echo 'format "UBIFS DBG gen" +ptlf' > control
to enable general messages and additionally ask the dynamic printk
infrastructure to print process ID, line number and function name. So there is
no reason to keep UBIFS-specific crud if there is more powerful generic thing.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current code always ignores the max_pending limit. Have it instead
only optionally ignore the pending limit. For CIFSSMBEcho, we need to
ignore it to make sure they always can go out. For async reads, writes
and potentially other calls, we need to respect it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We'll need this for async writes, so convert the call to take a kvec
array. CIFSSMBEcho is changed to put a kvec on the stack and pass
in the SMB buffer using that.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Further consolidate the SendReceive code by moving the checks run over
the packet into a separate function that all the SendReceive variants
can call.
We can also eliminate the check for a receive_len that's too big or too
small. cifs_demultiplex_thread already checks that and disconnects the
socket if that occurs, while setting the midStatus to MALFORMED. It'll
never call this code if that's the case.
Finally do a little cleanup. Use "goto out" on errors so that the flow
of code in the normal case is more evident. Also switch the logErr
variable in map_smb_to_linux_error to a bool.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.
This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
ext4_ext_truncate() should not invoke up_write(&EXT4_I(inode)->i_data_sem)
when ext4_orphan_add() returns an error, as it hasn't performed a
down_write() yet. This trivial patch fixes this by moving the up_write()
invocation above the out_stop label.
Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.
Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After creating an ext4 file system without a journal:
# mke2fs -t ext4 -O ^has_journal /dev/sda
# mount -t ext4 /dev/sda /test
the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.
So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.
Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fixes this build error on s390 and probably other archs as well:
fs/inode.c: In function 'new_inode':
fs/inode.c:894:2: error: implicit declaration of function 'spin_lock_prefetch'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[ Happens on architectures that don't define their own prefetch
functions in <asm/processor.h>, and instead rely on the default
ones in <linux/prefetch.h> - Linus]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ail flush code has always relied upon log flushing to prevent
it from spinning needlessly. This fixes it to wait on the last
I/O request submitted (we don't need to wait for all of it)
instead of either spinning with io_schedule or sleeping.
As a result cpu usage of gfs2_logd is much reduced with certain
workloads.
Reported-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The deallocation code for directories in GFS2 is largely divided into
two parts. The first part deallocates any directory leaf blocks and
marks the directory as being a regular file when that is complete. The
second stage was identical to deallocating regular files.
Regular files have their data blocks in a different
address space to directories, and thus what would have been normal data
blocks in a regular file (the hash table in a GFS2 directory) were
deallocated correctly. However, a reference to these blocks was left in the
journal (assuming of course that some previous activity had resulted in
those blocks being in the journal or ail list).
This patch uses the i_depth as a test of whether the inode is an
exhash directory (we cannot test the inode type as that has already
been changed to a regular file at this stage in deallocation)
The original issue was reported by Chris Hertel as an issue he encountered
running bonnie++
Reported-by: Christopher R. Hertel <crh@samba.org>
Cc: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This solves a serious VFS-level bug in nested_symlink (which was
rewritten from do_follow_link), and follows the order of depth tests
that existed before.
The bug triggers a BUG_ON in fs/namei.c:1381, when running racer with
symlink and rename ops.
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Ben Hutchings discovered [1], the patch for CVE-2011-1017 (buffer
overflow in ldm_frag_add) is not sufficient. The original patch in
commit c340b1d640 ("fs/partitions/ldm.c: fix oops caused by corrupted
partition table") does not consider that, for subsequent fragments,
previously allocated memory is used.
[1] http://lkml.org/lkml/2011/5/6/407
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Timo Warns <warns@pre-sense.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] define "_sdata" symbol
pstore: Fix Kconfig dependencies for apei->pstore
pstore: fix potential logic issue in pstore read interface
pstore: fix pstore filesystem mount/remount issue
pstore: fix one type of return value in pstore
[IA64] fix build warning in arch/ia64/oprofile/backtrace.c
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
[CIFS] Fix to problem with getattr caused by invalidate simplification patch
[CIFS] Remove sparse warning
[CIFS] Update cifs to version 1.72
cifs: Change key name to cifs.idmap, misc. clean-up
cifs: Unconditionally copy mount options to superblock info
cifs: Use kstrndup for cifs_sb->mountdata
cifs: Simplify handling of submount options in cifs_mount.
cifs: cifs_parse_mount_options: do not tokenize mount options in-place
cifs: Add support for mounting Windows 2008 DFS shares
cifs: Extract DFS referral expansion logic to separate function
cifs: turn BCC into a static inlined function
cifs: keep BCC in little-endian format
cifs: fix some unused variable warnings in id_rb_search
CIFS: Simplify invalidate part (try #5)
CIFS: directio read/write cleanups
consistently use smb_buf_length as be32 for cifs (try 3)
cifs: Invoke id mapping functions (try #17 repost)
cifs: Add idmap key and related data structures and functions (try #17 repost)
CIFS: Add launder_page operation (try #3)
Introduce smb2 mounts as vers=2
...
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (32 commits)
GFS2: Move all locking inside the inode creation function
GFS2: Clean up symlink creation
GFS2: Clean up mkdir
GFS2: Use UUID field in generic superblock
GFS2: Rename ops_inode.c to inode.c
GFS2: Inode.c is empty now, remove it
GFS2: Move final part of inode.c into super.c
GFS2: Move most of the remaining inode.c into ops_inode.c
GFS2: Move gfs2_refresh_inode() and friends into glops.c
GFS2: Remove gfs2_dinode_print() function
GFS2: When adding a new dir entry, inc link count if it is a subdir
GFS2: Make gfs2_dir_del update link count when required
GFS2: Don't use gfs2_change_nlink in link syscall
GFS2: Don't use a try lock when promoting to a higher mode
GFS2: Double check link count under glock
GFS2: Improve bug trap code in ->releasepage()
GFS2: Fix ail list traversal
GFS2: make sure fallocate bytes is a multiple of blksize
GFS2: Add an AIL writeback tracepoint
GFS2: Make writeback more responsive to system conditions
...
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.
Also fixes up conflicts in the below files.
Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition. The 'n'
should be also properly used on remount.
When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.
This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.
Addresses-Red-Hat-Bugzilla: #699708
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Fix to earlier "Simplify invalidate part (try #6)" patch
That patch caused problems with connectathon test 5.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is a minor fix for UBIFS kernel-doc comments - we forgot the "@" symbol
for several 'struct ubifs_debug_info'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits)
debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
sysfs: remove "last sysfs file:" line from the oops messages
drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION"
memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION
SYSFS: Fix erroneous comments for sysfs_update_group().
driver core: remove the driver-model structures from the documentation
driver core: Add the device driver-model structures to kerneldoc
Translated Documentation/email-clients.txt
RAW driver: Remove call to kobject_put().
reboot: disable usermodehelper to prevent fs access
efivars: prevent oops on unload when efi is not enabled
Allow setting of number of raw devices as a module parameter
Introduce CONFIG_GOOGLE_FIRMWARE
driver: Google Memory Console
driver: Google EFI SMI
x86: Better comments for get_bios_ebda()
x86: get_bios_ebda_length()
misc: fix ti-st build issues
params.c: Use new strtobool function to process boolean inputs
debugfs: move to new strtobool
...
Fix up trivial conflicts in fs/debugfs/file.c due to the same patch
being applied twice, and an unrelated cleanup nearby.
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)
Signed-off-by: Sage Weil <sage@newdream.net>
Both off and fi->offset are unsigned, so the difference is always >= 0.
Compare them directly instead of the sign of the difference.
Signed-off-by: Sage Weil <sage@newdream.net>
If we grab new_cap, retake the lock, and find we already have a cap now
for the given mds, release new_cap.
Signed-off-by: Sage Weil <sage@newdream.net>
We put ourselves on an inode list for the parent directory of metadata
operations so that an fsync on the directory will wait for metadata updates
to commit to disk. We weren't holding a reference to that directory,
however, and under certain workloads (fsstress in this case) the directory
can go away.
Signed-off-by: Sage Weil <sage@newdream.net>
When allocating an extent that is long enough to consume the
remaining free space in an AG, we need to ensure that the allocation
leaves enough space in the AG for any subsequent bmap btree blocks
that are needed to track the new extent. These have to be allocated
in the same AG as we only reserve enough blocks in an allocation
transaction for modification of the freespace trees in a single AG.
xfs_alloc_fix_minleft() has been considering blocks on the AGFL as
free blocks available for extent and bmbt block allocation, which is
not correct - blocks on the AGFL are there exclusively for the use
of the free space btrees. As a result, when minleft is less than the
number of blocks on the AGFL, xfs_alloc_fix_minleft() does not trim
the given extent to leave minleft blocks available for bmbt
allocation, and hence we can fail allocation during bmbt record
insertion.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When we free a vmapped buffer, we need to ensure the vmap address
and length we free is the same as when it was allocated. In various
places in the log code we change the memory the buffer is pointing
to before issuing IO, but we never reset the buffer to point back to
it's original memory (or no memory, if that is the case for the
buffer).
As a result, when we free the buffer it points to memory that is
owned by something else and attempts to unmap and free it. Because
the range does not match any known mapped range, it can trigger
BUG_ON() traps in the vmap code, and potentially corrupt the vmap
area tracking.
Fix this by always resetting these buffers to their original state
before freeing them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
When the underlying inode buffer is locked and xfs_sync_inode_attr()
is doing a non-blocking flush, xfs_iflush() can return EAGAIN. When
this happens, clear the error rather than returning it to
xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk
delaying for a short while and trying again. This can result in
background walks getting stuck on the one AG until inode buffer is
unlocked by some other means.
This behaviour was noticed when analysing event traces followed by
code inspection and verification of the fix via further traces.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Variables are ordered incorrectly in trace call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The workqueue initialisation function is called twice when
initialising the XFS subsystem. Remove the second initialisation
call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_alert_tag() can be defined using xfs_alert(), and thereby avoid
using xfs_printk() altogether. This is the only remaining use of
xfs_printk(), so changing it this way means xfs_printk() can simply
be eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.can simply be eliminated.can simply be
eliminated.can simply be eliminated.can simply be eliminated.can
simply be eliminated.
Also add format checking to the non-debug inline function xfs_debug.
Miscellaneous function prototype argument alignment.
(Updated to delete the definition of xfs_printk(), which is
no longer used or needed.)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Move extern for cifsConvertToUCS to different header to prevent following warning:
CHECK fs/cifs/cifs_unicode.c
fs/cifs/cifs_unicode.c:267:1: warning: symbol 'cifsConvertToUCS' was not declared. Should it be static?
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Change idmap key name from cifs.cifs_idmap to cifs.idmap.
Removed unused structure wksidarr and function match_sid().
Handle errors correctly in function init_cifs().
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Previously mount options were copied and updated in the cifs_sb_info
struct only when CONFIG_CIFS_DFS_UPCALL was enabled. Making this
information generally available allows us to remove a number of ifdefs,
extra function params, and temporary variables.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Steve French <sfrench@us.ibm.com>
A relatively minor nit, but also clarified the "consensus" from the
preceding comments that it is in fact better to try for the kstrdup
early and cleanup while cleaning up is still a simple thing to do.
Reviewed-By: Steve French <smfrench@gmail.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
With CONFIG_DFS_UPCALL enabled, maintain the submount options in
cifs_sb->mountdata, simplifying the code just a bit as well as making
corner-case allocation problems less likely.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
To keep strings passed to cifs_parse_mount_options re-usable (which is
needed to clean up the DFS referral handling), tokenize a copy of the
mount options instead. If values are needed from this tokenized string,
they too must be duplicated (previously, some options were copied and
others duplicated).
Since we are not on the critical path and any cleanup is relatively easy,
the extra memory usage shouldn't be a problem (and it is a bit simpler
than trying to implement something smarter).
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Windows 2008 CIFS servers do not always return PATH_NOT_COVERED when
attempting to access a DFS share. Therefore, when checking for remote
shares, unconditionally ask for a DFS referral for the UNC (w/out prepath)
before continuing with previous behavior of attempting to access the UNC +
prepath and checking for PATH_NOT_COVERED.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31092
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The logic behind the expansion of DFS referrals is now extracted from
cifs_mount into a new static function, expand_dfs_referral. This will
reduce duplicate code in upcoming commits.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Sean Finney <seanius@seanius.net>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's a bad idea to have macro functions that reference variables more
than once, as the arguments could have side effects. Turn BCC() into
a static inlined function instead.
While we're at it, make it return a void * to discourage anyone from
dereferencing it as-is.
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This is the same patch as originally posted, just with some merge
conflicts fixed up...
Currently, the ByteCount is usually converted to host-endian on receive.
This is confusing however, as we need to keep two sets of routines for
accessing it, and keep track of when to use each routine. Munging
received packets like this also limits when the signature can be
calulated.
Simplify the code by keeping the received ByteCount in little-endian
format. This allows us to eliminate a set of routines for accessing it
and we can now drop the *_le suffixes from the accessor functions since
that's now implied.
While we're at it, switch all of the places that read the ByteCount
directly to use the get_bcc inline which should also clean up some
unaligned accesses.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fs/cifs/cifsacl.c: In function ‘id_rb_search’:
fs/cifs/cifsacl.c:215:19: warning: variable ‘linkto’ set but not used
[-Wunused-but-set-variable]
fs/cifs/cifsacl.c:214:18: warning: variable ‘parent’ set but not used
[-Wunused-but-set-variable]
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Simplify many places when we call cifs_revalidate/invalidate to make
it do what it exactly needs.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recently introduced strictcache mode brought a new code that can be
efficiently used by directio part. That's let us add vectored operations
and break unnecessary cifs_user_read and cifs_user_write.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
There is one big endian field in the cifs protocol, the RFC1001
length, which cifs code (unlike in the smb2 code) had been handling as
u32 until the last possible moment, when it was converted to be32 (its
native form) before sending on the wire. To remove the last sparse
endian warning, and to make this consistent with the smb2
implementation (which always treats the fields in their
native size and endianness), convert all uses of smb_buf_length to
be32.
This version incorporates Christoph's comment about
using be32_add_cpu, and fixes a typo in the second
version of the patch.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
rb tree search and insertion routines.
A SID which needs to be mapped, is looked up in one of the rb trees
depending on whether SID is either owner or group SID.
If found in the tree, a (mapped) id from that node is assigned to
uid or gid as appropriate. If unmapped, an upcall is attempted to
map the SID to an id. If upcall is successful, node is marked as
mapped. If upcall fails, node stays marked as unmapped and a mapping
is attempted again only after an arbitrary time period has passed.
To map a SID, which can be either a Owner SID or a Group SID, key
description starts with the string "os" or "gs" followed by SID converted
to a string. Without "os" or "gs", cifs.upcall does not know whether
SID needs to be mapped to either an uid or a gid.
Nodes in rb tree have fields to prevent multiple upcalls for
a SID. Searching, adding, and removing nodes is done within global locks.
Whenever a node is either found or inserted in a tree, a reference
is taken on that node.
Shrinker routine prunes a node if it has expired but does not prune
an expired node if its refcount is not zero (i.e. sid/id of that node
is_being/will_be accessed).
Thus a node, if its SID needs to be mapped by making an upcall,
can safely stay and its fields accessed without shrinker pruning it.
A reference (refcount) is put on the node without holding the spinlock
but a reference is get on the node by holding the spinlock.
Every time an existing mapped node is accessed or mapping is attempted,
its timestamp is updated to prevent it from getting erased or a
to prevent multiple unnecessary repeat mapping retries respectively.
For now, cifs.upcall is only used to map a SID to an id (uid or gid) but
it would be used to obtain an SID for an id.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Define (global) data structures to store ids, uids and gids, to which a
SID maps. There are two separate trees, one for SID/uid and another one
for SID/gid.
A new type of key, cifs_idmap_key_type, is used.
Keys are instantiated and searched using credential of the root by
overriding and restoring the credentials of the caller requesting the key.
Id mapping functions are invoked under config option of cifs acl.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add this let us drop filemap_write_and_wait from cifs_invalidate_mapping
and simplify the code to properly process invalidate logic.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
As with Linux nfs client, which uses "nfsvers=" or "vers=" to
indicate which protocol to use for mount, specifying
"vers=smb2" or "vers=2"
will force an smb2 mount. When vers is not specified cifs is used
ie "vers=cifs" or "vers=1"
We can eventually autonegotiate down from smb2 to cifs
when smb2 is stable enough to make it the default, but this
is for the future. At that time we could also implement a
"maxprotocol" mount option as smbclient and Samba have today,
but that would be premature until smb2 is stable.
Intially the smb2 Kconfig option will depend on "BROKEN"
until the merge is complete, and then be "EXPERIMENTAL"
When it is no longer experimental we can consider changing
the default protocol to attempt first.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Use invalidate_inode_pages2 that don't leave pages even if shrink_page_list()
has a temp ref on them. It prevents a data coherency problem when
cifs_invalidate_mapping didn't invalidate pages but the client thinks that a data
from the cache is uptodate according to an oplock level (exclusive or II).
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The comment about checking the bcc is in the wrong place. Also make it
match kernel coding style.
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
I originally intended to remove this warning in 2.6.34, but it's not in
a high performance codepath and might help us to catch bugs later. Let's
keep it, but fix the comment to allay confusion about its removal.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Allow setting cifs_acl on the server.
Pass on to the server the ACL blob generated by an application.
cifs is just a pass-through, it does not monitor or inspect the contents
of the blob, server decides whether to enforce/apply the ACL blob composed
by an application.
If setting of ACL is succeessful, mark the inode for revalidation.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
local cifs functions (repost)
Using kernel crypto APIs for DES encryption during LM and NT hash generation
instead of local functions within cifs.
Source file smbdes.c is deleted sans four functions, one of which
uses ecb des functionality provided by kernel crypto APIs.
Remove function SMBOWFencrypt.
Add return codes to various functions such as calc_lanman_hash,
SMBencrypt, and SMBNTencrypt. Includes fix noticed by Dan Carpenter.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Dan Carpenter <error27@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Remove config flag CIFS_EXPERIMENTAL.
Do export operations under new config flag CIFS_NFSD_EXPORT
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
SMB2 is the followon to the CIFS (and SMB) protocols
and the default for Windows since Windows Vista, and also
now implemented by various non-Windows servers. SMB2
is more secure, has various performance advantages, including
larger i/o sizes, flow control, better caching model and more.
SMB2 also resolves some scalability limits in the cifs
protocol and adds many new features while being much
simpler (only a few dozen commands instead of hundreds)
and since the protocol is clearer it is
also more consistently implemented across servers
and thus easier to optimize.
After much discussion with Jeff Layton, Jeremy Allison
and others at Connectathon, we decided to move the smb2
code from a distinct .ko and fstype into distinct
C files that optionally build in cifs.ko. As a result
the Kconfig gets simpler.
To avoid destabilizing cifs, the smb2 code is going
to be moved into its own experimental CONFIG_CIFS_SMB2 ifdef
as it is merged and rereviewed. The changes to stable
cifs (builds with the smb2 ifdef off) are expected to be
fairly small.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We were reserving MAX_USERNAME (now 256) on stack for
something which only needs to fit about 24 bytes ie
string krb50x + printf version of uid
Signed-off-by: Steve French <sfrench@us.ibm.com>
The patch below removes an extra "l" in the word.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Recent Windows versions now create symlinks more frequently
and they do use this "reparse point" symlink mechanism. We can of course
do symlinks nicely to Samba and other servers which support the
CIFS Unix Extensions and we can also do SFU symlinks and "client only"
"MF" symlinks optionally, but for recent Windows we currently can not
handle the common "reparse point" symlinks fully, removing the caller
for this. We will need to extend and reenable this "reparse point" worker
code in cifs and fix cifs_symlink to call this. In the interim this code
has been moved to its own config option so it is not compiled in by default
until cifs_symlink fixed up (and tested) to use this.
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The CIFSSMBNotify worker is unused, pending changes to allow it to be called
via inotify, so move it into its own experimental config option so it does
not get built in, until the necessary VFS support is fixed. It used to
be used in dnotify, but according to Jeff, inotify needs minor changes
before we can reenable this.
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
ino is unused in function cifs_root_iget().
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
No functional changes requires that we eat errors from strtobool.
If people want to not do this, then it should be fixed at a later date.
V2: Simplification suggested by Rusty Russell removes the need for
additional variable ret.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>