Commit 28331a46d8 "Ensure we request the
ordinary fileid when doing readdirplus"
changed the meaning of NFS_ATTR_FATTR_FILEID which used to be set when
FATTR4_WORD1_MOUNTED_ON_FILED was requested.
Allow nfs_fhget to succeed with only a mounted on fileid when crossing
a mountpoint or a referral.
Ask for the fileid of the absent file system if mounted_on_fileid is not
supported.
Signed-off-by: Andy Adamson <andros@netapp.com>
cc:stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we add something to the global device id cache, we need to bump the
reference count, so that the cache itself holds a reference.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't support header padding yet so better off ditching it
Reported-by: Sid Moore <learnmost@gmail.com>
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_update_inode will update isize if there is no queued pages. For pNFS,
layoutcommit is supposed to change file size on server, the same effect as queued
pages. nfs_update_inode may be called when dirty pages are written back (nfsi->npages==0)
but layoutcommit is not sent, and it will change client file size according to server
file size. Then client ends up losing what it just writes back in pNFS path.
So we should skip updating client file size if file needs layoutcommit.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the NLM daemon is killed on the NFS server, we can currently end up
hanging forever on an 'unlock' request, instead of aborting. Basically,
if the rpcbind request fails, or the server keeps returning garbage, we
really want to quit instead of retrying.
Tested-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Unmounting a pnfs filesystem hangs using filelayout and possibly others.
This fixes the use of the rcu protected node by making use of a new 'tmpnode'
for the temporary purge list. Also, the spinlock shouldn't be held when calling
synchronize_rcu().
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
->mknod() should return negative on errors and PTR_ERR() gives
already negative value...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently processes waiting with poll on cancelable timerfd timers are
not woken up when the timers are canceled. When the system time is set
the clock_was_set() function calls timerfd_clock_was_set() to cancel
and wake up processes waiting on potential cancelable timerfd
timers. However the wake up currently has no effect because in the
case of timerfd_read it is dependent on ctx->ticks not being
0. timerfd_poll also requires ctx->ticks being non zero. As a
consequence processes waiting on cancelable timers only get woken up
when the timers expire. This patch fixes this by incrementing
ctx->ticks before calling wake_up.
Signed-off-by: Max Asbock <masbock@linux.vnet.ibm.com>
Cc: kay.sievers@vrfy.org
Cc: virtuoso@slind.org
Cc: johnstul <johnstul@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1307985512.4710.41.camel@w-amax.beaverton.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We were iterating across stripe boundaries properly, but not moving the
write buffer pointer forward. This caused us to rewrite the same data
after the break. Fix by adjusting the data pointer forward, and
recalculating the io and buffer alignment after the break.
Signed-off-by: Sage Weil <sage@newdream.net>
Long ago (in commit 00e485b0), I added some code to handle share-level
passwords in CIFSTCon. That code ignored the fact that it's legit to
pass in a NULL tcon pointer when connecting to the IPC$ share on the
server.
This wasn't really a problem until recently as we only called CIFSTCon
this way when the server returned -EREMOTE. With the introduction of
commit c1508ca2 however, it gets called this way on every mount, causing
an oops when share-level security is in effect.
Fix this by simply treating a NULL tcon pointer as if user-level
security were in effect. I'm not aware of any servers that protect the
IPC$ share with a specific password anyway. Also, add a comment to the
top of CIFSTCon to ensure that we don't make the same mistake again.
Cc: <stable@kernel.org>
Reported-by: Martijn Uffing <mp3project@sarijopen.student.utwente.nl>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's possible for the following set of events to happen:
cifsd calls cifs_reconnect which reconnects the socket. A userspace
process then calls cifs_negotiate_protocol to handle the NEGOTIATE and
gets a reply. But, while processing the reply, cifsd calls
cifs_reconnect again. Eventually the GlobalMid_Lock is dropped and the
reply from the earlier NEGOTIATE completes and the tcpStatus is set to
CifsGood. cifs_reconnect then goes through and closes the socket and sets the
pointer to zero, but because the status is now CifsGood, the new socket
is not created and cifs_reconnect exits with the socket pointer set to
NULL.
Fix this by only setting the tcpStatus to CifsGood if the tcpStatus is
CifsNeedNegotiate, and by making sure that generic_ip_connect is always
called at least once in cifs_reconnect.
Note that this is not a perfect fix for this issue. It's still possible
that the NEGOTIATE reply is handled after the socket has been closed and
reconnected. In that case, the socket state will look correct but it no
NEGOTIATE was performed on it be for the wrong socket. In that situation
though the server should just shut down the socket on the next attempted
send, rather than causing the oops that occurs today.
Cc: <stable@kernel.org> # .38.x: fd88ce9: [CIFS] cifs: clarify the meaning of tcpStatus == CifsGood
Reported-and-Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_sb_master_tlink was declared as inline, but without a definition.
Remove the declaration and move the definition up.
Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
jbd2_journal_remove_journal_head() can oops when trying to access
journal_head returned by bh2jh(). This is caused for example by the
following race:
TASK1 TASK2
jbd2_journal_commit_transaction()
...
processing t_forget list
__jbd2_journal_refile_buffer(jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
jbd2_journal_try_to_free_buffers()
jbd2_journal_grab_journal_head(bh)
jbd_lock_bh_state(bh)
__journal_try_to_free_buffer()
jbd2_journal_put_journal_head(jh)
jbd2_journal_remove_journal_head(bh);
jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
buffer is not part of any transaction and thus frees journal_head
before TASK1 gets to doing so. Note that even buffer_head can be
released by try_to_free_buffers() after
jbd2_journal_put_journal_head() which adds even larger opportunity for
oops (but I didn't see this happen in reality).
Fix the problem by making transactions hold their own journal_head
reference (in b_jcount). That way we don't have to remove journal_head
explicitely via jbd2_journal_remove_journal_head() and instead just
remove journal_head when b_jcount drops to zero. The result of this is
that [__]jbd2_journal_refile_buffer(),
[__]jbd2_journal_unfile_buffer(), and
__jdb2_journal_remove_checkpoint() can free journal_head which needs
modification of a few callers. Also we have to be careful because once
journal_head is removed, buffer_head might be freed as well. So we
have to get our own buffer_head reference where it matters.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: unwind canceled flock state
ceph: fix ENOENT logic in striped_read
ceph: fix short sync reads from the OSD
ceph: fix sync vs canceled write
ceph: use ihold when we already have an inode ref
6b4517a791 (block: implement bd_claiming and claiming block)
introduced claiming block to support O_EXCL blkdev opens properly.
bd_start_claiming() looks up the part 0 bdev and starts claiming
block. The function assumed that there is only one part 0 bdev and
always used bdget_disk(disk, 0) to look it up; unfortunately, this
isn't true for some drivers (floppy) which use multiple block devices
to denote different operating parameters for the same physical device.
There can be multiple part 0 bdev's for the same device number.
This incorrect assumption caused the wrong bdev to be used during
claiming leading to unbalanced bd_holders as reported in the following
bug.
https://bugzilla.kernel.org/show_bug.cgi?id=28522
This patch updates bd_start_claiming() such that it uses the bdev
specified as argument if its partno is zero.
Note that this means that different bdev's can be used for the same
device and O_EXCL check can be effectively bypassed. It has always
been broken that way and floppy is fortunately on its way out. Leave
that breakage alone.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Tested-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Cc: stable@kernel.org # >= v2.6.36
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
credits isn't a parameter for jbd2_journal_get_write_access and
jbd2_journal_get_undo_access. So remove the corresponding comments.
Acked-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* new refcount in struct net, controlling actual freeing of the memory
* new method in kobj_ns_type_operations (->drop_ns())
* ->current_ns() semantics change - it's supposed to be followed by
corresponding ->drop_ns(). For struct net in case of CONFIG_NET_NS it bumps
the new refcount; net_drop_ns() decrements it and calls net_free() if the
last reference has been dropped. Method renamed to ->grab_current_ns().
* old net_free() callers call net_drop_ns() instead.
* sysfs_exit_ns() is gone, along with a large part of callchain
leading to it; now that the references stored in ->ns[...] stay valid we
do not need to hunt them down and replace them with NULL. That fixes
problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
of sb->s_instances abuse.
Note that struct net *shutdown* logics has not changed - net_cleanup()
is called exactly when it used to be called. The only thing postponed by
having a sysfs instance refering to that struct net is actual freeing of
memory occupied by struct net.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* set ->s_fs_info in set() callback passed to sget()
* allocate the thing and set it up enough for afs_test_super() before
making it visible
* have it freed in ->kill_sb() (current tree simply leaks it)
* have ->put_super() leave ->s_fs_info->volume alone; it's too early for
dropping it; do that from ->kill_sb() after having called kill_anon_super().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allocate ubifs_info in ->mount(), fill it enough for sb_test() and
set ->s_fs_info to it in set() callback passed to sget().
* do *not* free it in ->put_super(); do that in ->kill_sb() after we'd
done kill_anon_super().
* don't free it in ubifs_fill_super() either - deactivate_locked_super()
done by caller when ubifs_fill_super() returns an error will take care
of that sucker.
* get rid of kludge with passing ubi to ubifs_fill_super() in ->s_fs_info;
we only need it in alloc_ubifs_info(), so ubifs_fill_super() will need
only ubifs_info. Which it will find in ->s_fs_info just fine, no need to
reassign anything...
As the result, sb_test() becomes safe to apply to all superblocks that
can be found by sget() (and a kludge with temporary use of ->s_fs_info
to store a pointer to very different structure goes away).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: use join_transaction in btrfs_evict_inode()
Btrfs - use %pU to print fsid
Btrfs: fix extent state leak on failed nodatasum reads
btrfs: fix unlocked access of delalloc_inodes
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
Btrfs: clear current->journal_info on async transaction commit
Btrfs: make sure to recheck for bitmaps in clusters
btrfs: remove unneeded includes from scrub.c
btrfs: reinitialize scrub workers
btrfs: scrub: errors in tree enumeration
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: unlock the trans lock properly
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: fix duplicate checking logic
Btrfs: fix the allocator loop logic
Btrfs: fix bitmap regression
Btrfs: don't commit the transaction if we dont have enough pinned bytes
Btrfs: noinline the cluster searching functions
Btrfs: cache bitmaps when searching for a cluster
The WARN_ON() in start_transaction() was triggered while balancing.
The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checkpoint generation interval of nilfs goes wrong after user has
changed the interval parameter with nilfs-tune tool.
segctord starting. Construction interval = 5 seconds,
CP frequency < 30 seconds
segctord starting. Construction interval = 0 seconds,
CP frequency < 30 seconds
This turned out to be caused by a trivial bug in initialization code
of log writer. This will fix it.
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_delete function does not terminate part of virtual block
addresses when shrinking the last remaining child node into the root
node. The missing address termination causes that dead btree node
blocks persist and chip away free disk space.
This fixes the leak bug on the btree node deletion.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_delete function wrongly terminates virtual block address
of the btree node held by its parent at index 0. When concatenating
the index-0 node with its right sibling node, nilfs_btree_delete
terminates the block address of index-0 node instead of the right
sibling node which should be deleted.
This bug not only wears disk space in the long run, but also causes
file system corruption. This will fix it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Get rid of FIXME comment. Uuids from dmesg are now the same as uuids
given by btrfs-progs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about
BUG extent_state: Objects remaining on kmem_cache_close()
on rmmod.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
builds. This shrinks its size to 128 bytes allowing it to fit into one
fewer cache lines and allows more objects per slab in its kmem_cache.
slabinfo extent_buffer reports :-
before:-
Sizes (bytes) Slabs
----------------------------------
Object : 136 Total : 123
SlabObj: 136 Full : 121
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 30
after :-
Object : 128 Total : 4
SlabObj: 128 Full : 2
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 32
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Normally current->jouranl_info is cleared by commit_transaction. For an
async snap or subvol creation, though, it runs in a work queue. Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace. When the actual commit runs in
the other thread it won't care that it's current->journal_info is already
NULL.
Signed-off-by: Sage Weil <sage@newdream.net>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef recently changed the free extent cache to look in
the block group cluster for any bitmaps before trying to
add a new bitmap for the same offset. This avoids BUG_ON()s due
covering duplicate ranges.
But it didn't go quite far enough. A given free range might span
between one or more bitmaps or free space entries. The code has
looping to cover this, but it doesn't check for clustered bitmaps
every time.
This shuffles our gotos to check for a bitmap in the cluster
for every new bitmap entry we try to add.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Unconditionally changing the address limit to USER_DS and not restoring
it to its old value in the error path is wrong because it prevents us
using kernel memory on repeated calls to this function. This, in fact,
breaks the fallback of hard coded paths to the init program from being
ever successful if the first candidate fails to load.
With this patch applied switching to USER_DS is delayed until the point
of no return is reached which makes it possible to have a multi-arch
rootfs with one arch specific init binary for each of the (hard coded)
probed paths.
Since the address limit is already set to USER_DS when start_thread()
will be invoked, this redundancy can be safely removed.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In btrfs_wait_for_commit if we came upon a transaction that had committed we
just exited, but that's bad since we are holding the trans_lock. So break
instead so that the lock is dropped. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: trivial: add space in fsc error message
cifs: silence printk when establishing first session on socket
CIFS ACL support needs CONFIG_KEYS, so depend on it
possible memory corruption in cifs_parse_mount_options()
cifs: make CIFS depend on CRYPTO_ECB
cifs: fix the kernel release version in the default security warning message
When merging my code into the integration test the second check for duplicate
entries got screwed up. This patch fixes it by dropping ret2 and just using ret
for the return value, and checking if we got an error before adding the bitmap
to the local list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics. This was because our loop logic was a little
confused. So this is what I did
1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.
2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.
3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk. If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.
This kept me from hitting panics while trying to reproduce the other problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In cleaning up the clustering code I accidently introduced a regression by
adding bitmap entries to the cluster rb tree. The problem is if we've maxed out
the number of bitmaps we can have for the block group we can only add free space
to the bitmaps, but since the bitmap is on the cluster we can't find it and we
try to create another one. This would result in a panic because the total
bitmaps was bigger than the max bitmaps that were allowed. This patch fixes
this by checking to see if we have a cluster, and then looking at the cluster rb
tree to see if it has a bitmap entry and if it does and that space belongs to
that bitmap, go ahead and add it to that bitmap.
I could hit this panic every time with an fs_mark test within a couple of
minutes. With this patch I no longer hit the panic and fs_mark goes to
completion. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When profiling the find cluster code it's hard to tell where we are spending our
time because the bitmap and non-bitmap functions get inlined by the compiler, so
make that not happen. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we are looking for a cluster in a particularly sparse or fragmented block
group, we will do a lot of looping through the free space tree looking for
various things, and if we need to look at bitmaps we will endup doing the whole
dance twice. So instead add the bitmap entries to a temporary list so if we
have to do the bitmap search we can just look through the list of entries we've
found quickly instead of having to loop through the entire tree again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we request a lock and then abort (e.g., ^C), we need to send a matching
unlock request to the MDS to unwind our lock attempt to avoid indefinitely
blocking other clients.
Reported-by: Brian Chrisman <brchrisman@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Getting ENOENT is equivalent to reading 0 bytes. Make that correction
before setting up the hit_stripe and was_short flags.
Fixes the following case:
dd if=/dev/zero of=/mnt/fs_depot/dd3 bs=1 seek=1048576 count=0
dd if=/mnt/fs_depot/dd3 of=/root/ddout1 skip=8 bs=500 count=2 iflag=direct
Reported-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
If we get a short read from the OSD because the object is small, we need to
zero the remainder of the buffer. For O_DIRECT reads, the attempted range
is not trimmed to i_size by the VFS, so we were actually looping
indefinitely.
Fix by trimming by i_size, and the unconditionally zeroing the trailing
range.
Reported-by: Jeff Wu <cpwu@tnsoft.com.cn>
Signed-off-by: Sage Weil <sage@newdream.net>
We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
If user space attempts to remove a non-existent file or directory, and
the file system is mounted read-only, return ENOENT instead of EROFS.
Either error code is arguably valid/correct, but ENOENT is a more
specific error message.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Callers of lmLogOpen() expect it to return -E... on failure exits, which
is what it returns, except for the case of blkdev_get_by_dev() failure.
It that case lmLogOpen() return the error with the wrong sign...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
When signing is enabled, the first session that's established on a
socket will cause a printk like this to pop:
CIFS VFS: Unexpected SMB signature
This is because the key exchange hasn't happened yet, so the signature
field is bogus. Don't try to check the signature on the socket until the
first session has been established. Also, eliminate the specific check
for SMB_COM_NEGOTIATE since this check covers that case too.
Cc: stable@kernel.org
Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fix for commit 4795bb37ef, nfsd: break
lease on unlink, link, and rename
if the LINK operation breaks a delegation, it returns NFS4ERR_NOENT
(which is not a valid error in rfc 5661) instead of NFS4ERR_DELAY.
the return value of nfsd_break_lease() in nfsd_link() must be
converted from host_err to err
Signed-off-by: Casey Bodley <cbodley@citi.umich.edu>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd V4 support uses crypto interfaces, so select CRYPTO
to fix build errors in 2.6.39:
ERROR: "crypto_destroy_tfm" [fs/nfsd/nfsd.ko] undefined!
ERROR: "crypto_alloc_base" [fs/nfsd/nfsd.ko] undefined!
Reported-by: Wakko Warner <wakko@animx.eu.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit b0b0c0a26e "nfsd: add proc file listing kernel's gss_krb5
enctypes" added an nunnecessary dependency of nfsd on the auth_rpcgss
module.
It's a little ad hoc, but since the only piece of information nfsd needs
from rpcsec_gss_krb5 is a single static string, one solution is just to
share it with an include file.
Cc: stable@kernel.org
Reported-by: Michael Guntsche <mike@it-loops.com>
Cc: Kevin Coffman <kwc@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Build fails if CONFIG_KEYS is not selected.
Signed-off-by: Darren Salt <linux@youmustbejoking.demon.co.uk>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
While creating fixed tracepoints for ext3, basically by porting them
from ext4, I found a lot of useless retyping, wrong type usage, useless
variable passing and other inconsistencies in the ext4 fixed tracepoint
code.
This patch cleans the fixed tracepoint code for ext4 and also simplify
some of them.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently we are not marking the extent as the last one
(FIEMAP_EXTENT_LAST) if there is a hole at the end of the file. This is
because we just do not check for it right now and continue searching for
next extent. But at the point we hit the hole at the end of the file, it
is too late.
This commit adds check for the allocated block in subsequent extent and
if there is no more extents (block = EXT_MAX_BLOCKS) just flag the
current one as the last one.
This behaviour has been spotted unintentionally by 252 xfstest, when the
test hangs out, because of wrong loop condition. However on other
filesystems (like xfs) it will exit anyway, because we notice the last
extent flag and exit.
With this patch xfstest 252 does not hang anymore, ext4 fiemap
implementation still reports bad extent type in some cases, however
this seems to be different issue.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
format and fill the tail of file up to its end. We will hit the BUG_ON
when we write the last block (2^32-1) into the sparse file.
The root cause of the problem lies in the fact that we specifically set
s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
which is 32 bit long. However, we are not storing start and end block
number, but rather start block number and length in blocks. It means
that in order to cover extent from 0 to EXT_MAX_BLOCK we need
EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
and it does not.
The only way to fix it without changing the meaning of the struct
ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
by one fs block so we can cover the whole extent we can get by the
on-disk extent format.
Also in many places EXT_MAX_BLOCK is used as length instead of maximum
logical block number as the name suggests, it is all a bit messy. So
this commit renames it to EXT_MAX_BLOCKS and change its usage in some
places to actually be maximum number of blocks in the extent.
The bug which this commit fixes can be reproduced as follows:
dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-2))
sync
dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-1))
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
metadata is not parameter of ext4_free_blocks() any more.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
JOBCTL_TRAPPING indicates that ptracer is waiting for tracee to
(re)transit into TRACED. task_clear_jobctl_pending() must be called
when either tracee enters TRACED or the transition is cancelled for
some reason. The former is achieved by explicitly calling
task_clear_jobctl_pending() in ptrace_stop() and the latter by calling
it at the end of do_signal_stop().
Calling task_clear_jobctl_trapping() at the end of do_signal_stop()
limits the scope TRAPPING can be used and is fragile in that seemingly
unrelated changes to tracee's control flow can lead to stuck TRAPPING.
We already have task_clear_jobctl_pending() calls on those cancelling
events to clear JOBCTL_STOP_PENDING. Cancellations can be handled by
making those call sites use JOBCTL_PENDING_MASK instead and updating
task_clear_jobctl_pending() such that task_clear_jobctl_trapping() is
called automatically if no stop/trap is pending.
This patch makes the above changes and removes the fallback
task_clear_jobctl_trapping() call from do_signal_stop().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
This patch introduces JOBCTL_PENDING_MASK and replaces
task_clear_jobctl_stop_pending() with task_clear_jobctl_pending()
which takes an extra @mask argument.
JOBCTL_PENDING_MASK is currently equal to JOBCTL_STOP_PENDING but
future patches will add more bits. recalc_sigpending_tsk() is updated
to use JOBCTL_PENDING_MASK instead.
task_clear_jobctl_pending() takes @mask which in subset of
JOBCTL_PENDING_MASK and clears the relevant jobctl bits. If
JOBCTL_STOP_PENDING is set, other STOP bits are cleared together. All
task_clear_jobctl_stop_pending() users are updated to call
task_clear_jobctl_pending() with JOBCTL_STOP_PENDING which is
functionally identical to task_clear_jobctl_stop_pending().
This patch doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
signal->group_stop currently hosts mostly group stop related flags;
however, it's gonna be used for wider purposes and the GROUP_STOP_
flag prefix becomes confusing. Rename signal->group_stop to
signal->jobctl and rename all GROUP_STOP_* flags to JOBCTL_*.
Bit position macros JOBCTL_*_BIT are defined and JOBCTL_* flags are
defined in terms of them to allow using bitops later.
While at it, reassign JOBCTL_TRAPPING to bit 22 to better accomodate
future additions.
This doesn't cause any functional change.
-v2: JOBCTL_*_BIT macros added as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Use hlist_entry() for io_context.cic_list.first
cfq-iosched: Remove bogus check in queue_fail path
xen/blkback: potential null dereference in error handling
xen/blkback: don't call vbd_size() if bd_disk is NULL
block: blkdev_get() should access ->bd_disk only after success
CFQ: Fix typo and remove unnecessary semicolon
block: remove unwanted semicolons
Revert "block: Remove extra discard_alignment from hd_struct."
nbd: adjust 'max_part' according to part_shift
nbd: limit module parameters to a sane value
nbd: pass MSG_* flags to kernel_recvmsg()
block: improve the bio_add_page() and bio_add_pc_page() descriptions
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When CONFIG_CRYPTO_ECB is not set, trying to mount a CIFS share with NTLM
security resulted in mount failure with the following error:
"CIFS VFS: could not allocate des crypto API"
Seems like a leftover from commit 43988d7.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
CC: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When ntlm security mechanim is used, the message that warns about the upgrade
to ntlmv2 got the kernel release version wrong (Blame it on Linus :). Fix it.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The free space fixup is currently initiated during mount after the call to
ubifs_write_master() which results in a write to PEBs; this has been observed
with the patch 'assert no fixup when writing a node' applied:
Move the free space fixup on mount to before the calls to
ubifs_recover_inl_heads() and ubifs_write_master(). This results in no
assertions with the previously mentioned patch applied.
Artem: tweaked the patch a bit
Signed-off-by: Ben Gardiner <bengardiner@nanometrics>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current 'mount_ubifs()' implementation does not initialize the LPT until the
the master node is marked dirty. Move the LPT initialization to before marking
the master node dirty. This is a preparation for the next patch which will move
the free-space-fixup check to before marking the master node dirty, because we
have to fix-up the free space before doing any writes.
Artem: massaged the patch and commit message.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The current free space fixup can result in some writing to the UBI volume
when the space_fixup flag is set.
To catch instances where UBIFS is writing to the NAND while the space_fixup
flag is set, add an assert to ubifs_write_node().
Artem: tweaked the patch, added similar assertion to the write buffer
write path.
Signed-off-by: Ben Gardiner <bengardiner@nanometrics.ca>
Reviewed-by: Matthew L. Creech <mlcreech@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS maintains per-filesystem and global clean znode counters
('c->clean_zn_cnt' and 'ubifs_clean_zn_cnt'). It is important to maintain
correct values there since the shrinker relies on 'ubifs_clean_zn_cnt'.
However, in case of failures during commit the counters were corrupted. E.g.,
if a failure happens in the middle of 'write_index()', then some nodes in the
commit list ('c->cnext') are marked as clean, and some are marked as dirty. And
the 'ubifs_destroy_tnc_subtree()' frees does not retrun correct count, and we
end up with non-zero 'c->clean_zn_cnt' when unmounting. This means that if we
have 2 file-sytem and one of them fails, and we unmount it,
'ubifs_clean_zn_cnt' stays incorrect and confuses the shrinker.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
UBIFS leaks memory on error path in 'ubifs_jnl_update()' in case of write
failure because it forgets to free the 'struct ubifs_dent_node *dent' object.
Although the object is small, the alignment can make it large - e.g., 2KiB
if the min. I/O unit is 2KiB.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Sometimes VM asks the shrinker to return amount of objects it can shrink,
and we return the ubifs_clean_zn_cnt in that case. However, it is possible
that this counter is negative for a short period of time, due to the way
UBIFS TNC code updates it. And I can observe the following warnings sometimes:
shrink_slab: ubifs_shrinker+0x0/0x2b7 [ubifs] negative objects to delete nr=-8541616642706119788
This patch makes sure UBIFS never returns negative count of objects.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: stable@kernel.org
Unfortunately, the recovery fix d1606a59b6be4ea392eabd40d1250aa1eeb19efb
(UBIFS: fix extremely rare mount failure) broke recovery. This commit make
UBIFS drop the last min. I/O unit in all journal heads, but this is needed only
for the GC head. And this does not work for non-GC heads. For example, if
suppose we have min. I/O units A and B, and A contains a valid node X, which
was fsynced, and then a group of nodes Y which spans the rest of A and B. In
this case we'll drop not only Y, but also X, which is obviously incorrect.
This patch fixes the issue and additionally makes recovery to drop last min.
I/O unit only for the GC head, and leave things as they have been for ages for
the other heads - this is safer.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Instead of passing "grouped" parameter to 'ubifs_recover_leb()' which tells
whether the nodes are grouped in the LEB to recover, pass the journal head
number and let 'ubifs_recover_leb()' look at the journal head's 'grouped' flag.
This patch is a preparation to a further fix where we'll need to know the
journal head number for other purposes.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Journal heads are different in a way how UBIFS writes nodes there. All normal
journal heads receive grouped nodes, while the GC journal heads receives
ungrouped nodes. This patch adds a 'grouped' flag to 'struct ubifs_jhead' which
describes this property.
This patch is a preparation to a further recovery fix.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Commit ab51afe05273741f72383529ef488aa1ea598ec6 was a good clean-up, but
it introduced a regression - now UBIFS prints scary error messages during
recovery on all corrupted nodes, even though the corruptions are expected
(due to a power cut). This patch fixes the issue.
Additionally fix a typo in a commentary introduced by the same commit.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
d4dc210f69 (block: don't block events on excl write for non-optical
devices) added dereferencing of bdev->bd_disk to test
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; however, bdev->bd_disk can be
%NULL if open failed which can lead to an oops.
Test the flag after testing open was successful, not before.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: David Miller <davem@davemloft.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The lockdep warning below detects a possible A->B/B->A locking
dependency of mm->mmap_sem and dcookie_mutex. The order in
sync_buffer() is mm->mmap_sem/dcookie_mutex, while in
sys_lookup_dcookie() it is vice versa.
Fixing it in sys_lookup_dcookie() by unlocking dcookie_mutex before
copy_to_user().
oprofiled/4432 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff810b444b>] might_fault+0x53/0xa3
but task is already holding lock:
(dcookie_mutex){+.+.+.}, at: [<ffffffff81124d28>] sys_lookup_dcookie+0x45/0x149
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (dcookie_mutex){+.+.+.}:
[<ffffffff8106557f>] lock_acquire+0xf8/0x11e
[<ffffffff814634f0>] mutex_lock_nested+0x63/0x309
[<ffffffff81124e5c>] get_dcookie+0x30/0x144
[<ffffffffa0000fba>] sync_buffer+0x196/0x3ec [oprofile]
[<ffffffffa0001226>] task_exit_notify+0x16/0x1a [oprofile]
[<ffffffff81467b96>] notifier_call_chain+0x37/0x63
[<ffffffff8105803d>] __blocking_notifier_call_chain+0x50/0x67
[<ffffffff81058068>] blocking_notifier_call_chain+0x14/0x16
[<ffffffff8105a718>] profile_task_exit+0x1a/0x1c
[<ffffffff81039e8f>] do_exit+0x2a/0x6fc
[<ffffffff8103a5e4>] do_group_exit+0x83/0xae
[<ffffffff8103a626>] sys_exit_group+0x17/0x1b
[<ffffffff8146ad4b>] system_call_fastpath+0x16/0x1b
-> #0 (&mm->mmap_sem){++++++}:
[<ffffffff81064dfb>] __lock_acquire+0x1085/0x1711
[<ffffffff8106557f>] lock_acquire+0xf8/0x11e
[<ffffffff810b4478>] might_fault+0x80/0xa3
[<ffffffff81124de7>] sys_lookup_dcookie+0x104/0x149
[<ffffffff8146ad4b>] system_call_fastpath+0x16/0x1b
other info that might help us debug this:
1 lock held by oprofiled/4432:
#0: (dcookie_mutex){+.+.+.}, at: [<ffffffff81124d28>] sys_lookup_dcookie+0x45/0x149
stack backtrace:
Pid: 4432, comm: oprofiled Not tainted 2.6.39-00008-ge5a450d #9
Call Trace:
[<ffffffff81063193>] print_circular_bug+0xae/0xbc
[<ffffffff81064dfb>] __lock_acquire+0x1085/0x1711
[<ffffffff8102ef13>] ? get_parent_ip+0x11/0x42
[<ffffffff810b444b>] ? might_fault+0x53/0xa3
[<ffffffff8106557f>] lock_acquire+0xf8/0x11e
[<ffffffff810b444b>] ? might_fault+0x53/0xa3
[<ffffffff810d7d54>] ? path_put+0x22/0x27
[<ffffffff810b4478>] might_fault+0x80/0xa3
[<ffffffff810b444b>] ? might_fault+0x53/0xa3
[<ffffffff81124de7>] sys_lookup_dcookie+0x104/0x149
[<ffffffff8146ad4b>] system_call_fastpath+0x16/0x1b
References: https://bugzilla.kernel.org/show_bug.cgi?id=13809
Cc: <stable@kernel.org> # .27+
Signed-off-by: Robert Richter <robert.richter@amd.com>
The dentry_unhash push-down series missed that shink_dcache_parent needs to
be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and
allow efficient dentry reclaim.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It was not a good idea to start dereferencing disk->queue from
the fs sysfs strategy for displaying discard alignment. We ran
into first a NULL pointer deref, and after fixing that we sometimes
see unvalid disk->queue pointer values.
Since discard is the only one of the bunch actually looking into
the queue, just revert the change.
This reverts commit 23ceb5b771.
Conflicts:
fs/partitions/check.c
Now that ecryptfs_lookup_interpose() is no longer using
ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be
removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to
ecryptfs_header_cache.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
ecryptfs_lookup_interpose() has turned into spaghetti code over the
years. This is an effort to clean it up.
- Shorten overly descriptive variable names such as ecryptfs_dentry
- Simplify gotos and error paths
- Create helper function for reading plaintext i_size from metadata
It also includes an optimization when reading i_size from the metadata.
A complete page-sized kmem_cache_alloc() was being done to read in 16
bytes of metadata. The buffer for that is now statically declared.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Instead of having the calling functions translate the true/false return
code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or
-EINVAL so that the calling functions can just reuse the return code.
Also, rename the function to ecryptfs_validate_marker() to avoid callers
mistakenly thinking that it returns true/false codes.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Only unlock and d_add() new inodes after the plaintext inode size has
been read from the lower filesystem. This fixes a race condition that
was sometimes seen during a multi-job kernel build in an eCryptfs mount.
https://bugzilla.kernel.org/show_bug.cgi?id=36002
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Reported-by: David <david@unsolicited.net>
Tested-by: David <david@unsolicited.net>
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd: make local functions static
NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
NFSD: Check status from nfsd4_map_bcts_dir()
NFSD: Remove setting unused variable in nfsd_vfs_read()
nfsd41: error out on repeated RECLAIM_COMPLETE
nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
nfsd v4.1 lOCKT clientid field must be ignored
nfsd41: add flag checking for create_session
nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
nfsd4: fix wrongsec handling for PUTFH + op cases
nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
nfsd4: introduce OPDESC helper
nfsd4: allow fh_verify caller to skip pseudoflavor checks
nfsd: distinguish functions of NFSD_MAY_* flags
svcrpc: complete svsk processing on cb receive failure
svcrpc: take advantage of tcp autotuning
SUNRPC: Don't wait for full record to receive tcp data
svcrpc: copy cb reply instead of pages
svcrpc: close connection if client sends short packet
svcrpc: note network-order types in svc_process_calldir
...
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Support for RPC over AF_LOCAL transports
SUNRPC: Remove obsolete comment
SUNRPC: Use AF_LOCAL for rpcbind upcalls
SUNRPC: Clean up use of curly braces in switch cases
NFS: Revert NFSROOT default mount options
SUNRPC: Rename xs_encode_tcp_fragment_header()
nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
nfs41: Correct offset for LAYOUTCOMMIT
NFS: nfs_update_inode: print current and new inode size in debug output
NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
NFSv4: Handle expired stateids when the lease is still valid
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
Commit 1495f230fa ("vmscan: change shrinker API by passing
shrink_control struct") changed the API of ->shrink(), but missed ubifs
and cifs instances.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement pg_test vector to test for max IO sizes. We calculate
a max_io_size member only once, and cache it in lseg so to not
do so on every page insert.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[simplify logic]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
By default, unless pnfs is used coalesce pages until pg_bsize
(rsize or wsize) is reached.
pnfs layout drivers define their own pg_test methods that use
pnfs_generic_pg_test and need to define their own I/O size
limits (e.g. based on the file stripe size).
[Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Use common code for pnfs_pageio_init_{read,write} and use
a common generic pg_test function.
Note that this function always assumes the the layout driver's
pg_test method is implemented.
[Fix BUG]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Define API for io-engines to report delta_space_used in IOs
* Encode the osd-layout specific information of the layoutcommit
XDR buffer.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add a layout driver method to encode the layout type specific
opaque part of layout commit in-line in the xdr stream.
Currently, the pnfs-objects layout driver uses it to encode metadata hints
to the MDS and the blocks layout driver to commit provisionally allocated
extents to the file.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
An io_state pre-allocates an error information structure for each
possible osd-device that might error during IO. When IO is done if all
was well the io_state is freed. (as today). If the I/O has ended with an
error, the io_state is queued on a per-layout err_list. When eventually
encode_layoutreturn() is called, each error is properly encoded on the
XDR buffer and only then the io_state is removed from err_list and
de-allocated.
It is up to the io_engine to fill in the segment that fault and the type
of osd_error that occurred. By calling objlayout_io_set_result() for
each failing device.
In objio_osd:
* Allocate io-error descriptors space as part of io_state
* Use generic objlayout error reporting at end of io.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add a layout driver method to encode the layout type specific
opaque part of layout return in-line in the xdr stream.
Currently the pnfs-objects layout driver uses it to encode i/o error
information on LAYOUTRETURN.
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for encode_layoutreturn]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
With the objects layout security model, we have object capabilities
that are associated with the layout and we anticipate that the server
will issue a cb_layoutrecall for any setattr that changes security
related attributes (user/group/mode/acl) or truncates the file.
Therefore, the layout is returned before issuing the setattr to avoid
the anticipated cb_layoutrecall.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
NFSv4.1 LAYOUTRETURN implementation
Currently, does not support layout-type payload encoding.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
With the use of the in-kernel osd library. Implement read/write
of data from/to osd-objects according to information specified
in the objects-layout.
Support for stripping over mirrors with a received stripe_unit.
There are however a few constrains which are not supported:
1. Stripe Unit must be a multiple of PAGE_SIZE
2. stripe length (stripe_unit * number_of_stripes) can not be
bigger then 32bit.
Also support raid-groups and partial-layout. Partial-layout is
when not all the groups are received on the line, addressing
only a partial range of the file.
TODO:
Only raid0! raid 4/5/6 support will come at later stage
A none supported layout will send IO through the MDS
[Important fallout from the last rebase]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.
[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
When a new layout is received in objio_alloc_lseg all device_ids
referenced are retrieved. The device information is queried for from MDS
and then the osd_device is looked-up from the osd-initiator library. The
devices are cached in a per-mount-point list, for later use. At unmount
all devices are "put" back to the library.
objlayout_get_deviceinfo(), objlayout_put_deviceinfo() middleware
API for retrieving device information given a device_id.
TODO: The device cache can get big. Cap its size. Keep an LRU and start
to return devices which were not used, when list gets to big, or
when new entries allocation fail.
[pnfs-obj: Bugs in new global-device-cache code]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[use global device cache]
[use layout driver in global device cache]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
objlayout_alloc_lseg prepares an xdr_stream and calls the
raid engins objio_alloc_lseg() to allocate a private
pnfs_layout_segment.
objio_osd.c::objio_alloc_lseg() uses passed xdr_stream to
decode and store the layout_segment information in an
objio_segment struct, using the pnfs_osd_xdr.h API for
the actual parsing the layout xdr.
objlayout_free_lseg calls objio_free_lseg() to free the
allocated space.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[removed "extern" from function definitions]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Add the fs/nfs/objlayout/pnfs_osd_xdr_cli.c file, which will
include the XDR encode/decode implementations for the pNFS
client objlayout driver.
[Wrong type in comments]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Define the PNFS_OBJLAYOUT Kconfig option in the nfs
master Kconfig file.
* Add the objlayout driver to the Kernel's Kbuild system.
* Add the fs/nfs/objlayout/Kbuild file for building the
objlayoutdriver.ko driver
* Define fs/nfs/objlayout/objio_osd.c, register the driver on module
initialization and unregister on exit.
[pnfs-obj: remove of CONFIG_PNFS fallout]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[added "unsure" clause]
[depend on NFS_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
A pNFS client auto-negotiates a lot of features (minorversion level,
pNFS layout type, etc.). This is convenient, but makes certain kinds of
failures hard for a user to detect.
For example, if the client falls back on 4.0, or falls back to MDS IO
because the user didn't connect to the right iscsi disks before
mounting, the only symptoms may be reduced performance, which may not be
noticed till long after the actual failure, and may be difficult for a
user to diagnose.
However, such "failures" may also be perfectly normal in some cases, so
we don't want to spam the system logs with them.
One approach would be to put some more information into
/proc/self/mountstats.
Signed-off-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: add commit client stats]
[fixup data types for "ret" variables in pnfs_try_to* inline funcs.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[fix definition of show_pnfs for !CONFIG_PNFS]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Fix show_sessions in the not CONFIG_NFS_V4_1 case]
There is a build error when CONFIG_NFS_V4 is set but
CONFIG_NFS_V4_1 is *not* set. show_sessions() prototype
was unbalanced between the two cases.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfs: super.c remove CONFIG_PNFS]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add offset and count parameters to pnfs_update_layout and use them to get
the layout in the pageio path.
Order cache layout segments in the following order:
* offset (ascending)
* length (descending)
* iomode (RW before READ)
Test byte range against the layout segment in use in pnfs_{read,write}_pg_test
so not to coalesce pages not using the same layout segment.
[fix lseg ordering]
[clean up pnfs_find_lseg lseg arg]
[remove unnecessary FIXME]
[fix ordering in pnfs_insert_layout]
[clean up pnfs_insert_layout]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
pnfs deviceids are unique per server, per layout type.
struct nfs_client is currently used to distinguish deviceids from
different nfs servers, yet these may clash between different layout
types on the same server. Therefore, use the layout driver associated
with each deviceid at insertion time to look it up, unhash, or
delete it.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Note: This functionlaity is incomplete as all layout segments referring to
the 'to be removed device id' need to be reaped, and all in flight I/O drained.
[use be32 res in nfs4_callback_devicenotify]
[use nfs_client to qualify deviceid for cb_notify_deviceid]
[use global deviceid cache for CB_NOTIFY_DEVICEID]
[refactor device cache _lookup_deviceid]
[refactor device cache _find_get_deviceid]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Bug in new global-device-cache code]
[layout_driver MUST set free_deviceid_node if using dev-cache]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
The eCryptfs inode get, initialization, and dentry interposition code
has two separate paths. One is for when dentry interposition is needed
after doing things like a mkdir in the lower filesystem and the other
is needed after a lookup. Unlocking new inodes and doing a d_add() needs
to happen at different times, depending on which type of dentry
interposing is being done.
This patch cleans up the inode get and initialization code paths and
splits them up so that the locking and d_add() differences mentioned
above can be handled appropriately in a later patch.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Tested-by: David <david@unsolicited.net>
Use the pnfs_layoutdriver_type both as a qualifier for the deviceid,
distinguishing deviceid from different layout types on the server,
and for freeing the layout-driver allocated structure containing the
nfs4_deviceid_node.
[BUG in _deviceid_purge_client]
[layout_driver MUST set free_deviceid_node if using dev-cache]
[let ver < 4.1 compile]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[removed EXPORT_SYMBOL_GPL(nfs4_deviceid_purge_client)]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
These functions should live in inode.c since their focus is on inodes
and they're primarily used by functions in inode.c.
Also does a simple cleanup of ecryptfs_inode_test() and rolls
ecryptfs_init_inode() into ecryptfs_inode_set().
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Tested-by: David <david@unsolicited.net>
Move deviceid cache from the pnfs files layout driver to the
generic layer in preparation for the objects layout driver.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
deviceids are unique per server, per layout type.
Therefore, in the global cache in the files layout driver
deviceids from different servers may clash so we need
to qualify them with a struct nfs_client that represents
the nfs server that returned the deviceid.
Introduced in 2.6.39 commit ea8eecdd
"NFSv4.1 move deviceid cache to filelayout driver"
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.
Why xattr lookup on every write I hear you ask?
write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.
In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.
Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.
This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.
I also used the same flag to avoid the suid check, although
that one is pretty cheap.
A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.
With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.
v2: Rename is_sgid. add file system helper.
Cc: chris.mason@oracle.com
Cc: josef@redhat.com
Cc: viro@zeniv.linux.org.uk
Cc: agruen@linbit.com
Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The descriptions of bio_add_page() and bio_add_pc_page() are slightly
inconsistent; improve them.
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
For filesystems such as nilfs2 and xfs that use block_page_mkwrite, modify that
function to wait for pending writeback before allowing the page to become
writable. This is needed to stabilize pages during writeback for those two
filesystems.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
configfs does not have problems with references to unlinked directories.
CC: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fat does not have problems with references to unlinked directories.
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hpfs has no problems with references to unlinked directories.
We leave one dentry_unhash call in place, in hpfs_unlink's strange path
where it tries to truncate a file because the disk is full. I'm not sure
what the full story is there.
CC: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Minix has no issues with references to unlinked directories.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fuse has no problems with references to unlinked directories.
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: fuse-devel@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Coda has no problems with references to unlinked directories.
CC: Jan Harkes <jaharkes@cs.cmu.edu>
CC: coda@cs.cmu.edu
CC: codalist@coda.cs.cmu.edu
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
afs has no problems with references to unlinked directories.
CC: David Howells <dhowells@redhat.com>
CC: linux-afs@lists.infradead.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
affs has no problems with references to unlinked directories.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9p has no problems with references to unlinked directories.
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ncpfs does not handle references to unlinked directories (or so it would
seem given the ncp_rmdir check). Since it is also possible to rename over
an empty directory, perform the same check here.
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ncpfs returns EBUSY if there are any references to the directory. The
dentry_unhash call only unhashes the dentry if there are no references.
CC: Petr Vandrovec <petr@vandrovec.name>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ecryptfs does not have problems with references to unlinked directories.
CC: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
CC: Dustin Kirkland <kirkland@canonical.com>
CC: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hostfs does not have problems with references to unlinked directories.
CC: Jeff Dike <jdike@addtoit.com>
CC: Richard Weinberger <richard@nod.at>
CC: user-mode-linux-devel@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hfsplus does not have problems with references to unlinked directories.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hfs does not have problems with references to unlinked directories.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
omfs does not have problems with references to unlinked directories.
CC: Bob Copeland <me@bobcopeland.com>
CC: linux-karma-devel@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
udf does not have problems with references to unlinked directories.
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reiserfs does not have problems with references to unlinked directories.
CC: reiserfs-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ufs does not have problems with references to unlinked directories.
CC: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ubifs does not have problems with references to unlinked directories.
CC: Artem Bityutskiy <dedekind1@gmail.com>
CC: Adrian Hunter <adrian.hunter@nokia.com>
CC: linux-mtd@lists.infradead.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nilfs2 does not have problems with references to unlinked directories.
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-nilfs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
logfs does not have problems with references to unlinked directories.
CC: Joern Engel <joern@logfs.org>
CC: logfs@logfs.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
jfs does not have problems with references to unlinked directories.
CC: Dave Kleikamp <shaggy@kernel.org>
CC: jfs-discussion@lists.sourceforge.net
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
jffs2 does not have problems with references to unlinked directories.
CC: David Woodhouse <dwmw2@infradead.org>
CC: linux-mtd@lists.infradead.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysv does not have problems with references to unlinked directories.
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Bfs does not have problems with references to unlinked directories.
CC: tigran@aivazian.fsnet.co.uk
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.infradead.org/mtd-2.6: (97 commits)
mtd: kill CONFIG_MTD_PARTITIONS
mtd: remove add_mtd_partitions, add_mtd_device and friends
mtd: convert remaining users to mtd_device_register()
mtd: samsung onenand: convert to mtd_device_register()
mtd: omap2 onenand: convert to mtd_device_register()
mtd: txx9ndfmc: convert to mtd_device_register()
mtd: tmio_nand: convert to mtd_device_register()
mtd: socrates_nand: convert to mtd_device_register()
mtd: sharpsl: convert to mtd_device_register()
mtd: s3c2410 nand: convert to mtd_device_register()
mtd: ppchameleonevb: convert to mtd_device_register()
mtd: orion_nand: convert to mtd_device_register()
mtd: omap2: convert to mtd_device_register()
mtd: nomadik_nand: convert to mtd_device_register()
mtd: ndfc: convert to mtd_device_register()
mtd: mxc_nand: convert to mtd_device_register()
mtd: mpc5121_nfc: convert to mtd_device_register()
mtd: jz4740_nand: convert to mtd_device_register()
mtd: h1910: convert to mtd_device_register()
mtd: fsmc_nand: convert to mtd_device_register()
...
Fixed up trivial conflicts in
- drivers/mtd/maps/integrator-flash.c: removed in ARM tree
- drivers/mtd/maps/physmap.c: addition of afs partition probe type
clashing with removal of CONFIG_MTD_PARTITIONS
Marek Belisko <marek.belisko@gmail.com> reports that recent attempts
to fix regressions in NFSROOT have broken his configuration:
> After update from 2.6.38-rc8 to 2.6.38 is mounting rootfs over nfs not possible.
> Log:
> VFS: Mounted root (nfs filesystem) on device 0:14.
> Freeing init memory: 132K
> nfs: server 10.146.1.21 not responding, still trying
> nfs: server 10.146.1.21 not responding, still trying
>
> This is never ending. I make short bisect (not too much commits
> between versions)
> and bad commit was reported: 53d4737580
>
> NFS: NFSROOT should default to "proto=udp"
>
> I've tested on mini2440 board (DM9000, static IP).
> Is there some missing option or something else to be checked?
An examination of a network trace captured during the failure shows
that the mount is actually succeeding, but that the client is not
seeing READ replies larger than 16KB. This could be a local packet
filtering issue on the client, but we didn't troubleshoot this
further because of the reported "git bisect" result.
Last fall we removed the ad hoc mount option parser in
fs/nfs/nfsroot.c in favor of using the main parser in fs/nfs/super.c
(see commit 56463e50 "NFS: Use super.c for NFSROOT mount option
parsing"). That commit changed the default NFSROOT mount options to
be the same as those employed by user space mounts.
As it turns out, these new default mount options are not tolerated by
many embedded systems. So far these problems have been due to
specific behavior of certain embedded NICs. The NFS community does
not have such hardware on hand for running tests.
Commit 53d47375 recently introduced a clean way to specify default
mount options for NFSROOT, so we can now easily restore the
traditional defaults for NFSROOT:
vers=2,udp,rsize=4096,wsize=4096
This should revert the new default NFSROOT mount options introduced
with commit 56463e50.
Tested-by: Marek Belisto <marek.belisto@open-nandra.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rcu callback nfs_free_delegation_callback() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(nfs_free_delegation_callback).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A client sends offset to MDS as it was seen by DS. As result,
file size after copy is only half of original file size in case
of 2 DS.
Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
In nfs_update_inode debug output, print the current and new inode
size when the file size changes on the NFS server.
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the call to nfs4_schedule_session_recovery() will actually just
result in a test of the lease when what we really want is to force a
session reset.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Currently, if the server returns NFS4ERR_EXPIRED in reply to a READ or
WRITE, but the RENEW test determines that the lease is still active, we
fail to recover and end up looping forever in a READ/WRITE + RENEW death
spiral.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
Btrfs: use the device_list_mutex during write_dev_supers
Btrfs: setup free ino caching in a more asynchronous way
btrfs scrub: don't coalesce pages that are logically discontiguous
Btrfs: return -ENOMEM in clear_extent_bit
Btrfs: add mount -o auto_defrag
Btrfs: using rcu lock in the reader side of devices list
Btrfs: drop unnecessary device lock
Btrfs: fix the race between remove dev and alloc chunk
Btrfs: fix the race between reading and updating devices
Btrfs: fix bh leak on __btrfs_open_devices path
Btrfs: fix unsafe usage of merge_state
Btrfs: allocate extent state and check the result properly
fs/btrfs: Add missing btrfs_free_path
Btrfs: check return value of btrfs_inc_extent_ref()
Btrfs: return error to caller if read_one_inode() fails
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Btrfs: return error code to caller when btrfs_del_item fails
Btrfs: return error code to caller when btrfs_previous_item fails
btrfs: fix typo 'testeing' -> 'testing'
btrfs: typo: 'btrfS' -> 'btrfs'
...
As Jeff just pointed out, __constant_cpu_to_le32 was required instead of
cpu_to_le32 in previous patch to cifsacl.c 383c55350f
(Fix endian error comparing authusers when cifsacl enabled)
CC: Stable <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix endian error comparing authusers when cifsacl enabled
[CIFS] Rename three structures to avoid camel case
Fix extended security auth failure
CIFS: Add rwpidforward mount option
CIFS: Migrate to shared superblock model
[CIFS] Migrate from prefixpath logic
CIFS: Fix memory leak in cifs_do_mount
[CIFS] When mandatory encryption on share, fail mount
CIFS: Use pid saved from cifsFileInfo in writepages and set_file_size
cifs: add cifs_async_writev
cifs: clean up wsize negotiation and allow for larger wsize
cifs: convert cifs_writepages to use async writes
CIFS: Fix undefined behavior when mount fails
cifs: don't call mid_q_entry->callback under the Global_MidLock (try #5)
CIFS: Simplify mount code for further shared sb capability
CIFS: Simplify connection structure search calls
cifs: remove unused SMB2 config and mount options
cifs: add ignore_pend flag to cifs_call_async
cifs: make cifs_send_async take a kvec array
cifs: consolidate SendReceive response checks
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
Ocfs2/move_extents: Validate moving goal after the adjustment.
Ocfs2/move_extents: Avoid doing division in extent moving.
The buffers allocated while encrypting and decrypting long filenames can
sometimes straddle two pages. In this situation, virt_to_scatterlist()
will return -ENOMEM, causing the operation to fail and the user will get
scary error messages in their logs:
kernel: ecryptfs_write_tag_70_packet: Internal error whilst attempting
to convert filename memory to scatterlist; expected rc = 1; got rc =
[-12]. block_aligned_filename_size = [272]
kernel: ecryptfs_encrypt_filename: Error attempting to generate tag 70
packet; rc = [-12]
kernel: ecryptfs_encrypt_and_encode_filename: Error attempting to
encrypt filename; rc = [-12]
kernel: ecryptfs_lookup: Error attempting to encrypt and encode
filename; rc = [-12]
The solution is to allow up to 2 scatterlist entries to be used.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
eCryptfs wasn't clearing the eCryptfs inode's i_nlink after a successful
vfs_rmdir() on the lower directory. This resulted in the inode evict and
destroy paths to be missed.
https://bugs.launchpad.net/ecryptfs/+bug/723518
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Fix sparse warning:
CHECK fs/cifs/cifsacl.c
fs/cifs/cifsacl.c:41:36: warning: incorrect type in initializer
(different base types)
fs/cifs/cifsacl.c:41:36: expected restricted __le32
fs/cifs/cifsacl.c:41:36: got int
fs/cifs/cifsacl.c:461:52: warning: restricted __le32 degrades to integer
fs/cifs/cifsacl.c:461:73: warning: restricted __le32 degrades to integer
The second one looks harmless but the first one (sid_authusers)
was added in commit 2fbc2f1729
and only affects 2.6.38/2.6.39
CC: Stable <stable@kernel.org>
Reviewed-and-Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This change introduces a few of the less controversial /proc and
/proc/sys interfaces for tile, along with sysfs attributes for
various things that were originally proposed as /proc/tile files.
It also adjusts the "hardwall" proc API.
Arnd Bergmann reviewed the initial arch/tile submission, which
included a complete set of all the /proc/tile and /proc/sys/tile
knobs that we had added in a somewhat ad hoc way during initial
development, and provided feedback on where most of them should go.
One knob turned out to be similar enough to the existing
/proc/sys/debug/exception-trace that it was re-implemented to use
that model instead.
Another knob was /proc/tile/grid, which reported the "grid" dimensions
of a tile chip (e.g. 8x8 processors = 64-core chip). Arnd suggested
looking at sysfs for that, so this change moves that information
to a pair of sysfs attributes (chip_width and chip_height) in the
/sys/devices/system/cpu directory. We also put the "chip_serial"
and "chip_revision" information from our old /proc/tile/board file
as attributes in /sys/devices/system/cpu.
Other information collected via hypervisor APIs is now placed in
/sys/hypervisor. We create a /sys/hypervisor/type file (holding the
constant string "tilera") to be parallel with the Xen use of
/sys/hypervisor/type holding "xen". We create three top-level files,
"version" (the hypervisor's own version), "config_version" (the
version of the configuration file), and "hvconfig" (the contents of
the configuration file). The remaining information from our old
/proc/tile/board and /proc/tile/switch files becomes an attribute
group appearing under /sys/hypervisor/board/.
Finally, after some feedback from Arnd Bergmann for the previous
version of this patch, the /proc/tile/hardwall file is split up into
two conceptual parts. First, a directory /proc/tile/hardwall/ which
contains one file per active hardwall, each file named after the
hardwall's ID and holding a cpulist that says which cpus are enclosed by
the hardwall. Second, a /proc/PID file "hardwall" that is either
empty (for non-hardwall-using processes) or contains the hardwall ID.
Finally, this change pushes the /proc/sys/tile/unaligned_fixup/
directory, with knobs controlling the kernel code for handling the
fixup of unaligned exceptions.
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Move the lock order description after all the includes, remove several
fairly outdated and/or incorrect comments, move Andrea's
copyright/changelog to the top where it belongs, remove the pointless
filename in the top of the file comment, and remove to useless macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The descriptions of bio_add_page() and bio_add_pc_page() are slightly
inconsistent; improve them.
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Return -ENODATA when trying to read a user.* attribute which cannot
exist: user space otherwise does not have a reasonable way to
distinguish between non-existent and inaccessible attributes.
Likewise, return -ENODATA when an unprivileged process tries to read a
trusted.* attribute: to unprivileged processes, those attributes are
invisible (listxattr() won't include them).
Related to this bug report: https://bugzilla.redhat.com/660613
Signed-off-by: Andreas Gruenbacher <agruen@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.
This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.
Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and kill a useless local variable in follow_dotdot_rcu(), while
we are at it - follow_mount_rcu(nd, path, inode) *always* assigned
value to *inode, and always it had been path->dentry->d_inode (aka
nd->path.dentry->d_inode, since it always got &nd->path as the second
argument).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
though the goal_to_be_moved will be validated again in following moving, it's
still a good idea to validate it after adjustment at the very beginning, instead
of validating it before adjustment.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
It's not wise enough to do a 64bits division anywhere in kernside, replace it
with a decent helper or proper shifts.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Fix authentication failures using extended security mechanisms.
cifs client does not take into consideration extended security bit
in capabilities field in negotiate protocol response from the server.
Please refer to Samba bugzilla 8046.
Reported-and-tested by: Werner Maes <Werner.Maes@icts.kuleuven.be>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add rwpidforward mount option that switches on a mode when we forward
pid of a process who opened a file to any read and write operation.
This can prevent applications like WINE from failing on read or write
operation on a previously locked file region from the same netfd from
another process if we use mandatory brlock style.
It is actual for WINE because during a run of WINE program two processes
work on the same netfd - share the same file struct between several VFS
fds:
1) WINE-server does open and lock;
2) WINE-application does read and write.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Add cifs_match_super to use in sget to share superblock between mounts
that have the same //server/sharename, credentials and mount options.
It helps us to improve performance on work with future SMB2.1 leases.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Now we point superblock to a server share root and set a root dentry
appropriately. This let us share superblock between mounts like
//server/sharename/foo/bar and //server/sharename/foo further.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
Squashfs: update email address
Squashfs: add extra sanity checks at mount time
Squashfs: add sanity checks to fragment reading at mount time
Squashfs: add sanity checks to lookup table reading at mount time
Squashfs: add sanity checks to id reading at mount time
Squashfs: add sanity checks to xattr reading at mount time
Squashfs: reverse order of filesystem table reading
Squashfs: move table allocation into squashfs_read_table()
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating GUID partitions (in fs/partitions/efi.c) contains
a bug that causes a kernel oops on certain corrupted GUID partition
tables.
This bug has security impacts, because it allows, for example, to
prepare a storage device that crashes a kernel subsystem upon connecting
the device (e.g., a "USB Stick of (Partial) Death").
crc = efi_crc32((const unsigned char *) (*gpt), le32_to_cpu((*gpt)->header_size));
computes a CRC32 checksum over gpt covering (*gpt)->header_size bytes.
There is no validation of (*gpt)->header_size before the efi_crc32 call.
A corrupted partition table may have large values for (*gpt)->header_size.
In this case, the CRC32 computation access memory beyond the memory
allocated for gpt, which may cause a kernel heap overflow.
Validate value of GUID partition table header size.
[akpm@linux-foundation.org: fix layout and indenting]
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The balloon driver in a Xen guest frees guest pages and marks them as
mmio. When the kernel crashes and the crash kernel attempts to read the
oldmem via /proc/vmcore a read from ballooned pages will generate 100%
load in dom0 because Xen asks qemu-dm for the page content. Since the
reads come in as 8byte requests each ballooned page is tried 512 times.
With this change a hook can be registered which checks wether the given
pfn is really ram. The hook has to return a value > 0 for ram pages, a
value < 0 on error (because the hypercall is not known) and 0 for non-ram
pages.
This will reduce the time to read /proc/vmcore. Without this change a
512M guest with 128M crashkernel region needs 200 seconds to read it, with
this change it takes just 2 seconds.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, pagemap_read() has three error and/or corner case handling
mistake.
(1) If ppos parameter is wrong, mm refcount will be leak.
(2) If count parameter is 0, mm refcount will be leak too.
(3) If the current task is sleeping in kmalloc() and the system
is out of memory and oom-killer kill the proc associated task,
mm_refcount prevent the task free its memory. then system may
hang up.
<Quote Hugh's explain why we shold call kmalloc() before get_mm()>
check_mem_permission gets a reference to the mm. If we
__get_free_page after check_mem_permission, imagine what happens if the
system is out of memory, and the mm we're looking at is selected for
killing by the OOM killer: while we wait in __get_free_page for more
memory, no memory is freed from the selected mm because it cannot reach
exit_mmap while we hold that reference.
This patch fixes the above three.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It whould be better if put check_mem_permission after __get_free_page in
mem_write, to be same as function mem_read.
Hugh Dickins explained the reason.
check_mem_permission gets a reference to the mm. If we __get_free_page
after check_mem_permission, imagine what happens if the system is out
of memory, and the mm we're looking at is selected for killing by the
OOM killer: while we wait in __get_free_page for more memory, no memory
is freed from the selected mm because it cannot reach exit_mmap while
we hold that reference.
Reported-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Stephen Wilson <wilsons@start.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a macro for the max size kmalloc can allocate, so use it instead
of a hardcoded number.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No need for this local array to be writable, so mark it const.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, exe_file is not proc FS dependent, so we can use it to name core
file. So we add %E pattern for core file name cration which extract path
from mm_struct->exe_file. Then it converts slashes to exclamation marks
and pastes the result to the core file name itself.
This is useful for environments where binary names are longer than 16
character (the current->comm limitation). Also where there are binaries
with same name but in a different path. Further in case the binery itself
changes its current->comm after exec.
So by doing (s/$/#/ -- # is treated as git comment):
$ sysctl kernel.core_pattern='core.%p.%e.%E'
$ ln /bin/cat cat45678901234567890
$ ./cat45678901234567890
^Z
$ rm cat45678901234567890
$ fg
^\Quit (core dumped)
$ ls core*
we now get:
core.2434.cat456789012345.!root!cat45678901234567890 (deleted)
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc/<pid>/exe. Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.
To achieve that move that out of proc FS to the kernel/ where in fact it
should belong. By doing that we can make dup_mm_exe_file static. Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.
"pgfault"
"pgmajfault"
They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.
It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.
Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:
$ cat /proc/vmstat | grep fault
pgfault 1070751
pgmajfault 553
$ cat /dev/cgroup/memory.stat | grep fault
pgfault 1071138
pgmajfault 553
total_pgfault 1071142
total_pgmajfault 553
$ cat /dev/cgroup/A/memory.stat | grep fault
pgfault 199
pgmajfault 0
total_pgfault 199
total_pgmajfault 0
Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container. There is no regression noticed on the
"flt/cpu/s"
Sample output from pft:
TAG pft:anon-sys-default:
Gb Thr CLine User System Wall flt/cpu/s fault/wsec
15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260
+-------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 16682.962 17344.027 16913.524 16928.812 166.5362
+ 10 16695.568 16923.896 16820.604 16824.652 84.816568
No difference proven at 95.0% confidence
[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally i_lastfrag was 32 bits but then we added support for handling
64 bit metadata and it became a 64 bit variable. That was during 2007, in
54fb996ac1 "[PATCH] ufs2 write: block allocation update". Unfortunately
these casts got left behind so the value got truncated to 32 bit again.
[akpm@linux-foundation.org: remove now-unneeded min_t/max_t casting]
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For a filesystem that has lots of files in it, the first time we mount
it with free ino caching support, it can take quite a long time to
setup the caching before we can create new files.
Here we fill the cache with [highest_ino, BTRFS_LAST_FREE_OBJECTID]
before we start the caching thread to search through the extent tree.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
gfs2: Drop __TIME__ usage
isdn/diva: Drop __TIME__ usage
atm: Drop __TIME__ usage
dlm: Drop __TIME__ usage
wan/pc300: Drop __TIME__ usage
parport: Drop __TIME__ usage
hdlcdrv: Drop __TIME__ usage
baycom: Drop __TIME__ usage
pmcraid: Drop __DATE__ usage
edac: Drop __DATE__ usage
rio: Drop __DATE__ usage
scsi/wd33c93: Drop __TIME__ usage
scsi/in2000: Drop __TIME__ usage
aacraid: Drop __TIME__ usage
media/cx231xx: Drop __TIME__ usage
media/radio-maxiradio: Drop __TIME__ usage
nozomi: Drop __TIME__ usage
cyclades: Drop __TIME__ usage
Eric Dumazet reports:
----
At boot, I have a crash in part_discard_alignment_show+0x1b/0x50
CR2 : 000006ac
fault in : mov 0x2c(%rcx),%edx
I suspect commit 23ceb5b771 (block: Remove extra
discard_alignment from hd_struct) being in fault
----
Not quite known how ->queue can be NULL while the sysfs entry
exists, but lets play it safe and check for a NULL queue.
The rest of the sysfs show strategies in check.c do not dereference
disk->queue.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When mandatory encryption is configured in samba server on a
share (smb.conf parameter "smb encrypt = mandatory") the
server will hang up the tcp session when we try to send
the first frame after the tree connect if it is not a
QueryFSUnixInfo, this causes cifs mount to hang (it must
be killed with ctl-c). Move the QueryFSUnixInfo call
earlier in the mount sequence, and check whether the SetFSUnixInfo
fails due to mandatory encryption so we can return a sensible
error (EACCES) on mount.
In a future patch (for 2.6.40) we will support mandatory
encryption.
CC: Stable <stable@kernel.org>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We need it to make them work with mandatory locking style because
we can fail in a situation like when kernel need to flush dirty pages
and there is a lock held by a process who opened file.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (28 commits)
Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
Ocfs2/move_extents: Set several trivial constraints for threshold.
Ocfs2/move_extents: Let defrag handle partial extent moving.
Ocfs2/move_extents: move/defrag extents within a certain range.
Ocfs2/move_extents: helper to calculate the defraging length in one run.
Ocfs2/move_extents: move entire/partial extent.
Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
Ocfs2/move_extents: helper to validate and adjust moving goal.
Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
Ocfs2/move_extents: defrag a range of extent.
Ocfs2/move_extents: move a range of extent.
Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
Ocfs2/move_extents: Add basic framework and source files for extent moving.
Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
xen: cleancache shim to Xen Transcendent Memory
ocfs2: add cleancache support
ext4: add cleancache support
btrfs: add cleancache support
ext3: add cleancache support
mm/fs: add hooks to support cleancache
mm: cleancache core ops functions and config
fs: add field to superblock to support cleancache
mm/fs: cleancache documentation
Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent
xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec
xfs: fix up asserts in xfs_iflush_fork
xfs: do not do pointer arithmetic on extent records
xfs: do not use unchecked extent indices in xfs_bunmapi
xfs: do not use unchecked extent indices in xfs_bmapi
xfs: do not use unchecked extent indices in xfs_bmap_add_extent_*
xfs: remove if_lastex
xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag
xfs: do not discard alloc btree blocks
xfs: add online discard support
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
jbd2: Add MAINTAINERS entry
jbd2: fix a potential leak of a journal_head on an error path
ext4: teach ext4_ext_split to calculate extents efficiently
ext4: Convert ext4 to new truncate calling convention
ext4: do not normalize block requests from fallocate()
ext4: enable "punch hole" functionality
ext4: add "punch hole" flag to ext4_map_blocks()
ext4: punch out extents
ext4: add new function ext4_block_zero_page_range()
ext4: add flag to ext4_has_free_blocks
ext4: reserve inodes and feature code for 'quota' feature
ext4: add support for multiple mount protection
ext4: ensure f_bfree returned by ext4_statfs() is non-negative
ext4: protect bb_first_free in ext4_trim_all_free() with group lock
ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
jbd2: Fix comment to match the code in jbd2__journal_start()
ext4: fix waiting and sending of a barrier in ext4_sync_file()
jbd2: Add function jbd2_trans_will_send_data_barrier()
jbd2: fix sending of data flush on journal commit
ext4: fix ext4_ext_fiemap_cb() to handle blocks before request range correctly
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits)
cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
ceph: remove unnecessary dentry_unhash calls
vfs: clean up vfs_rename_other
vfs: clean up vfs_rename_dir
vfs: clean up vfs_rmdir
vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
libfs: drop unneeded dentry_unhash
vfs: update dentry_unhash() comment
vfs: push dentry_unhash on rename_dir into file systems
vfs: push dentry_unhash on rmdir into file systems
vfs: remove dget() from dentry_unhash()
vfs: dentry_unhash immediately prior to rmdir
vfs: Block mmapped writes while the fs is frozen
...
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This eighth patch of eight in this cleancache series "opts-in"
cleancache for ocfs2. Clustered filesystems must explicitly enable
cleancache by calling cleancache_init_shared_fs anytime an instance
of the filesystem is mounted. Ocfs2 is currently the only user of
the clustered filesystem interface but nevertheless, the cleancache
hooks in the VFS layer are sufficient for ocfs2 including the matching
cleancache_flush_fs hook which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v8: trivial merge conflict update]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Nitin Gupta <ngupta@vflare.org>
This seventh patch of eight in this cleancache series "opts-in"
cleancache for ext4. Filesystems must explicitly enable cleancache
by calling cleancache_init_fs anytime an instance of the filesystem
is mounted. For ext4, all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
This fifth patch of eight in this cleancache series "opts-in"
cleancache for ext3. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. For ext3, all other cleancache
hooks are in the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
This fourth patch of eight in this cleancache series provides the
core hooks in VFS for: initializing cleancache per filesystem;
capturing clean pages reclaimed by page cache; attempting to get
pages from cleancache before filesystem read; and ensuring coherency
between pagecache, disk, and cleancache. Note that the placement
of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
change was required due to a patchset in 2.6.39.
All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
a check of a boolean global if CONFIG_CLEANCACHE is set but no
cleancache "backend" has claimed cleancache_ops.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cifs has no problems with lingering references to unlinked directory
inodes.
CC: Steve French <sfrench@samba.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ocfs2 has no issues with lingering references to unlinked directory inodes.
CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Exofs has no problems with lingering references to unlinked directory
inodes.
CC: Benny Halevy <bhalevy@panasas.com>
CC: osd-dev@open-osd.org
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NFS has no problems with lingering references to unlinked directory
inodes.
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: linux-nfs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext2 has no problems with lingering references to unlinked directory
inodes.
CC: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext3 has no problems with lingering references to unlinked directory
inodes.
CC: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext4 has no problems with lingering references to unlinked directory
inodes.
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Andreas Dilger <adilger.kernel@dilger.ca>
CC: linux-ext4@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs has no problems with lingering references to unlinked directory
inodes.
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ceph does not need these, and they screw up our use of the dcache as a
consistent cache.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_rename_dir() doesn't properly account for filesystems with
FS_RENAME_DOES_D_MOVE. If new_dentry has a target inode attached, it
unhashes the new_dentry prior to the rename() iop and rehashes it after,
but doesn't account for the possibility that rename() may have swapped
{old,new}_dentry. For FS_RENAME_DOES_D_MOVE filesystems, it rehashes
new_dentry (now the old renamed-from name, which d_move() expected to go
away), such that a subsequent lookup will find it. Currently all
FS_RENAME_DOES_D_MOVE filesystems compensate for this by failing in
d_revalidate.
The bug was introduced by: commit 349457ccf2
"[PATCH] Allow file systems to manually d_move() inside of ->rename()"
Fix by not rehashing the new dentry. Rehashing used to be needed by
d_move() but isn't anymore.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are no libfs issues with dangling references to empty directories.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The helper is now only called by file systems, not the VFS.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only a few file systems need this. Start by pushing it down into each
rename method (except gfs2 and xfs) so that it can be dealt with on a
per-fs basis.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only a few file systems need this. Start by pushing it down into each
fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
basis.
This does not change behavior for any in-tree file systems.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This serves no useful purpose that I can discern. All callers (rename,
rmdir) hold their own reference to the dentry.
A quick audit of all file systems showed no relevant checks on the value
of d_count in vfs_rmdir/vfs_rename_dir paths.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This presumes that there is no reason to unhash a dentry if we fail because
it is a mountpoint or the LSM check fails, and that the LSM checks do not
depend on the dentry being unhashed.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should not allow file modification via mmap while the filesystem is
frozen. So block in block_page_mkwrite() while the filesystem is frozen.
We cannot do the blocking wait in __block_page_mkwrite() since e.g. ext4
will want to call that function with transaction started in some cases
and that would deadlock. But we can at least do the non-blocking reliable
check in __block_page_mkwrite() which is the hardest part anyway.
We have to check for frozen filesystem with the page marked dirty and under
page lock with which we then return from ->page_mkwrite(). Only that way we
cannot race with writeback done by freezing code - either we mark the page
dirty after the writeback has started, see freezing in progress and block, or
writeback will wait for our page lock which is released only when the fault is
done and then writeback will writeout and writeprotect the page again.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create __block_page_mkwrite() helper which does all what block_page_mkwrite()
does except that it passes back errors from __block_write_begin /
block_commit_write calls.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This issue was discovered by users of busybox. And the bug is actual for
busybox users, I don't know how it affects others. Apparently, mount is
called with and without MS_SILENT, and this affects mount() behaviour.
But MS_SILENT is only supposed to affect kernel logging verbosity.
The following script was run in an empty test directory:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1
mount -vv --make-rshared mount.shared1
mount -vv --bind mount.shared2 mount.shared2
mount -vv --make-rshared mount.shared2
mount -vv --bind mount.shared2 mount.shared1
mount -vv --bind mount.dir mount.shared2
ls -R mount.dir mount.shared1 mount.shared2
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
mount -vv was used to show the mount() call arguments and result.
Output shows that flag argument has 0x00008000 = MS_SILENT bit:
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
mount.shared2:
a
b
After adding --loud option to remove MS_SILENT bit from just one mount cmd:
mkdir -p mount.dir mount.shared1 mount.shared2
touch mount.dir/a mount.dir/b
mount -vv --bind mount.shared1 mount.shared1 2>&1
mount -vv --make-rshared mount.shared1 2>&1
mount -vv --bind mount.shared2 mount.shared2 2>&1
mount -vv --loud --make-rshared mount.shared2 2>&1 # <-HERE
mount -vv --bind mount.shared2 mount.shared1 2>&1
mount -vv --bind mount.dir mount.shared2 2>&1
ls -R mount.dir mount.shared1 mount.shared2 2>&1
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
umount mount.dir mount.shared1 mount.shared2 2>/dev/null
rm -f mount.dir/a mount.dir/b mount.dir/c
rmdir mount.dir mount.shared1 mount.shared2
The result is different now - look closely at mount.shared1 directory listing.
Now it does show files 'a' and 'b':
mount: mount('mount.shared1','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared1','',0x0010c000,''):0
mount: mount('mount.shared2','mount.shared2','(null)',0x00009000,'(null)'):0
mount: mount('','mount.shared2','',0x00104000,''):0
mount: mount('mount.shared2','mount.shared1','(null)',0x00009000,'(null)'):0
mount: mount('mount.dir','mount.shared2','(null)',0x00009000,'(null)'):0
mount.dir:
a
b
mount.shared1:
a
b
mount.shared2:
a
b
The analysis shows that MS_SILENT flag which is ON by default in any
busybox-> mount operations cames to flags_to_propagation_type function and
causes the error return while is_power_of_2 checking because the function
expects only one bit set. This doesn't allow to do busybox->mount with
any --make-[r]shared, --make-[r]private etc options.
Moreover, the recently added flags_to_propagation_type() function doesn't
allow us to do such operations as --make-[r]private --make-[r]shared etc.
when MS_SILENT is on. The idea or clearing the MS_SILENT flag came from
to Denys Vlasenko.
Signed-off-by: Roman Borisov <ext-roman.borisov@nokia.com>
Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 990d6c2d7a ("vfs: Add name to file
handle conversion support") changed EXPORTFS to be a bool.
This was needed for earlier revisions of the original patch, but the actual
commit put the code needing it into its own file that only gets compiled
when FHANDLE is selected which in turn selects EXPORTFS.
So EXPORTFS can be safely compiled as a module when not selecting FHANDLE.
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: complete_walk(). Done on successful completion
of walk, drops out of RCU mode, does d_revalidate of final
result if that hadn't been done already.
handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
that one; callers converted to use of complete_walk().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My existing email address may stop working in a month or two, so update
email to one that will continue working.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
The kernel already prints its build timestamp during boot, no need to
repeat it in random drivers and produce different object files each
time.
Cc: Christine Caulfield <ccaulfie@redhat.com>
Cc: David Teigland <teigland@redhat.com>
Cc: cluster-devel@redhat.com
Signed-off-by: Michal Marek <mmarek@suse.cz>
Oops, local-mounted of 'ocfs2_fops_no_plocks' is just missing the support
of unwritten_extents/punching-hole due to no func pointer was given correctly
to '.follocate' field.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
During dlm domain shutdown, o2dlm has to free all the lock resources. Ones that
have no locks and references are freed. Ones that have locks and/or references
are migrated to another node.
The first task in migration is finding a target. Currently we scan the lock
resource and find one node that either has a lock or a reference. This is not
very efficient in a parallel umount case as we might end up migrating the
lock resource to a node which itself may have to migrate it to a third node.
The patch scans the dlm->exit_domain_map to ensure the target node is not
leaving the domain. If no valid target node is found, o2dlm does not migrate
the resource but instead waits for the unlock and deref messages that will
allow it to free the resource.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
This patch adds a new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG and ups the dlm
protocol to 1.2.
o2dlm sends this new message in dlm_unregister_domain() to mark the beginning
of the exit domain. This message is sent to all nodes in the domain.
Currently o2dlm has no way of informing other nodes of its impending exit.
This information is useful as the other nodes could disregard the exiting
node in certain operations. For example, in resource migration. If two or
more nodes were umounting in parallel, it would be more efficient if o2dlm
were to choose a non-exiting node to be the new master node rather than an
exiting one.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Choosing TMPFS_XATTR default N was switching off TMPFS_POSIX_ACL,
even if it had been Y in oldconfig; and Linus reports that PulseAudio
goes subtly wrong unless it can use ACLs on /dev/shm.
Make TMPFS_POSIX_ACL select TMPFS_XATTR (and depend upon TMPFS),
and move the TMPFS_POSIX_ACL entry before the TMPFS_XATTR entry,
to avoid asking unnecessary questions then ignoring their answers.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
net: fix get_net_ns_by_fd for !CONFIG_NET_NS
ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
ns: Declare sys_setns in syscalls.h
net: Allow setting the network namespace by fd
ns proc: Add support for the ipc namespace
ns proc: Add support for the uts namespace
ns proc: Add support for the network namespace.
ns: Introduce the setns syscall
ns: proc files for namespace naming policy.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (89 commits)
bonding: documentation and code cleanup for resend_igmp
bonding: prevent deadlock on slave store with alb mode (v3)
net: hold rtnl again in dump callbacks
Add Fujitsu 1000base-SX PCI ID to tg3
bnx2x: protect sequence increment with mutex
sch_sfq: fix peek() implementation
isdn: netjet - blacklist Digium TDM400P
via-velocity: don't annotate MAC registers as packed
xen: netfront: hold RTNL when updating features.
sctp: fix memory leak of the ASCONF queue when free asoc
net: make dev_disable_lro use physical device if passed a vlan dev (v2)
net: move is_vlan_dev into public header file (v2)
bug.h: Fix build with CONFIG_PRINTK disabled.
wireless: fix fatal kernel-doc error + warning in mac80211.h
wireless: fix cfg80211.h new kernel-doc warnings
iwlagn: dbg_fixed_rate only used when CONFIG_MAC80211_DEBUGFS enabled
dst: catch uninitialized metrics
be2net: hash key for rss-config cmd not set
bridge: initialize fake_rtable metrics
net: fix __dst_destroy_metrics_generic()
...
Fix up trivial conflicts in drivers/staging/brcm80211/brcmfmac/wl_cfg80211.c
Make ext4_ext_split() get extents to be moved by calculating in a statement
instead of counting in a loop.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Trivial conversion. Fixup one error handling case calling vmtruncate()
and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and
IS_APPEND files could not be truncated during failed writes. In fact, the
test can be completely removed as upper layers do necessary permission
checks for truncate in do_sys_[f]truncate() and may_open() anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add the ability for CIFS to do an asynchronous write. The kernel will
set the frame up as it would for a "normal" SMBWrite2 request, and use
cifs_call_async to send it. The mid callback will then be configured to
handle the result.
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Now that we can handle larger wsizes in writepages, fix up the
negotiation of the wsize to allow for that. find_get_pages only seems to
give out a max of 256 pages at a time, so that gives us a reasonable
default of 1M for the wsize.
If the server however does not support large writes via POSIX
extensions, then we cap the wsize to (128k - PAGE_CACHE_SIZE). That
gives us a size that goes up to the max frame size specified in RFC1001.
Finally, if CAP_LARGE_WRITE_AND_X isn't set, then further cap it to the
largest size allowed by the protocol (USHRT_MAX).
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Have cifs_writepages issue asynchronous writes instead of waiting on
each write call to complete before issuing another. This also allows us
to return more quickly from writepages. It can just send out all of the
I/Os and not wait around for the replies.
In the WB_SYNC_ALL case, if the write completes with a retryable error,
then the completion workqueue job will resend the write.
This also changes the page locking semantics a little bit. Instead of
holding the page lock until the response is received, release it after
doing the send. This will reduce contention for the page lock and should
prevent processes that have the file mmap'ed from being blocked
unnecessarily.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-and-Tested-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix double kfree() calls on the same pointers and cleanup mount code.
Reviewed-and-Tested-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: fix cap flush race reentrancy
libceph: subscribe to osdmap when cluster is full
libceph: handle new osdmap down/state change encoding
rbd: handle online resize of underlying rbd image
ceph: avoid inode lookup on nfs fh reconnect
ceph: use LOOKUPINO to make unconnected nfs fh more reliable
rbd: use snprintf for disk->disk_name
rbd: cleanup: make kfree match kmalloc
rbd: warn on update_snaps failure on notify
ceph: check return value for start_request in writepages
ceph: remove useless check
libceph: add missing breaks in addr_set_port
libceph: fix TAG_WAIT case
ceph: fix broken comparison in readdir loop
libceph: fix osdmap timestamp assignment
ceph: fix rare potential cap leak
libceph: use snprintf for unknown addrs
libceph: use snprintf for formatting object name
ceph: use snprintf for dirstat content
libceph: fix uninitialized value when no get_authorizer method is set
...
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock fragments field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time fragment table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock inodes field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time lookup table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Fsfuzzer generates corrupted filesystems which throw a warn_on in
kmalloc. One of these is due to a corrupted superblock no_ids field.
Fix this by checking that the number of bytes to be read (and allocated)
does not extend into the next filesystem structure.
Also add a couple of other sanity checks of the mount-time id table
structures.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reverse order of table reading from mostly first to last in placement
order, to last to first. This is to enable extra superblock sanity
checks to be added in later patches.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: update Documentation pointers
net/9p: enable 9p to work in non-default network namespace
net/9p: p9_idpool_get return -1 on error
fs/9p: Don't clunk dentry fid when we fail to get a writeback inode
9p: Small cleanup in <net/9p/9p.h>
9p: remove experimental tag from tested configurations
9p: typo fixes and minor cleanups
net/9p: Change linuxdoc names to match functions.
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits)
cfq-iosched: free cic_index if cfqd allocation fails
cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
cfq-iosched: reduce bit operations in cfq_choose_req()
cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
block: move bd_set_size() above rescan_partitions() in __blkdev_get()
block: call elv_bio_merged() when merged
cfq-iosched: Make IO merge related stats per cpu
cfq-iosched: Fix a memory leak of per cpu stats for root group
backing-dev: Kill set but not used var in bdi_debug_stats_show()
block: get rid of on-stack plugging debug checks
blk-throttle: Make no throttling rule group processing lockless
blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
blk-throttle: Make dispatch stats per cpu
blk-throttle: Free up a group only after one rcu grace period
blk-throttle: Use helper function to add root throtl group to lists
blk-throttle: Introduce a helper function to fill in device details
blk-throttle: Dynamically allocate root group
blk-cgroup: Allow sleeping while dynamically allocating a group
...
The code in xfs_bmap_del_extent does not correctly decrement the
extent buffer index when deleting a whole extent. Most of the time
this gets caught by checks in xfs_bmapi that work around it and
decrement it manually and thus wasn't noticed so far.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove asserts in xfs_iflush_fork that would call xfs_iext_get_ext
with a potentially invalid extent buffer index.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to call xfs_iext_get_ext for the previous extent to get a
valid pointer, and can't just do pointer arithmetics as they might
be in different pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bunmapi. Also
remove the old workaround for too large indices that has been
superceeded by the proper fix in xfs_bmap_del_extent.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index when moving on to the next index in xfs_bmapi.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Make sure to only call xfs_iext_get_ext after we've validate the
extent index in the various xfs_bmap_add_extent_* helpers.
Based on an earlier patch from Lachlan McIlroy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The if_lastex field in struct xfs_ifork is only used as a temporary
index during xfs_bmapi and xfs_bunmapi. Instead of using the inode
fork to store it keep it local in the callchain. Fortunately this
is very easy as we already pass a stack copy of it down the whole
chain which can simplify be changed to be passed by reference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The XFS_BMAPI_RSVBLOCKS is unused, and as far as I can see has
always been. Remove it to simplify the bmapi implementation and
conserve stack space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We get this spurious warning:
fs/ncpfs/inode.c: In function 'ncp_fill_super':
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function
fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function
...
It's notabug, but we can easily fix it with a memset().
Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not a
candidate for stack allocation.
Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.
Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:
... imagine what happens if the system is out of memory, and the mm
we're looking at is selected for killing by the OOM killer: while
we wait in __get_free_page for more memory, no memory is freed
from the selected mm because it cannot reach exit_mmap while we hold
that reference.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world. Move it
to fs/proc/internal.h instead.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.
- Having the show() operation "miles away" from the corresponding
seq_file iteration operations is a maintenance burden.
- The need to export ad hoc info like struct proc_maps_private is
eliminated.
- The implementation of show_numa_map() can be improved in a simple
manner by cooperating with the other seq_file operations (start,
stop, etc) -- something that would be messy to do without this
change.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement generic xattrs for tmpfs filesystems. The Feodra project, while
trying to replace suid apps with file capabilities, realized that tmpfs,
which is used on the build systems, does not support file capabilities and
thus cannot be used to build packages which use file capabilities. Xattrs
are also needed for overlayfs.
The xattr interface is a bit odd. If a filesystem does not implement any
{get,set,list}xattr functions the VFS will call into some random LSM hooks
and the running LSM can then implement some method for handling xattrs.
SELinux for example provides a method to support security.selinux but no
other security.* xattrs.
As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have
xattr handler routines specifically to handle acls. Because of this tmpfs
would loose the VFS/LSM helpers to support the running LSM. To make up
for that tmpfs had stub functions that did nothing but call into the LSM
hooks which implement the helpers.
This new patch does not use the LSM fallback functions and instead just
implements a native get/set/list xattr feature for the full security.* and
trusted.* namespace like a normal filesystem. This means that tmpfs can
now support both security.selinux and security.capability, which was not
previously possible.
The basic implementation is that I attach a:
struct shmem_xattr {
struct list_head list; /* anchored by shmem_inode_info->xattr_list */
char *name;
size_t size;
char value[0];
};
Into the struct shmem_inode_info for each xattr that is set. This
implementation could easily support the user.* namespace as well, except
some care needs to be taken to prevent large amounts of unswappable memory
being allocated for unprivileged users.
[mszeredi@suse.cz: new config option, suport trusted.*, support symlinks]
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Jordi Pujol <jordipujolp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change each shrinker's API by consolidating the existing parameters into
shrink_control struct. This will simplify any further features added w/o
touching each file of shrinker.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate the existing parameters to shrink_slab() into a new
shrink_control struct. This is needed later to pass the same struct to
shrinkers.
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Straightforward conversion of i_mmap_lock to a mutex.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh says:
"The only significant loser, I think, would be page reclaim (when
concurrent with truncation): could spin for a long time waiting for
the i_mmap_mutex it expects would soon be dropped? "
Counter points:
- cpu contention makes the spin stop (need_resched())
- zap pages should be freeing pages at a higher rate than reclaim
ever can
I think the simplification of the truncate code is definitely worth it.
Effectively reverts: 2aa15890f3 ("mm: prevent concurrent
unmap_mapping_range() on the same inode") and takes out the code that
caused its problem.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the existing mmu_gather infrastructure.
The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.
The first 9 patches rework the mmu_gather infrastructure. For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.
The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.
Then there is one patch that extends the generic mmu_gather batching.
After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.
Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.
Furthermore, it removes the new and universially detested unmap_mutex.
This patch:
Remove the first obstacle towards a fully preemptible mmu_gather.
The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.
Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex. Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.
Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.
[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.
check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.
Let's clean this up by exporting both functions and make those names
consistent. Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion). expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dentry fid get clunked via the dput.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The 9p client is currently undergoing regular regresssion and
stress testing as a by-product of the virtfs work. I think its
finally time to take off the experimental tags from the well-tested
code paths.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Currently, an fallocate request of size slightly larger than a power of
2 is turned into two block requests, each a power of 2, with the extra
blocks pre-allocated for future use. When an application calls
fallocate, it already has an idea about how large the file may grow so
there is usually little benefit to reserve extra blocks on the
preallocation list. This reduces disk fragmentation.
Tested: fsstress. Also verified manually that fallocat'ed files are
contiguously laid out with this change (whereas without it they begin at
power-of-2 boundaries, leaving blocks in between). CPU usage of
fallocate is not appreciably higher. In a tight fallocate loop, CPU
usage hovers between 5%-8% with this change, and 5%-7% without it.
Using a simulated file system aging program which the file system to
70%, the percentage of free extents larger than 8MB (as measured by
e2freefrag) increased from 38.8% without this change, to 69.4% with
this change.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds new routines: "ext4_punch_hole" "ext4_ext_punch_hole"
and "ext4_ext_check_cache"
fallocate has been modified to call ext4_punch_hole when the punch hole
flag is passed. At the moment, we only support punching holes in
extents, so this routine is pretty much a wrapper for the ext4_ext_punch_hole
routine.
The ext4_ext_punch_hole routine first completes all outstanding writes
with the associated pages, and then releases them. The unblock
aligned data is zeroed, and all blocks in between are punched out.
The ext4_ext_check_cache routine is very similar to ext4_ext_in_cache
except it accepts a ext4_ext_cache parameter instead of a ext4_extent
parameter. This routine is used by ext4_ext_punch_hole to check and
see if a block in a hole that has been cached. The ext4_ext_cache
parameter is necessary because the members ext4_extent structure are
not large enough to hold a 32 bit value. The existing
ext4_ext_in_cache routine has become a wrapper to this new function.
[ext4 punch hole patch series 5/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch adds a new flag to ext4_map_blocks() that specifies the
given range of blocks should be punched out. Extents are first
converted to uninitialized extents before they are punched
out. Because punching a hole may require that the extent be split, it
is possible that the splitting may need more blocks than are
available. To deal with this, use of reserved blocks are enabled to
allow the split to proceed.
The routine then returns the number of blocks successfully
punched out.
[ext4 punch hole patch series 4/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch modifies the truncate routines to support hole punching
Below is a brief summary of the patches changes:
- Added end param to ext_ext4_rm_leaf
This function has been modified to accept an end parameter
which enables it to punch holes in leafs instead of just
truncating them.
- Implemented the "remove head" case in the ext_remove_blocks routine
This routine is used by ext_ext4_rm_leaf to remove the tail
of an extent during a truncate. The new ext_ext4_rm_leaf
routine will now also use it to remove the head of an extent in the
case that the hole covers a region of blocks at the beginning
of an extent.
- Added "end" param to ext4_ext_remove_space routine
This function has been modified to accept a stop parameter, which
is passed through to ext4_ext_rm_leaf.
[ext4 punch hole patch series 3/5 v6]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch modifies the existing ext4_block_truncate_page() function
which was used by the truncate code path, and which zeroes out block
unaligned data, by adding a new length parameter, and renames it to
ext4_block_zero_page_rage(). This function can now be used to zero out the
head of a block, the tail of a block, or the middle
of a block.
The ext4_block_truncate_page() function is now a wrapper to
ext4_block_zero_page_range().
[ext4 punch hole patch series 2/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks. This will allow a
punch hole to proceed even if the disk is full. Punching a hole may
require additional blocks to first split the extents.
Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:
ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks
[ext4 punch hole patch series 1/5 v7]
Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
This patch fixes a race in the GFS2 glock state machine that may
result in lockups. The symptom is that all nodes but one will
hang, waiting for a particular glock. All the holder records
will have the "W" (Waiting) bit set. The other node will
typically have the glock stuck in Exclusive mode (EX) with no
holder records, but the dinode will be cached. In other words,
an entry with "I:" will appear in the glock dump for that glock,
but nothing else.
The race has to do with the glock "Pending Demote" bit, which
can be set, then immediately reset, thus losing the fact that
another node needs the glock. The sequence of events is:
1. Something schedules the glock workqueue (e.g. glock request from fs)
2. The glock workqueue gets to the point between the test of the reply pending
bit and the spin lock:
if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) {
finish_xmote(gl, gl->gl_reply);
drop_ref = 1;
}
down_read(&gfs2_umount_flush_sem); <---- i.e. here
spin_lock(&gl->gl_spin);
3. In comes (a) the reply to our EX lock request setting GLF_REPLY_PENDING and
(b) the demote request which sets GLF_PENDING_DEMOTE
4. The following test is executed:
if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
gl->gl_state != LM_ST_UNLOCKED &&
gl->gl_demote_state != LM_ST_EXCLUSIVE) {
This resets the pending demote flag, and gl->gl_demote_state is not equal to
exclusive, however because the reply from the dlm arrived after we checked for
the GLF_REPLY_PENDING flag, gl->gl_state is still equal to unlocked, so
although we reset the GLF_PENDING_DEMOTE flag, we didn't then set the
GLF_DEMOTE flag or reinstate the GLF_PENDING_DEMOTE_FLAG.
The patch closes the timing window by only transitioning the
"Pending demote" bit to the "demote" flag once we know the
other conditions (not unlocked and not exclusive) are met.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We're going to support partial extent moving, which may split entire extent
movement into pieces to compromise the insuffice allocations, it eases the
'ENSPC' pain and makes the whole moving much less likely to fail, the downside
is it may make the fs even more fragmented before moving, just let the userspace
make a trade-off here.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
the basic logic of moving extents for a file is pretty like punching-hole
sequence, walk the extents within the range as user specified, calculating
an appropriate len to defrag/move, then let ocfs2_defrag/move_extent() to
do the actual moving.
This func ends up setting 'OCFS2_MOVE_EXT_FL_COMPLETE' to userpace if operation
gets done successfully.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The helper is to calculate the defrag length in one run according to a threshold,
it will proceed doing defragmentation until the threshold was meet, and skip a
LARGE extent if any.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
ocfs2_move_extent() logic will validate the goal_offset_in_block,
where extents to be moved, what's more, it also compromises a bit
to probe the appropriate region around given goal_offset when the
original goal is not able to fit the movement.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Before doing the movement of extents, we'd better probe the alloc group from
'goal_blk' for searching a contiguous region to fit the wanted movement, we
even will have a best-effort try by compromising to a threshold around the
given goal.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
First best-effort attempt to validate and adjust the goal (physical address in
block), while it can't guarantee later operation can succeed all the time since
global_bitmap may change a bit over time.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
This function tries locate the right alloc group, where a given physical block
resides, it returns the caller a buffer_head of victim group descriptor, and also
the offset of block in this group, by passing the block number.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
It's a relatively complete function to accomplish defragmentation for entire
or partial extent, one journal handle was kept during the operation, it was
logically doing one more thing than ocfs2_move_extent() acutally, yes, it's
claiming the new clusters itself;-)
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The moving range of __ocfs2_move_extent() was within one extent always, it
consists following parts:
1. Duplicates the clusters in pages to new_blkoffset, where extent to be moved.
2. Split the original extent with new extent, coalecse the nearby extents if possible.
3. Append old clusters to truncate log, or decrease_refcount if the extent was refcounted.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
ocfs2_lock_allocators_move_extents() was like the common ocfs2_lock_allocators(),
to lock metadata and data alloctors during extents moving, reserve appropriate
metadata blocks and data clusters, also performa a best- effort to calculate the
credits for journal transaction in one run of movement.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The original goal of commonizing these funcs is to benefit defraging/extent_moving
codes in the future, based on the fact that reflink and defragmentation having
the same Copy-On-Wrtie mechanism.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
This new code is a bit more complicated than former ones, the goal is to
show user all statistics required to take a deep insight into filesystem
on how the disk is being fragmentaed.
The goal is achieved by scaning global bitmap from (cluster)group to group
to figure out following factors in the filesystem:
- How many free chunks in a fixed size as user requested.
- How many real free chunks in all size.
- Min/Max/Avg size(in) clusters of free chunks.
- How do free chunks distribute(in size) in terms of a histogram,
just like following:
---------------------------------------------------------
Extent Size Range : Free extents Free Clusters Percent
32K... 64K- : 1 1 0.00%
1M... 2M- : 9 288 0.03%
8M... 16M- : 2 831 0.09%
32M... 64M- : 1 2047 0.23%
128M... 256M- : 1 8191 0.92%
256M... 512M- : 2 21706 2.43%
512M... 1024M- : 27 858623 96.29%
---------------------------------------------------------
Userspace ioctl() call eventually gets the above info returned by passing
a 'struct ocfs2_info_freefrag' with the chunk_size being specified first.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
The new code is dedicated to calculate free inodes number of all inode_allocs,
then return the info to userpace in terms of an array.
Specially, flag 'OCFS2_INFO_FL_NON_COHERENT', manipulated by '--cluster-coherent'
from userspace, is now going to be involved. setting the flag on means no cluster
coherency considered, usually, userspace tools choose none-coherency strategy by
default for the sake of performace.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Replace direct call to kmalloc for a potentially large, contiguous
buffer allocation with one to mtd_kmalloc_up_to which helps ensure the
operation can succeed under low-memory, highly- fragmented situations
albeit somewhat more slowly.
Signed-off-by: Grant Erickson <marathon96@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
I am working on patch to add quota as a built-in feature for ext4
filesystem. The implementation is based on the design given at
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4.
This patch reserves the inode numbers 3 and 4 for quota purposes and
also reserves EXT4_FEATURE_RO_COMPAT_QUOTA feature code.
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
File: "/mnt/mp1/"
ID: e175ccb83a872efe Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 258022 Free: -15 Available: -13122
Inodes: Total: 65536 Free: 63029
f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.
CASE 1:
ext4_da_writepages
mpage_da_map_and_submit
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_diskspace_used
percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
<--- interrupt statfs systemcall --->
ext4_da_update_reserve_space
percpu_counter_sub(&sbi->s_dirtyblocks_counter,
used + ei->i_allocated_meta_blocks);
CASE 2:
ext4_write_begin
__block_write_begin
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_diskspace_used
percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
<--- interrupt statfs systemcall --->
percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);
To avoid the issue, this patch ensures that f_bfree is non-negative.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.
This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().
Tested by me with xfstests 251.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
jbd: Fix comment to match the code in journal_start()
jbd/jbd2: remove obsolete summarise_journal_usage.
jbd: Fix forever sleeping process in do_get_write_access()
ext2: fix error msg when mounting fs with too-large blocksize
jbd: fix fsync() tid wraparound bug
ext3: Fix fs corruption when make_indexed_dir() fails
ext3: Fix lock inversion in ext3_symlink()
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits)
TOMOYO: Fix wrong domainname validation.
SELINUX: add /sys/fs/selinux mount point to put selinuxfs
CRED: Fix load_flat_shared_library() to initialise bprm correctly
SELinux: introduce path_has_perm
flex_array: allow 0 length elements
flex_arrays: allow zero length flex arrays
flex_array: flex_array_prealloc takes a number of elements, not an end
SELinux: pass last path component in may_create
SELinux: put name based create rules in a hashtable
SELinux: generic hashtab entry counter
SELinux: calculate and print hashtab stats with a generic function
SELinux: skip filename trans rules if ttype does not match parent dir
SELinux: rename filename_compute_type argument to *type instead of *con
SELinux: fix comment to state filename_compute_type takes an objname not a qstr
SMACK: smack_file_lock can use the struct path
LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH
LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE
SELINUX: Make selinux cache VFS RCU walks safe
SECURITY: Move exec_permission RCU checks into security modules
SELinux: security_read_policy should take a size_t not ssize_t
...
In e9964c10 we change cap flushing to do a delicate dance because some
inodes on the cap_dirty list could be in a migrating state (got EXPORT but
not IMPORT) in which we couldn't actually flush and move from
dirty->flushing, breaking the while (!empty) { process first } loop
structure. It worked for a single sync thread, but was not reentrant and
triggered infinite loops when multiple syncers came along.
Instead, move inodes with dirty to a separate cap_dirty_migrating list
when in the limbo export-but-no-import state, allowing us to go back to
the simple loop structure (which was reentrant). This is cleaner and more
robust.
Audited the cap_dirty users and this looks fine:
list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we
have dirty caps (which list we're on is irrelevant) and list_del_init()
calls still do the right thing.
Signed-off-by: Sage Weil <sage@newdream.net>
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (52 commits)
UBIFS: switch to dynamic printks
UBIFS: fix kernel-doc comments
UBIFS: fix extremely rare mount failure
UBIFS: simplify LEB recovery function further
UBIFS: always cleanup the recovered LEB
UBIFS: clean up LEB recovery function
UBIFS: fix-up free space on mount if flag is set
UBIFS: add the fixup function
UBIFS: add a superblock flag for free space fix-up
UBIFS: share the next_log_lnum helper
UBIFS: expect corruption only in last journal head LEBs
UBIFS: synchronize write-buffer before switching to the next bud
UBIFS: remove BUG statement
UBIFS: change bud replay function conventions
UBIFS: substitute the replay tree with a replay list
UBIFS: simplify replay
UBIFS: store free and dirty space in the bud replay entry
UBIFS: remove unnecessary stack variable
UBIFS: double check that buds are replied in order
UBIFS: make 2 functions static
...
Blocks for the allocation btree are allocated from and released to
the AGFL, and thus frequently reused. Even worse we do not have an
easy way to avoid using an AGFL block when it is discarded due to
the simple FILO list of free blocks, and thus can frequently stall
on blocks that are currently undergoing a discard.
Add a flag to the busy extent tracking structure to skip the discard
for allocation btree blocks. In normal operation these blocks are
reused frequently enough that there is no need to discard them
anyway, but if they spill over to the allocation btree as part of a
balance we "leak" blocks that we would otherwise discard. We could
fix this by adding another flag and keeping these block in the
rbtree even after they aren't busy any more so that we could discard
them when they migrate out of the AGFL. Given that this would cause
significant overhead I don't think it's worthwile for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.
The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard. Note that we don't bother
supporting discard for the non-delaylog mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
jbd2_log_start_commit() returns 1 only when we really start a
transaction. But we also need to wait for a transaction when the
commit is already running. Fix this problem by waiting for
transaction commit unconditionally (which is just a quick check if the
transaction is already committed).
Also we have to be more careful with sending of a barrier because when
transaction is being committed in parallel to ext4_sync_file()
running, we cannot be sure that the barrier the journalling code sends
happens after we wrote all the data for fsync (note that not every
data writeout needs to trigger metadata changes thus commit of some
metadata changes can be running while other data is still written
out). So use jbd2_will_send_data_barrier() helper to detect the common
cases when we can be sure barrier will be issued by the commit code
and issue the barrier ourselves in the remaining cases.
Reported-by: Edward Goggin <egoggin@vmware.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device. The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit. In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).
Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
To get delayed-extent information, ext4_ext_fiemap_cb() looks up
pagecache, it thus collects information starting from a page's
head block.
If blocksize < pagesize, the beginning blocks of a page may lies
before the request range. So ext4_ext_fiemap_cb() should proceed
ignoring them, because they has been handled before. If no mapped
buffer in the range is found in the 1st page, we need to look up
the 2nd page, otherwise delayed-extents after a hole will be ignored.
Without this patch, xfstests 225 will hung on ext4 with 1K block.
Reported-by: Amir Goldstein <amir73il@users.sourceforge.net>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Change function param_set_dlmfs_capabilities from 'extern' to 'static' since
function param_get_dlmfs_capabilities is also 'static'.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch cleans up the gunk added by commit 388c4bcb4e.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Add ocfs2_trim_fs to support trimming freed clusters in the
volume. A range will be given and all the freed clusters greater
than minlen will be discarded to the block layer.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
fat: Fix statfs->f_namelen
fat: Replace all printk with fat_msg()
fat: Add fat_msg() function for preformated FAT messages
fat: Convert fat_fs_error to use %pV
fat: Fix possible null deref in fat_cache_add()
fat: use new setup() for ->dir_ops too
Minor revision to the last version of this patch -- the only difference
is the fix to the cFYI statement in cifs_reconnect.
Holding the spinlock while we call this function means that it can't
sleep, which really limits what it can do. Taking it out from under
the spinlock also means less contention for this global lock.
Change the semantics such that the Global_MidLock is not held when
the callback is called. To do this requires that we take extra care
not to have sync_mid_result remove the mid from the list when the
mid is in a state where that has already happened. This prevents
list corruption when the mid is sitting on a private list for
reconnect or when cifsd is coming down.
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Reorganize code to get mount option at first and when get a superblock.
This lets us use shared superblock model further for equal mounts.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
journal_start returns an ERR_PTR() value rather than NULL on failure.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: obey minleft values during extent allocation correctly
xfs: reset buffer pointers before freeing them
xfs: avoid getting stuck during async inode flushes
xfs: fix xfs_itruncate_start tracing
xfs: fix duplicate workqueue initialisation
xfs: kill off xfs_printk()
xfs: fix race condition in AIL push trigger
xfs: make AIL target updates and compares 32bit safe.
xfs: always push the AIL to the target
xfs: exit AIL push work correctly when AIL is empty
xfs: ensure reclaim cursor is reset correctly at end of AG
xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGE
xfs: fix compiler warning in xfs_trace.h
xfs: cleanup duplicate initializations
xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy
xfs: exact busy extent tracking
xfs: do not immediately reuse busy extent ranges
xfs: optimize AGFL refills
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
tty/serial: Fix break handling for PORT_TEGRA
tty/serial: Add explicit PORT_TEGRA type
n_tracerouter and n_tracesink ldisc additions.
Intel PTI implementaiton of MIPI 1149.7.
Kernel documentation for the PTI feature.
export kernel call get_task_comm().
tty: Remove to support serial for S5P6442
pch_phub: Support new device ML7223
8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
ASoC: Update cx20442 for TTY API change
pch_uart: Support new device ML7223 IOH
parport: Use request_muxed_region for IT87 probe and lock
tty/serial: add support for Xilinx PS UART
n_gsm: Use print_hex_dump_bytes
drivers/tty/moxa.c: Put correct tty value
TTY: tty_io, annotate locking functions
TTY: serial_core, remove superfluous set_task_state
TTY: serial_core, remove invalid test
Char: moxa, fix locking in moxa_write
...
Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
drivers/tty/serial/Makefile.
I did the hci_ldisc thing as an evil merge, cleaning things up.
In commit c8d46e41 (ext4: Add flag to files with blocks intentionally
past EOF), if the EOFBLOCKS_FL flag is set, we call ext4_truncate()
before calling vmtruncate(). This caused any allocated but unwritten
blocks created by calling fallocate() with the FALLOC_FL_KEEP_SIZE
flag to be dropped. This was done to make to make sure that
EOFBLOCKS_FL would not be cleared while still leaving blocks past
i_size allocated. This was not necessary, since ext4_truncate()
guarantees that blocks past i_size will be dropped, even in the case
where truncate() has increased i_size before calling ext4_truncate().
So fix this by removing the EOFBLOCKS_FL special case treatment in
ext4_setattr(). In addition, use truncate_setsize() followed by a
call to ext4_truncate() instead of using vmtruncate(). This is more
efficient since it skips the call to inode_newsize_ok(), which has
been checked already by inode_change_ok(). This is also in a win in
the case where EOFBLOCKS_FL is set since it avoids calling
ext4_truncate() twice.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use separate functions for comparison between existing structure
and what we are requesting for to make server, session and tcon
search code easier to use on next superblock match call.
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: Reorder clock bases
hrtimers: Avoid touching inactive timer bases
hrtimers: Make struct hrtimer_cpu_base layout less stupid
timerfd: Manage cancelable timers in timerfd
clockevents: Move C3 stop test outside lock
alarmtimer: Drop device refcount after rtc_open()
alarmtimer: Check return value of class_find_device()
timerfd: Allow timers to be cancelled when clock was set
hrtimers: Prepare for cancel on clock was set timers
There's no SMB2 support in the CIFS filesystem driver, so there's no need to
have a config and mount option for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>