Using Eric Sandeen's xfstest for fallocate, you can easily trigger a ENOSPC
panic on btrfs. This is because we do not account for data we may use when
doing the fallocate. This patch fixes the problem by properly reserving space,
and then just freeing it when we are done. The reservation stuff was made with
delalloc in mind, so its a little crude for this case, but it keeps the box
from panicing.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The per-user inotify_devs value is incremented each time a new file is
allocated, but never decremented. This led to inotify_init failing after a
limited number of calls.
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
* git://git.infradead.org/mtd-2.6:
mtd: nand: fix build failure and incorrect return from omap_wait()
mtd: Use BLOCK_NIL consistently in NFTL/INFTL
mtd: m25p80 timeout too short for worst-case m25p16 devices
mtd: atmel_nand: Fix typo s/parititions/partitions/
mtd: cmdlineparts: Use 64-bit format when printing a debug message.
mtd: maps: Remove BUS_ID_SIZE from integrator_flash
jffs2: fix another potential leak on error path in scan.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: invalidation reverse calls
fuse: allow umask processing in userspace
fuse: fix bad return value in fuse_file_poll()
fuse: fix return value of fuse_dev_write()
Check before use it.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs. Each bip slab
has an embedded bio_vec array at the end. This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version. Only the largest bip slab is backed by a mempool. The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
Maximum file size for hostfs mounts defaults to 2GB, so bigger files cannot be
read/written through hostfs. This patch initializes the maximum file size to
MAX_LFS_SIZE.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13531
Signed-off-by: Wolfgang Illmeyer <wolfgang@illmeyer.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext2_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode. This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.
The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.
This patch thus changes ext2_lookup to return -EIO if it receives -ESTALE
from ext2_iget(), as ext2 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.
Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With ELF, at generating coredump, some more headers other than used
vmas are added.
When max_map_count == 65536, a core generated by following kinds of
code can be unreadable because the number of ELF's program header is
written in 16bit in Ehdr (please see elf.h) and the number overflows.
==
... = mmap(); (munmap, mprotect, etc...)
if (failed)
abort();
==
This can happen in mmap/munmap/mprotect/etc...which calls split_vma().
I think 65536 is not safe as _default_ and reduce it to 65530 is good
for avoiding unexpected corrupted core.
Anyway, max_map_count can be enlarged by sysctl if a user is brave..
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the eventfd interface to de-couple the eventfd memory context, from
the file pointer instance.
Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away. Also,
now the internal eventfd APIs are using the eventfd context instead of the
file*.
This patch is required by KVM's IRQfd code, which is still under
development.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't unlock on vfs_rejected_lock path in afs_do_setlk, since the lock
is unlocked after abort_attempt label.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add notification messages that allow the filesystem to invalidate VFS
caches.
Two notifications are added:
1) inode invalidation
- invalidate cached attributes
- invalidate a range of pages in the page cache (this is optional)
2) dentry invalidation
- try to invalidate a subtree in the dentry cache
Care must be taken while accessing the 'struct super_block' for the
mount, as it can go away while an invalidation is in progress. To
prevent this, introduce a rw-semaphore, that is taken for read during
the invalidation and taken for write in the ->kill_sb callback.
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@zresearch.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This patch lets filesystems handle masking the file mode on creation.
This is needed if filesystem is using ACLs.
- The CREATE, MKDIR and MKNOD requests are extended with a "umask"
parameter.
- A new FUSE_DONT_MASK flag is added to the INIT request/reply. With
this the filesystem may request that the create mode is not masked.
CC: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
exposes a bug due to a non-careful return of an int or unsigned value:
implement a FUSE filesystem which sends an unsolicited notification to
the kernel with invalid opcode. The respective write to /dev/fuse
will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
errno == EINVAL.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
This patch fixes an imbalance message as reported by J.R. Okajima.
The IMA file counters are incremented in ima_path_check. If the
actual open fails, such as ETXTBSY, decrement the counters to
prevent unnecessary imbalance messages.
Reported-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Fixes a regression caused by commit a6ce4932fb
When this lock was converted to a mutex, the locks were turned into
unlocks and vice-versa.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] remove unknown mount option warning message
[CIFS] remove bkl usage from umount begin
cifs: Fix incorrect return code being printed in cFYI messages
[CIFS] cleanup asn handling for ntlmssp
[CIFS] Copy struct *after* setting the port, instead of before.
cifs: remove rw/ro options
cifs: fix problems with earlier patches
cifs: have cifs parse scope_id out of IPv6 addresses and use it
[CIFS] Do not send tree disconnect if session is already disconnected
[CIFS] Fix build break
cifs: display scopeid in /proc/mounts
cifs: add new routine for converting AF_INET and AF_INET6 addrs
cifs: have cifs_show_options show forceuid/forcegid options
cifs: remove unneeded NULL checks from cifs_show_options
Jeff's previous patch which removed the unneeded rw/ro
parsing can cause a minor warning in dmesg (about the
unknown rw or ro mount option) at mount time. This
patch makes cifs ignore them in kernel to remove the warning
(they are already handled in the mount helper and VFS).
Signed-off-by: Steve French <sfrench@us.ibm.com>
The lock_kernel call moved into the fs for umount_begin
is not needed. This adds a check to make sure we don't
call umount_begin twice on the same fs.
umount_begin for cifs is probably not needed and
may eventually be able to be removed, but in
the meantime this smaller patch is safe and
gets rid of the bkl from this path which provides
some benefit.
Acked-by: Jeff Layton <redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
FreeXid() along with freeing Xid does add a cifsFYI debug message that
prints rc (return code) as well. In some code paths where we set/return
error code after calling FreeXid(), incorrect error code is being
printed when cifsFYI is enabled.
This could be misleading in few cases. For eg.
In cifs_open() if cifs_fill_filedata() returns a valid pointer to
cifsFileInfo, FreeXid() prints rc=-13 whereas 0 is actually being
returned. Fix this by setting rc before calling FreeXid().
Basically convert
FreeXid(xid); rc = -ERR;
return -ERR; => FreeXid(xid);
return rc;
[Note that Christoph would like to replace the GetXid/FreeXid
calls, which are primarily used for debugging. This seems
like a good longer term goal, but although there is an
alternative tracing facility, there are no examples yet
available that I know of that we can use (yet) to
convert this cifs function entry/exit logging, and for
creating an identifier that we can use to correlate
all dmesg log entries for a particular vfs operation
(ie identify all log entries for a particular vfs
request to cifs: e.g. a particular close or read or write
or byte range lock call ... and just using the thread id
is harder). Eventually when a replacement
for this is available (e.g. when NFS switches over and various
samples to look at in other file systems) we can remove the
GetXid/FreeXid macro but in the meantime multiple people
use this run time configurable logging all the time
for debugging, and Suresh's patch fixes a problem
which made it harder to notice some low
memory problems in the log so it is worthwhile
to fix this problem until a better logging
approach is able to be used]
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Also removes obsolete distinction between rawntlmssp and ntlmssp (in asn/SPNEGO)
since as jra noted we can always send raw ntlmssp in session setup now.
remove check for experimental runtime flag (/proc/fs/cifs/Experimental) in
ntlmssp path.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs: remove rw/ro options
These options are handled at the VFS layer. They only ever set the
option in the smb_vol struct. Nothing was ever done with them afterward
anyway.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs: fix problems with earlier patches
cifs_show_address hasn't been introduced yet, and fix a typo that was
silently fixed by a later patch in the series.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch has CIFS look for a '%' in an IPv6 address. If one is
present then it will try to treat that value as a numeric interface
index suitable for stuffing into the sin6_scope_id field.
This should allow people to mount servers on IPv6 link-local addresses.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Holder <david@erion.co.uk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Noticed this when tree connect timed out (due to Samba server crash) -
we try to send a tree disconnect for a tid that does not exist
since we don't have a valid tree id yet. This checks that the
session is valid before sending the tree disconnect to handle
this case.
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
switch xfs to generic acl caching helpers
helpers for acl caching + switch to those
switch shmem to inode->i_acl
switch reiserfs to inode->i_acl
switch reiserfs to usual conventions for caching ACLs
reiserfs: minimal fix for ACL caching
switch nilfs2 to inode->i_acl
switch btrfs to inode->i_acl
switch jffs2 to inode->i_acl
switch jfs to inode->i_acl
switch ext4 to inode->i_acl
switch ext3 to inode->i_acl
switch ext2 to inode->i_acl
add caching of ACLs in struct inode
fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls
cleanup __writeback_single_inode
... and the same for vfsmount id/mount group id
Make allocation of anon devices cheaper
update Documentation/filesystems/Locking
devpts: remove module-related code
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: remove redundant tests on unsigned
udf: Use device size when drive reported bogus number of written blocks
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).
ubifs/xattr.c needed includes reordered, the rest is a plain switchover.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
reiserfs uses NULL as "unknown" and ERR_PTR(-ENODATA) as "no ACL";
several codepaths store the former instead of the latter.
All those codepaths go through iset_acl() and all cases when it's
called with NULL acl are for the second variety, so the minimal
fix is to teach iset_acl() to deal with that.
Proper fix is to switch to more usual conventions and avoid back
and forth between internally used ERR_PTR(-ENODATA) and NULL
expected by the rest of the kernel.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch adds ioctls to vfs for compatibility with legacy XFS
pre-allocation ioctls (XFS_IOC_*RESVP*). The implementation
effectively invokes sys_fallocate for the new ioctls.
Also handles the compat_ioctl case.
Note: These legacy ioctls are also implemented by OCFS2.
[AV: folded fixes from hch]
Signed-off-by: Ankit Jain <me@ankitjain.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no reason to for the split between __writeback_single_inode and
__sync_single_inode, the former just does a couple of checks before
tail-calling the latter. So merge the two, and while we're at it split
out the I_SYNC waiting case for data integrity writers, as it's
logically separate function. Finally rename __writeback_single_inode to
writeback_single_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Standard trick - add a new variable (start) such that
for each n < start n is known to be busy. Allocation can
skip checking everything in [0..start) and if it returns
n, we can set start to n + 1. Freeing below start sets
start to what we'd just freed.
Of course, it still sucks if we do something like
free 0
allocate
allocate
in a loop - still O(n^2) time. However, on saner loads it
improves the things a lot and the entire thing is not worth
the trouble of switching to something with better worst-case
behaviour.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These days, the devpts filesystem is closely integrated with the pty
memory management, and cannot be built as a module, even less removed
from the kernel. Accordingly, remove all module-related stuff from
this filesystem.
[ v2: only remove code that's actually dead ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
commit 2a73787110 "Cache root in nameidata"
introduced a new member nd->root, but forgot to put it in do_filp_open().
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reiserfs doesn't use lock_super anywhere internally, and ->remount_fs
which calls reiserfs_resize does have it currently but also expects it
to be held on return, so there's no business for the unlock_super here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked by Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
first_block and goal are unsigned. When negative they are wrapped and caught by
the other test.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2/trivial: Wrap ocfs2_sysfile_cluster_lock_key within define.
ocfs2: Add lockdep annotations
vfs: Set special lockdep map for dirs only if not set by fs
ocfs2: Disable orphan scanning for local and hard-ro mounts
ocfs2: Do not initialize lvb in ocfs2_orphan_scan_lock_res_init()
ocfs2: Stop orphan scan as early as possible during umount
ocfs2: Fix ocfs2_osb_dump()
ocfs2: Pin journal head before accessing jh->b_committed_data
ocfs2: Update atime in splice read if necessary.
ocfs2: Provide the ocfs2_dlm_lvb_valid() stack API.
As noted in the previous patch, the NFSv4 client mount code currently
has several limitations. If the mount path contains symlinks, or
referrals, or even if it just contains a '..', then the client code in
nfs4_path_walk() will fail with an error.
This patch replaces the nfs4_path_walk()-based lookup with a helper
function that sets up a private namespace to represent the namespace on the
server, then uses the ordinary VFS and NFS path lookup code to walk down the
mount path in that namespace.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of this patch is to improve the remote mount path lookup
support for distributed filesystems such as the NFSv4 client.
When given a mount command of the form "mount server:/foo/bar /mnt", the
NFSv4 client is required to look up the filehandle for "server:/", and
then look up each component of the remote mount path "foo/bar" in order
to find the directory that is actually going to be mounted on /mnt.
Following that remote mount path may involve following symlinks,
crossing server-side mount points and even following referrals to
filesystem volumes on other servers.
Since the standard VFS path lookup code already supports walking paths
that contain all these features (using in-kernel automounts for
following referrals) we would like to be able to reuse that rather than
duplicate the full path traversal functionality in the NFSv4 client code.
This patch therefore defines a VFS helper function create_mnt_ns(), that
sets up a temporary filesystem namespace and attaches a root filesystem to
it. It exports the create_mnt_ns() and put_mnt_ns() function for use by
filesystem modules.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to allow modules to use it without having to export vfsmount_lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.infradead.org/mtd-2.6: (63 commits)
mtd: OneNAND: Allow setting of boundary information when built as module
jffs2: leaking jffs2_summary in function jffs2_scan_medium
mtd: nand: Fix memory leak on txx9ndfmc probe failure.
mtd: orion_nand: use burst reads with double word accesses
mtd/nand: s3c6400 support for s3c2410 driver
[MTD] [NAND] S3C2410: Use DIV_ROUND_UP
[MTD] [NAND] S3C2410: Deal with unaligned lengths in S3C2440 buffer read/write
[MTD] [NAND] S3C2410: Allow the machine code to get the BBT table from NAND
[MTD] [NAND] S3C2410: Added a kerneldoc for s3c2410_nand_set
mtd: physmap_of: Add multiple regions and concatenation support
mtd: nand: max_retries off by one in mxc_nand
mtd: nand: s3c2410_nand_setrate(): use correct macros for 2412/2440
mtd: onenand: add bbt_wait & unlock_all as replaceable for some platform
mtd: Flex-OneNAND support
mtd: nand: add OMAP2/OMAP3 NAND driver
mtd: maps: Blackfin async: fix memory leaks in probe/remove funcs
mtd: uclinux: mark local stuff static
mtd: uclinux: do not allow to be built as a module
mtd: uclinux: allow systems to override map addr/size
mtd: blackfin NFC: fix hang when using NAND on BF527-EZKITs
...
Actually ocfs2_sysfile_cluster_lock_key is only used if we enable
CONFIG_DEBUG_LOCK_ALLOC. Wrap it so that we can avoid a building
warning.
fs/ocfs2/sysfile.c:53: warning: ‘ocfs2_sysfile_cluster_lock_key’
defined but not used
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Add lockdep support to OCFS2. The support also covers all of the cluster
locks except for open locks, journal locks, and local quotafile locks. These
are special because they are acquired for a node, not for a particular process
and lockdep cannot deal with such type of locking.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Some filesystems need to set lockdep map for i_mutex differently for
different directories. For example OCFS2 has system directories (for
orphan inode tracking and for gathering all system files like journal
or quota files into a single place) which have different locking
locking rules than standard directories. For a filesystem setting
lockdep map is naturaly done when the inode is read but we have to
modify unlock_new_inode() not to overwrite the lockdep map the filesystem
has set.
Acked-by: peterz@infradead.org
CC: mingo@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Local and Hard-RO mounts do not need orphan scanning.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We don't access the LVB in our ocfs2_*_lock_res_init() functions.
Since the LVB can become invalid during some cluster recovery
operations, the dlmglue must be able to handle an uninitialized
LVB.
For the orphan scan lock, we initialized an uninitialzed LVB with our
scan sequence number plus one. This starts a normal orphan scan
cycle.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Currently if the orphan scan fires a tick before the user issues the umount,
the umount will wait for the queued orphan scan tasks to complete.
This patch makes the umount stop the orphan scan as early as possible so as
to reduce the probability of the queued tasks slowing down the umount.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Skip printing information that is not valid for local mounts.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch adds jbd_lock_bh_state() and jbd_unlock_bh_state() around accessses
to jh->b_committed_data.
Fixes oss bugzilla#1131
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1131
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We should call ocfs2_inode_lock_atime instead of ocfs2_inode_lock
in ocfs2_file_splice_read like we do in ocfs2_file_aio_read so
that we can update atime in splice read if necessary.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The Lock Value Block (LVB) of a DLM lock can be lost when nodes die and
the DLM cannot reconstruct its state. Clients of the DLM need to know
this.
ocfs2's internal DLM, o2dlm, explicitly zeroes out the LVB when it loses
track of the state. This is not a standard behavior, but ocfs2 has
always relied on it. Thus, an o2dlm LVB is always "valid".
ocfs2 now supports both o2dlm and fs/dlm via the stack glue. When
fs/dlm loses track of an LVBs state, it sets a flag
(DLM_SBF_VALNOTVALID) on the Lock Status Block (LKSB). The contents of
the LVB may be garbage or merely stale.
ocfs2 doesn't want to try to guess at the validity of the stale LVB.
Instead, it should be checking the VALNOTVALID flag. As this is the
'standard' way of treating LVBs, we will promote this behavior.
We add a stack glue API ocfs2_dlm_lvb_valid(). It returns non-zero when
the LVB is valid. o2dlm will always return valid, while fs/dlm will
check VALNOTVALID.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
* 'for-2.6.31' of git://fieldses.org/git/linux-nfsd: (60 commits)
SUNRPC: Fix the TCP server's send buffer accounting
nfsd41: Backchannel: minorversion support for the back channel
nfsd41: Backchannel: cleanup nfs4.0 callback encode routines
nfsd41: Remove ip address collision detection case
nfsd: optimise the starting of zero threads when none are running.
nfsd: don't take nfsd_mutex twice when setting number of threads.
nfsd41: sanity check client drc maxreqs
nfsd41: move channel attributes from nfsd4_session to a nfsd4_channel_attr struct
NFS: kill off complicated macro 'PROC'
sunrpc: potential memory leak in function rdma_read_xdr
nfsd: minor nfsd_vfs_write cleanup
nfsd: Pull write-gathering code out of nfsd_vfs_write
nfsd: track last inode only in use_wgather case
sunrpc: align cache_clean work's timer
nfsd: Use write gathering only with NFSv2
NFSv4: kill off complicated macro 'PROC'
NFSv4: do exact check about attribute specified
knfsd: remove unreported filehandle stats counters
knfsd: fix reply cache memory corruption
knfsd: reply cache cleanups
...
* 'for-2.6.31' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (128 commits)
nfs41: sunrpc: xprt_alloc_bc_request() should not use spin_lock_bh()
nfs41: Move initialization of nfs4_opendata seq_res to nfs4_init_opendata_res
nfs: remove unnecessary NFS_INO_INVALID_ACL checks
NFS: More "sloppy" parsing problems
NFS: Invalid mount option values should always fail, even with "sloppy"
NFS: Remove unused XDR decoder functions
NFS: Update MNT and MNT3 reply decoding functions
NFS: add XDR decoder for mountd version 3 auth-flavor lists
NFS: add new file handle decoders to in-kernel mountd client
NFS: Add separate mountd status code decoders for each mountd version
NFS: remove unused function in fs/nfs/mount_clnt.c
NFS: Use xdr_stream-based XDR encoder for MNT's dirpath argument
NFS: Clean up MNT program definitions
lockd: Don't bother with RPC ping for NSM upcalls
lockd: Update NSM state from SM_MON replies
NFS: Fix false error return from nfs_callback_up() if ipv6.ko is not available
NFS: Return error code from nfs_callback_up() to user space
NFS: Do not display the setting of the "intr" mount option
NFS: add support for splice writes
nfs41: Backchannel: CB_SEQUENCE validation
...
I happened to find that fs/minix/minix.h doesn't guard double include.
Yes, I know this never cause something destructive because this is
self-evidence that no source file includes minix.h twice, but I think
fixing this is better than disregarding it.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nfs4_open_recover_helper clears opendata->o_res
before calling nfs4_init_opendata_res, thus causing
NFSv4.0 OPEN operations to be sent rather than nfsv4.1.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
perfcounter: Handle some IO return values
perf_counter: Push perf_sample_data through the swcounter code
perf_counter tools: Define and use our own u64, s64 etc. definitions
perf_counter: Close race in perf_lock_task_context()
perf_counter, x86: Improve interactions with fast-gup
perf_counter: Simplify and fix task migration counting
perf_counter tools: Add a data file header
perf_counter: Update userspace callchain sampling uses
perf_counter: Make callchain samples extensible
perf report: Filter to parent set by default
perf_counter tools: Handle lost events
perf_counter: Add event overlow handling
fs: Provide empty .set_page_dirty() aop for anon inodes
perf_counter: tools: Makefile tweaks for 64-bit powerpc
perf_counter: powerpc: Add processor back-end for MPC7450 family
perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
perf_counter: powerpc: Change how processor-specific back-ends get selected
perf_counter: powerpc: Use unsigned long for register and constraint values
perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
perf_counter tools: Add and use isprint()
...
(ce3b0f8d5c2203301fc87f3aaaed73e5819e2a48: New helper - current_umask())
is removing the opts->fs_dmask, probably it's a cut-and-paste
miss or something.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
inotify_destroy_mark_entry could get called twice for the same mark since it
is called directly in inotify_rm_watch and when the mark is being destroyed for
another reason. As an example assume that the file being watched was just
deleted so inotify_destroy_mark_entry would get called from the path
fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() ->
fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry(). If this
happened at the same time as userspace tried to remove a watch via
inotify_rm_watch we could attempt to remove the mark from the idr twice and
could thus double dec the ref cnt and potentially could be in a use after
free/double free situation. The fix is to have inotify_rm_watch use the
generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the
inotify_destroy_mark_entry() function can only be called one.
This patch also renames the function to inotify_ingored_remove_idr() so it is
clear what is actually going on in the function.
Hopefully this fixes:
[ 20.342058] idr_remove called for id=20 which is not allocated.
[ 20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077
[ 20.353933] Call Trace:
[ 20.356410] [<ffffffff811a82b7>] idr_remove+0x115/0x18f
[ 20.361737] [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75
[ 20.367061] [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf
[ 20.373771] [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf
[ 20.380306] [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10
[ 20.386238] [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170
[ 20.393293] [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf
[ 20.399829] [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6
[ 20.405850] [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Tested-by: Peter Ziljlstra <peterz@infradead.org>
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".
Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Prepare to share backchannel code with NFSv4.1.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[nfsd41: use nfsd4_cb_sequence for callback minorversion]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Mimic the client and prepare to share the back channel xdr with NFSv4.1.
Bump the number of operations in each encode routine, then backfill the
number of operations.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Verified that cthon and pynfs exchange id tests pass (except for the
two expected fails: EID8 and EID50)
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: clean up jbd2_journal_try_to_free_buffers()
ext4: Don't update ctime for non-extent-mapped inodes
ext4: Fix up whitespace issues in fs/ext4/inode.c
ext4: Fix 64-bit block type problem on 32-bit platforms
ext4: teach the inode allocator to use a goal inode number
ext4: Use a hash of the topdir directory name for the Orlov parent group
ext4: document the "abort" mount option
ext4: move the abort flag from s_mount_opts to s_mount_flags
ext4: update the s_last_mounted field in the superblock
ext4: change s_mount_opt to be an unsigned int
ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
ext4: avoid unnecessary spinlock in critical POSIX ACL path
ext3: avoid unnecessary spinlock in critical POSIX ACL path
ext4: convert instrumentation from markers to tracepoints
jbd2: convert instrumentation from markers to tracepoints
seq_write() can be used to construct seq_files containing arbitrary data.
Required by the gcov-profiling interface to synthesize binary profiling
data files.
Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In theory it is not safe to dereference ->parent/real_parent without
tasklist or rcu lock, we can race with re-parenting.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several code paths in reiserfs have a construct like:
if (is_direntry_le_ih(ih = B_N_PITEM_HEAD(src, item_num))) ...
which, in addition to being ugly, end up causing compiler warnings with
gcc 4.4.0. Previous compilers didn't issue a warning.
fs/reiserfs/do_balan.c:1273: warning: operation on `aux_ih' may be undefined
fs/reiserfs/lbalance.c:393: warning: operation on `ih' may be undefined
fs/reiserfs/lbalance.c:421: warning: operation on `ih' may be undefined
fs/reiserfs/lbalance.c:777: warning: operation on `ih' may be undefined
I believe this is due to the ih being passed to macros which evaluate the
argument more than once. This is old code and we haven't seen any
problems with it, but this patch eliminates the warnings.
It converts the multiple evaluation macros to static inlines and does a
preassignment for the cases that were causing the warnings because that
code is just ugly.
Reported-by: Chris Mason <mason@oracle.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused variables from isofs_sb_info (used to be some mount
options), unify variables for option to use 0/1 (some options used
'y'/'n'), use bit fields for option flags in superblock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
isofs allows setting of default uid and gid of files but value 0 was used
to indicate that user did not specify any uid/gid mount option. Since
this option also overrides uid/gid set in Rock Ridge extension, it makes
sense to allow forcing uid/gid 0. Fix option processing to allow this.
Cc: <Hans-Joachim.Baader@cjt.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So far, permissions set via 'mode' and/or 'dmode' mount options were
effective only if the medium had no rock ridge extensions (or was mounted
without them). Add 'overriderockmode' mount option to indicate that these
options should override permissions set in rock ridge extensions. Maybe
this should be default but the current behavior is there since mount
options were created so I think we should not change how they behave.
Cc: <Hans-Joachim.Baader@cjt.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Ted pointed out, it can happen that ext3_truncate() returns without
removing inode from orphan list. This way we could in some rare cases
(like when we get ENOMEM from an allocation in ext3_truncate called
because of failed ext3_write_begin) leave the inode on orphan list and
that triggers assertion failure on umount.
So make ext3_truncate() always remove inode from in-memory orphan list.
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I delete the following patch
"commit 3f31fddfa2
Author: Mingming Cao <cmm@us.ibm.com>
Date: Fri Jul 25 01:46:22 2008 -0700
jbd: fix race between free buffer and commit transaction
This patch is no longer needed because if race between freeing buffer and
committing transaction functionality occurs and dio gets error, currently
dio falls back to buffered IO by the following patch.
commit 6ccfa806a9
Author: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Date: Tue Sep 2 14:35:40 2008 -0700
VFS: fix dio write returning EIO when try_to_release_page fails
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chain verification in ext3_get_blocks() has been hosed since it called
verify_chain(chain, NULL) which always returns success. As a result
readers could in theory race with truncate. On the other hand the race
probably cannot happen with the current locking scheme, since by the
time ext3_truncate() is called all the pages are already removed and
hence get_block() shouldn't be called on such pages...
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
One of our users is complaining that his backup tool is upset on ext2
(while it's happy on ext3, xfs, ...) because of the mtime change.
The problem is:
mkdir foo
mkdir bar
mkdir foo/a
Now under ext2:
mv foo/a foo/b
changes mtime of 'foo/a' (foo/b after the move). That does not really
make sense and it does not happen under any other filesystem I've seen.
More complicated is:
mv foo/a bar/a
This changes mtime of foo/a (bar/a after the move) and it makes some
sense since we had to update parent directory pointer of foo/a. But
again, no other filesystem does this. So after some thoughts I'd vote
for consistency and change ext2 to behave the same as other filesystems.
Do not update mtime of a moved directory. Specs don't say anything
about it (neither that it should, nor that it should not be updated) and
other common filesystems (ext3, ext4, xfs, reiserfs, fat, ...) don't do
it. So let's become more consistent.
Spotted by ronny.pretzsch@dfs.de, initial fix by Jörn Engel.
Reported-by: <ronny.pretzsch@dfs.de>
Cc: <hare@suse.de>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CHECK fs/proc/proc_devtree.c
fs/proc/proc_devtree.c:197:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:203:34: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:210:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:223:26: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:226:14: warning: Using plain integer as NULL pointer
Signed-off-by: Michal Simek <monstr@monstr.eu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a regression in 2.6.30.
I unfortunately accepted a patch time ago, to drop the "current" usage
from possible IRQ context, w/out proper thought over it. The patch
switched to using the CPU id by bounding the nested call callback with a
get_cpu()/put_cpu().
Unfortunately the ep_call_nested() function can be called with a callback
that grabs sleepy locks (from own f_op->poll()), that results in epic
fails. The following patch uses the proper "context" depending on the
path where it is called, and on the kind of callback.
This has been reported by Stefan Richter, that has also verified the patch
is his previously failing environment.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Reported-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export statistics for softirq in /proc/softirqs and /proc/stat.
1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.
2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.
[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, if we ask to set then number of nfsd threads to zero when
there are none running, we set up all the sockets and register the
service, and then tear it all down again.
This is pointless.
So detect that case and exit promptly.
(also remove an assignment to 'error' which was never used.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Currently when we write a number to 'threads' in nfsdfs,
we take the nfsd_mutex, update the number of threads, then take the
mutex again to read the number of threads.
Mostly this isn't a big deal. However if we are write '0', and
portmap happens to be dead, then we can get unpredictable behaviour.
If the nfsd threads all got killed quickly and the last thread is
waiting for portmap to respond, then the second time we take the mutex
we will block waiting for the last thread.
However if the nfsd threads didn't die quite that fast, then there
will be no contention when we try to take the mutex again.
Unpredictability isn't fun, and waiting for the last thread to exit is
pointless, so avoid taking the lock twice.
To achieve this, get nfsd_svc return a non-negative number of active
threads when not returning a negative error.
Signed-off-by: NeilBrown <neilb@suse.de>
.set_page_dirty() is one of those a_ops that defaults to the
buffer implementation when not set. Therefore provide a dummy
function to make it do nothing.
(Uncovered by perfcounters fd's which can now be writable-mmap-ed.)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some drives report 0 as the number of written blocks when there are some blocks
recorded. Use device size in such case so that we can automagically mount such
media.
Signed-off-by: Jan Kara <jack@suse.cz>
Unless I'm mistaken, NFS_INO_INVALID_ACL is being checked twice during
getacl calls (i.e. first via nfs_revalidate_inode() and then by each all
site).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Specifying "port=-5" with the kernel's current mount option parser
generates "unrecognized mount option". If "sloppy" is set, this
causes the mount to succeed and use the default values; the desired
behavior is that, since this is a valid option with an invalid value,
the mount should fail, even with "sloppy."
To properly handle "sloppy" parsing, we need to distinguish between
correct options with invalid values, and incorrect options. We will
need to parse integer values by hand, therefore, and not rely on
match_token().
For instance, these must all fail with "invalid value":
port=12345678
port=-5
port=samuel
and not with "unrecognized option," as they do currently.
Thus, for the sake of match_token() we need to treat the values for
these options as strings, and do the conversion to integers using
strict_strtol().
This is basically the same solution we used for the earlier "retry="
fix (commit ecbb3845), except in this case the kernel actually has to
parse the value, rather than ignore it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ian Kent reports:
"I've noticed a couple of other regressions with the options vers
and proto option of mount.nfs(8).
The commands:
mount -t nfs -o vers=<invalid version> <server>:/<path> /<mountpoint>
mount -t nfs -o proto=<invalid proto> <server>:/<path> /<mountpoint>
both immediately fail.
But if the "-s" option is also used they both succeed with the
mount falling back to defaults (by the look of it).
In the past these failed even when the sloppy option was given, as
I think they should. I believe the sloppy option is meant to allow
the mount command to still function for mount options (for example
in shared autofs maps) that exist on other Unix implementations but
aren't present in the Linux mount.nfs(8). So, an invalid value
specified for a known mount option is different to an unknown mount
option and should fail appropriately."
See RH bugzilla 486266.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Remove xdr_decode_fhstatus() and xdr_decode_fhstatus3(), now
that they are unused.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Solder xdr_stream-based XDR decoding functions into the in-kernel mountd
client that are more careful about checking data types and watching for
buffer overflows. The new MNT3 decoder includes support for auth-flavor
list decoding.
The "_sz" macro for MNT3 replies was missing the size of the file handle.
I've added this back, and included the size of the auth flavor array.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce an xdr_stream-based XDR decoder that can unpack the auth-
flavor list returned in a MNT3 reply.
The nfs_mount() function's caller allocates an array, and passes the
size and a pointer to it. The decoder decodes all the flavors it can
into the array, and returns the number of decoded flavors.
If the caller is not interested in the auth flavors, it can pass a
value of zero as the size of the pre-allocated array.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce xdr_stream-based XDR file handle decoders to the in-kernel
mountd client. These are more careful than the existing decoder
functions about buffer overflows and data type and range checking.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce data structures and xdr_stream-based decoding functions for
unmarshalling mountd status codes properly.
Mountd version 3 uses specific standard error return codes that are
not errno values and not NFS3ERR_ values. These have a well-defined
standard mapping to local errno values. Introduce data structures
and a decoder function that map these status codes to local errno
values properly. This is new functionality (but not used yet).
Version 1 mountd status values are defined by RFC 1094 as UNIX error
values (errno values). Errno values on heterogeneous systems do not
necessarily match each other. To avoid exposing possibly incorrect
errno values to upper layers, the current XDR decoder converts all
non-zero MNT version 1 status codes to -EACCES.
The OpenGroup XNFS standard provides a mapping similar to but smaller
than the version 3 error codes. Implement a decoder that uses the XNFS
error codes, replacing the current decoder.
For both mountd protocol versions, map unrecognized errors to -EACCES.
Finally we introduce a replacement data structure for mnt_fhstatus
at this time, which is used by the new XDR decoders. In addition to
documenting that the status value returned by the XDR decoders is
always an errno, this new structure will be expanded in subsequent
patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: remove xdr_encode_dirpath() now that it has been replaced.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Check the length of the supplied dirpath, and see that it fits
properly in the RPC buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Relocate MNT program procedure number definitions to the
only file that uses them. Relocate the version number definitions,
which are shared, to nfs.h. Remove duplicate program number
definitions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cut NSM upcall RPC traffic in half -- don't do a NULL call first.
The cases where a ping would be helpful are rare.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When rpc.statd starts up in user space at boot time, it attempts to
write the latest NSM local state number into
/proc/sys/fs/nfs/nsm_local_state.
If lockd.ko isn't loaded yet (as is the case in most configurations),
that file doesn't exist, thus the kernel's NSM state remains set to
its initial value of zero during lockd operation.
This is a problem because rpc.statd and lockd use the NSM state number
to prevent repeated lock recovery on rebooted hosts. If lockd sends
a zero NSM state, but then a delayed SM_NOTIFY with a real NSM state
number is received, there is no way for lockd or rpc.statd to
distinguish that stale SM_NOTIFY from an actual reboot. Thus lock
recovery could be performed after the rebooted host has already
started reclaiming locks, and those locks will be lost.
We could change /etc/init.d/nfslock so it always modprobes lockd.ko
before starting rpc.statd. However, if lockd.ko is ever unloaded
and reloaded, we are back at square one, since the NSM state is not
preserved across an unload/reload cycle. This may happen frequently
on clients that use automounter. A period of NFS inactivity causes
lockd.ko to be unloaded, and the kernel loses its NSM state setting.
Instead, let's use the fact that rpc.statd plants the local system's
NSM state in every SM_MON (and SM_UNMON) reply. lockd performs a
synchronous SM_MON upcall to the local rpc.statd _before_ sending its
first NLM request to a new remote. This would permit rpc.statd to
provide the current NSM state to lockd, even after lockd.ko had been
unloaded and reloaded.
Note that NLMPROC_LOCK arguments are constructed before the
nsm_monitor() call, so we have to rearrange argument construction very
slightly to make this all work out.
And, the kernel appears to treat NSM state as a u32 (see struct
nlm_args and nsm_res). Make nsm_local_state a u32 as well, to ensure
we don't get bogus comparison results.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clear "ret" if the error return from svc_create_xprt(AF_INET6) was
-EAFNOSUPORT. Otherwise, callback start-up will succeed, but
nfs_callback_up() will return -EAFNOSUPPORT anyway, and the first
NFSv4 mount attempt after a reboot will fail.
Bug introduced by commit f738f517 in 2.6.30-rc1.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the kernel cannot start the NFSv4 callback service during a mount
request, it returns -ENOMEM to user space, resulting in this message:
mount.nfs4: Cannot allocate memory
Adjust nfs_alloc_client() and nfs_get_client() to pass NFSv4 callback
start-up errors back to user space so a less mysterious error message
can be displayed by the mount command.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The "intr" mount option has been deprecated for a while, but
/proc/mounts continues to display "nointr" whether "intr" or "nointr"
has been specified for a mount point.
Since these options do not have any effect, simply do not display
them.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adds support for splice writes. It effectively calls
generic_file_splice_write() to do the writes.
We need not worry about O_APPEND case as the combination of splice()
writes and O_APPEND is disallowed. This patch propagates NFS write
errors back to the caller. The number of bytes written via splice are
being added to NFSIO_NORMALWRITTENBYTES as these are effectively
cached writes.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch reverts 3f31fddf, which is no longer needed because if a
race between freeing buffer and committing transaction functionality
occurs and dio gets error, currently dio falls back to buffered IO due
to the commit 6ccfa806.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Validates the callback's sessionID, the slot number, and the sequence ID.
Increments the slot's sequence.
Detects replays, but simply prints a debug message (if debugging is enabled
since we don't yet implement a duplicate request cache for the backchannel.
This should not present a problem, since only idempotent callbacks are
currently implemented.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Backchannel: Be more obvious about the return value]
[nfs41: Backchannel: dprink in host order]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Finds the 'struct nfs_client' that matches the server's address, major
version number, and session ID.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Defines a new 'struct nfs4_slot_table' in the 'struct nfs4_session'
for use by the backchannel. Initializes, resets, and destroys the backchannel
slot table in the same manner the forechannel slot table is initialized,
reset, and destroyed.
The sequenceid for each slot in the backchannel slot table is initialized
to 0, whereas the forechannel slotid's sequenceid is set to 1.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Generalize nfs4_init_slot_table() so it can be used to initialize the
backchannel slot table in addition to the forechannel slot table.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Generalize nfs4_reset_slot_table() so it can be used to reset the
backchannel slot table in addition to the forechannel slot table.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Change the type of cs_addr and csr_status to 'struct sockaddr' and
'__be32' since the cb_sequence processing function will use existing
functionality that expects these types.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
CB_SEQUENCE must appear first in the callback compound RPC.
If it is not the first operation NFS4ERR_SEQUENCE_POS must be returned.
If the first operation ni the CB_COMPOUND is not CB_SEQUENCE then
NFS4ERR_OP_NOT_IN_SESSION must be returned.
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: refactor op preprocessing out of process_op]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get rid of READMEM and COPYMEM for callback_xdr.c]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get rid of READ64 in callback_xdr.c]
See http://linux-nfs.org/pipermail/pnfs/2009-June/007846.html
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Note that this patch changes the nfsv4.0 behavior also when
CONFIG_NFS_V4_1 is not defined where NFS4ERR_MINOR_VERS_MISMATCH
will be returned if the client received a CB_COMPOUND
with minorversion != 0. Previously, it would have
returned NFS4ERR_OP_ILLEGAL for CB_SEQUENCE.
(or if the server is broken and sent OP_CB_GETATTR or OP_CB_RECALL
with minorversion!=0, they would have been processed normally.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: refactor op preprocessing out of process_op]
See http://linux-nfs.org/pipermail/pnfs/2009-June/007845.html
[nfs41: define CB_NOTIFY_DEVICEID as not supported]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Frees the preallocated backchannel resources that are associated with
this session when the session is destroyed.
A backchannel is currently created once per session. Destroy the backchannel
only when the session is destroyed.
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Set the SESSION4_BACK_CHAN flag to indicate the client supports a backchannel.
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
The NFS v4.1 callback service has already been setup, and
rpc_xprt->serv points to the svc_serv structure describing it.
Invoke the xprt_setup_backchannel() initialization to pre-
allocate the necessary backchannel structures.
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: change nfs4_put_session(nfs4_session**) to nfs4_destroy_session(nfs_session*)]
Signed-off-by: Alexandros Batsakis <Alexandros.Batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved xprt_setup_backchannel from nfs4_init_session to nfs4_init_backchannel]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Tracks the nfs_callback_info for both versions, enabling the callback
service for v4 and v4.1 to run concurrently and be stopped independently
of each other.
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
nfs41_callback_up() initializes the necessary queues and creates the new
nfs41_callback_svc thread. This thread executes the callback service which
waits for requests to arrive on the svc_serv->sv_cb_list.
NFS41_BC_MIN_CALLBACKS is set to 1 because we expect callbacks to not
cause substantial latency.
The actual processing of the callback will be implemented as a separate patch.
There is only one NFSv4.1 callback service. The first caller of
nfs4_callback_up() creates the service, subsequent callers increment a
reference count on the service. The service is destroyed when the last
caller invokes nfs_callback_down().
The transport needs to hold a reference to the callback service in order
to invoke it during callback processing. Currently this reference is only
obtained when the service is first created. This is incorrect, since
subsequent registrations for other transports will leave the xprt->serv
pointer uninitialized, leading to an oops when a callback arrives on
the "unreferenced" transport.
This patch fixes the problem by ensuring that a reference to the service
is saved in xprt->serv, either because the service is created by this
invocation to nfs4_callback_up() or by a prior invocation.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Add a reference to svc_serv during callback service bring up]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Type check arguments of nfs_callback_up]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: save svc_serv in nfs_callback_info]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Removal of ugly #ifdefs]
[nfs41: Update to removal of ugly #ifdefs]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
It is possible for servers to return NFS4ERR_BAD_STATEID when
the state management code is recovering locks or is reclaiming state when
returning a delegation. Ensure that we handle that case.
While we're at it, add in handlers for NFS4ERR_STALE,
NFS4ERR_ADMIN_REVOKED, NFS4ERR_OPENMODE, NFS4ERR_DENIED and
NFS4ERR_STALE_STATEID, since the protocol appears to allow for them too.
Also handle ENOMEM...
Finally, rather than add new NFSv4.0-specific errors and error handling into
the generic delegation code, move that open file and locking state error
handling into the NFSv4 layer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 delegation recovery code is required by the protocol to handle
more errors. Rather than add NFSv4.0 specific errors into 'generic'
delegation code, we should move the error handling into the NFSv4 layer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>