Replace references to system_utsname to the per-process uts namespace
where appropriate. This includes things like uname.
Changes: Per Eric Biederman's comments, use the per-process uts namespace
for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c
[jdike@addtoit.com: UML fix]
[clg@fr.ibm.com: cleanup]
[akpm@osdl.org: build fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It isn't needed as it is available in rqstp->rq_server, and dropping it allows
some local vars to be dropped.
[akpm@osdl.org: build fix]
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently lockd listens on UDP always, and TCP if CONFIG_NFSD_TCP is set.
However as lockd performs services of the client as well, this is a problem.
If CONFIG_NfSD_TCP is not set, and a tcp mount is used, the server will not be
able to call back to lockd.
So:
- add an option to lockd_up saying which protocol is needed
- Always open sockets for which an explicit port was given, otherwise
only open a socket of the type required
- Change nfsd to do one lockd_up per socket rather than one per thread.
This
- removes the dependancy on CONFIG_NFSD_TCP
- means that lockd may open sockets other than at startup
- means that lockd will *not* listen on UDP if the only
mounts are TCP mount (and nfsd hasn't started).
The latter is the only one that concerns me at all - I don't know if this
might be a problem with some servers.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nfsd has some cleanup that it wants to do when the last thread exits, and
there will shortly be some more. So collect this all into one place and
define a callback for an rpc service to call when the service is about to be
destroyed.
[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some filesystems, instead of simply decrementing i_nlink, simply zero it
during an unlink operation. We need to catch these in addition to the
decrement operations.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a filesystem decrements i_nlink to zero, it means that a write must be
performed in order to drop the inode from the filesystem.
We're shortly going to have keep filesystems from being remounted r/o between
the time that this i_nlink decrement and that write occurs.
So, add a little helper function to do the decrements. We'll tie into it in a
bit to note when i_nlink hits zero.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove inclusions of linux/mpage.h that are no longer necessary due to the
transfer of generic_writepages().
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This eliminates the i_blksize field from struct inode. Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.
Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.
[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* Rougly half of callers already do it by not checking return value
* Code in drivers/acpi/osl.c does the following to be sure:
(void)kmem_cache_destroy(cache);
* Those who check it printk something, however, slab_error already printed
the name of failed cache.
* XFS BUGs on failed kmem_cache_destroy which is not the decision
low-level filesystem driver should make. Converted to ignore.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some file systems want to manually d_move() the dentries involved in a
rename. We can do this by making use of the FS_ODD_RENAME flag if we just
have nfs_rename() unconditionally do the d_move(). While there, we rename
the flag to be more descriptive.
OCFS2 uses this to protect that part of the rename operation with a cluster
lock.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Comments-only change to clarify a detail of the NFS protocol and how it is
implemented in Linux.
Test plan:
None.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC3530 states that the registered port 2049 for the NFS protocol should be
the default configuration in order to allow clients not to use the RPC
binding protocols.
If the mount program sends us a port=0, we therefore substitute port=2049.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Retry a few times before we give up: the error is usually due to ordering
issues with asynchronous RPC calls.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
And slight optimisation of nfs_end_data_update(): directories never have
delegations anyway.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, a read() request will return EIO even if the file has been
deleted on the server, simply because that is what the VM will return
if the call to readpage() fails to update the page.
Ensure that readpage() marks the inode as stale if it receives an ESTALE.
Then return that error to userland.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the open intents tell us that a given lookup is going to result in a,
exclusive create, we currently optimize away the lookup call itself. The
reason is that the lookup would not be atomic with the create RPC call, so
why do it in the first place?
A problem occurs, however, if the VFS aborts the exclusive create operation
after the lookup, but before the call to create the file/directory: in this
case we will end up with a hashed negative dentry in the dcache that has
never been looked up.
Fix this by only actually hashing the dentry once the create operation has
been successfully completed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix an oops when the referral server is not responding.
Check the error return from nfs4_set_client() in nfs4_create_referral_server.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Teach NFS_ROOT to use the new rpc_create API instead of the old two-call
API for creating an RPC transport.
Test plan:
Compile the kernel with the NFS client build-in, and set CONFIG_NFS_ROOT.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up warnings from compiling on ppc64.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a copy of the symlink path in the page cache, we can pass
a struct page down to the XDR routines instead of a string buffer.
Test plan:
Connectathon, all NFS versions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently the NFS client does not cache symlinks it creates. They get
cached only when the NFS client reads them back from the server.
Copy the symlink into the page cache before sending it.
Test plan:
Connectathon, all NFS versions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the LOOKUP or GETATTR in nfs_instantiate fail, nfs_instantiate will do a
d_drop before returning. But some callers already do a d_drop in the case
of an error return. Make certain we do only one d_drop in all error paths.
This issue was introduced because over time, the symlink proc API diverged
slightly from the create/mkdir/mknod proc API. To prevent other coding
mistakes of this type, change the symlink proc API to be more like
create/mkdir/mknod and move the nfs_instantiate call into the symlink proc
routines so it is used in exactly the same way for create, mkdir, mknod,
and symlink.
Test plan:
Connectathon, all versions of NFS.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the early days of NFS, there was no duplicate reply cache on the server.
Thus retransmitted non-idempotent requests often found that the request had
already completed on the server. To avoid passing an unanticipated return
code to unsuspecting applications, NFS clients would often shunt error
codes that implied the request had been retried but already completed.
Thanks to NFS over TCP, duplicate reply caches on the server, and network
performance and reliability improvements, it is safe to remove such checks.
Test plan:
None.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Convert NFS client mount logic to use rpc_create() instead of the old
xprt_create_proto/rpc_create_client API.
Test plan:
Mount stress tests.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
include/linux/sunrpc/clnt.h already includes include/linux/sunrpc/xprt.h.
We can remove xprt.h from source files that already include clnt.h.
Likewise include/linux/sunrpc/timer.h.
Test plan:
Compile kernel with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Invoke security_d_instantiate() on root dentries after allocating them with
dentry_alloc_anon(). Normally dentry_alloc_root() would do that, but we don't
call that as we don't want to assign a name to the root dentry at this point
(we may discover the real name later).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix an error handling problem: nfs_put_client() can be given a NULL pointer if
nfs_free_server() is asked to destroy a partially initialised record.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make two new proc files available:
/proc/fs/nfsfs/servers
/proc/fs/nfsfs/volumes
The first lists the servers with which we are currently dealing (struct
nfs_client), and the second lists the volumes we have on those servers (struct
nfs_server).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The attached patch makes NFS share superblocks between mounts from the same
server and FSID over the same protocol.
It does this by creating each superblock with a false root and returning the
real root dentry in the vfsmount presented by get_sb(). The root dentry set
starts off as an anonymous dentry if we don't already have the dentry for its
inode, otherwise it simply returns the dentry we already have.
We may thus end up with several trees of dentries in the superblock, and if at
some later point one of anonymous tree roots is discovered by normal filesystem
activity to be located in another tree within the superblock, the anonymous
root is named and materialises attached to the second tree at the appropriate
point.
Why do it this way? Why not pass an extra argument to the mount() syscall to
indicate the subpath and then pathwalk from the server root to the desired
directory? You can't guarantee this will work for two reasons:
(1) The root and intervening nodes may not be accessible to the client.
With NFS2 and NFS3, for instance, mountd is called on the server to get
the filehandle for the tip of a path. mountd won't give us handles for
anything we don't have permission to access, and so we can't set up NFS
inodes for such nodes, and so can't easily set up dentries (we'd have to
have ghost inodes or something).
With this patch we don't actually create dentries until we get handles
from the server that we can use to set up their inodes, and we don't
actually bind them into the tree until we know for sure where they go.
(2) Inaccessible symbolic links.
If we're asked to mount two exports from the server, eg:
mount warthog:/warthog/aaa/xxx /mmm
mount warthog:/warthog/bbb/yyy /nnn
We may not be able to access anything nearer the root than xxx and yyy,
but we may find out later that /mmm/www/yyy, say, is actually the same
directory as the one mounted on /nnn. What we might then find out, for
example, is that /warthog/bbb was actually a symbolic link to
/warthog/aaa/xxx/www, but we can't actually determine that by talking to
the server until /warthog is made available by NFS.
This would lead to having constructed an errneous dentry tree which we
can't easily fix. We can end up with a dentry marked as a directory when
it should actually be a symlink, or we could end up with an apparently
hardlinked directory.
With this patch we need not make assumptions about the type of a dentry
for which we can't retrieve information, nor need we assume we know its
place in the grand scheme of things until we actually see that place.
This patch reduces the possibility of aliasing in the inode and page caches for
inodes that may be accessed by more than one NFS export. It also reduces the
number of superblocks required for NFS where there are many NFS exports being
used from a server (home directory server + autofs for example).
This in turn makes it simpler to do local caching of network filesystems, as it
can then be guaranteed that there won't be links from multiple inodes in
separate superblocks to the same cache file.
Obviously, cache aliasing between different levels of NFS protocol could still
be a problem, but at least that gives us another key to use when indexing the
cache.
This patch makes the following changes:
(1) The server record construction/destruction has been abstracted out into
its own set of functions to make things easier to get right. These have
been moved into fs/nfs/client.c.
All the code in fs/nfs/client.c has to do with the management of
connections to servers, and doesn't touch superblocks in any way; the
remaining code in fs/nfs/super.c has to do with VFS superblock management.
(2) The sequence of events undertaken by NFS mount is now reordered:
(a) A volume representation (struct nfs_server) is allocated.
(b) A server representation (struct nfs_client) is acquired. This may be
allocated or shared, and is keyed on server address, port and NFS
version.
(c) If allocated, the client representation is initialised. The state
member variable of nfs_client is used to prevent a race during
initialisation from two mounts.
(d) For NFS4 a simple pathwalk is performed, walking from FH to FH to find
the root filehandle for the mount (fs/nfs/getroot.c). For NFS2/3 we
are given the root FH in advance.
(e) The volume FSID is probed for on the root FH.
(f) The volume representation is initialised from the FSINFO record
retrieved on the root FH.
(g) sget() is called to acquire a superblock. This may be allocated or
shared, keyed on client pointer and FSID.
(h) If allocated, the superblock is initialised.
(i) If the superblock is shared, then the new nfs_server record is
discarded.
(j) The root dentry for this mount is looked up from the root FH.
(k) The root dentry for this mount is assigned to the vfsmount.
(3) nfs_readdir_lookup() creates dentries for each of the entries readdir()
returns; this function now attaches disconnected trees from alternate
roots that happen to be discovered attached to a directory being read (in
the same way nfs_lookup() is made to do for lookup ops).
The new d_materialise_unique() function is now used to do this, thus
permitting the whole thing to be done under one set of locks, and thus
avoiding any race between mount and lookup operations on the same
directory.
(4) The client management code uses a new debug facility: NFSDBG_CLIENT which
is set by echoing 1024 to /proc/net/sunrpc/nfs_debug.
(5) Clone mounts are now called xdev mounts.
(6) Use the dentry passed to the statfs() op as the handle for retrieving fs
statistics rather than the root dentry of the superblock (which is now a
dummy).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Start rpciod in the server common (nfs_client struct) management code rather
than in the superblock management code. This means we only need to "start" it
once per server instead of once per superblock.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate nfs_server::client_sys in favour of nfs_client::cl_rpcclient as we
only really need one per server that we're talking to since it doesn't have any
security on it.
The retransmission management variables are also moved to the common struct as
they're required to set up the cl_rpcclient connection.
The NFS2/3 client and client_acl connections are thenceforth derived by cloning
the cl_rpcclient connection and post-applying the authorisation flavour.
The code for setting up the initial common connection has been moved to
client.c as nfs_create_rpc_client(). All the NFS program definition tables are
also moved there as that's where they're now required rather than super.c.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the rpc_ops from the nfs_server struct to the nfs_client struct as they're
common to all server records of a particular NFS protocol version.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make better use of inode* dereferencing macros to hide dereferencing chains
(including NFS_PROTO and NFS_CLIENT).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Maintain a common server record for NFS2/3 as well as for NFS4 so that common
stuff can be moved there from struct nfs_server.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add some extra const qualifiers into NFS.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use the nominated dentry's superblock directly in the NFS statfs() op to get a
file handle, rather than using s_root (which will become a dummy dentry in a
future patch).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Generalise the nfs_client structure by:
(1) Moving nfs_client to a more general place (nfs_fs_sb.h).
(2) Renaming its maintenance routines to be non-NFS4 specific.
(3) Move those maintenance routines to a new non-NFS4 specific file (client.c)
and move the declarations to internal.h.
(4) Make nfs_find/get_client() take a full sockaddr_in to include the port
number (will be required for NFS2/3).
(5) Make nfs_find/get_client() take the NFS protocol version (again will be
required to differentiate NFS2, 3 & 4 client records).
Also:
(6) Make nfs_client construction proceed akin to inodes, marking them as under
construction and providing a function to indicate completion.
(7) Make nfs_get_client() wait interruptibly if it finds a client that it can
share, but that client is currently being constructed.
(8) Make nfs4_create_client() use (6) and (7) instead of locking cl_sem.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a set_capabilities NFS RPC op so that the server capabilities can be set.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a lookup filehandle NFS RPC op so that a file handle can be looked up
without requiring dentries and inodes and other VFS stuff when doing an NFS4
pathwalk during mounting.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Return an error when starting the idmapping pipe so that we can detect it
failing.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Rename nfs_server::nfs4_state to nfs_client as it will be used to represent the
client state for NFS2 and NFS3 also.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Rename struct nfs4_client to struct nfs_client so that it can become the basis
for a general client record for NFS2 and NFS3 in addition to NFS4.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the nfs_callback_up()/down() prototypes just do nothing if NFS4 is not
enabled. Also make the down function void type since we can't really do
anything if it fails.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Rename the NFS4 version of nfs_stat_to_errno() so that it doesn't conflict with
the common one used by NFS2 and NFS3.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix ups for the splitting of the superblock stuff out of fs/nfs/inode.c,
including:
(*) Move the callback tcpport module param into callback.c.
(*) Move the idmap cache timeout module param into idmap.c.
(*) Changes to internal.h:
(*) namespace-nfs4.c was renamed to nfs4namespace.c.
(*) nfs_stat_to_errno() is in nfs2xdr.c, not nfs4xdr.c.
(*) nfs4xdr.c is contingent on CONFIG_NFS_V4.
(*) nfs4_path() is only uses if CONFIG_NFS_V4 is set.
Plus also:
(*) The sec_flavours[] table should really be const.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A pinned inode may in theory end up filling memory with cached ACCESS
calls. This patch ensures that the VM may shrink away the cache in these
particular cases.
The shrinker works by iterating through the list of inodes on the global
nfs_access_lru_list, and removing the least recently used access
cache entry until it is done (or until the entire cache is empty).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The current access cache only allows one entry at a time to be cached for each
inode. Add a per-inode red-black tree in order to allow more than one to
be cached at a time.
Should significantly cut down the time spent in path traversal for shared
directories such as ${PATH}, /usr/share, etc.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The logic in nfs_direct_read_schedule and nfs_direct_write_schedule can
allow data->npages to be one larger than rpages. This causes a page
pointer to be written beyond the end of the pagevec in nfs_read_data (or
nfs_write_data).
Fix this by making nfs_(read|write)_alloc() calculate the size of the
pagevec array, and initialise data->npages.
Also get rid of the redundant argument to nfs_commit_alloc().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is needed in order to handle any NFS4ERR_DELAY errors that might be
returned by the server. It also ensures that we map the NFSv4 errors before
they are returned to userland.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 71c12b3f0abc7501f6ed231a6d17bc9c05a238dc commit)
Check the bounds of length specifiers more thoroughly in the XDR decoding of
NFS4 readdir reply data.
Currently, if the server returns a bitmap or attr length that causes the
current decode point pointer to wrap, this could go undetected (consider a
small "negative" length on a 32-bit machine).
Also add a check into the main XDR decode handler to make sure that the amount
of data is a multiple of four bytes (as specified by RFC-1014). This makes
sure that we can do u32* pointer subtraction in the NFS client without risking
an undefined result (the result is undefined if the pointers are not correctly
aligned with respect to one another).
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 5861fddd64a7eaf7e8b1a9997455a24e7f688092 commit)
The problem is that we may be caching writes that would extend the file and
create a hole in the region that we are reading. In this case, we need to
detect the eof from the server, ensure that we zero out the pages that
are part of the hole and mark them as up to date.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 856b603b01b99146918c093969b6cb1b1b0f1c01 commit)
rpc_unlink() and rpc_rmdir() will dput the dentry reference for you.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from a05a57effa71a1f67ccbfc52335c10c8b85f3f6a commit)
nfs_wb_page() waits on request completion and, as a result, is not safe to be
called from nfs_release_page() invoked by VM scanner as part of GFP_NOFS
allocation. Fix possible deadlock by analyzing gfp mask and refusing to
release page if __GFP_FS is not set.
Signed-off-by: Nikita Danilov <danilov@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 374d969debfb290bafcb41d28918dc6f7e43ce31 commit)
nfs_writedata_free() and nfs_readdata_free() can now become static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from 5e1ce40f0c3c8f67591aff17756930d7a18ceb1a commit)
In one of the error paths of nfs_path, it may return with dcache_lock still
held; fix this by adding and using a new error path Elong_unlock which unlocks
dcache_lock.
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(cherry picked from f4b90b43677fb23297c56802c3056fc304f988d9 commit)
In the case when compiling via a symlink tree, we want to ensure that the
close-to-open GETATTR call is applied only to the final file, and not to
the symlink.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The introduction of the FLUSH_INVALIDATE argument to nfs_sync_inode_wait()
does not clear the nr_unstable page state counter for pages that are being
released.
Also fix a longstanding similar bug when nfs_commit_list() fails.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use FL_ACCESS flag to test and/or wait for local locks before we try
requesting a lock from the server
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use the new behaviour of {flock,posix}_file_lock(F_UNLCK) to determine if
we held a lock, and only send the RPC request to the server if this was the
case.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This fixes a bug in fs/nfs which makes it impossible to build nfs
without having procfs enabled.
Signed-off-by: Dominik Hackl <dominik@hackl.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conversion of nr_unstable to a per zone counter
We need to do some special modifications to the nfs code since there are
multiple cases of disposition and we need to have a page ref for proper
accounting.
This converts the last critical page state of the VM and therefore we need to
remove several functions that were depending on GET_PAGE_STATE_LAST in order
to make the kernel compile again. We are only left with event type counters
in page state.
[akpm@osdl.org: bugfixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes nr_dirty a per zone counter. Looping over all processors is
avoided during writeback state determination.
The counter aggregation for nr_dirty had to be undone in the NFS layer since
we summed up the page counts from multiple zones. Someone more familiar with
NFS should probably review what I have done.
[akpm@osdl.org: bugfix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Same as with already do with the file operations: keep them in .rodata and
prevents people from doing runtime patching.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts ccf01ef7aa commit.
No idea how git managed this one: when I asked it to merge the odirect
topic branch it actually generated a patch which reverted the change.
Reverting the 'merge' will once again reveal Chuck's recent NFS/O_DIRECT
work to the world.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Builds on ARM report link problems with common configurations like
statically linked NFS (for nfsroot). The symptom is that __init
section code references __exit section code; that won't work since
the exit sections are discarded (since they can never be called).
The best fix for these particular cases would be an "__init_or_exit"
section annotation.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Trond had apparently merged the same patch twice, causing a duplicate
include of the "internal.h" file, with resulting obvious confusion.
Tssk. I'm the only one allowed to send out trees that don't even
compile! Who does this Trond guy think he is?
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix various problems with nfs4 disabled. And various other things.
In file included from fs/nfs/inode.c:50:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
fs/nfs/inode.c: In function 'init_once':
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'open_states'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'delegation'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'delegation_state'
fs/nfs/inode.c:1116: error: 'struct nfs_inode' has no member named 'rwsem'
distcc[26452] ERROR: compile fs/nfs/inode.c on g5/64 failed
make[1]: *** [fs/nfs/inode.o] Error 1
make: *** [fs/nfs/inode.o] Error 2
make: *** Waiting for unfinished jobs....
In file included from fs/nfs/nfs3xdr.c:26:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
distcc[26486] ERROR: compile fs/nfs/nfs3xdr.c on g5/64 failed
make[1]: *** [fs/nfs/nfs3xdr.o] Error 1
make: *** [fs/nfs/nfs3xdr.o] Error 2
In file included from fs/nfs/nfs3proc.c:24:
fs/nfs/internal.h:24: error: static declaration of 'nfs_do_refmount' follows non-static declaration
include/linux/nfs_fs.h:320: error: previous declaration of 'nfs_do_refmount' was here
fs/nfs/internal.h:65: warning: 'struct nfs4_fs_locations' declared inside parameter list
fs/nfs/internal.h:65: warning: its scope is only this definition or declaration, which is probably not what you want
fs/nfs/internal.h: In function 'nfs4_path':
fs/nfs/internal.h:97: error: 'struct nfs_server' has no member named 'mnt_path'
distcc[26469] ERROR: compile fs/nfs/nfs3proc.c on bix/32 failed
make[1]: *** [fs/nfs/nfs3proc.o] Error 1
make: *** [fs/nfs/nfs3proc.o] Error 2
**FAILED**
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Cc: Andy Adamson <andros@citi.umich.edu>
Cc: Chuck Lever <cel@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Manoj Naik <manoj@almaden.ibm.com>
Cc: Marc Eshel <eshel@almaden.ibm.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Re-arrange the logic in the NFS direct I/O path so that nfs_read/write_data
structs are allocated just before they are scheduled, rather than
allocating them all at once before we start scheduling requests.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown observed that the kmalloc() in nfs_get_user_pages() is more
likely to fail if the I/O is large enough to require the allocation of more
than a single page to keep track of all the pinned pages in the user's
buffer.
Instead of tracking one large page array per dreq/iocb, track pages per
nfs_read/write_data, just like the cached I/O path does. An array for
pages is already allocated for us by nfs_readdata_alloc() (and the write
and commit equivalents).
This is also required for adding support for vectored I/O to the NFS direct
I/O path.
The original reason to pin the user buffer and allocate all the NFS data
structures before trying to schedule I/O was to ensure all needed resources
are allocated on the client before starting to send requests. This reduces
the chance that resource exhaustion on the client will cause a short read
or write.
On the other hand, for an application making very large application I/O
requests, this means that it will be nearly impossible for the application
to make forward progress on a resource-limited client.
Thus, moving the buffer pinning functionality into the I/O scheduling
loops should be good for scalability. The next patch will do the same for
NFS data structure allocation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up and fix a minor bug: the logic was dirtying page cache pages on
both read and write operations.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the user_addr, user_count, and pos parameters explicit to the
scheduler routines, and remove the fields from nfs_direct_req. The
iovec API will be passing in a series of these, not just one set.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An NFSv3/v4 client must reschedule on-the-wire writes if the writes are
UNSTABLE, and the server reboots before the client can complete a
subsequent COMMIT request.
To support direct asynchronous scatter-gather writes, the write
rescheduler in fs/nfs/direct.c must not depend on the I/O parameters
in the controlling nfs_direct_req structure. iovecs can be somewhat
arbitrarily complex, so there could be an unbounded amount of information
to save for a rarely encountered requirement.
Refactor the direct write rescheduler so it uses information from each
nfs_write_data structure to reschedule writes, instead of caching that
information in the controlling nfs_direct_req structure.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the logic that increments and decrements the outstanding I/O
count. This will be a commonly used bit of code in upcoming patches.
Also make this an atomic_t again, since it will be very often manipulated
outside dreq->spin lock.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the POSIX lock owner ID to the flush operation.
This is useful for filesystems which don't want to store any locking state
in inode->i_flock but want to handle locking/unlocking POSIX locks
internally. FUSE is one such filesystem but I think it possible that some
network filesystems would need this also.
Also add a flag to indicate that a POSIX locking request was generated by
close(), so filesystems using the above feature won't send an extra locking
request in this case.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.
This complements the get_sb() patch. That reduced the significance of
sb->s_root, allowing NFS to place a fake root there. However, NFS does
require a dentry to use as a target for the statfs operation. This permits
the root in the vfsmount to be used instead.
linux/mount.h has been added where necessary to make allyesconfig build
successfully.
Interest has also been expressed for use with the FUSE and XFS filesystems.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.
The filesystem is then required to manually set the superblock and root dentry
pointers. For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).
The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.
This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing. In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.
The patch also makes the following changes:
(*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
pointer argument and return an integer, so most filesystems have to change
very little.
(*) If one of the convenience function is not used, then get_sb() should
normally call simple_set_mnt() to instantiate the vfsmount. This will
always return 0, and so can be tail-called from get_sb().
(*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
dcache upon superblock destruction rather than shrink_dcache_anon().
This is required because the superblock may now have multiple trees that
aren't actually bound to s_root, but that still need to be cleaned up. The
currently called functions assume that the whole tree is rooted at s_root,
and that anonymous dentries are not the roots of trees which results in
dentries being left unculled.
However, with the way NFS superblock sharing are currently set to be
implemented, these assumptions are violated: the root of the filesystem is
simply a dummy dentry and inode (the real inode for '/' may well be
inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
with child trees.
[*] Anonymous until discovered from another tree.
(*) The documentation has been adjusted, including the additional bit of
changing ext2_* into foo_* in the documentation.
[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As fs/nfs/inode.c is rather large, heterogenous and unwieldy, the attached
patch splits it up into a number of files:
(*) fs/nfs/inode.c
Strictly inode specific functions.
(*) fs/nfs/super.c
Superblock management functions for NFS and NFS4, normal access, clones
and referrals. The NFS4 superblock functions _could_ move out into a
separate conditionally compiled file, but it's probably not worth it as
there're so many common bits.
(*) fs/nfs/namespace.c
Some namespace-specific functions have been moved here.
(*) fs/nfs/nfs4namespace.c
NFS4-specific namespace functions (this could be merged into the previous
file). This file is conditionally compiled.
(*) fs/nfs/internal.h
Inter-file declarations, plus a few simple utility functions moved from
fs/nfs/inode.c.
Additionally, all the in-.c-file externs have been moved here, and those
files they were moved from now includes this file.
For the most part, the functions have not been changed, only some multiplexor
functions have changed significantly.
I've also:
(*) Added some extra banner comments above some functions.
(*) Rearranged the function order within the files to be more logical and
better grouped (IMO), though someone may prefer a different order.
(*) Reduced the number of #ifdefs in .c files.
(*) Added missing __init and __exit directives.
Signed-Off-By: David Howells <dhowells@redhat.com>
Respond to a moved error on NFS lookup by setting up the referral.
Note: We don't actually follow the referral during lookup/getattr, but
later when we detect fsid mismatch in inode revalidation (similar to the
processing done for cloning submounts). Referrals will have fake attributes
until they are actually followed or traversed.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up mountpoint when hitting a referral on moved error by getting
fs_locations.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move existing code into a separate function so that it can be also used by
referral code.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is (similar to getattr bitmap) but includes fs_locations and
mounted_on_fileid attributes. Use this bitmap for encoding in fs_locations
requests.
Note: We can probably do better by requesting locations as part of fsinfo
itself.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Per referral draft, only fs_locations, fsid, and mounted_on_fileid can be
requested in a GETATTR on referrals.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is ignored if fileid is also requested. This will be used on referrals
(fs_locations).
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use component4-style formats for decoding list of servers and pathnames in
fs_locations.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 allows for the fact that filesystems may be replicated across
several servers or that they may be migrated to a backup server in case of
failure of the primary server.
fs_locations is an NFSv4 operation for retrieving information about the
location of migrated and/or replicated filesystems.
Based on an initial implementation by Jiaying Zhang <jiayingz@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make automounted partitions expire using the mark_mounts_for_expiry()
function. The timeout is controlled via a sysctl.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This should enable us to detect if we are crossing a mountpoint in the
case where the server is exporting "nohide" mounts.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow filesystems to decide to perform pre-umount processing whether or not
MNT_FORCE is set.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a real nfs_invalidate_page() to ensure that
truncate_inode_pages() does the right thing when there are pending dirty
pages, we can get rid of nfs_delete_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the case of a call to truncate_inode_pages(), we should really try to
cancel any pending writes on the page.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We just set *acl_len to zero, and attrlen is unsigned, so this comparison
is clearly bogus. I have no idea what I was thinking.
Fixes a bug that caused getacl to fail over krb5p.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix two errors in the client-side acl cache: First, when nfs3_proc_getacl
requests only the default acl of a file and the access acl is not cached
already, a NULL access acl entry is cached instead of ERR_PTR(-EAGAIN)
("not cached").
Second, update the cached acls in nfs3_proc_setacls: nfs_refresh_inode does
not always invalidate the cached acls, and when it does not, the cached acls
get out of sync.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, we are accounting for all calls to nfs_revalidate_inode(), but not
to nfs_revalidate_mapping(), or nfs_lookup_verify_inode(), etc...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Separate out the function of revalidating the inode metadata, and
revalidating the mapping. The former may be called by lookup(),
and only really needs to check that permissions, ctime, etc haven't changed
whereas the latter needs only done when we want to read data from the page
cache, and may need to sync and then invalidate the mapping.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Whenever the directory changes, we want to make sure that we always
invalidate its page cache. Fix up update_changeattr() and
nfs_mark_for_revalidate() so that they do so.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up a bug in the handling of NFS_INO_REVAL_PAGECACHE: make sure that
nfs_update_inode() clears it when we're sure we're not racing with other
updates.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up use of page_array, and fix an off-by-one error noticed by Tom
Talpey which causes kmalloc calls in cases where using the page_array
is sufficient.
Test plan:
Normal client functional testing with r/wsize=32768.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The Linux NFSv4 server violates RFC3530 in that the change attribute is not
guaranteed to be updated for every change to the inode. Our optimisation
for checking whether or not the inode metadata has changed or not is broken
too. Grr....
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code that is supposed to zero the uninitialised partial pages when the
server returns a short read is currently broken: it looks at the nfs_page
wb_pgbase and wb_bytes fields instead of the equivalent nfs_read_data
values when deciding where to start truncating the page.
Also ensure that we are more careful about setting PG_uptodate
before retrying a short read: the retry will change the nfs_read_data
args.pgbase and args.count.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Local variable res was initialized to 0 - no check needed here.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Convert a for-loop that explicitly references "NR_CPUS" into the
potentially more efficient for_each_possible_cpu() construct.
Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the call to nfs_intent_set_file() fails to open a file in
nfs4_proc_create(), we should return an error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is a conversion to make the various file_operations structs in fs/
const. Basically a regexp job, with a few manual fixups
The goal is both to increase correctness (harder to accidentally write to
shared datastructures) and reducing the false sharing of cachelines with
things that get dirty in .data (while .rodata is nicely read only and thus
cache clean)
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify well over a dozen mempool users to call mempool_create_slab_pool()
rather than calling mempool_create() with extra arguments, saving about 30
lines of code and increasing readability.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The return value of this function is never used, so let's be honest and
declare it as void.
Some places where invalidatepage returned 0, I have inserted comments
suggesting a BUG_ON.
[akpm@osdl.org: JBD BUG fix]
[akpm@osdl.org: rework for git-nfs]
[akpm@osdl.org: don't go BUG in block_invalidate_page()]
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Eric Van Hensbergen <ericvh@ericvh.myip.org>
Cc: Robert Love <rml@tech9.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (103 commits)
SUNRPC,RPCSEC_GSS: spkm3--fix config dependencies
SUNRPC,RPCSEC_GSS: spkm3: import contexts using NID_cast5_cbc
LOCKD: Make nlmsvc_traverse_shares return void
LOCKD: nlmsvc_traverse_blocks return is unused
SUNRPC,RPCSEC_GSS: fix krb5 sequence numbers.
NFSv4: Dont list system.nfs4_acl for filesystems that don't support it.
SUNRPC,RPCSEC_GSS: remove unnecessary kmalloc of a checksum
SUNRPC: Ensure rpc_call_async() always calls tk_ops->rpc_release()
SUNRPC: Fix memory barriers for req->rq_received
NFS: Fix a race in nfs_sync_inode()
NFS: Clean up nfs_flush_list()
NFS: Fix a race with PG_private and nfs_release_page()
NFSv4: Ensure the callback daemon flushes signals
SUNRPC: Fix a 'Busy inodes' error in rpc_pipefs
NFS, NLM: Allow blocking locks to respect signals
NFS: Make nfs_fhget() return appropriate error values
NFSv4: Fix an oops in nfs4_fill_super
lockd: blocks should hold a reference to the nlm_file
NFSv4: SETCLIENTID_CONFIRM should handle NFS4ERR_DELAY/NFS4ERR_RESOURCE
NFSv4: Send the delegation stateid for SETATTR calls
...
Rewrap the overly long source code lines resulting from the previous
patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch
contains only formatting changes, and no function change.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
memory spreading.
If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
from such a slab cache, the allocations are spread evenly over all the
memory nodes (task->mems_allowed) allowed to that task, instead of favoring
allocation on the node local to the current cpu.
The following inode and similar caches are marked SLAB_MEM_SPREAD:
file cache
==== =====
fs/adfs/super.c adfs_inode_cache
fs/affs/super.c affs_inode_cache
fs/befs/linuxvfs.c befs_inode_cache
fs/bfs/inode.c bfs_inode_cache
fs/block_dev.c bdev_cache
fs/cifs/cifsfs.c cifs_inode_cache
fs/coda/inode.c coda_inode_cache
fs/dquot.c dquot
fs/efs/super.c efs_inode_cache
fs/ext2/super.c ext2_inode_cache
fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr
fs/ext3/super.c ext3_inode_cache
fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr
fs/fat/cache.c fat_cache
fs/fat/inode.c fat_inode_cache
fs/freevxfs/vxfs_super.c vxfs_inode
fs/hpfs/super.c hpfs_inode_cache
fs/isofs/inode.c isofs_inode_cache
fs/jffs/inode-v23.c jffs_fm
fs/jffs2/super.c jffs2_i
fs/jfs/super.c jfs_ip
fs/minix/inode.c minix_inode_cache
fs/ncpfs/inode.c ncp_inode_cache
fs/nfs/direct.c nfs_direct_cache
fs/nfs/inode.c nfs_inode_cache
fs/ntfs/super.c ntfs_big_inode_cache_name
fs/ntfs/super.c ntfs_inode_cache
fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache
fs/ocfs2/super.c ocfs2_inode_cache
fs/proc/inode.c proc_inode_cache
fs/qnx4/inode.c qnx4_inode_cache
fs/reiserfs/super.c reiser_inode_cache
fs/romfs/inode.c romfs_inode_cache
fs/smbfs/inode.c smb_inode_cache
fs/sysv/inode.c sysv_inode_cache
fs/udf/super.c udf_inode_cache
fs/ufs/super.c ufs_inode_cache
net/socket.c sock_inode_cache
net/sunrpc/rpc_pipe.c rpc_inode_cache
The choice of which slab caches to so mark was quite simple. I marked
those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
inode_cache, and buffer_head, which were marked in a previous patch. Even
though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
potentially large file system i/o related slab caches as we need for memory
spreading.
Given that the rule now becomes "wherever you would have used a
SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
Future file system writers will just copy one of the existing file system
slab cache setups and tend to get it right without thinking.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use ARRAY_SIZE macro instead of sizeof(x)/sizeof(x[0]) and remove a
duplicate of ARRAY_SIZE. Some trailing whitespaces are also deleted.
Signed-off-by: Tobias Klauser <tklauser@nuerscht.ch>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The meaning of MS_VERBOSE is backwards; if the bit is set, it really means,
"don't be verbose". This is confusing and counter-intuitive.
In addition, there is also no way to set the MS_VERBOSE flag in the
mount(8) program in util-linux, but interesting, it does define options
which would do the right thing if MS_SILENT were defined, which
unfortunately we do not:
#ifdef MS_SILENT
{ "quiet", 0, 0, MS_SILENT }, /* be quiet */
{ "loud", 0, 1, MS_SILENT }, /* print out messages. */
#endif
So the obvious fix is to deprecate the use of MS_VERBOSE and replace it
with MS_SILENT.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thanks to Frank Filz for pointing out that we list system.nfs4_acl extended
attribute even on filesystems where we don't actually support nfs4_acl.
This is inconsistent with the e.g. ext3 POSIX ACL behaviour, and seems to
annoy cp.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently this will not happen if we exit before rpc_new_task() was called.
Also fix up rpc_run_task() to do the same (for consistency).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Kudos to Neil Brown for spotting the problem:
"in nfs_sync_inode, there is effectively the sequence:
nfs_wait_on_requests
nfs_flush_inode
nfs_commit_inode
This seems a bit racy to me as if the only requests are on the
->commit list, and nfs_commit_inode is called separately after
nfs_wait_on_requests completes, and before nfs_commit_inode start
(say: by nfs_write_inode) then none of these function will return
>0, yet there will be some pending request that aren't waited for."
The solution is to search for requests to wait upon, search for dirty
requests, and search for uncommitted requests while holding the
nfsi->req_lock
The patch also cleans up nfs_sync_inode(), getting rid of the redundant
FLUSH_WAIT flag. It turns out that we were always setting it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't need to set PG_private for readahead pages, since they never get
unlocked while I/O is in progress. However there is a small race in
nfs_readpage_release() whereby the page may be unlocked, and have
PG_private set.
Fix is to have PG_private set only for the case of writes...
Also fix a bug in nfs_clear_page_writeback(): Don't attempt to clear the
radix_tree tag if we've already deleted the radix tree entry.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the callback daemon is signalled, but is unable to exit because it still
has users, then we need to flush signals. If not, then svc_recv() can
never sleep, and so we hang.
If we flush signals, then we also have to be prepared to resend them when
we want the thread to exit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently it returns NULL, which usually gets interpreted as ENOMEM. In
fact it can mean a host of issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The mount statistics patches introduced a call to nfs_free_iostats that is
not only redundant, but actually causes an oops.
Also fix a memory leak due to the lack of a call to nfs_free_iostats on
unmount.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The struct file_lock does not carry a properly initialised lock,
so don't copy it as if it were.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have aio writes, it is possible for dreq->outstanding to be
zero, but for the I/O not to have completed. Convert struct nfs_direct_req
to use a completion to signal when the I/O is done.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduced by NFS aio+dio patches.
Test plan:
Compile kernel with CONFIG_NFS enabled on 64-bit hardware.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The struct nfs_direct_req currently keeps a pointer to the file descriptor
without referencing it. This may cause problems if the parent process is
killed.
The nfs_open_context should normally have all the information that we're
currently using the filp for, and unlike fput(), is safe to release from
an rpciod process context.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently NFS O_DIRECT writes use FILE_SYNC so that a COMMIT is not
necessary. This simplifies the internal logic, but this could be a
difficult workload for some servers.
Instead, let's send UNSTABLE writes, and after they all complete, send a
COMMIT for the dirty range. After the COMMIT returns successfully, then do
the wake_up or fire off aio_complete().
Test plan:
Async direct I/O tests against Solaris (or any server that requires
committed unstable writes). Reboot server during test.
Based on an earlier patch by Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
^C against "iozone -I" is hitting the assertion in nfs_clear_inode().
Test plan:
"iozone -i0 -I -a -c" against a slow server, then control C. This should
not cause an oops.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Three atomic_t variables cause a lot of bus locking. Because they are all
used in the same places in the code, just use a single spin lock.
Now that the atomic_t variables are gone, we can remove the request size
limitation since the code no longer depends on the limited width of atomic_t
on some platforms.
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions of fsx
operations, iozone, OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up tab damage and comments. Replace "file_offset" with more commonly
used "pos".
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct write path now returns EIOCBQUEUED,
and calls aio_complete when all the requested writes are finished. The
synchronous part of the NFS direct write path behaves exactly as it
was before.
Shared mapped NFS files will have some coherency difficulties when
accessed concurrently with aio+dio. Will need to explore how this
is handled in the local file system case.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct write request
scheduler, and make it available in nfs_direct_write_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the
NFS client's direct write path to pave the way for introducing support
for aio against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Duplicate infrastructure from direct read path that will allow write
path to generate multiple write requests concurrently. This will
enable us to add support for aio in this path.
Temporarily we will lose the ability to do UNSTABLE writes followed by
a COMMIT in the direct write path. However, all applications I am
aware of that use NFS O_DIRECT currently write in relatively small
chunks, so this should not be inconvenient in any way.
Test plan:
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the common piece of completing an NFS direct I/O request.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out a small common piece of the path that allocate nfs_direct_req
structures.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're about to add asynchrony to the NFS direct write path. Begin by
abstracting out the common pieces in the read path.
The first piece is nfs_direct_read_wait, which works the same whether the
process is waiting for a read or a write.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct read path should return EIOCBQUEUED and
call aio_complete when all the requested reads are finished. The
synchronous part of the NFS direct read path behaves exactly as it was
before.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct read request
scheduler, and make it available in nfs_direct_read_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the NFS
client's direct read path to pave the way for introducing support for aio
against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
size_t is used for holding byte counts, so use it for variables storing rsize.
Note that the write path will be updated as we add support for async
O_DIRECT writes.
Test plan:
Need to verify that existing comparisons against new size_t variables behave
correctly.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Update to latest coding style standards. Remove block comments on
statically defined functions, and place function definitions all on
one line.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS client's a_ops->direct_IO method, nfs_direct_IO, is required to
be present to allow NFS files to be opened with O_DIRECT, but is never
called because the NFS client shunts reads and writes to files opened
with O_DIRECT directly to its own routines.
Gut the nfs_direct_IO function. This eliminates the only part of the
NFS client's direct I/O path that requires support for multi-segment
iovs, allowing further simplification in subsequent patches.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions
of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Same callback hierarchy inversion as for the NFS write calls. This patch is
not strictly speaking needed by the O_DIRECT code, but avoids confusing
differences between the asynchronous read and write code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch inverts the callback hierarchy for NFS write calls.
Instead of having the NFSv2/v3/v4-specific code set up the RPC callback
ops, we allow the original caller to do so. This allows for more
flexibility w.r.t. how to set up and tear down the nfs_write_data
structure while still allowing the NFSv3/v4 code to perform error
handling.
The greater flexibility is needed by the asynchronous O_DIRECT code, which
wants to be able to hold on to the original nfs_write_data structures after
the WRITE RPC call has completed in order to be able to replay them if the
COMMIT call determines that the server has rebooted.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
posix_test_lock() returns a pointer to a struct file_lock which is unprotected
and can be removed while in use by the caller. Move the conflicting lock from
the return to a parameter, and copy the conflicting lock.
In most cases the caller ends up putting the copy of the conflicting lock on
the stack. On i386, sizeof(struct file_lock) appears to be about 100 bytes.
We're assuming that's reasonable.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reuse NFSDBG_DIRCACHE and NFSDBG_LOOKUPCACHE to provide additional
diagnostic messages that trace the operation of the NFS client's
directory behavior. A few new messages are now generated when NFSDBG_VFS
is active, as well, to trace normal VFS activity. This compromise
provides better trace debugging for those who use pre-built kernels,
without adding a lot of extra noise to the standard debug settings.
Test-plan:
Enable NFS trace debugging with flags 1, 2, or 4. You should be able to
see different types of trace messages with each flag setting.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up: replace rpc_call() helper with direct call to rpc_call_sync.
This makes NFSv2 and NFSv3 synchronous calls more computationally
efficient, and reduces stack consumption in functions that used to
invoke rpc_call more than once.
Test plan:
Compile kernel with CONFIG_NFS enabled. Connectathon on NFS version 2,
version 3, and version 4 mount points.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add fields to the rpc_procinfo struct that allow the display of a
human-readable name for each procedure in the rpc_iostats output.
Also fix it so that the NFSv4 stats are broken up correctly by
sub-procedure number. NFSv4 uses only two real RPC procedures:
NULL, and COMPOUND.
Test plan:
Mount with NFSv2, NFSv3, and NFSv4, and do "cat /proc/self/mountstats".
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS client now shows various RPC I/O metrics in /proc/self/mountstats.
Test plan:
Mount/umount while doing "cat /proc/self/mountstats", multiple iterations
of connectathon locking suite. Test with NFS version 2, 3, and 4.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a field in nfs_server to record a timestamp when a mount succeeds.
Report the number of seconds the file system has been mounted via
nfs_show_stats().
Test plan:
Mount an NFS file system, watch the mountstats reports and compare with
clock time.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make an inode or an nfs_server struct available in the logic that handles
JUKEBOX/DELAY type errors so the NFS client can account for them.
This patch is split out from the main nfs iostat patch to highlight minor
architectural changes required to support this statistic.
Test plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Invoke the byte and event counter macros where we want to count bytes and
events.
Clean-up: fix a possible NULL dereference in nfs_lock, and simplify
nfs_file_open.
Test-plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a per-superblock performance counter facility to the NFS client. This
facility mimics the counters available for block devices and for
networking. Expose these new counters via the new /proc/self/mountstats
interface.
Thanks to Andrew Morton and Trond Myklebust for their review and comments.
Test plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Get rid of "lock" and "posix", and spell out "vers=".
Test plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sometimes it's important to know the exact RPC retransmit settings the
kernel is using for an NFS mount point. Add this facility to the NFS
client's show_options method.
Test plan:
Set various retransmit settings via the mount command, and check that the
settings are reflected in /proc/mounts.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
semaphore to mutex conversion.
the conversion was generated via scripts, and the result was validated
automatically via a script as well.
build and boot tested.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
this converts fs/nfs to kzalloc() usage.
compile tested with make allyesconfig
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_open_revalidate: 'res' may be used uninitialized
nfs4_callback_compound: ‘hdr_res.nops’ may be used uninitialized
'op_nr’ may be used uninitialized
encode_getattr_res: ‘savep’ may be used uninitialized
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If not, we cannot guarantee that idmap->idmap_dentry, gss_auth->dentry and
clnt->cl_dentry are valid dentries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
My previous "const static" vs "static const" cleanup missed a single case,
patch below takes care of it.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we flush out writes in the case when someone calls utimes() in
order to set the file times.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I've been reading through fs/nfs/write.c trying to track down a bug
that seems to be related to pages loosing a refcount and getting
freed too early (you interested in detail??) and I spotted a little
bug which the following patch should fix.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, there is no serialisation between NFS asynchronous writebacks
and truncation at the page level due to the fact that nfs_sync_inode()
cannot lock the pages that it is about to write out.
This means that it is possible to be flushing out data (and calling something
like set_page_writeback()) while the page cache is busy evicting the page.
Oops...
Use the hooks provided in try_to_release_page() to ensure that dirty pages
are always written back to storage before we evict them.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs_open_context may live longer than the file descriptor that spawned
it, so it needs to carry a reference to the vfsmount. If not, then
generic_shutdown_super() may end up being called before reads and writes
have been flushed out.
Make a couple of functions static while we're at it...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It turns out that nfs4_proc_get_root() may return raw NFSv4 errors instead of
mapping them to kernel errors. Problem spotted by Neil Horman
<nhorman@tuxdriver.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on an original patch by Mike O'Connor and Greg Banks of SGI.
Mike states:
A normal user can panic an NFS client and cause a local DoS with
'judicious'(?) use of O_DIRECT. Any O_DIRECT write to an NFS file where the
user buffer starts with a valid mapped page and contains an unmapped page,
will crash in this way. I haven't followed the code, but O_DIRECT reads with
similar user buffers will probably also crash albeit in different ways.
Details: when nfs_get_user_pages() calls get_user_pages(), it detects and
correctly handles get_user_pages() returning an error, which happens if the
first page covered by the user buffer's address range is unmapped. However,
if the first page is mapped but some subsequent page isn't, get_user_pages()
will return a positive number which is less than the number of pages requested
(this behaviour is sort of analagous to a short write() call and appears to be
intentional). nfs_get_user_pages() doesn't detect this and hands off the
array of pages (whose last few elements are random rubbish from the newly
allocated array memory) to it's caller, whence they go to
nfs_direct_write_seg(), which then totally ignores the nr_pages it's given,
and calculates its own idea of how many pages are in the array from the user
buffer length. Needless to say, when it comes to transmit those uninitialised
page* pointers, we see a crash in the network stack.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Direct backport of 2.4 fix that didn't get propagated to 2.6; original
comment follows:
<quote>
When I specify the NFS port for nfsroot (e.g.,
nfsroot=<dir>,port=2049), the
kernel uses the wrong port. In my case it tries to use 264 (0x108)
instead
of 2049 (0x801).
This patch adds the missing htons().
Eric
</quote>
Patch got applied in 2.4.21-pre6. Author: Eric Lammerts (<eric@lammerts.org>,
AFAICS).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Turn noatime and nodiratime into per-mount instead of per-sb flags.
After all the preparations this is a rather trivial patch. The mount code
needs to treat the two options as per-mount instead of per-superblock, and
touch_atime needs to be changed to check the new MNT_ flags in addition to
the MS_ flags that are kept for filesystems that are always
noatime/nodiratime but not user settable anymore. Besides that core code
only nfs needed an update because it's leaving atime updates to the server
and thus sets the S_NOATIME flag on every inode, but needs to know whether
it's a real noatime mount for an getattr optimization.
While we're at it I've killed the IS_NOATIME/IS_NODIRATIME macros that were
only used by touch_atime.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch converts the inode semaphore to a mutex. I have tested it on
XFS and compiled as much as one can consider on an ia64. Anyway your
luck with it might be different.
Modified-by: Ingo Molnar <mingo@elte.hu>
(finished the conversion)
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It would be helpful if the kernel did not silently stop parsing
nfs options, but instead warned about any he does not recognize. The
attached patch adds one printk to do just that.
It took me a couple of hours to find my configuration mistake.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
See mm/filemap.c:
And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
returns error. However, even if filemap_fdatawrite() returned an
error, it may have submitted the partially data pages to the device.
(e.g. in the case of -ENOSPC)
<quotation>
Andrew Morton writes,
If filemap_fdatawrite() returns an error, this might be due to some
I/O problem: dead disk, unplugged cable, etc. Given the generally
crappy quality of the kernel's handling of such exceptions, there's a
good chance that the filemap_fdatawait() will get stuck in D state
forever.
</quotation>
So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
Trond, could you please review the nfs part? Especially I'm not sure,
nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If someone changes the uid/gid mapping in userland, then we do eventually
want those changes to be propagated to the kernel. Currently the kernel
assumes that it may cache entries forever.
Add an expiration time + garbage collector for idmap entries.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
inode->i_mode contains a lot more than just the mode bits. Make sure that
we mask away this extra stuff in SETATTR calls to the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Every ULP that uses the in-kernel RPC client, except the NLM
client, sets cl_chatty. There's no reason why NLM shouldn't set it, so
just get rid of cl_chatty and always be verbose.
Test-plan:
Compile with CONFIG_NFS enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Thanks to Ed Keizer for bug and root cause. He says: "... we could only mount
the top-level Solaris share. We could not mount deeper into the tree.
Investigation showed that Solaris allows UNIX authenticated FSINFO only on the
top level of the share. This is a problem because we share/export our home
directories one level higher than we mount them. I.e. we share the partition
and not the individual home directories. This prevented access to home
directories."
We still may need to try auth_sys for the case where the client doesn't have
appropriate credentials.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Upon return of a write delegation, the server will almost always bump the
change attribute. Ensure that we pick up that change so that we don't
invalidate our data cache unnecessarily.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
According to RFC3530 we're supposed to cache the change attribute
at the time the client receives a write delegation.
If the inode is clean, a CB_GETATTR callback by the server to the
client is supposed to return the cached change attribute.
If, OTOH, the inode is dirty, the client should bump the cached
change attribute by 1.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The SuS states that a call to write() will cause mtime to be updated on
the file. In order to satisfy that requirement, we need to flush out
any cached writes in nfs_getattr().
Speed things up slightly by not committing the writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.
Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.
Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To help NFS users and server developers, make the "inode number mismatch"
message display more useful information.
Test-plan:
None.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_statfs() generates a log message when GETATTR returns an error. This
is usually a useless message. Make it a dprintk.
Test plan:
None
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Red Hat found a problem in the error recovery logic in __init_nfs.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace ad hoc write parameter sanity checking in nfs_file_direct_write()
with a call to generic_write_checks(). This should make the proper checks
modulo the O_LARGEFILE flag, and should catch NFSv2-specific limitations by
virtue of i_sb->s_maxbytes.
Test plan:
Posix compliance testing with both NFSv2 and NFSv3.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In RFC3530, the RENEW operation is allowed to use either
the same principal, RPC security flavour and (if RPCSEC_GSS), the same
mechanism and service that was used for SETCLIENTID_CONFIRM
OR
Any principal, RPC security flavour and service combination that
currently has an OPEN file on the server.
Choose the latter since that doesn't require us to keep credentials for
the same principal for the entire duration of the mount.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Convert private implementations in NFSv4 state recovery and delegation
code to use kthreads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When recovering from a delegation recall or a network partition, we need
to replay open(O_RDWR), open(O_RDONLY) and open(O_WRONLY) separately.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A closer reading of RFC3530 reveals that OPEN_DOWNGRADE must always
specify a access modes that have been the argument of a previous OPEN
operation.
IOW: doing OPEN(O_RDWR) and then OPEN_DOWNGRADE(O_WRONLY) is forbidden
unless the user called OPEN(O_WRONLY)
In order to fix that, we really need to track the three possible open
states separately.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
OPEN is a stateful operation, so we must ensure that it always
completes. In order to allow users to interrupt the operation,
we need to make the RPC call asynchronous, and then wait on
completion (or cancel).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.
The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.
Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we always initiate flushing of data before we exit
a single-page ->writepage() call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFS client prevents mandatory lock, but there is a flaw on it; Locks are
possibly left if the mode is changed while locking.
This permits unlocking even if the mandatory lock bits are set.
Signed-off-by: ASANO Masahiro <masano@tnes.nec.co.jp>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ensure we call unmap_mapping_range() and sync dirty pages to disk before
doing an NFS direct write.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- Missing initialisation of attribute bitmask in _nfs4_proc_write()
- On success, _nfs4_proc_write() must return number of bytes written.
- Missing post_op_update_inode() in _nfs4_proc_write()
- Missing initialisation of attribute bitmask in _nfs4_proc_commit()
- Missing post_op_update_inode() in _nfs4_proc_commit()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we use set_page_writeback() in the appropriate places
to help the VM in keeping its page radix_tree in sync.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Steve Dickson writes:
Doing the following:
1. On server:
$ mkdir ~/t
$ echo Hello > ~/t/tmp
2. On client, wait for a string to appear in this file:
$ until grep -q foo t/tmp ; do echo -n . ; sleep 1 ; done
3. On server, create a *new* file with the same name containing that
string:
$ mv ~/t/tmp ~/t/tmp.old; echo foo > ~/t/tmp
will show how the client will never (and I mean never ;-) ) see
the updated file.
The problem is that we do not update nfsi->cache_change_attribute when the
file changes on the server (we only update it when our client makes the
changes). This again means that functions like nfs_check_verifier() will
fail to register when the parent directory has changed and should trigger
a dentry lookup revalidation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make sure cache_change_attribute is initialized to jiffies
so when the mtime changes on directory, the directory
will be refreshed.
Signed-off by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In cases where the server has gone insane, nfs_update_inode() may end
up calling nfs_invalidate_inode(), which again calls stuff that takes
the inode->i_lock that we're already holding.
In addition, given the sort of things we have in NFS these days that
need to be cleaned up on inode release, I'm not sure we should ever
be calling make_bad_inode().
Fix up spinlock recursion, and limit nfs_invalidate_inode() to clearing
the caches, and marking the inode as being stale.
Thanks to Steve Dickson <SteveD@redhat.com> for spotting this.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When caching locks due to holding a file delegation, we must always
check against local locks before sending anything to the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is the fs/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in fs/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix some dprintk's so that NLM, NFS client, and RPC client compile
cleanly if CONFIG_SYSCTL is disabled.
Test plan:
Compile kernel with CONFIG_NFS enabled and CONFIG_SYSCTL disabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have a method of dealing with delegation recalls, actually
enable the caching of posix and BSD locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Delegations allow us to cache posix and BSD locks, however when the
delegation is recalled, we need to "flush the cache" and send
the cached LOCK requests to the server.
This patch sets up the mechanism for doing so.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I missed this one... Any form of rename will result in a delegation
recall, so it is more efficient to return the one we hold before
trying the rename.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC 3530 states that for OPEN_DOWNGRADE "The share_access and share_deny
bits specified must be exactly equal to the union of the share_access and
share_deny bits specified for some subset of the OPENs in effect for
current openowner on the current file.
Setattr is currently violating the NFSv4 rules for OPEN_DOWNGRADE in that
it may cause a downgrade from OPEN4_SHARE_ACCESS_BOTH to
OPEN4_SHARE_ACCESS_WRITE despite the fact that there exists no open file
with O_WRONLY access mode.
Fix the problem by replacing nfs4_find_state() with a modified version of
nfs_find_open_context().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We must not remove the nfs4_state structure from the inode open lists
before we are in sequence lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Optimise attribute revalidation when hardlinking. Add post-op attributes
for the directory and the original inode.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
"Optional" means that the close call will not fail if the getattr
at the end of the compound fails.
If it does succeed, try to refresh inode attributes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since the directory attributes change every time we CREATE a file,
we might as well pick up the new directory attributes in the same
compound.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_lookup() used to consult a lookup cache before trying an actual wire
lookup operation. The lookup cache would be invalid, of course, if the
parent directory's mtime had changed, so nfs_lookup performed an inode
revalidation on the parent.
Since nfs_lookup() doesn't use a cache anymore, the revalidation is no
longer necessary. There are cases where it will generate a lot of
unnecessary GETATTR traffic.
See http://bugzilla.linux-nfs.org/show_bug.cgi?id=9
Test-plan:
Use lndir and "rm -rf" and watch for excess GETATTR traffic or application
level errors.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since we almost always call nfs_end_data_update() after we called
nfs_refresh_inode(), we now end up marking the inode metadata
as needing revalidation immediately after having updated it.
This patch rearranges things so that we mark the inode as needing
revalidation _before_ we call nfs_refresh_inode() on those operations
that need it.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow nfs_refresh_inode() also to update attributes on the inode if the
RPC call was sent after the last call to nfs_update_inode().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
resp_len is passed in as buffer size to decode routine; make sure it's
set right in case where userspace provides less than a page's worth of
buffer.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Stop handing garbage to userspace in the case where a weird server clears the
acl bit in the getattr return (despite the fact that they've already claimed
acl support.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Storing a pointer to the struct rpc_task in the nfs_seqid is broken
since the nfs_seqid may be freed well after the task has been destroyed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If someone tries to rename a directory onto an empty directory, we
currently fail and return EBUSY.
This patch ensures that we try the rename if both source and target
are directories, and that we fail with a correct error of EISDIR if
the source is not a directory.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server is in the unconfirmed OPEN state for a given open owner
and receives a second OPEN for the same open owner, it will cancel the
state of the first request and set up an OPEN_CONFIRM for the second.
This can cause a race that is discussed in rfc3530 on page 181.
The following patch allows the client to recover by retrying the
original open request.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make NFSv4 return the fully initialized file pointer with the
stateid that it created in the lookup w/intent.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We no longer need to worry about collisions between close() and the state
recovery code, since the new close will automatically recheck the
file state once it is done waiting on its sequence slot.
Ditto for the nfs4_proc_locku() procedure.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Once the state_owner and lock_owner semaphores get removed, it will be
possible for other OPEN requests to reopen the same file if they have
lower sequence ids than our CLOSE call.
This patch ensures that we recheck the file state once
nfs_wait_on_sequence() has completed waiting.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 file state-changing functions such as OPEN, CLOSE, LOCK,... are all
labelled with "sequence identifiers" in order to prevent the server from
reordering RPC requests, as this could cause its file state to
become out of sync with the client.
Currently the NFS client code enforces this ordering locally using
semaphores to restrict access to structures until the RPC call is done.
This, of course, only works with synchronous RPC calls, since the
user process must first grab the semaphore.
By dropping semaphores, and instead teaching the RPC engine to hold
the RPC calls until they are ready to be sent, we can extend this
process to work nicely with asynchronous RPC calls too.
This patch adds a new list called "rpc_sequence" that defines the order
of the RPC calls to be sent. We add one such list for each state_owner.
When an RPC call is ready to be sent, it checks if it is top of the
rpc_sequence list. If so, it proceeds. If not, it goes back to sleep,
and loops until it hits top of the list.
Once the RPC call has completed, it can then bump the sequence id counter,
and remove itself from the rpc_sequence list, and then wake up the next
sleeper.
Note that the state_owner sequence ids and lock_owner sequence ids are
all indexed to the same rpc_sequence list, so OPEN, LOCK,... requests
are all ordered w.r.t. each other.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Oopsable since nfs_wait_on_inode() can get called as part of iput_final().
Unnecessary since the caller had better be damned sure that the inode won't
disappear from underneath it anyway.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the data cache has been marked as potentially invalid by nfs_refresh_inode,
we should invalidate it rather than assume that changes are due to our own
activity.
Also ensure that we always start with a valid cache before declaring it
to be protected by a delegation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently rpc_mkdir/rpc_rmdir and rpc_mkpipe/mk_unlink have an API that's
a little unfortunate. They take a path relative to the rpc_pipefs root and
thus need to perform a full lookup. If you look at debugfs or usbfs they
always store the dentry for directories they created and thus can pass in
a dentry + single pathname component pair into their equivalents of the
above functions.
And in fact rpc_pipefs actually stores a dentry for all but one component so
this change not only simplifies the core rpc_pipe code but also the callers.
Unfortuntately this code path is only used by the NFS4 idmapper and
AUTH_GSSAPI for which I don't have a test enviroment. Could someone give
it a spin? It's the last bit needed before we can rework the
lookup_hash API
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Each transport implementation can now set unique bind, connect,
reestablishment, and idle timeout values. These are variables,
allowing the values to be modified dynamically. This permits
exponential backoff of any of these values, for instance.
As an example, we implement exponential backoff for the connection
reestablishment timeout.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement a best practice: don't use exponential backoff when computing
retransmit timeout values on TCP connections, but simply retransmit
at regular intervals.
This also fixes a bug introduced when xprt_reset_majortimeo() was added.
Test-plan:
Enable RPC debugging and watch timeout behavior on a NFS/TCP mount.
Version: Thu, 11 Aug 2005 16:02:19 -0400
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fixes a condition whereby the kernel is returning the non-POSIX error
EBADCOOKIE to userspace.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When doing a rename on top of an existing file that is not in use,
the inode of the overwritten file will remain in the icache.
The fix is to decrement i_nlink of the overwritten inode, like we
do for unlink, rmdir etc already.
Problem diagnosed by Olaf Kirch. This patch is a slight variation
on his fix.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_readpage_release() causes an oops while accessing a file with NFS
debugging turned on (echo 32767 > /proc/sys/sunrpc/nfs_debug) and a kernel
built with CONFIG_DEBUG_SLAB.
This patch moves the debugging statement above nfs_release_request() to
avoid accessing freed memory.
Signed-off-by: Nick Wilson <njw@osdl.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use schedule_timeout_{,un}interruptible() instead of
set_current_state()/schedule_timeout() to reduce kernel size. Also use helper
functions to convert between human time units and jiffies rather than constant
HZ division to avoid rounding errors.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update the file systems in fs/ implementing a delete_inode() callback to
call truncate_inode_pages(). One implementation note: In developing this
patch I put the calls to truncate_inode_pages() at the very top of those
filesystems delete_inode() callbacks in order to retain the previous
behavior. I'm guessing that some of those could probably be optimized.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This bug could cause oopses and page state corruption, because ncpfs
used the generic page-cache symlink handlign functions. But those
functions only work if the page cache is guaranteed to be "stable", ie a
page that was installed when the symlink walk was started has to still
be installed in the page cache at the end of the walk.
We could have fixed ncpfs to not use the generic helper routines, but it
is in many ways much cleaner to instead improve on the symlink walking
helper routines so that they don't require that absolute stability.
We do this by allowing "follow_link()" to return a error-pointer as a
cookie, which is fed back to the cleanup "put_link()" routine. This
also simplifies NFS symlink handling.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Down the road we want to eliminate the use of the global kernel lock entirely
from the NFS client. To do this, we need to protect the fields in the
nfs_inode structure adequately. Start by serializing updates to the
"cache_validity" field.
Note this change addresses an SMP hang found by njw@osdl.org, where processes
deadlock because nfs_end_data_update and nfs_revalidate_mapping update the
"cache_validity" field without proper serialization.
Test plan:
Millions of fsx ops on SMP clients. Run Nick Wilson's breaknfs program on
large SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce atomic bitops to manipulate the bits in the nfs_inode structure's
"flags" field.
Using bitops means we can use a generic wait_on_bit call instead of an ad hoc
locking scheme in fs/nfs/inode.c, so we can remove the "nfs_i_wait" field from
nfs_inode at the same time.
The other new flags field will continue to use bitmask and logic AND and OR.
This permits several flags to be set at the same time efficiently. The
following patch adds a spin lock to protect these flags, and this spin lock
will later cover other fields in the nfs_inode structure, amortizing the cost
of using this type of serialization.
Test plan:
Millions of fsx ops on SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Certain bits in nfsi->flags can be manipulated with atomic bitops, and some
are better manipulated via logical bitmask operations.
This patch splits the flags field into two. The next patch introduces atomic
bitops for one of the fields.
Test plan:
Millions of fsx ops on SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the client performs an exclusive create and opens the file for writing,
a Netapp filer will first create the file using the mode 01777. It does this
since an NFSv3/v4 exclusive create cannot immediately set the mode bits.
The 01777 mode then gets put into the inode->i_mode. After the file creation
is successful, we then do a setattr to change the mode to the correct value
(as per the NFS spec).
The problem is that nfs_refresh_inode() no longer updates inode->i_mode, so
the latter retains the 01777 mode. A bit later, the VFS notices this, and calls
remove_suid(). This of course now resets the file mode to inode->i_mode & 0777.
Hey presto, the file mode on the server is now magically changed to 0777. Duh...
Fixes http://bugzilla.linux-nfs.org/show_bug.cgi?id=32
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Looks like it sneaked back with the NFS ACL merge..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following patch removes the f_error field and all checks of f_error.
Trond said:
f_error was introduced for NFS, and made sense when we were guaranteed
always to have a file pointer around when write errors occurred. Since
then, we have (for various reasons) had to introduce the nfs_open_context in
order to track the file read/write state, and it made sense to move our
f_error tracking there too.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Request RDATTR_ERROR as an attribute in readdir to distinguish between a
directory being within an absent filesystem or one (or more) of its entries.
Signed-off-by: Manoj Naik <manoj@almaden.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Basically copies the VFS's method for tracking writebacks and applies
it to the struct nfs_page.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Unless we're doing O_APPEND writes, we really don't care about revalidating
the file length. Just make sure that we catch any page cache invalidations.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of looking at whether or not the file is open for writes before
we accept to update the length using the server value, we should rather
be looking at whether or not we are currently caching any writes.
Failure to do so means in particular that we're not updating the file
length correctly after obtaining a POSIX or BSD lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we do not hold a valid stateid that is open for writes, there is little
point in doing an extra open of the file, as the RFC does not appear to
mandate this...
Make setattr use the correct stateid if we're holding mandatory byte
range locks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv3 currently returns the unsigned 64-bit cookie directly to
userspace. The following patch causes the kernel to generate
loff_t offsets for the benefit of userland.
The current server-generated READDIR cookie is cached in the
nfs_open_context instead of in filp->f_pos, so we still end up work
correctly under directory insertions/deletion.
Signed-off-by: Olivier Galibert <galibert@pobox.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The changeset "trond.myklebust@fys.uio.no|ChangeSet|20050322152404|16979"
(RPC: Ensure XDR iovec length is initialized correctly in call_header)
causes the NFSv4 callback code to BUG() due to an incorrectly initialized
scratch buffer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Older gcc's don't like this.
fs/nfs/nfs4proc.c:2194: field `data' has incomplete type
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The Coverity checker noticed that such a simplification was possible.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Pointer arithmetic bug: p is in word units. This fixes a memory
corruption with big acls.
* Initialize pg_class to prevent a NULL pointer access.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attach acls to inodes in the icache to avoid unnecessary GETACL RPC
round-trips. As long as the client doesn't retrieve any acls itself, only the
default acls of exiting directories and the default and access acls of new
directories will end up in the cache, which preserves some memory compared to
always caching the access and default acl of all files.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv3 has no concept of a umask on the server side: The client applies
the umask locally, and sends the effective permissions to the server.
This behavior is wrong when files are created in a directory that has a
default ACL. In this case, the umask is supposed to be ignored, and
only the default ACL determines the file's effective permissions.
Usually its the server's task to conditionally apply the umask. But
since the server knows nothing about the umask, we have to do it on the
client side. This patch tries to fetch the parent directory's default
ACL before creating a new file, computes the appropriate create mode to
send to the server, and finally sets the new file's access and default
acl appropriately.
Many thanks to Buck Huppmann <buchk@pobox.com> for sending the initial
version of this patch, as well as for arguing why we need this change.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This adds acl support fo nfs clients via the NFSACL protocol extension, by
implementing the getxattr, listxattr, setxattr, and removexattr iops for the
system.posix_acl_access and system.posix_acl_default attributes. This patch
implements a dumb version that uses no caching (and thus adds some overhead).
(Another patch in this patchset adds caching as well.)
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently we return -ENOMEM for every single failure to create a new auth.
This is actually accurate for auth_null and auth_unix, but for auth_gss it's a
bit confusing.
Allow rpcauth_create (and the ->create methods) to return errors. With this
patch, the user may sometimes see an EINVAL instead. Whee.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add nfs4_acl field to the nfs_inode, and use it to cache acls. Only cache
acls of size up to a page. Also prepare for up to a page of acl data even
when the user doesn't pass in a buffer, as when they want to get the acl
length to decide what size buffer to allocate.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side write support for NFSv4 ACLs.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 acls: xdr encoding and decoding routines for
writing acls
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 ACLs. Exports the raw xdr code via the
system.nfs4_acl extended attribute. It is up to userspace to decode the acl
(and to provide correctly xdr'd acls on setxattr), and to convert to/from
POSIX ACLs if desired.
This patch provides only the read support.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Client-side support for NFSv4 acls: xdr encoding and decoding routines for
reading acls
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make nfs4 fattr size calculations more explicit, revising them downward a
bit in the process.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add {get,set,list}xattr methods for nfs4. The new methods are no-ops, to be
used by subsequent ACL patch.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
ACL support will require supporting additional inode operations in v4
(getxattr, setxattr, listxattr). This patch allows different protocol versions
to support different inode operations by adding a file_inode_ops to the
nfs_rpc_ops (to match the existing dir_inode_ops).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we fix up the missing fields in the nfs_mount_data with
sane defaults for older versions of mount, and return errors in the
cases where we cannot.
Convert a bunch of annoying warnings into dprintks()
Return -EPROTONOSUPPORT rather than EIO if mount() tries to set NFSv3
without it actually being compiled in.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This fixes a data corruption error for mail delivery applications that
expect to be able to do posix locking and then append writes on NFS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We should never apply a lookup intent to anything other than the last
path component in an open(), create() or access() call.
Introduce the helper nfs_lookup_check_intent() which always returns
zero if LOOKUP_CONTINUE or LOOKUP_PARENT are set, and returns the
intent flags if we're on the last component of the lookup.
By doing so, we fix a bug in open(O_EXCL), where we may end up
optimizing away a real lookup of the parent directory.
Problem noticed by Linda Dunaphant <linda.dunaphant@ccur.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes some needlessly global identifiers static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Arjan van de Ven <arjanv@infradead.org>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!