Server should return NFS4ERR_ATTRNOTSUPP if an attribute specified is
not supported in current environment.
Operations CREATE, NVERIFY, OPEN, SETATTR and VERIFY should do this check.
This bug is found when do newpynfs tests. The names of the tests that failed
are following:
CR12 NVF7a NVF7b NVF7c NVF7d NVF7f NVF7r NVF7s
OPEN15 VF7a VF7b VF7c VF7d VF7f VF7r VF7s
Add function do_check_fattr() to do exact check:
1, Check attribute specified is supported by the NFSv4 server or not.
2, Check FATTR4_WORD0_ACL & FATTR4_WORD0_FS_LOCATIONS are supported
in current environment or not.
3, Check attribute specified is writable or not.
step 1 and 3 are done in function nfsd4_decode_fattr() but removed
to this function now.
Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The file nfsfh.c contains two static variables nfsd_nr_verified and
nfsd_nr_put. These are counters which are incremented as a side
effect of the fh_verify() fh_compose() and fh_put() operations,
i.e. at least twice per NFS call for any non-trivial workload.
Needless to say this makes the cacheline that contains them (and any
other innocent victims) a very hot contention point indeed under high
call-rate workloads on multiprocessor NFS server. It also turns out
that these counters are not used anywhere. They're not reported to
userspace, they're not used in logic, they're not even exported from
the object file (let alone the module). All they do is waste CPU time.
So this patch removes them.
Tests on a 16 CPU Altix A4700 with 2 10gige Myricom cards, configured
separately (no bonding). Workload is 640 client threads doing directory
traverals with random small reads, from server RAM.
Before
======
Kernel profile:
% cumulative self self total
time samples samples calls 1/call 1/call name
6.05 2716.00 2716.00 30406 0.09 1.02 svc_process
4.44 4706.00 1990.00 1975 1.01 1.01 spin_unlock_irqrestore
3.72 6376.00 1670.00 1666 1.00 1.00 svc_export_put
3.41 7907.00 1531.00 1786 0.86 1.02 nfsd_ofcache_lookup
3.25 9363.00 1456.00 10965 0.13 1.01 nfsd_dispatch
3.10 10752.00 1389.00 1376 1.01 1.01 nfsd_cache_lookup
2.57 11907.00 1155.00 4517 0.26 1.03 svc_tcp_recvfrom
...
2.21 15352.00 1003.00 1081 0.93 1.00 nfsd_choose_ofc <----
^^^^
Here the function nfsd_choose_ofc() reads a global variable
which by accident happened to be located in the same cacheline as
nfsd_nr_verified.
Call rate:
nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 00:15:27 184780.663
Thu Dec 13 00:15:28 184885.881
Thu Dec 13 00:15:29 184449.215
Thu Dec 13 00:15:30 184971.058
Thu Dec 13 00:15:31 185036.052
Thu Dec 13 00:15:32 185250.475
Thu Dec 13 00:15:33 184481.319
Thu Dec 13 00:15:34 185225.737
Thu Dec 13 00:15:35 185408.018
Thu Dec 13 00:15:36 185335.764
After
=====
kernel profile:
% cumulative self self total
time samples samples calls 1/call 1/call name
6.33 2813.00 2813.00 29979 0.09 1.01 svc_process
4.66 4883.00 2070.00 2065 1.00 1.00 spin_unlock_irqrestore
4.06 6687.00 1804.00 2182 0.83 1.00 nfsd_ofcache_lookup
3.20 8110.00 1423.00 10932 0.13 1.00 nfsd_dispatch
3.03 9456.00 1346.00 1343 1.00 1.00 nfsd_cache_lookup
2.62 10622.00 1166.00 4645 0.25 1.01 svc_tcp_recvfrom
[...]
0.10 42586.00 44.00 74 0.59 1.00 nfsd_choose_ofc <--- HA!!
^^^^
Call rate:
nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 01:45:28 194677.118
Thu Dec 13 01:45:29 193932.692
Thu Dec 13 01:45:30 194294.364
Thu Dec 13 01:45:31 194971.276
Thu Dec 13 01:45:32 194111.207
Thu Dec 13 01:45:33 194999.635
Thu Dec 13 01:45:34 195312.594
Thu Dec 13 01:45:35 195707.293
Thu Dec 13 01:45:36 194610.353
Thu Dec 13 01:45:37 195913.662
Thu Dec 13 01:45:38 194808.675
i.e. about a 5.3% improvement in call rate.
Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Reviewed-by: David Chinner <dgc@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix a regression in the reply cache introduced when the code was
converted to use proper Linux lists. When a new entry needs to be
inserted, the case where all the entries are currently being used
by threads is not correctly detected. This can result in memory
corruption and a crash. In the current code this is an extremely
unlikely corner case; it would require the machine to have 1024
nfsd threads and all of them to be busy at the same time. However,
upcoming reply cache changes make this more likely; a crash due to
this problem was actually observed in field.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Make REQHASH() an inline function. Rename hash_list to cache_hash.
Fix an obsolete comment.
Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
lockd/svclock.c is missing a header file <linux/fs.h>.
<linux/fs.h> is missing a definition of locks_release_private()
for the config case of FILE_LOCKING=n, causing a build error:
fs/lockd/svclock.c:330: error: implicit declaration of function 'locks_release_private'
lockd without FILE_LOCKING doesn't make sense, so make LOCKD and LOCKD_V4
depend on FILE_LOCKING, and make NFS depend on FILE_LOCKING.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Eliminate 56 sparse warnings like this one:
fs/nfsd/nfs4xdr.c:1331:15: warning: obsolete array initializer, use C99 syntax
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Move this out of a local variable into the nfs4_delegation object in
preparation for making this an async rpc call (at which point we'll need
any state like this in a common object that's preserved across function
calls).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's no point in keeping this field around--it's always zero.
(Background: the protocol allows you to tell the client that the file is
about to be truncated, as an optimization to save the client from
writing back dirty pages that will just be discarded. We don't
implement this hint. If we do some day, adding this field back in will
be the least of the work involved.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The nfs4_cb_recall struct is used only in nfs4_delegation, so its
pointer to the containing delegation is unnecessary--we could just use
container_of().
But there's no real reason to have this a separate struct at all--just
move these fields to nfs4_delegation.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
I want to use the name for a struct that actually does represent a
single callback.
(Actually, I've never been sure it helps to a separate struct for the
callback information. Some day maybe those fields could just be dumped
into struct nfs4_client. I don't know.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We don't really need a synchronous rpc, and moving to an asynchronous
rpc allows us to do without this extra kthread.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The code is a little simpler, and it should be easier to avoid races, if
we just do all rpc client creation/destruction from nfsd or laundromat
threads and do only the rpc calls themselves asynchronously. The rpc
creation doesn't involve any significant waiting (it doesn't call the
client, for example), so there's no reason not to do this.
Also don't bother destroying the client on failure of the rpc null
probe. We may want to retry the probe later anyway.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We tried to do something overly complicated with the callback rpc
timeouts here. And they're wrong--the result is that by the time a
single callback times out, it's already too late to tell the client
(using the cb_path_down return to RENEW) that the callback is down.
Use a much shorter, simpler timeout.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This setclientid_confirm case should allow the client to change
callbacks, but it currently has a dummy implementation that just turns
off callbacks completely. That dummy implementation isn't completely
correct either, though:
- There's no need to remove any client recovery directory in
this case.
- New clientid confirm verifiers should be generated (and
returned) in setclientid; there's no need to generate a new
one here.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Stephen Rothwell said:
"Today's linux-next build (powerpc ppc64_defconfig) produced this new
warning:
fs/nfsd/nfs4state.c: In function 'EXPIRED_STATEID':
fs/nfsd/nfs4state.c:2757: warning: comparison of distinct pointer types lacks a cast
Caused by commit 78155ed75f ("nfsd4:
distinguish expired from stale stateids")."
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
ext4 supports a real NFSv4 change attribute, which is bumped whenever
the ctime would be updated, including times when two updates arrive
within a jiffy of each other. (Note that although ext4 has space for
nanosecond-precision ctime, the real resolution is lower: it actually
uses jiffies as the time-source.) This ensures clients will invalidate
their caches when they need to.
There is some fear that keeping the i_version up-to-date could have
performance drawbacks, so for now it's turned on only by a mount option.
We hope to do something better eventually.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Theodore Tso <tytso@mit.edu>
We don't need comments to tell us these macros are ugly. And we're long
past trying to share any of this code with the BSD's.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: For consistency, handle output buffer size checking in a
other nfsctl functions the same way it's done for write_versions().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
While it's not likely today that there are enough NFS versions to
overflow the output buffer in write_versions(), we should be more
careful about detecting the end of the buffer.
The number of NFS versions will only increase as NFSv4 minor versions
are added.
Note that this API doesn't behave the same as portlist. Here we
attempt to display as many versions as will fit in the buffer, and do
not provide any indication that an overflow would have occurred. I
don't have any good rationale for that.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
While it's not likely a pathname will be longer than
SIMPLE_TRANSACTION_SIZE, we should be more careful about just
plopping it into the output buffer without bounds checking.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Adjust the synopsis of svc_sock_names() to pass in the size of the
output buffer. Add a documenting comment.
This is a cosmetic change for now. A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Adjust the synopsis of svc_addsock() to pass in the size of the output
buffer. Add a documenting comment.
This is a cosmetic change for now. A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The svc_xprt_names() function can overflow its buffer if it's so near
the end of the passed in buffer that the "name too long" string still
doesn't fit. Of course, it could never tell if it was near the end
of the passed in buffer, since its only caller passes in zero as the
buffer length.
Let's make this API a little safer.
Change svc_xprt_names() so it *always* checks for a buffer overflow,
and change its only caller to pass in the correct buffer length.
If svc_xprt_names() does overflow its buffer, it now fails with an
ENAMETOOLONG errno, instead of trying to write a message at the end
of the buffer. I don't like this much, but I can't figure out a clean
way that's always safe to return some of the names, *and* an
indication that the buffer was not long enough.
The displayed error when doing a 'cat /proc/fs/nfsd/portlist' is
"File name too long".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up.
A couple of years ago, a series of commits, finishing with commit
5680c446, swapped the order of the lockd_up() and svc_addsock() calls
in __write_ports(). At that time lockd_up() needed to know the
transport protocol of the passed-in socket to start a listener on the
same transport protocol.
These days, lockd_up() doesn't take a protocol argument; it always
starts both a UDP and TCP listener. It's now more straightforward to
try the lockd_up() first, then do a lockd_down() if the svc_addsock()
fails.
Careful review of this code shows that the svc_sock_names() call is
used only to close the just-opened socket in case lockd_up() fails.
So it is no longer needed if lockd_up() is done first.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Refactor transport name listing out of __write_ports() to
make it easier to understand and maintain.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
User space must call listen(3) on SOCK_STREAM sockets passed into
/proc/fs/nfsd/portlist, otherwise that listener is ignored. Document
this.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Refactor the socket creation logic out of __write_ports() to
make it easier to understand and maintain.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Refactor the socket closing logic out of __write_ports() to
make it easier to understand and maintain.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Refactor transport addition out of __write_ports() to make
it easier to understand and maintain.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: Refactor transport removal out of __write_ports() to make it
easier to understand and maintain.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If we encode the time of client creation into the stateid instead of the
time of server boot, then we can determine whether that stateid is from
a previous instance of the a server, or from a client that has expired,
and return an appropriate error to the client.
Signed-off-by: Bian Naimeng <biannm@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
For every lock request lockd creates a new file_lock object
in nlmsvc_setgrantargs() by copying the passed in file_lock with
locks_copy_lock(). A filesystem can attach it's own lock_operations
vector to the file_lock. It has to be cleaned up at the end of the
file_lock's life. However, lockd doesn't do it today, yet it
asserts in nlmclnt_release_lockargs() that the per-filesystem
state is clean.
This patch fixes it by exporting locks_release_private() and adding
it to nlmsvc_freegrantargs(), to be symmetrical to creating a
file_lock in nlmsvc_setgrantargs().
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix btrfs fallocate oops and deadlock
Btrfs: use the right node in reada_for_balance
Btrfs: fix oops on page->mapping->host during writepage
Btrfs: add a priority queue to the async thread helpers
Btrfs: use WRITE_SYNC for synchronous writes
This fixes the following BUG:
# mount -o size=MM -t hugetlbfs none /huge
hugetlbfs: Bad value 'MM' for mount option 'size=MM'
------------[ cut here ]------------
kernel BUG at fs/super.c:996!
Due to
BUG_ON(!mnt->mnt_sb);
in vfs_kern_mount().
Also, remove unused #include <linux/quotaops.h>
Cc: William Irwin <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Btrfs fallocate was incorrectly starting a transaction with a lock held
on the extent_io tree for the file, which could deadlock. Strictly
speaking it was using join_transaction which would be safe, but it is better
to move the transaction outside of the lock.
When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
being called on an unlocked buffer. This was triggering an assertion and
oops because the lock is supposed to be held.
The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
been run. btrfs_del_item takes care of dirtying things, so the solution is a
to skip the btrfs_mark_buffer_dirty call in this case.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Fix page_mkwrite() return code
GFS2: Clear dirty bit at end of inode glock sync
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
reiserfs: fix j_last_flush_trans_id type
fs: Mark get_filesystem_list() as __init function.
kill vfs_stat_fd / vfs_lstat_fd
Separate out common fstatat code into vfs_fstatat
ecryptfs: use memdup_user()
ncpfs: use memdup_user()
xfs: use memdup_user()
sysfs: use memdup_user()
btrfs: use memdup_user()
xattr: use memdup_user()
autofs4: use memchr() in invalid_string()
Documentation/filesystems: remove out of date reference to BKL being held
Fix i_mutex vs. readdir handling in nfsd
fs/compat_ioctl: fix build when !BLOCK
Fix autofs_expire()
No need for crossing to mountpoint in audit_tag_tree()
Safer nfsd_cross_mnt()
Touch all affected namespaces on propagation of mount
Fix AUTOFS_DEV_IOCTL_REQUESTER_CMD
Commit ae46141ff0 (NFSv3: Fix posix ACL code)
introduces a bug in the calculation of the XDR header iovec. In the case
where we are inlining the acls, we need to adjust the length of the iovec
req->rq_svec, in addition to adjusting the total buffer length.
Tested-by: Leonardo Chiquitto <leonardo.lists@gmail.com>
Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"int get_filesystem_list(char * buf)" is called by only
"static void __init get_fs_names(char *page)".
We can mark get_filesystem_list() as "__init".
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There's really no reason to keep vfs_stat_fd and vfs_lstat_fd with
Oleg's vfs_fstatat. Use vfs_fstatat for the few cases having the
directory fd, and switch all others to vfs_stat / vfs_lstat.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is a version incorporating Christoph's suggestion.
Separate out common *fstatat functionality into a single function
instead of duplicating it all over the code.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>