block layout can just make use of generic read/write_done.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also avoid unnecessary lock_page if page is handled by others.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It does not need to manipulate on partial initialized blocks.
Writeback code takes care of it.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise
bio_alloc will fail when there are many pages in one read/write_pagelist.
Cc: <stable@vger.kernel.org> #3.1+
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
bl_free_block_dev() may sleep. We can not call it with spinlock held.
Besides, there is no need to take bm_lock as we are last user freeing bm_devlist.
Cc: <stable@vger.kernel.org> #3.1+
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is wrong to kmalloc in _add_entry() as it is inside
spinlock. memory should be already allocated _add_entry() is called.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To pass the IO status to upper layer.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When calling _add_entry, we should take the im_lock to protect
agains other modifiers.
Cc: <stable@vger.kernel.org> #3.1+
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
compile in nfs-for-3.3 branch shows following warnings. Fix it here.
fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’:
fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 4 has type ‘size_t’
fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 6 has type ‘size_t’
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
On 32 bit, if n is too large then "n * sizeof(*args->devs)" could
overflow and args->devs would be smaller than expected.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
port is supposed to be a __be16 here. The existing code should work
fine, but this is a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This function returns a big endian value. The implementation in
fs/nfs/callback_proc.c is declared with "__be32" but the .h file uses
"unsigned" instead. It makes sparse complain:
fs/nfs/callback_proc.c:232:8: error:
symbol 'nfs4_callback_layoutrecall' redeclared with different
type (originally declared at fs/nfs/callback.h:165) - different
base types
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Change the default setting of the nfs4_disable_idmapping parameter
NFSv4: Save the owner/group name string when doing open
NFS: Remove pNFS bloat from the generic write path
pnfs-obj: Must return layout on IO error
pnfs-obj: pNFS errors are communicated on iodata->pnfs_error
NFS: Cache state owners after files are closed
NFS: Clean up nfs4_find_state_owners_locked()
NFSv4: include bitmap in nfsv4 get acl data
nfs: fix a minor do_div portability issue
NFSv4.1: cleanup comment and debug printk
NFSv4.1: change nfs4_free_slot parameters for dynamic slots
NFSv4.1: cleanup init and reset of session slot tables
NFSv4.1: fix backchannel slotid off-by-one bug
nfs: fix regression in handling of context= option in NFSv4
NFS - fix recent breakage to NFS error handling.
NFS: Retry mounting NFSROOT
SUNRPC: Clean up the RPCSEC_GSS service ticket requests
Now that the use of numeric uids/gids is officially sanctioned in
RFC3530bis, it is time to change the default here to 'enabled'.
By doing so, we ensure that NFSv4 copies the behaviour of NFSv3 when we're
using the default AUTH_SYS authentication (i.e. when the client uses the
numeric uids/gids as authentication tokens), so that when new files are
created, they will appear to have the correct user/group.
It also fixes a number of backward compatibility issues when migrating
from NFSv3 to NFSv4 on a platform where the server uses different uid/gid
mappings than the client.
Note also that this setting has been successfully tested against servers
that do not support numeric uids/gids at several Connectathon/Bakeathon
events at this point, and the fall back to using string names/groups has
been shown to work well in all those test cases.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
PM / Hibernate: Implement compat_ioctl for /dev/snapshot
PM / Freezer: fix return value of freezable_schedule_timeout_killable()
PM / shmobile: Allow the A4R domain to be turned off at run time
PM / input / touchscreen: Make st1232 use device PM QoS constraints
PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
PM / shmobile: Remove the stay_on flag from SH7372's PM domains
PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
ARM: S3C64XX: Implement basic power domain support
PM / shmobile: Use common always on power domain governor
...
Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
...so that we can do the uid/gid mapping outside the asynchronous RPC
context.
This fixes a bug in the current NFSv4 atomic open code where the client
isn't able to determine what the true uid/gid fields of the file are,
(because the asynchronous nature of the OPEN call denies it the ability
to do an upcall) and so fills them with default values, marking the
inode as needing revalidation.
Unfortunately, in some cases, the VFS will do some additional sanity
checks on the file, and may override the server's decision to allow
the open because it sees the wrong owner/group fields.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We have no business doing any this in the standard write release path.
Get rid of it, and put it in the pNFS layer.
Also, while we're at it, get rid of the completely bogus unlock/relock
semantics that were present in nfs_writeback_release_full(). It is
not only unnecessary, but actually dangerous to release the write lock
just in order to take it again in nfs_page_async_flush(). Better just
to open code the pgio operations in a pnfs helper.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As mandated by the standard. In case of an IO error, a pNFS
objects layout driver must return it's layout. This is because
all device errors are reported to the server as part of the
layout return buffer.
This is implemented the same way PNFS_LAYOUTRET_ON_SETATTR
is done, through a bit flag on the pnfs_layoutdriver_type->flags
member. The flag is set by the layout driver that wants a
layout_return preformed at pnfs_ld_{write,read}_done in case
of an error.
(Though I have not defined a wrapper like pnfs_ld_layoutret_on_setattr
because this code is never called outside of pnfs.c and pnfs IO
paths)
Without this patch 3.[0-2] Kernels leak memory and have an annoying
WARN_ON after every IO error utilizing the pnfs-obj driver.
[This patch is for 3.2 Kernel. 3.1/0 Kernels need a different patch]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some time along the way pNFS IO errors were switched to
communicate with a special iodata->pnfs_error member instead
of the regular RPC members. But objlayout was not switched
over.
Fix that!
Without this fix any IO error is hanged, because IO is not
switched to MDS and pages are never cleared or read.
[Applies to 3.2.0. Same bug different patch for 3.1/0 Kernels]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Servers have a finite amount of memory to store NFSv4 open and lock
owners. Moreover, servers may have a difficult time determining when
they can reap their state owner table, thanks to gray areas in the
NFSv4 protocol specification. Thus clients should be careful to reuse
state owners when possible.
Currently Linux is not too careful. When a user has closed all her
files on one mount point, the state owner's reference count goes to
zero, and it is released. The next OPEN allocates a new one. A
workload that serially opens and closes files can run through a large
number of open owners this way.
When a state owner's reference count goes to zero, slap it onto a free
list for that nfs_server, with an expiry time. Garbage collect before
looking for a state owner. This makes state owners for active users
available for re-use.
Now that there can be unused state owners remaining at umount time,
purge the state owner free list when a server is destroyed. Also be
sure not to reclaim unused state owners during state recovery.
This change has benefits for the client as well. For some workloads,
this approach drops the number of OPEN_CONFIRM calls from the same as
the number of OPEN calls, down to just one. This reduces wire traffic
and thus open(2) latency. Before this patch, untarring a kernel
source tarball shows the OPEN_CONFIRM call counter steadily increasing
through the test. With the patch, the OPEN_CONFIRM count remains at 1
throughout the entire untar.
As long as the expiry time is kept short, I don't think garbage
collection should be terribly expensive, although it does bounce the
clp->cl_lock around a bit.
[ At some point we should rationalize the use of the nfs_server
->destroy method. ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Trond: Fixed a garbage collection race and a few efficiency issues]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There's no longer a need to check the so_server field in the state
owner, because nowadays the RB tree we search for state owners
contains owners for that only server.
Make nfs4_find_state_owners_locked() use the same tree searching logic
as nfs4_insert_state_owner_locked().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 bitmap size is unbounded: a server can return an arbitrary
sized bitmap in an FATTR4_WORD0_ACL request. Replace using the
nfs4_fattr_bitmap_maxsz as a guess to the maximum bitmask returned by a server
with the inclusion of the bitmap (xdr length plus bitmasks) and the acl data
xdr length to the (cached) acl page data.
This is a general solution to commit e5012d1f "NFSv4.1: update
nfs4_fattr_bitmap_maxsz" and fixes hitting a BUG_ON in xdr_shrink_bufhead
when getting ACLs.
Fix a bug in decode_getacl that returned -EINVAL on ACLs > page when getxattr
was called with a NULL buffer, preventing ACL > PAGE_SIZE from being retrieved.
Cc: stable@kernel.org
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This change modifies filelayout_get_dense_offset() to use the functions
in math64.h and thus avoid a 32-bit platform compile error trying to
use do_div() on an s64 type.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We are either initializing or resetting a session. Initialize or reset
the session slot tables accordingly.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Setting the security context of a NFSv4 mount via the context= mount
option is currently broken. The NFSv4 codepath allocates a parsed
options struct, and then parses the mount options to fill it. It
eventually calls nfs4_remote_mount which calls security_init_mnt_opts.
That clobbers the lsm_opts struct that was populated earlier. This bug
also looks like it causes a small memory leak on each v4 mount where
context= is used.
Fix this by moving the initialization of the lsm_opts into
nfs_alloc_parsed_mount_data. Also, add a destructor for
nfs_parsed_mount_data to make it easier to free all of the allocations
hanging off of it, and to ensure that the security_free_mnt_opts is
called whenever security_init_mnt_opts is.
I believe this regression was introduced quite some time ago, probably
by commit c02d7adf.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
From c6d615d2b97fe305cbf123a8751ced859dca1d5e Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 16 Nov 2011 09:39:05 +1100
Subject: [PATCH] NFS - fix recent breakage to NFS error handling.
commit 02c24a8218 made a small and
presumably unintended change to write error handling in NFS.
Previously an error from filemap_write_and_wait_range would only be of
interest if nfs_file_fsync did not return an error. After this commit,
an error from filemap_write_and_wait_range would mean that (the rest of)
nfs_file_fsync would not even be called.
This means that:
1/ you are more likely to see EIO than e.g. EDQUOT or ENOSPC.
2/ NFS_CONTEXT_ERROR_WRITE remains set for longer so more writes are
synchronous.
This patch restores previous behaviour.
Cc: stable@kernel.org
Cc: Josef Bacik <josef@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of hacking specific service names into gss_encode_v1_msg, we should
just allow the caller to specify the service name explicitly.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's not needed anymore; we used to, back when we had to do
mount_subtree() by hand, complete with put_mnt_ns() in it.
No more... Apparmor didn't need it since the __d_path() fix.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need to duplicate them in both callers; make it return
ERR_PTR(-ENOMEM) on allocation failure instead of NULL and
it'll be able to report rpc_lookup_cred() failures just
fine. Callers are much happier that way...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...
Conflicts:
kernel/cgroup_freezer.c
After commit 06222e491e (fs: handle
SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek)
the behaviour of llseek() was changed so that it always revalidates
the file size. The bug appears to be due to a logic error in the
afore-mentioned commit, which always evaluates to 'true'.
Reported-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.1]
...and report the servers that try to return a delegation when the client
is using the CLAIM_DELEG_CUR open mode. That behaviour is explicitly
forbidden in RFC3530.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are currently 2 places in the state recovery code, where we do not
take sufficient precautions before accessing the state->lock_states. In
both cases, we should be holding the state->state_lock.
Reported-by: Pascal Bouchareine <pascal@gandi.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allow the freezer to skip wait_on_bit_killable sleeps in the sunrpc
layer. This should allow suspend and hibernate events to proceed, even
when there are RPC's pending on the wire.
Also, wrap the TASK_KILLABLE sleeps in NFS layer in freezer_do_not_count
and freezer_count calls. This allows the freezer to skip tasks that are
sleeping while looping on EJUKEBOX or NFS4ERR_DELAY sorts of errors.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Currently, the code assumes that the SEQUENCE status bits are mutually
exclusive. They are not...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 2.6.34]
If we handled an error condition, then nfs4_recovery_handle_error should
return '0' so that the state recovery thread can continue.
Also ensure that nfs4_check_lease() continues to abort if we haven't got
any credentials by having it return ENOKEY (which is not handled).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Revert pnfs ugliness from the generic NFS read code path
SUNRPC: destroy freshly allocated transport in case of sockaddr init error
NFS: Fix a regression in the referral code
nfs: move nfs_file_operations declaration to bottom of file.c (try #2)
nfs: when attempting to open a directory, fall back on normal lookup (try #5)
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).
btrfs and nfs switched to it instead of open-coding the sucker.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pNFS-specific code belongs in the pnfs layer. It should not be
hijacking generic NFS read or write code paths.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a regression that was introduced by commit
0c2e53f11a (NFS: Remove the unused
"lookupfh()" version of nfs4_proc_lookup()).
In the case where the lookup gets an NFS4ERR_MOVED, we want to return
the result of nfs4_get_referral(). Instead, that value is getting
clobbered by the call to nfs4_handle_exception()...
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Tested-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return
from an OPEN call. Prior to that patch, that caused the client to fall
back to doing a normal lookup. When that patch went in, the code began
returning that error to userspace. The d_revalidate codepath however
never had the corresponding change, so it was still possible to end up
with a NULL ctx->state pointer after that.
That patch caused a regression. When we attempt to open a directory that
does not have a cached dentry, that open now errors out with EISDIR. If
you attempt the same open with a cached dentry, it will succeed.
Fix this by reverting the change in nfs_atomic_lookup and allowing
attempts to open directories to fall back to a normal lookup
Also, add a NFSv4-specific f_ops->open routine that just returns
-ENOTDIR. This should never be called if things are working properly,
but if it ever is, then the dprintk may help in debugging.
To facilitate this, a new file_operations field is also added to the
nfs_rpc_ops struct.
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This service should not be registered with or unregistered from rpcbind.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The ore need suplied a r4w_get_page/r4w_put_page API
from Filesystem so it can get cache pages to read-into when
writing parial stripes.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Finally remove all the old raid engine, which is by now
dead code.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In this patch we are actually moving to the ORE.
(Object Raid Engine).
objio_state holds a pointer to an ore_io_state. Once
we have an ore_io_state at hand we can call the ore
for reading/writing. We register on the done path
to kick off the nfs io_done mechanism.
Again for Ease of reviewing the old code is "#if 0"
but is not removed so the diff command works better.
The old code will be removed in the next patch.
fs/exofs/Kconfig::ORE is modified to also be auto-included
if PNFS_OBJLAYOUT is set. Since we now depend on ORE.
(See comments in fs/exofs/Kconfig)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For Ease of reviewing I split the move to ore into 3 parts
move to ore 01: ore_layout & ore_components
move to ore 02: move to ORE
move to ore 03: Remove old raid engine
This patch modifies the objio_lseg, layout-segment level
and devices and components arrays to use the ORE types.
Though it will be removed soon, also the raid engine
is modified to actually compile, possibly run, with
the new types. So it is the same old raid engine but
with some new ORE types.
For Ease of reviewing, some of the old code is
"#if 0" but is not removed so the diff command works
better. The old code will be removed in the 3rd patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* All instances of objlayout_io_state => objlayout_io_res
* All instances of state => oir;
* All instances of ol_state => oir;
Big but nothing to it
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is part of moving objio_osd to use the ORE.
objlayout_io_state had two functions:
1. It was used in the error reporting mechanism at layout_return.
This function is kept intact.
(Later patch will rename objlayout_io_state => objlayout_io_res)
2. Carrier of rw io members into the objio_read/write_paglist API.
This is removed in this patch.
The {r,w}data received from NFS are passed directly to the
objio_{read,write}_paglist API. The io_engine is now allocating
it's own IO state as part of the read/write. The minimal
functionality that was part of the generic allocation is passed
to the io_engine.
So part of this patch is rename of:
ios->ol_state.foo => ios->foo
At objlayout_{read,write}_done an objlayout_io_state is passed that
denotes the result of the IO. (Hence the later name change).
If the IO is successful objlayout calls an objio_free_result() API
immediately (Which for objio_osd causes the release of the io_state).
If the IO ended in an error it is hanged onto until reported in
layout_return and is released later through the objio_free_result()
API. (All this is not new just renamed and cleaned)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
objlayout driver was always returning PNFS_ATTEMPTED from it's
read/write_pagelist operations. Even on error. Fix that.
Start by establishing an error return API from io-engine, by
not returning ssize_t (length-or-error) but returning "int"
0=OK, 0>Error. And clean up all return types in io-engine.
Then if io-engine returned error return PNFS_NOT_ATTEMPTED
to generic layer. (With a dprint)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The EOF calculation was done on .read_pagelist(), cached
in objlayout_io_state->eof, and set in objlayout_read_done()
into nfs_read_data->res.eof.
So set it directly into nfs_read_data->res.eof and avoid
the extra member.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When CONFIG_NFS=y and CONFIG_NFS_V3_{,V4}=n we get the following warning.
fs/nfs/write.c: In function ‘nfs_writeback_done’:
fs/nfs/write.c:1246:21: warning: unused variable ‘server’
Remove the variable 'server' to fix the above warning.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Replace direct i_nlink updates with the respective updater function
(inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Some files were using the complete module.h infrastructure without
actually including the header at all. Fix them up in advance so
once the implicit presence is removed, we won't get failures like this:
CC [M] fs/nfsd/nfssvc.o
fs/nfsd/nfssvc.c: In function 'nfsd_create_serv':
fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function)
fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once
fs/nfsd/nfssvc.c:335: error: for each function it appears in.)
fs/nfsd/nfssvc.c: In function 'nfsd':
fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit'
make[3]: *** [fs/nfsd/nfssvc.o] Error 1
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Clean up: the first parameter of nfs_create_request() has been
incorrectly documented since time immemorial (OK, since before
2.6.12).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
craa_type_mask is bitmap4 per RFC5661. We need to expect a length before
extracting bitmap value.
Cc: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Current pnfs_layoutcommit_inode can not handle parallel layoutcommit.
And as Trond suggested , there is no need for client to optimize for
parallel layoutcommit. So add NFS_INO_LAYOUTCOMMITTING flag to
mark inflight layoutcommit and serialize lalyoutcommit with it.
Also mark_inode_dirty_sync if pnfs_layoutcommit_inode fails to issue
layoutcommit.
Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
leases: fix write-open/read-lease race
nfs: drop unnecessary locking in llseek
ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
vfs: add generic_file_llseek_size
vfs: do (nearly) lockless generic_file_llseek
direct-io: merge direct_io_walker into __blockdev_direct_IO
direct-io: inline the complete submission path
direct-io: separate map_bh from dio
direct-io: use a slab cache for struct dio
direct-io: rearrange fields in dio/dio_submit to avoid holes
direct-io: fix a wrong comment
direct-io: separate fields only used in the submission path from struct dio
vfs: fix spinning prevention in prune_icache_sb
vfs: add a comment to inode_permission()
vfs: pass all mask flags check_acl and posix_acl_permission
vfs: add hex format for MAY_* flag values
vfs: indicate that the permission functions take all the MAY_* flags
compat: sync compat_stats with statfs.
vfs: add "device" tag to /proc/self/mountstats
cleanup: vfs: small comment fix for block_invalidatepage
...
Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
This makes NFS follow the standard generic_file_llseek locking scheme.
Cc: Trond.Myklebust@netapp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The i_mutex lock use of generic _file_llseek hurts. Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.
Under high utilization this can cause llseek to scale very poorly on larger
systems.
This patch does some rethinking of the llseek locking model:
First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.
Let's look at the different seek variants:
SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.
For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now. The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.
=> Don't need a lock.
SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.
Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET. On non atomic 32bit it's the same
as SEEK_SET.
=> Don't need a lock, but need to use i_size_read()
SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.
=> So still need a lock, but can use a f_lock.
This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (26 commits)
Check validity of cl_rpcclient in nfs_server_list_show
NFS: Get rid of the nfs_rdata_mempool
NFS: Don't rely on PageError in nfs_readpage_release_partial
NFS: Get rid of unnecessary calls to ClearPageError() in read code
NFS: Get rid of nfs_restart_rpc()
NFS: Get rid of the unused nfs_write_data->flags field
NFS: Get rid of the unused nfs_read_data->flags field
NFSv4: Translate NFS4ERR_BADNAME into ENOENT when applied to a lookup
NFS: Remove the unused "lookupfh()" version of nfs4_proc_lookup()
NFS: Use the inode->i_version to cache NFSv4 change attribute information
SUNRPC: Remove unnecessary export of rpc_sockaddr2uaddr
SUNRPC: Fix rpc_sockaddr2uaddr
nfs/super.c: local functions should be static
pnfsblock: fix writeback deadlock
pnfsblock: fix NULL pointer dereference
pnfs: recoalesce when ld read pagelist fails
pnfs: recoalesce when ld write pagelist fails
pnfs: make _set_lo_fail generic
pnfsblock: add missing rpc_put_mount and path_put
SUNRPC/NFS: make rpc pipe upcall generic
...
* 'for-3.2' of git://linux-nfs.org/~bfields/linux: (103 commits)
nfs41: implement DESTROY_CLIENTID operation
nfsd4: typo logical vs bitwise negate for want_mask
nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL | NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED
nfsd4: seq->status_flags may be used unitialized
nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid
nfsd4: implement new 4.1 open reclaim types
nfsd4: remove unneeded CLAIM_DELEGATE_CUR workaround
nfsd4: warn on open failure after create
nfsd4: preallocate open stateid in process_open1()
nfsd4: do idr preallocation with stateid allocation
nfsd4: preallocate nfs4_file in process_open1()
nfsd4: clean up open owners on OPEN failure
nfsd4: simplify process_open1 logic
nfsd4: make is_open_owner boolean
nfsd4: centralize renew_client() calls
nfsd4: typo logical vs bitwise negate
nfs: fix bug about IPv6 address scope checking
nfsd4: more robust ignoring of WANT bits in OPEN
nfsd4: move name-length checks to xdr
nfsd4: move access/deny validity checks to xdr code
...
As soon as the nfs_client gets created, its cl_rpcclient is set to
ERR_PTR(-EINVAL). The rpc client structure is allocated later. Check
if the client is ready before using the cl_rpcclient pointer.
Signed-off-by: Malahal Naineni <malahal@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't rely on the PageError flag to tell us if one of the partial reads of
the page failed. Instead, replace that with a dedicated flag in the
struct nfs_page.
Then clean out redundant uses of the PageError flag: the VM no longer
checks it for reads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both LOOKUP and OPEN operations may return NFS4ERR_BADNAME if we send a
an invalid name as a filename argument. As far as the application is
concerned, it just has to know that the file doesn't exist, and so
ENOENT would be the appropriate reply. We should only return EINVAL
if the filename is being used to _create_ a new object on the
remote filesystem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
commit ae50c0b5 "pnfs: client stats" added additional information to
the output of /proc/self/mountstats. The new functions introduced are
only used in this file and should be marked static.
If CONFIG_NFS_V4_1 is not defined, empty stub functions are used. If
CONFIG_NFS_V4 is not defined these stub functions are not used at all.
Adding static for the functions results in compile warnings:
fs/nfs/super.c:743: warning: 'show_sessions' defined but not used
fs/nfs/super.c:756: warning: 'show_pnfs' defined but not used
Fix this by adding a #ifdef CONFIG_NFS_V4 guard around the two
show_ functions.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
bl_add_page_to_bio returns error pointer. bio should be reset to
NULL in failure cases as the out path always calls bl_submit_bio.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For pnfs pagelist read failure, we need to pg_recoalesce and resend IO to
mds.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For pnfs pagelist write failure, we need to pg_recoalesce and resend IO to
mds.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
file layout and block layout both use it to set mark layout io failure
bit. So make it generic.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The same function is used by idmap, gss and blocklayout code. Make it
generic.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the status field explicitly 32 bits. "...it's unlikely that the kernel
and userspace would differ on the size of an int here, but it might be a
good idea to go ahead and make that explicitly 32 bits in case we end up
dealing with more exotic arches at some point in the future."
Suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Always return PTR_ERR, not NULL, from nfs4_blk_get_deviceinfo and
nfs4_blk_decode_device.
Check for IS_ERR, not NULL, in bl_set_layoutdriver when calling
nfs4_blk_get_deviceinfo.
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@kernel.org [3.0]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_find_and_lock_request will take a reference to the nfs_page and
will then put it if the req is already locked. It's possible though
that the reference will be the last one. That put then can kick off
a whole series of reference puts:
nfs_page
nfs_open_context
dentry
inode
If the inode ends up being deleted, then the VFS will call
truncate_inode_pages. That function will try to take the page lock, but
it was already locked when migrate_page was called. The code
deadlocks.
Fix this by simply refusing the migration request if PagePrivate is
already set, indicating that the page is already associated with an
active read or write request.
We've had a customer test a backported version of this patch and
the preliminary results seem good.
Cc: stable@kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The result from ipv6_addr_scope() always not be a single SCOPE,
so we can't use equal to compare the result with IPV6_ADDR_SCOPE_LINKLOCAL
at nfs_sockaddr_match_ipaddr6.
This patch fixs the problem, and lets checking address before scope_id.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
commit 420e3646 allowed the kernel to reduce the number of unnecessary
commit calls by skipping the commit when there are a large number of
outstanding pages.
However, the current test in nfs_commit_unstable_pages does not handle
the edge condition properly. When ncommit == 0, then that means that the
kernel doesn't need to do anything more for the inode. The current test
though in the WB_SYNC_NONE case will return true, and the inode will end
up being marked dirty. Once that happens the inode will never be clean
until there's a WB_SYNC_ALL flush.
Fix this by immediately returning from nfs_commit_unstable_pages when
ncommit == 0.
Mike noticed this problem initially in RHEL5 (2.6.18-based kernel) which
has a backported version of 420e3646. The inode cache there was growing
very large. The inode cache was unable to be shrunk since the inodes
were all marked dirty. Calling sync() would essentially "fix" the
problem -- the WB_SYNC_ALL flush would result in the inodes all being
marked clean.
What I'm not clear on is how big a problem this is in mainline kernels
as the writeback code there is very different. Either way, it seems
incorrect to re-mark the inode dirty in this case.
Reported-by: Mike McLean <mikem@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org [2.6.34+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This reverts commit b80c3cb628.
The reverted commit was rendered obsolete by a VFS fix: commit
5547e8aac6 (writeback: Update dirty flags in
two steps). We now no longer need to worry about writeback_single_inode()
missing our marking the inode for COMMIT in 'do_writepages()' call.
Reverting this patch, fixes a performance regression in which the inode
would continuously get queued to the dirty list, causing the writeback
code to unnecessarily try to send a COMMIT.
Signed-off-by: Trond Myklebust <Trond.Myklebust>
Tested-by: Simon Kirby <sim@hostway.ca>
Cc: stable@kernel.org [2.6.35+]
The result from ipv6_addr_scope() is a set of flags, not a single value,
so we can't just compare the result with IPV6_ADDR_SCOPE_LINKLOCAL.
This patch fixs the problem, and checks for unequal addresses before
scope_id.
Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The concensus seems to be that system calls such as stat() etc should
not trigger an automount. Neither should the l* versions.
This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
that _should_ trigger an automount on the last path element.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
force automounting for their own reasons - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not allow multiple mounts on same mountpoint when using -o noac
When you normally attempt to mount a share twice on the same mountpoint,
a check in do_add_mount causes it to return an error
# mount localhost:/nfsv3 /mnt
# mount localhost:/nfsv3 /mnt
mount.nfs: /mnt is already mounted or busy
However when using the option 'noac', the user is able to mount the same
share on the same mountpoint multiple times. This happens because a
share mounted with the noac option is automatically assigned the 'sync'
flag MS_SYNCHRONOUS in nfs_initialise_sb(). This flag is set after the
check for already existing superblocks is done in sget(). The check for
the mount flags in nfs_compare_mount_options() does not take into
account the 'sync' flag applied later on in the code path. This means
that when using 'noac', a new superblock structure is assigned for every
new mount of the same share and multiple shares on the same mountpoint
are allowed.
ie.
# mount -onoac localhost:/nfsv3 /mnt
can be run multiple times.
The patch checks for noac and assigns the sync flag before sget() is
called to obtain an already existing superblock structure.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a typo which causes an Oops in the RPC layer, when using wsize < 4k.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Sricharan R <r.sricharan@ti.com>
The NFSv4 spec does not specify that the server must repeat that error,
so in order to avoid having the delegations revoked, we should handle
it immediately.
Also note that NFS4ERR_CB_PATH_DOWN does in fact renew the lease...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
RFC3530 states that if the client holds a delegation, then it is obliged
to continue to send RENEW calls once every lease period in order to allow
the server to return NFS4ERR_CB_PATH_DOWN if the callback path is
unreachable.
This is not required for NFSv4.1, since the server can at any time set
the SEQ4_STATUS_CB_PATH_DOWN_SESSION in any SEQUENCE operation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use NUMA aware allocations to reduce latencies and increase throughput.
sunrpc kthreads can use kthread_create_on_node() if pool_mode is
"percpu" or "pernode", and svc_prepare_thread()/svc_init_buffer() can
also take into account NUMA node affinity for memory allocations.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Neil Brown <neilb@suse.de>
CC: David Miller <davem@davemloft.net>
Reviewed-by: Greg Banks <gnb@fastmail.fm>
[bfields@redhat.com: fix up caller nfs41_callback_up]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Return NFS4ERR_BADSESSION to callbacks during session resets
NFSv4.1: Fix the callback 'highest_used_slotid' behaviour
pnfs-obj: Fix the comp_index != 0 case
pnfs-obj: Bug when we are running out of bio
nfs: add missing prefetch.h include
Just like files-layout, blocks & objects layouts are part of the
NFS 4.1 protocol and should be automatically selected if NFS_4_1
is selected. The small problem is that these depend on other
Kernel support being present, while files only depends on NFS
itself.
This patch removes from the user choice the presence of objects
and blocks layout. But makes sure these are selected only if
the depended subsystems are present in the Kernel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
pnfs layout drivers. Seperate it out so others can still build when
BLK_DEV_DM/MD is not enabled.
Also change select to depends on to avoid build failures.
Reported-and-tested-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Acked-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the client is in the process of resetting the session when it receives
a callback, then returning NFS4ERR_DELAY may cause a deadlock with the
DESTROY_SESSION call.
Basically, if the client returns NFS4ERR_DELAY in response to the
CB_SEQUENCE call, then the server is entitled to believe that the
client is busy because it is already processing that call. In that
case, the server is perfectly entitled to respond with a
NFS4ERR_BACK_CHAN_BUSY to any DESTROY_SESSION call.
Fix this by having the client reply with a NFS4ERR_BADSESSION in
response to the callback if it is resetting the session.
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, there is no guarantee that we will call nfs4_cb_take_slot() even
though nfs4_callback_compound() will consistently call
nfs4_cb_free_slot() provided the cb_process_state has set the 'clp' field.
The result is that we can trigger the BUG_ON() upon the next call to
nfs4_cb_take_slot().
This patch fixes the above problem by using the slot id that was taken in
the CB_SEQUENCE operation as a flag for whether or not we need to call
nfs4_cb_free_slot().
It also fixes an atomicity problem: we need to set tbl->highest_used_slotid
atomically with the check for NFS4_SESSION_DRAINING, otherwise we end up
racing with the various tests in nfs4_begin_drain_session().
Cc: stable@kernel.org [2.6.38+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There were bugs in the case of partial layout where olo_comp_index
is not zero. This used to work and was tested but one of the later
cleanup SQUASHMEs broke it and was not tested since.
Also add a dprint that specify those received layout parameters.
Everything else was already printed.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we have a situation that the number of pages we want
to encode is bigger then the size of the bio. (Which can
currently happen only when all IO is going to a single device
.e.g group_width==1) then the IO is submitted short and we
report back only the amount of bytes we actually wrote/read
and all is fine. BUT ...
There was a bug that the current length counter was advanced
before the fail to add the extra page, and we come to a situation
that the CDB length was one-page longer then the actual bio size,
which is of course rejected by the osd-target.
While here also fix the bio size calculation, in the case
that we received more then one group of devices.
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix this compile error on s390:
CC [M] fs/nfs/blocklayout/blocklayout.o
fs/nfs/blocklayout/blocklayout.c: In function 'bl_end_io_read':
fs/nfs/blocklayout/blocklayout.c:201:4: error: implicit declaration of function 'prefetchw'
Introduced with 9549ec01 "pnfsblock: bl_read_pagelist".
Cc: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
switch posix_acl_chmod() to umode_t
switch posix_acl_from_mode() to umode_t
switch posix_acl_equiv_mode() to umode_t *
switch posix_acl_create() to umode_t *
block: initialise bd_super in bdget()
vfs: avoid call to inode_lru_list_del() if possible
vfs: avoid taking inode_hash_lock on pipes and sockets
vfs: conditionally call inode_wb_list_del()
VFS: Fix automount for negative autofs dentries
Btrfs: load the key from the dir item in readdir into a fake dentry
devtmpfs: missing initialialization in never-hit case
hppfs: missing include
For invalid extents, find other pages in the same fsblock and write them out.
[pnfsblock: write_begin]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: bl_write_pagelist support functions]
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
[SQUASHME: pnfsblock: mds_offset is set in the generic layer]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fixup blksize alignment in bl_setup_layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: bl_write_pagelist adjust for missing PG_USE_PNFS]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new write_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Note: When upper layer's read/write request cannot be fulfilled, the block
layout driver shouldn't silently mark the page as error. It should do
what can be done and leave the rest to the upper layer. To do so, we
should set rdata/wdata->res.count properly.
When upper layer re-send the read/write request to finish the rest
part of the request, pgbase is the position where we should start at.
[pnfsblock: mark IO error with NFS_LAYOUT_{RW|RO}_FAILED]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: read path error handling]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: handle errors when read or write pagelist.]
Signed-off-by: Zhang Jingwang <yyalone@gmail.com>
[pnfs-block: use new read_pagelist api]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: introduce bl_committing list]
Signed-off-by: Peng Tao <peng_tao@emc.com>
[pnfsblock: SQUASHME: adjust to API change]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: cleanup_layoutcommit wants a status parameter]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In blocklayout driver. There are two things happening
while layoutcommit/cleanup.
1. the modified extents are encoded.
2. On cleanup the extents are put back on the layout rw
extents list, for reads.
In the new system where actual xdr encoding is done in
encode_layoutcommit() directly into xdr buffer, these are
the new commit stages:
1. On setup_layoutcommit, the range is adjusted as before
and a structure is allocated for communication with
bl_encode_layoutcommit && bl_cleanup_layoutcommit
(Generic layer provides a void-star to hang it on)
2. bl_encode_layoutcommit is called to do the actual
encoding directly into xdr. The commit-extent-list is not
freed and is stored on above structure.
FIXME: The code is not yet converted to the new XDR cleanup
3. On cleanup the commit-extent-list is put back by a call
to set_to_rw() as before, but with no need for XDR decoding
of the list as before. And the commit-extent-list is freed.
Finally allocated structure is freed.
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[blocklayout: encode_layoutcommit implementation]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfsblock: fix bug setting up layoutcommit.]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
[pnfsblock: prevent commit list corruption]
[pnfsblock: fix layoutcommit with an empty opaque]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Adds working implementations of various support functions
to handle INVAL extents, needed by writes, such as
bl_mark_sectors_init and bl_is_sector_init.
[pnfsblock: fix 64-bit compiler warnings for extent manipulation]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[Implement release_inval_marks]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement bl_find_get_extent(), one of the core extent manipulation
routines.
[pnfsblock: Lookup list entry of layouts and tags in reverse order]
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
pnfsblock: fix print format warnings for sector_t and size_t
gcc spews warnings about these on x86_64, e.g.:
fs/nfs/blocklayout/blocklayout.c:74: warning: format ‘%Lu’ expects type ‘long long unsigned int’, but argument 2 has type ‘sector_t’
fs/nfs/blocklayout/blocklayout.c:388: warning: format ‘%d’ expects type ‘int’, but argument 5 has type ‘size_t’
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
XDR decodes the block layout payload sent in LAYOUTGET result, storing
the result in an extent list.
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix bug getting pnfs_layout_type in translate_devid().]
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Call GETDEVICELIST during mount, then call and parse GETDEVICEINFO
for each device returned.
[pnfsblock: get rid of deprecated xdr macros]
Signed-off-by: Jim Rees <rees@umich.edu>
[pnfsblock: fix pnfs_deviceid references]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: fix print format warnings for sector_t and size_t]
[pnfs-block: #include <linux/vmalloc.h>]
[pnfsblock: no PNFS_NFS_SERVER]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfsblock: fix bug determining size of striped volume]
[pnfsblock: fix oops when using multiple devices]
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[pnfsblock: get rid of vmap and deviceid->area structure]
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace a stub, so that extents underlying the layouts are properly
added, merged, or ignored as necessary.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
[pnfsblock: delete the new node before put it]
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This gives layout driver a chance to cleanup structures they put in at
encode_layoutcommit.
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for layoutcommit]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
[rm inode and pnfs_layout_hdr args from cleanup_layoutcommit()]
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
To allow layout driver to issue getdevicelist at mount time, and clean up
at umount time.
[fixup non NFS_V4_1 set_pnfs_layoutdriver definition]
[pnfs: pass mntfh down the init_pnfs path]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Using NFS4_MAX_UINT64 will break current protocol.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There can be multiple lseg per file, so layoutcommit should be
able to handle it.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need to save it for every lseg.
No need to save it at every pnfs_set_layoutcommit.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need to save it for every lseg.
[Needed in v3.0]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Jim Rees <rees@umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Print out the name of the file that triggers the cookie loop message to
make it slightly easier to track down the cause.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the directory contents change, then we have to accept that the
file->f_pos value may shrink if we do a 'search-by-cookie'. In that
case, we should turn off the loop detection and let the NFS client
try to recover.
The patch also fixes a second loop detection bug by ensuring
that after turning on the ctx->duped flag, we read at least one new
cookie into ctx->dir_cookie before attempting to match with
ctx->dup_cookie.
Reported-by: Petr Vandrovec <petr@vandrovec.name>
Cc: stable@kernel.org [2.6.39+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'nfs-for-3.1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (44 commits)
NFSv4: Don't use the delegation->inode in nfs_mark_return_delegation()
nfs: don't use d_move in nfs_async_rename_done
RDMA: Increasing RPCRDMA_MAX_DATA_SEGS
SUNRPC: Replace xprt->resend and xprt->sending with a priority queue
SUNRPC: Allow caller of rpc_sleep_on() to select priority levels
SUNRPC: Support dynamic slot allocation for TCP connections
SUNRPC: Clean up the slot table allocation
SUNRPC: Initalise the struct xprt upon allocation
SUNRPC: Ensure that we grab the XPRT_LOCK before calling xprt_alloc_slot
pnfs: simplify pnfs files module autoloading
nfs: document nfsv4 sillyrename issues
NFS: Convert nfs4_set_ds_client to EXPORT_SYMBOL_GPL
SUNRPC: Convert the backchannel exports to EXPORT_SYMBOL_GPL
SUNRPC: sunrpc should not explicitly depend on NFS config options
NFS: Clean up - simplify the switch to read/write-through-MDS
NFS: Move the pnfs write code into pnfs.c
NFS: Move the pnfs read code into pnfs.c
NFS: Allow the nfs_pageio_descriptor to signal that a re-coalesce is needed
NFS: Use the nfs_pageio_descriptor->pg_bsize in the read/write request
NFS: Cache rpc_ops in struct nfs_pageio_descriptor
...
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
nfs_mark_return_delegation() is usually called without any locking, and
so it is not safe to dereference delegation->inode. Since the inode is
only used to discover the nfs_client anyway, it makes more sense to
have the callers pass a valid pointer to the nfs_server as a parameter.
Reported-by: Ian Kent <raven@themaw.net>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the task that initiated the sillyrename ends up being killed by a
fatal signal, then it will eventually return back to userspace and end
up releasing the i_mutex. d_move however needs to be done while holding
the i_mutex.
Instead of using d_move here, just unhash the old and new dentries to
prevent them from being found by lookups. With this change though, the
dentries are now incorrect post-rename and do not reflect the actual
name of the file on the server. I'm proceeding under the assumption
that since they are unhashed that this isn't really a problem.
In order for the sillydelete to still work though, the dname must be
copied earlier when setting up the sillydelete info, and the name must
be recopied if the sillydelete info has to be moved to a new dentry.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error. All callers of posix_acl_create_masq()
switched.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
we just return -EINVAL, in others we do the normal generic thing, and in others
we're simply making sure that the properly due-dilligence is done. For example
in NFS/CIFS we need to make sure the file size is update properly for the
SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
that is all we have to do. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE)
b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0
c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN;
otherwise we need to issue CLOSE, lest we leak stateid on server.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Embed the necessary alias into the module rather than waiting for
someone to add it to /etc/modprobe.conf
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somebody working on this code asked what the deal was with NFSv4, since
this comment notes that it's v2/v3's statelessness that requires
sillyrename. Shouldn't hurt to document the answer.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Change explicit references to CONFIG_NFS_V4_1 to implicit ones
Get rid of the unnecessary defines in backchannel_rqst.c and
bc_svc.c: the Makefile takes care of those dependency.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use nfs_pageio_reset_read_mds and nfs_pageio_reset_write_mds instead of
completely reinitialising the struct nfs_pageio_descriptor.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
differences in block sizes when falling back to write through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
block sizes when falling back to read through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If an attempt to do pNFS fails, and we have to fall back to writing through
the MDS, then we may want to re-coalesce the requests that we already have
since the block size for the MDS read/writes may be different to that of
the DS read/writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Instead of looking up the rsize and wsize, the routines that generate the
RPC requests should really be using the pg_bsize, since that is what we
use when deciding whether or not to coalesce write requests...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix use of static variable in rpcb_getport_async
NFSv4.1: update nfs4_fattr_bitmap_maxsz
SUNRPC: Fix a race between work-queue and rpc_killall_tasks
pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs
Split them up into two parts: one which sets up the struct nfs_read/write_data,
the other which sets up the actual RPC call or pNFS call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're writing back data, and the FLUSH_STABLE flag is set, then we
always want to use NFS_FILE_SYNC, since we're always in a situation where
we're doing page reclaim, and so we want to free up the page as quickly
as possible.
If we're in the FLUSH_COND_STABLE case, then we either want to use another
unstable write (if we have to do a commit anyway) or again, we want to
use NFS_FILE_SYNC because we know that we have no more pages to write
out.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Mark all deviceids established under an expired MDS clientid as invalid.
Stop all new i/o through DS and send through the MDS.
Don't use any new LAYOUTGETs that use the invalid deviceid. Purge all layouts
established under the expired MDS clientid.
Remove the MDS clientid deviceid and data servers reference
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since we take a reference to it, we really ought to pass the a pointer to
the layout header in the arguments instead of assuming that
NFS_I(inode)->layout will forever point to the correct object.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ask for whole file layouts. Until support for layout segments is fully
supported in the file layout code, discard non-whole file layouts.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The fact that the global device id cache holds a reference to the
nfs4_deviceid_node until it is invisible to rcu lookups implies that
we can always assume that the reference count is non-zero in
_find_get_deviceid.
Also clean up nfs4_put_deviceid_node and the removal of the device id
from the cache.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to ensure that the layouts are set up before we can decide to
coalesce requests. To do so, we want to further split up the struct
nfs_pageio_descriptor operations into an initialisation callback, a
coalescing test callback, and a 'do i/o' callback.
This patch cleans up the existing callback methods before adding the
'initialisation' callback.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When recovering open files and locks, the stateid should be tested
against the server and freed if it is invalid. This patch adds new
recovery functions for NFS v4.1.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
FREE_STATEID is used to tell the server that we want to free a stateid
that no longer has any locks associated with it. This allows the client
to reclaim locks without encountering edge conditions documented in
section 8.4.3 of RFC 5661.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds in the xdr for doing a TEST_STATEID call with a single
stateid. RFC 5661 allows multiple stateids to be tested in a single
call, but only testing one keeps things simpler for now.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the client is using NFS v4.1, then we can use SECINFO_NO_NAME to find
the secflavor for the initial mount. If the server doesn't support
SECINFO_NO_NAME then I fall back on the "guess and check" method used
for v4.0 mounts.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Layouts should be tracked per nfs_server (aka superblock)
instead of per struct nfs_client, which may have multiple FSIDs associated
with it.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
can be skipped if the "eir_server_scope" from the exchange_id proc differs from
previous calls.
Also, in the future server_scope will be useful for determining whether client
trunking is available
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't just use the first addr in the multipath list - instead, loop
over addresses when calling nfs4_set_ds_client() (which calls connect)
until it is successful.
Although this is not real multipath support, it's a quick fix to handle when
an MDS sends a list of addresses for a DS and some of the addr families are
unsupported or misconfigured (like no routable ipv6 addr assigned).
This will attempt all paths to the DS before giving up, instead of immediately
falling back to the MDS.
As before, an error encountered after a successful connect() will cause all
i/o to fall back to the MDS.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This parses and stores all addresses associated with each data server,
laying the groundwork for supporting multipath to data servers.
- Skips over addresses that cannot be parsed (ie IPv6 addrs if v6 is not
enabled). Only fails if none of the addresses are recognizable
- Currently only uses the first address that parsed cleanly
- Tested against pynfs server (modified to support multipath)
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Handle ipv6 remote addresses from GETDEVICEINFO
- supports netid "tcp" for ipv4 and "tcp6" for ipv6 as rfc 5665 specifies
- added ds_remotestr to avoid having to handle different AFs in every dprintk
- tested against pynfs 4.1 server, submitting ipv6 support patch to pynfs
- tested with IPv6 disabled, it compiles cleanly and relies on rpc_pton to
refuse to accept IPv6 addresses
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attribute IDs assigned in RFC 5661 now require three bitmaps.
Fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs.
Signed-off-by: Andy Adamson <andros@netapp.com>
Cc:stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add an FS-Cache helper to bulk uncache pages on an inode. This will
only work for the circumstance where the pages in the cache correspond
1:1 with the pages attached to an inode's page cache.
This is required for CIFS and NFS: When disabling inode cookie, we were
returning the cookie and setting cifsi->fscache to NULL but failed to
invalidate any previously mapped pages. This resulted in "Bad page
state" errors and manifested in other kind of errors when running
fsstress. Fix it by uncaching mapped pages when we disable the inode
cookie.
This patch should fix the following oops and "Bad page state" errors
seen during fsstress testing.
------------[ cut here ]------------
kernel BUG at fs/cachefiles/namei.c:201!
invalid opcode: 0000 [#1] SMP
Pid: 5, comm: kworker/u:0 Not tainted 2.6.38.7-30.fc15.x86_64 #1 Bochs Bochs
RIP: 0010: cachefiles_walk_to_object+0x436/0x745 [cachefiles]
RSP: 0018:ffff88002ce6dd00 EFLAGS: 00010282
RAX: ffff88002ef165f0 RBX: ffff88001811f500 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000282
RBP: ffff88002ce6dda0 R08: 0000000000000100 R09: ffffffff81b3a300
R10: 0000ffff00066c0a R11: 0000000000000003 R12: ffff88002ae54840
R13: ffff88002ae54840 R14: ffff880029c29c00 R15: ffff88001811f4b0
FS: 00007f394dd32720(0000) GS:ffff88002ef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffcb62ddf8 CR3: 000000001825f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:0 (pid: 5, threadinfo ffff88002ce6c000, task ffff88002ce55cc0)
Stack:
0000000000000246 ffff88002ce55cc0 ffff88002ce6dd58 ffff88001815dc00
ffff8800185246c0 ffff88001811f618 ffff880029c29d18 ffff88001811f380
ffff88002ce6dd50 ffffffff814757e4 ffff88002ce6dda0 ffffffff8106ac56
Call Trace:
cachefiles_lookup_object+0x78/0xd4 [cachefiles]
fscache_lookup_object+0x131/0x16d [fscache]
fscache_object_work_func+0x1bc/0x669 [fscache]
process_one_work+0x186/0x298
worker_thread+0xda/0x15d
kthread+0x84/0x8c
kernel_thread_helper+0x4/0x10
RIP cachefiles_walk_to_object+0x436/0x745 [cachefiles]
---[ end trace 1d481c9af1804caa ]---
I tested the uncaching by the following means:
(1) Create a big file on my NFS server (104857600 bytes).
(2) Read the file into the cache with md5sum on the NFS client. Look in
/proc/fs/fscache/stats:
Pages : mrk=25601 unc=0
(3) Open the file for read/write ("bash 5<>/warthog/bigfile"). Look in proc
again:
Pages : mrk=25601 unc=25601
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-Tested-by: Suresh Jayaraman <sjayaraman@suse.de>
cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In current pnfs tree, all the layouts set mds_offset in their
.write_pagelist member.
mds_offset is only used by generic layer and should be handled by it.
This patch is for upstream. It is needed in this -rc series to fix a
bug in objects layout_commit.
I'll send patches for objects and blocks to be
squashed into current pnfs tree.
TODO: It looks like the read path needs the same patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I initially did the calculation in bytes, and not words
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
1. If the intention is to coalesce requests 'prev' and 'req' then we
have to ensure at least that we have a layout starting at
req_offset(prev).
2. If we're only requesting a minimal layout of length desc->pg_count,
we need to test the length actually returned by the server before
we allow the coalescing to occur.
3. We need to deal correctly with (pgio->lseg == NULL)
4. Fixup the test guarding the pnfs_update_layout.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise we end up overflowing the rpc buffer size on the receive end.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Many stupid corrections of duplicated includes based on the output of
scripts/checkincludes.pl.
Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Andy's last device_cache patches, already take an extra
reference on the newly inserted device_id. So we can remove it
from obj-io.
Without this patch the device_ids are leaked.
Andy's patches are not in Linus tree yet. So I'm not sure if they are
scheduled for this Kernel or the next. This patch should be added as
part of these.
CC: Andy Adamson <andros@netapp.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 7ebb9315 (NFS: use secinfo when crossing mountpoints) introduces
a regression when decoding an NFSv4 readdir entry that sets the
rdattr_error field.
By treating the resulting value as if it is a decoding error, the current
code may cause us to skip valid readdir entries.
Reported-by: Andy Adamson <andros@netapp.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
mark_matching_lsegs_invalid could put the last ref to the layout, so
the get_layout_hdr needs to be called first.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We always get a reference on the layout header and we rely on
nfs4_layoutreturn_release to put it. If we hit an allocation error
before starting the rpc proc we bail out early without dereferncing
the layout header properly.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
(d)printks should use %zd for ssize_t arguments not %ld, otherwise they might
get a warning. I see the following with MN10300.
fs/nfs/objlayout/objlayout.c: In function 'objlayout_read_done':
fs/nfs/objlayout/objlayout.c:294: warning: format '%ld' expects type 'long int', but argument 3 has type 'ssize_t'
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <Trond.Myklebust@netapp.com>
cc: linux-nfs@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The break condition to skip out of the loop got broken when cmp_layout
was change. Essentially, we want to stop looking once we know no layout
on the remainder of the list can match the first byte of the looked-up
range.
Reported-by: Peng Tao <peng_tao@emc.com>
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
_pnfs_return_layout had the following problems:
- it did not call pnfs_free_lseg_list on all paths
- it unintentionally did a forgetful return when there was no outstanding io
- it raced with concurrent LAYOUTGETS
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 28331a46d8 "Ensure we request the
ordinary fileid when doing readdirplus"
changed the meaning of NFS_ATTR_FATTR_FILEID which used to be set when
FATTR4_WORD1_MOUNTED_ON_FILED was requested.
Allow nfs_fhget to succeed with only a mounted on fileid when crossing
a mountpoint or a referral.
Ask for the fileid of the absent file system if mounted_on_fileid is not
supported.
Signed-off-by: Andy Adamson <andros@netapp.com>
cc:stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we add something to the global device id cache, we need to bump the
reference count, so that the cache itself holds a reference.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't support header padding yet so better off ditching it
Reported-by: Sid Moore <learnmost@gmail.com>
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_update_inode will update isize if there is no queued pages. For pNFS,
layoutcommit is supposed to change file size on server, the same effect as queued
pages. nfs_update_inode may be called when dirty pages are written back (nfsi->npages==0)
but layoutcommit is not sent, and it will change client file size according to server
file size. Then client ends up losing what it just writes back in pNFS path.
So we should skip updating client file size if file needs layoutcommit.
Signed-off-by: Peng Tao <peng_tao@emc.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Unmounting a pnfs filesystem hangs using filelayout and possibly others.
This fixes the use of the rcu protected node by making use of a new 'tmpnode'
for the temporary purge list. Also, the spinlock shouldn't be held when calling
synchronize_rcu().
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Remove two unused struct writeback_control fields:
.encountered_congestion (completely unused)
.nonblocking (never set, checked/showed in XFS,NFS/btrfs)
The .for_background check in nfs_write_inode() is also removed btw,
as .for_background implies WB_SYNC_NONE.
Reviewed-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Support for RPC over AF_LOCAL transports
SUNRPC: Remove obsolete comment
SUNRPC: Use AF_LOCAL for rpcbind upcalls
SUNRPC: Clean up use of curly braces in switch cases
NFS: Revert NFSROOT default mount options
SUNRPC: Rename xs_encode_tcp_fragment_header()
nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
nfs41: Correct offset for LAYOUTCOMMIT
NFS: nfs_update_inode: print current and new inode size in debug output
NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
NFSv4: Handle expired stateids when the lease is still valid
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
Implement pg_test vector to test for max IO sizes. We calculate
a max_io_size member only once, and cache it in lseg so to not
do so on every page insert.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[simplify logic]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
By default, unless pnfs is used coalesce pages until pg_bsize
(rsize or wsize) is reached.
pnfs layout drivers define their own pg_test methods that use
pnfs_generic_pg_test and need to define their own I/O size
limits (e.g. based on the file stripe size).
[Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Use common code for pnfs_pageio_init_{read,write} and use
a common generic pg_test function.
Note that this function always assumes the the layout driver's
pg_test method is implemented.
[Fix BUG]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Define API for io-engines to report delta_space_used in IOs
* Encode the osd-layout specific information of the layoutcommit
XDR buffer.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add a layout driver method to encode the layout type specific
opaque part of layout commit in-line in the xdr stream.
Currently, the pnfs-objects layout driver uses it to encode metadata hints
to the MDS and the blocks layout driver to commit provisionally allocated
extents to the file.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
An io_state pre-allocates an error information structure for each
possible osd-device that might error during IO. When IO is done if all
was well the io_state is freed. (as today). If the I/O has ended with an
error, the io_state is queued on a per-layout err_list. When eventually
encode_layoutreturn() is called, each error is properly encoded on the
XDR buffer and only then the io_state is removed from err_list and
de-allocated.
It is up to the io_engine to fill in the segment that fault and the type
of osd_error that occurred. By calling objlayout_io_set_result() for
each failing device.
In objio_osd:
* Allocate io-error descriptors space as part of io_state
* Use generic objlayout error reporting at end of io.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add a layout driver method to encode the layout type specific
opaque part of layout return in-line in the xdr stream.
Currently the pnfs-objects layout driver uses it to encode i/o error
information on LAYOUTRETURN.
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixup layout header pointer for encode_layoutreturn]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
With the objects layout security model, we have object capabilities
that are associated with the layout and we anticipate that the server
will issue a cb_layoutrecall for any setattr that changes security
related attributes (user/group/mode/acl) or truncates the file.
Therefore, the layout is returned before issuing the setattr to avoid
the anticipated cb_layoutrecall.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
NFSv4.1 LAYOUTRETURN implementation
Currently, does not support layout-type payload encoding.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
With the use of the in-kernel osd library. Implement read/write
of data from/to osd-objects according to information specified
in the objects-layout.
Support for stripping over mirrors with a received stripe_unit.
There are however a few constrains which are not supported:
1. Stripe Unit must be a multiple of PAGE_SIZE
2. stripe length (stripe_unit * number_of_stripes) can not be
bigger then 32bit.
Also support raid-groups and partial-layout. Partial-layout is
when not all the groups are received on the line, addressing
only a partial range of the file.
TODO:
Only raid0! raid 4/5/6 support will come at later stage
A none supported layout will send IO through the MDS
[Important fallout from the last rebase]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.
[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
When a new layout is received in objio_alloc_lseg all device_ids
referenced are retrieved. The device information is queried for from MDS
and then the osd_device is looked-up from the osd-initiator library. The
devices are cached in a per-mount-point list, for later use. At unmount
all devices are "put" back to the library.
objlayout_get_deviceinfo(), objlayout_put_deviceinfo() middleware
API for retrieving device information given a device_id.
TODO: The device cache can get big. Cap its size. Keep an LRU and start
to return devices which were not used, when list gets to big, or
when new entries allocation fail.
[pnfs-obj: Bugs in new global-device-cache code]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[use global device cache]
[use layout driver in global device cache]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
objlayout_alloc_lseg prepares an xdr_stream and calls the
raid engins objio_alloc_lseg() to allocate a private
pnfs_layout_segment.
objio_osd.c::objio_alloc_lseg() uses passed xdr_stream to
decode and store the layout_segment information in an
objio_segment struct, using the pnfs_osd_xdr.h API for
the actual parsing the layout xdr.
objlayout_free_lseg calls objio_free_lseg() to free the
allocated space.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[removed "extern" from function definitions]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Add the fs/nfs/objlayout/pnfs_osd_xdr_cli.c file, which will
include the XDR encode/decode implementations for the pNFS
client objlayout driver.
[Wrong type in comments]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
* Define the PNFS_OBJLAYOUT Kconfig option in the nfs
master Kconfig file.
* Add the objlayout driver to the Kernel's Kbuild system.
* Add the fs/nfs/objlayout/Kbuild file for building the
objlayoutdriver.ko driver
* Define fs/nfs/objlayout/objio_osd.c, register the driver on module
initialization and unregister on exit.
[pnfs-obj: remove of CONFIG_PNFS fallout]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[added "unsure" clause]
[depend on NFS_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
A pNFS client auto-negotiates a lot of features (minorversion level,
pNFS layout type, etc.). This is convenient, but makes certain kinds of
failures hard for a user to detect.
For example, if the client falls back on 4.0, or falls back to MDS IO
because the user didn't connect to the right iscsi disks before
mounting, the only symptoms may be reduced performance, which may not be
noticed till long after the actual failure, and may be difficult for a
user to diagnose.
However, such "failures" may also be perfectly normal in some cases, so
we don't want to spam the system logs with them.
One approach would be to put some more information into
/proc/self/mountstats.
Signed-off-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: add commit client stats]
[fixup data types for "ret" variables in pnfs_try_to* inline funcs.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[fix definition of show_pnfs for !CONFIG_PNFS]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Fix show_sessions in the not CONFIG_NFS_V4_1 case]
There is a build error when CONFIG_NFS_V4 is set but
CONFIG_NFS_V4_1 is *not* set. show_sessions() prototype
was unbalanced between the two cases.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfs: super.c remove CONFIG_PNFS]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Add offset and count parameters to pnfs_update_layout and use them to get
the layout in the pageio path.
Order cache layout segments in the following order:
* offset (ascending)
* length (descending)
* iomode (RW before READ)
Test byte range against the layout segment in use in pnfs_{read,write}_pg_test
so not to coalesce pages not using the same layout segment.
[fix lseg ordering]
[clean up pnfs_find_lseg lseg arg]
[remove unnecessary FIXME]
[fix ordering in pnfs_insert_layout]
[clean up pnfs_insert_layout]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
pnfs deviceids are unique per server, per layout type.
struct nfs_client is currently used to distinguish deviceids from
different nfs servers, yet these may clash between different layout
types on the same server. Therefore, use the layout driver associated
with each deviceid at insertion time to look it up, unhash, or
delete it.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Note: This functionlaity is incomplete as all layout segments referring to
the 'to be removed device id' need to be reaped, and all in flight I/O drained.
[use be32 res in nfs4_callback_devicenotify]
[use nfs_client to qualify deviceid for cb_notify_deviceid]
[use global deviceid cache for CB_NOTIFY_DEVICEID]
[refactor device cache _lookup_deviceid]
[refactor device cache _find_get_deviceid]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Bug in new global-device-cache code]
[layout_driver MUST set free_deviceid_node if using dev-cache]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Use the pnfs_layoutdriver_type both as a qualifier for the deviceid,
distinguishing deviceid from different layout types on the server,
and for freeing the layout-driver allocated structure containing the
nfs4_deviceid_node.
[BUG in _deviceid_purge_client]
[layout_driver MUST set free_deviceid_node if using dev-cache]
[let ver < 4.1 compile]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[removed EXPORT_SYMBOL_GPL(nfs4_deviceid_purge_client)]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Move deviceid cache from the pnfs files layout driver to the
generic layer in preparation for the objects layout driver.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
deviceids are unique per server, per layout type.
Therefore, in the global cache in the files layout driver
deviceids from different servers may clash so we need
to qualify them with a struct nfs_client that represents
the nfs server that returned the deviceid.
Introduced in 2.6.39 commit ea8eecdd
"NFSv4.1 move deviceid cache to filelayout driver"
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Marek Belisko <marek.belisko@gmail.com> reports that recent attempts
to fix regressions in NFSROOT have broken his configuration:
> After update from 2.6.38-rc8 to 2.6.38 is mounting rootfs over nfs not possible.
> Log:
> VFS: Mounted root (nfs filesystem) on device 0:14.
> Freeing init memory: 132K
> nfs: server 10.146.1.21 not responding, still trying
> nfs: server 10.146.1.21 not responding, still trying
>
> This is never ending. I make short bisect (not too much commits
> between versions)
> and bad commit was reported: 53d4737580
>
> NFS: NFSROOT should default to "proto=udp"
>
> I've tested on mini2440 board (DM9000, static IP).
> Is there some missing option or something else to be checked?
An examination of a network trace captured during the failure shows
that the mount is actually succeeding, but that the client is not
seeing READ replies larger than 16KB. This could be a local packet
filtering issue on the client, but we didn't troubleshoot this
further because of the reported "git bisect" result.
Last fall we removed the ad hoc mount option parser in
fs/nfs/nfsroot.c in favor of using the main parser in fs/nfs/super.c
(see commit 56463e50 "NFS: Use super.c for NFSROOT mount option
parsing"). That commit changed the default NFSROOT mount options to
be the same as those employed by user space mounts.
As it turns out, these new default mount options are not tolerated by
many embedded systems. So far these problems have been due to
specific behavior of certain embedded NICs. The NFS community does
not have such hardware on hand for running tests.
Commit 53d47375 recently introduced a clean way to specify default
mount options for NFSROOT, so we can now easily restore the
traditional defaults for NFSROOT:
vers=2,udp,rsize=4096,wsize=4096
This should revert the new default NFSROOT mount options introduced
with commit 56463e50.
Tested-by: Marek Belisto <marek.belisto@open-nandra.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rcu callback nfs_free_delegation_callback() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(nfs_free_delegation_callback).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A client sends offset to MDS as it was seen by DS. As result,
file size after copy is only half of original file size in case
of 2 DS.
Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Cc: stable@kernel.org [2.6.39]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
In nfs_update_inode debug output, print the current and new inode
size when the file size changes on the NFS server.
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the call to nfs4_schedule_session_recovery() will actually just
result in a test of the lease when what we really want is to force a
session reset.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Currently, if the server returns NFS4ERR_EXPIRED in reply to a READ or
WRITE, but the RENEW test determines that the lease is still active, we
fail to recover and end up looping forever in a READ/WRITE + RENEW death
spiral.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Change each shrinker's API by consolidating the existing parameters into
shrink_control struct. This will simplify any further features added w/o
touching each file of shrinker.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, writebacks may end up recursing back into the filesystem due to
GFP_KERNEL direct reclaims in the pnfs subsystem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prevents an infinite loop as list was never emptied.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Free the slot and resend the RPC with new session <slot#,seq#>.
For nfs4_async_handle_error, return -EAGAIN and set the task->tk_status to 0
to restart the async rpc in the rpc_restart_call_prepare state which resets
the slot.
For nfs4_handle_exception, retrying a call that uses nfs4_call_sync will
reset the slot via nfs41_call_sync_prepare.
For open/close/lock/locku/delegreturn/layoutcommit/unlink/rename/write
cachethis is true, so these operations will not trigger an
NFS4ERR_RETRY_UNCACHED_REP.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
On a remount, the VFS layer will clear the MS_SYNCHRONOUS bit on the
assumption that the flags on the mount syscall will have it set if the
remounted fs is supposed to keep it.
In the case of "noac" though, MS_SYNCHRONOUS is implied. A remount of
such a mount will lose the MS_SYNCHRONOUS flag since "sync" isn't part
of the mount options.
Reported-by: Max Matveev <makc@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When compiling, I was getting this warning:
fs/nfs/nfs4xdr.c: In function ‘decode_secinfo’:
fs/nfs/nfs4xdr.c:4839:6: warning: variable ‘status’ set but not used
[-Wunused-but-set-variable]
We were unconditionally returning 0 as long as there wasn't an error
coming out of xdr_inline_decode(). We probably want to check the error
status coming out of decode_op_hdr() and decode_secinfo_gss(), rather
than assuming that everything is OK all the time.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When readdir() returns a directory entry for the root of a mounted
filesystem, Linux follows the old convention of returning the inode
number of the covered directory (despite newer versions of POSIX declaring
that this is a bug).
To ensure this continues to work, the NFSv4 readdir implementation requests
the 'mounted-on-fileid' from the server.
However, readdirplus also needs to instantiate an inode for this entry, and
for that, we also need to request the real fileid as per this patch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The following patch ensures that we do not get permanently trapped in
the RPC layer when trying to establish a new client id or session.
This again ensures that the state manager can finish in a timely
fashion when the last filesystem to reference the nfs_client exits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a server for some reason keeps sending NFS4ERR_DELAY errors, we can end
up looping forever inside nfs4_proc_create_session, and so the usual
mechanisms for detecting if the nfs_client is dead don't work.
Fix this by ensuring that we loop inside the nfs4_state_manager thread
instead.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I only want to try other secflavors during an initial mount if
NFS4ERR_WRONGSEC is returned. nfs4_handle_exception() could
potentially map other errors to EPERM, so we should handle this
error specially for correctness.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we fail to contact the gss upcall program, then no message will
be sent to the server. The client still updated the sequence number,
however, and this lead to NFS4ERR_SEQ_MISMATCH for the next several
RPC calls.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, the state manager may continue to try recovering state forever
even after the last filesystem to reference that nfs_client has umounted.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
While decoding a secinfo reply, I store the list of supported sec
flavors on a page accessible through res->flavors. Before reading
each new flavor, I do some math to determine if there is enough
space left on this page, and I break out of my read look if there
isn't. In order to perform this check correctly, I need to use the
address of res->flavors, rather than the address of res.
When this loop was broken early I lied to the caller and told them
that the entire list had been decoded. This could lead to problems
if the caller tries to use any the garbage data claiming to be a
valid sec flavor. I fixed this by using res->flavors->num_flavors
as a counter, incrementing it every time a sec flavor is
successfully decoded.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We were always attempting sec flavor negotiation, even if the user
told us a specific sec flavor to use. If that sec flavor fails,
we should return an error rather than continuing with sec flavor
negotiation.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_lookup_root() is already configured to use either RPC_AUTH_UNIX
or a user specified flavor (through -o sec=<whatever>). We should
use this flavor first, and only attempt negotiation if it fails
with -EPERM.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS_MAX_SECFLAVORS should already take into account RPC_AUTH_UNIX
and RPC_AUTH_NULL, so we don't need to set aside extra slots
for them.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There can be an infinite loop if gss_create_upcall() is called without
the userspace program running. To prevent this, we return -EACCES if
we notice that pipe_version hasn't changed (indicating that the pipe
has not been opened).
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
mark_inode_dirty_sync() grabs the same inode lock!
race conditions between holding the lock in pnfs_set_layoutcommit() and in
mark_inode_dirty_sync() can result in a second call to pnfs_layoutcommit_inode(), but
this will be a noop as NFS_INO_LAYOUTCOMMIT won't be set in the second call
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The inode was used in an earlier version of the code, but it isn't
used anymore.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In fs/nfs/super.c::nfs_fs_mount() we test for a NULL 'data':
...
if (data == NULL || mntfh == NULL)
goto out_free_fh;
...
and then further down in the function we test 'data' again:
...
nfs_fscache_get_super_cookie(
s, data ? data->fscache_uniq : NULL, NULL);
...
this second check is just dead code since there is no way 'data' could
possibly be NULL here.
We also rely on a non-NULL 'data' in more than one location between these
two tests, further proving the point that the second test is bogus.
This patch removes the dead code.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_scan_commit() is called with the inode->i_lock held, but it then
calls __mark_inode_dirty() while still holding the lock. This causes
a deadlock.
Push the inode->i_lock into nfs_scan_commit() so it can protect only
the parts of the code it needs to and can be dropped before the call
to __mark_inode_dirty() to avoid the deadlock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Will Simoneau <simoneau@ele.uri.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC
NFS: Fix a signed vs. unsigned secinfo bug
Revert "net/sunrpc: Use static const char arrays"
When attempting an initial mount, we should only attempt other
authflavors if AUTH_UNIX receives a NFS4ERR_WRONGSEC error.
This allows other errors to be passed back to userspace programs.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
rpc_authflavor_t is cast from an unsigned int, but the
initial code tried to use it as a signed int. I fix
this by passing an rpc_authflavor_t pointer around, and
returning signed integers from functions.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix the incorrect use of igrab() inside the i_lock in NFS and Ceph‥
If we are already holding the i_lock, we have a reference to the
inode so we can safely use ihold() to gain an extra reference. This
avoids hangs due to lock recursion on the i_lock now that the
inode_lock is gone and igrab() uses the i_lock itself.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Ryan Mallon <ryan@bluewatersys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the inode scalability patches have been merged, it is no longer
safe to call igrab() under the inode->i_lock.
Now that we no longer call nfs_clear_request() until the nfs_page is
being freed, we know that we are always holding a reference to the
nfs_open_context, which again holds a reference to the path, and so
the inode cannot be freed until the last nfs_page has been removed
from the radix tree and freed.
We can therefore skip the igrab()/iput() altogether.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
changes LAYOUTGET and GETDEVICEINFO XDR parsing to:
- not use vmap, which doesn't work on incoherent archs
- use xdr_stream parsing for all xdr
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Test NFS_INO_LAYOUTCOMMIT before kzalloc
Mark inode dirty to retry LAYOUTCOMMIT on kzalloc failure.
Add comments.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When sec=<something> is not presented as a mount option,
we should attempt to determine what security flavor the
server is using.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A submount may use different security than the parent
mount does. We should figure out what sec flavor the
submount uses at mount time.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A later patch will need to perform a lookup using an
alternate client with a different security flavor.
This patch adds support for doing that on NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch changes nfs4_call_sync() from a macro into a
static inline function. As a macro, the call_sync()
function will not do any type checking and depends
on the sequence arguments always having the same name.
As a function, we get to have type checking and can
rename the arguments if we so choose.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that all the infrastructure is in place, we will do the
right thing if we remove this special casing.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to
the data server (as opposed to the MDS) and the data server WRITE
is not NFS_FILE_SYNC.
Only whole file layout support means that there is only one IOMODE_RW layout
segment.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Tested-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement all the hooks created in the previous patches.
This requires exporting quite a few functions and adding a few
structure fields.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Any COMMIT compound directed to a data server needs to have the
GETATTR calls suppressed. We here, make sure the field we are testing
(data->lseg) is set and refcounted correctly.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We create three major hooks for the pnfs code.
pnfs_mark_request_commit() is called during writeback_done from
nfs_mark_request_commit, which gives the driver an opportunity to
claim it wants control over commiting a particular req.
pnfs_choose_commit_list() is called from nfs_scan_list
to choose which list a given req should be added to, based on
where we intend to send it for COMMIT. It is up to the driver
to have preallocated list headers for each destination it may need.
pnfs_commit_list() is how the driver actually takes control, it is
used instead of nfs_commit_list().
In order to pass information between the above functions, we create
a union in nfs_page to hold a lseg (which is possible because the req is
not on any list while in transition), and add some flags to indicate
if we need to use the pnfs code.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Create a preallocated list header to hold nfs_pages for each
non-MDS COMMIT destination. Note this is not necessarily each DS,
but is basically each <DS, fh> pair.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move it up to avoid forward declaration in later patch.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Create a separate support function for later use by data server
commit code.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Create a separate support function for later use by data server
commit code.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a callback that the pnfs layout driver can use to do its own
error handling of the data server's COMMIT response.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Based on consensus reached in Feb 2011 interim IETF meeting regarding
use of LAYOUTCOMMIT, it has been decided that a NFS_DATA_SYNC return
from a WRITE to data server should not initiate a COMMIT.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Some filesystems (such as ext4) can return the same cookie value for
multiple files. If we try to start a readdir with one of these cookies,
the server will return the first file found with a cookie of the same
value. This can cause the client to enter an infinite loop.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs_opendir() created a context that held much more information than we
need for a readdir. This patch introduces a slimmed-down
nfs_open_dir_context that contains only the cookie and the cred used for
RPC operations. The new context will eventually be used to help detect
readdir loops.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're doing a search by readdir cookie, we need to ensure that the
resulting f_pos is updated. To do so, we need to update the
desc->current_index, in the same way that we do in the search by
file offset case.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
d_alloc_and_lookup() calls i_op->lookup method due to
rootfh changes his fsid.
During mount i_op of NFS root inode is set to
nfs_mountpoint_inode_operations, if rpc_ops->getroot()
and rpc_ops->getattr() return different fsid.
After that nfs_follow_remote_path() raised oops:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
stack trace:
d_alloc_and_lookup+0x4c/0x74
do_lookup+0x1e3/0x280
link_path_walk+0x12e/0xab0
nfs4_remote_get_sb+0x56/0x2c0 [nfs]
path_walk+0x67/0xe0
vfs_path_lookup+0x8e/0x100
nfs_follow_remote_path+0x16f/0x3e0 [nfs]
nfs4_try_mount+0x6f/0xd0 [nfs]
nfs_get_sb+0x269/0x400 [nfs]
vfs_kern_mount+0x8a/0x1f0
do_kern_mount+0x52/0x130
do_mount+0x20a/0x260
sys_mount+0x90/0xe0
system_call_fastpath+0x16/0x1b
So just refresh fsid, as RFC3530 doesn't specify behavior
in case of rootfh changes fsid.
Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When one of the two waits in nfs_commit_inode() is interrupted, it
returns a non-negative value, which causes nfs_wb_page() to think
that the operation was successful causing it to busy-loop rather
than exiting.
It also causes nfs_file_fsync() to incorrectly report the file as
being successfully committed to disk.
This patch fixes both problems by ensuring that we return an error
if the attempts to wait fail.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
If we're only doing a single write, and there are no other unstable
writes being queued up, we might want to just flip to using a stable
write RPC call.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
These should be spin_unlock() instead of spin_lock(). It's a typo.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (54 commits)
RPC: killing RPC tasks races fixed
xprt: remove redundant check
SUNRPC: Convert struct rpc_xprt to use atomic_t counters
SUNRPC: Ensure we always run the tk_callback before tk_action
sunrpc: fix printk format warning
xprt: remove redundant null check
nfs: BKL is no longer needed, so remove the include
NFS: Fix a warning in fs/nfs/idmap.c
Cleanup: Factor out some cut-and-paste code.
cleanup: save 60 lines/100 bytes by combining two mostly duplicate functions.
NFS: account direct-io into task io accounting
gss:krb5 only include enctype numbers in gm_upcall_enctypes
RPCRDMA: Fix FRMR registration/invalidate handling.
RPCRDMA: Fix to XDR page base interpretation in marshalling logic.
NFSv4: Send unmapped uid/gids to the server when using auth_sys
NFSv4: Propagate the error NFS4ERR_BADOWNER to nfs4_do_setattr
NFSv4: cleanup idmapper functions to take an nfs_server argument
NFSv4: Send unmapped uid/gids to the server if the idmapper fails
NFSv4: If the server sends us a numeric uid/gid then accept it
NFSv4.1: reject zero layout with zeroed stripe unit
...
* 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: bury ->get_sb()
nfs: switch NFS from ->get_sb() to ->mount()
nfs: stop mangling ->mnt_devname on NFS
vfs: new superblock methods to override /proc/*/mount{s,info}
nfs: nfs_do_{ref,sub}mount() superblock argument is redundant
nfs: make nfs_path() work without vfsmount
nfs: store devname at disconnected NFS roots
nfs: propagate devname to nfs{,4}_get_root()
The last remaining instances of ->get_sb() can be converted ->mount()
now - nothing in them uses new vfsmount anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) ->show_devname(m, mnt) - what to put into devname columns in mounts,
mountinfo and mountstats
b) ->show_path(m, mnt) - what to put into relative path column in mountinfo
Leaving those NULL gives old behaviour. NFS switched to using those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
part 3: now we have everything to get nfs_path() just by dentry -
just follow to (disconnected) root and pick the rest of the thing
there.
Start killing propagation of struct vfsmount * on the paths that
used to bring it to nfs_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
part 2: make sure that disconnected roots have corresponding mnt_devname
values stashed into them.
Have nfs*_get_root() stuff a copy of devname into ->d_fsdata of the
found root, provided that it is disconnected.
Have ->d_release() free it when dentry goes away.
Have the places where NFS uses ->d_fsdata for sillyrename (and that
can *never* happen to a disconnected root - dentry will be attached
to its parent) free old devname copies if they find those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix build failure introduced by s/freezeable/freezable/
workqueue: add system_freezeable_wq
rds/ib: use system_wq instead of rds_ib_fmr_wq
net/9p: replace p9_poll_task with a work
net/9p: use system_wq instead of p9_mux_wq
xfs: convert to alloc_workqueue()
reiserfs: make commit_wq use the default concurrency level
ocfs2: use system_wq instead of ocfs2_quota_wq
ext4: convert to alloc_workqueue()
scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
misc/iwmc3200top: use system_wq instead of dedicated workqueues
i2o: use alloc_workqueue() instead of create_workqueue()
acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
fs/aio: aio_wq isn't used in memory reclaim path
input/tps6507x-ts: use system_wq instead of dedicated workqueue
cpufreq: use system_wq instead of dedicated workqueues
wireless/ipw2x00: use system_wq instead of dedicated workqueues
arm/omap: use system_wq in mailbox
workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
Factor out some cut-and-paste code in options parsing.
Saves about 800 bytes on x86-64.
Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate two mostly duplicate functions (nfs_parse_simple_hostname()
and nfs_parse_protected_hostname()) and instead just make the calling
function (nfs_parse_devname()) do everything.
Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Account NFS direct-io reads and writes into Task I/O Accounting.
Do it before complition to handle aio.
NFS have unusual direct-io implementation,
thus accounting in generic code does not work.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new behaviour is enabled using the new module parameter
'nfs4_disable_idmapping'.
Note that if the server rejects an unmapped uid or gid, then
the client will automatically switch back to using the idmapper.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will be required in order to switch uid/gid mapping back on if the
admin has tried to disable it.
Note that we also propagate NFS4ERR_BADNAME at the same time, in order to
work around a Linux server bug.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allowing stripe_unit==0 causes the client to crash later on
when dividing by zero.
Reported-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we have access to the pointer, clear it immediately after
the put, instead of in caller.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will make it possible to clear the lseg pointer in the same
function as it is put, instead of in the caller nfs_pageio_doio().
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Allows the pnfs filelayout driver to write to the data servers.
Note that COMMIT to data servers will be implemented in a future
patch. To avoid improper behavior, for the moment any WRITE to a data
server that would also require a COMMIT to the data server is sent
NFS_FILE_SYNC.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Any WRITE compound directed to a data server needs to have the
GETATTR calls suppressed.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We grab the lseg sent in from the doio function and attach it to
each struct nfs_write_data created. This is how the lseg will be
sent to the layout driver.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add callback that pnfs layout driver can use to do its own handling
of data server WRITE response.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reorder nfs_write_rpcsetup, preparing for a pnfs entry point.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a data server is unavailable, go through MDS.
Mark the deviceid containing the data server as a negative cache entry.
Do not try to connect to any data server on a deviceid marked as a negative
cache entry. Mark any layout that tries to use the marked deviceid as failed.
Inodes with a layout marked as fails will not use the layout for I/O, and will
not perform any more layoutgets.
Inodes without a layout will still do layoutget, but the layout will get
marked immediately.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
No need for generic cache with only one user.
Keep a simple hash of deviceids in the filelayout driver.
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use our own async error handler.
Mark the layout as failed and retry i/o through the MDS on specified errors.
Update the mds_offset in nfs_readpage_retry so that a failed short-read retry
to a DS gets correctly resent through the MDS.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Attempt a pNFS file layout read by setting up the nfs_read_data struct and
calling nfs_initiate_read with the data server rpc client and the
filelayout rpc call ops.
Error handling is implemented in a subsequent patch.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Tested-by: Guo Mingyang <guomingyang@nrchpc.ac.cn>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare for filelayout_read_pagelist with helper functions that find the correct
data server, filehandle, and offset.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Tigran Mkrtchyan <tigran@anahit.desy.de>
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a data server set_client and init session following the
nfs4_set_client and nfs4_init_session convention.
Once a new nfs_client is on the nfs_client_list, the nfs_client cl_cons_state
serializes access to creating an nfs_client struct with matching properties.
Use the new nfs_get_client() that initializes new clients.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Separate the rpc run portion of nfs_read_rpcsetup into a new function
nfs_initiate_read that is called for normal NFS I/O.
Add a pNFS read_pagelist function that is called instead of nfs_intitate_read
for pNFS reads.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the pnfs_update_layout call location to nfs_pageio_do_add_request().
Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach
it to each nfs_read_data so it can be sent to the layout driver.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a pg_test layout driver hook which is used to avoid coelescing I/O across
layout stripes.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prepare put_lseg and get_lseg to be called from the pNFS I/O code.
Pull common code from pnfs_lseg_locked to call from pnfs_lseg.
Inline pnfs_lseg_locked into it's only caller.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers cannot send nfs4_proc_get_lease_time. but still need to setup
state renewal. Add the NFS_CS_CHECK_LEASE_TIME bit to indicate if the lease
time can be checked.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers not sharing a session with the mount MDS always have an empty
cl_superblocks list.
Replace the cl_superblocks empty list check to see if it is time to shut down
renewd with the NFS_CS_STOP_RENEW bit which is not set by such a data server.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Data servers require a zero stateid seqid, and there is no advantage to not
doing the same for all NFSv4.1
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now nfs_get_client returns an nfs_client ready to be used no matter if it was
found or created.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Prevents an Oops triggered by CB_LAYOUTRECALL and LAYOUTGET race on a
pnfs_layout_hdr first pnfs_layout_segment.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The return values are not used by any callers.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code was doing nothing more in either branch of the if.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The pnfs code was using throughout the lock order i_lock, cl_lock.
This conflicts with the nfs delegation code. Rework the pnfs code
to avoid taking both locks simultaneously.
Currently the code takes the double lock to add/remove the layout to a
nfs_client list, while atomically checking that the list of lsegs is
empty. To avoid this, we rely on existing serializations. When a
layout is initialized with lseg count equal zero, LAYOUTGET's
openstateid serialization is in effect, making it safe to assume it
stays zero unless we change it. And once a layout's lseg count drops
to zero, it is set as DESTROYED and so will stay at zero.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We do not need to clear the NFS_LAYOUT_BULK_RECALL, as setting it
guarantees that NFS_LAYOUT_DESTROYED will be set once any outstanding
io is finished.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The code could violate the following from RFC5661, section 12.5.3:
"Once a client has no more layouts on a file, the layout stateid is no
longer valid and MUST NOT be used."
This can occur when a layout already has a lseg, starts another
non-everlapping LAYOUTGET, and a CB_LAYOUTRECALL for the existing lseg
is processed before we hit pnfs_layout_process().
Solve by setting, each time the client has no more lsegs for a file, a
flag which blocks further use of the layout and triggers its removal.
This also fixes a second bug which occurs in the same instance as
above. If we actually use pnfs_layout_process, we add the new lseg to
the layout, but the layout has been removed from the nfs_client list
by the intervening CB_LAYOUTRECALL and will not be added back. Thus
the newly acquired lseg will not be properly returned in the event of
a subsequent CB_LAYOUTRECALL.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.
Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing". Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.
TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.
When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation. Usually user space knows how
and when to retry it. The network layer does not report a distinct
error code for this particular failure mode. Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.
Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.
To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.
root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists. This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.
Tested-by: Brian Downing <bdowning@lavos.net>
Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> # 2.6.37
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().
This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid. Instead, delay and retry the CREATE_SESSION
call.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
add kmalloc return value check in decode_and_add_ds
Signed-off-by: Stanislav Fomichev <kernel@fomichev.me>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:
mkdir("DIR");
fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
write(fd, "abcdefg", 7);
close(fd);
unlink("DIR/FILE");
rmdir("DIR");
The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.
nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.
In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.
Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.
For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.
This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.
In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.
Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The "bad_page()" page allocator sanity check was reported recently (call
chain as follows):
bad_page+0x69/0x91
free_hot_cold_page+0x81/0x144
skb_release_data+0x5f/0x98
__kfree_skb+0x11/0x1a
tcp_ack+0x6a3/0x1868
tcp_rcv_established+0x7a6/0x8b9
tcp_v4_do_rcv+0x2a/0x2fa
tcp_v4_rcv+0x9a2/0x9f6
do_timer+0x2df/0x52c
ip_local_deliver+0x19d/0x263
ip_rcv+0x539/0x57c
netif_receive_skb+0x470/0x49f
:virtio_net:virtnet_poll+0x46b/0x5c5
net_rx_action+0xac/0x1b3
__do_softirq+0x89/0x133
call_softirq+0x1c/0x28
do_softirq+0x2c/0x7d
do_IRQ+0xec/0xf5
default_idle+0x0/0x50
ret_from_intr+0x0/0xa
default_idle+0x29/0x50
cpu_idle+0x95/0xb8
start_kernel+0x220/0x225
_sinittext+0x22f/0x236
It occurs because an skb with a fraglist was freed from the tcp
retransmit queue when it was acked, but a page on that fraglist had
PG_Slab set (indicating it was allocated from the Slab allocator (which
means the free path above can't safely free it via put_page.
We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in
xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer
to a page struct via virt_to_page, but the vfs allocates the buffer via
kmalloc, meaning the PG_slab bit is set. We can't create a buffer with
kmalloc and free it later in the tcp ack path with put_page, so we need
to either:
1) ensure that when we create the list of pages, no page struct has
PG_Slab set
or
2) not use a page list to send this data
Given that these buffers can be multiple pages and arbitrarily sized, I
think (1) is the right way to go. I've written the below patch to
allocate a page from the buddy allocator directly and copy the data over
to it. This ensures that we have a put_page free-able page for every
entry that winds up on an skb frag list, so it can be safely freed when
the frame is acked. We do a put page on each entry after the
rpc_call_sync call so as to drop our own reference count to the page,
leaving only the ref count taken by tcp_sendpages. This way the data
will be properly freed when the ack comes in
Successfully tested by myself to solve the above oops.
Note, as this is the result of a setacl operation that exceeded a page
of data, I think this amounts to a local DOS triggerable by an
uprivlidged user, so I'm CCing security on this as well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On recent 2.6.38-rc kernels, connectathon basic test 6 fails on
NFSv4 mounts of OpenSolaris with something like:
> ./test6: readdir
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0
> ./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors
> basic tests failed
> Tests failed, leaving /mnt/klimt mounted
> [cel@matisse cthon04]$
I narrowed the problem down to nfs4_decode_dirent() reporting that the
decode buffer had overflowed while decoding the entries for those
missing files.
verify_attr_len() assumes both it's pointer arguments reside on the
same page. When these arguments point to locations on two different
pages, verify_attr_len() can report false errors. This can happen now
that a large NFSv4 readdir result can span pages.
We have reasonably good checking in nfs4_decode_dirent() anyway, so
it should be safe to simply remove the extra checking.
At a guess, this was introduced by commit 6650239a, "NFS: Don't use
vm_map_ram() in readdir".
Cc: stable@kernel.org [2.6.37]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the decoding of NFSv4 directory entries slightly more efficient
by:
1. Avoiding unnecessary byte swapping when checking XDR booleans,
and
2. Not bumping "p" when its value will be immediately replaced by
xdr_inline_decode()
This commit makes nfs4_decode_dirent() consistent with similar logic
in the other two decode_dirent() functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There is no reason to be freeing the delegation cred in the rcu callback,
and doing so is resulting in a lockdep complaint that rpc_credcache_lock
is being called from both softirq and non-softirq contexts.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
As stated in section 2.4 of RFC 5661, subsequent instances of the client need
to present the same co_ownerid. Concatinate the client's IP dot address,
host name, and the rpc_auth pseudoflavor to form the co_ownerid.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>