* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Avoid IO in udf_clear_inode
udf: Try harder when looking for VAT inode
udf: Fix compilation with UDFFS_DEBUG enabled
It is not very good to do IO in udf_clear_inode. First, VFS does not really
expect inode to become dirty there and thus we have to write it ourselves,
second, memory reclaim gets blocked waiting for IO when it does not really
expect it, third, the IO pattern (e.g. on umount) resulting from writes in
udf_clear_inode is bad and it slows down writing a lot.
The reason why UDF needed to do IO in udf_clear_inode is that UDF standard
mandates extent length to exactly match inode size. But when we allocate
extents to a file or directory, we don't really know what exactly the final
file size will be and thus temporarily set it to block boundary and later
truncate it to exact length in udf_clear_inode. Now, this is changed to
truncate to final file size in udf_release_file for regular files. For
directories and symlinks, we do the truncation at the moment when learn
what the final file size will be.
Signed-off-by: Jan Kara <jack@suse.cz>
Some disks do not contain VAT inode in the last recorded block as required
by the standard but a few blocks earlier (or the number of recorded blocks
is wrong). So look for the VAT inode a bit before the end of the media.
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://neil.brown.name/md: (27 commits)
md: add 'recovery_start' per-device sysfs attribute
md: rcu_read_lock() walk of mddev->disks in md_do_sync()
md: integrate spares into array at earliest opportunity.
md: move compat_ioctl handling into md.c
md: revise Kconfig help for MD_MULTIPATH
md: add MODULE_DESCRIPTION for all md related modules.
raid: improve MD/raid10 handling of correctable read errors.
md/raid10: print more useful messages on device failure.
md/bitmap: update dirty flag when bitmap bits are explicitly set.
md: Support write-intent bitmaps with externally managed metadata.
md/bitmap: move setting of daemon_lastrun out of bitmap_read_sb
md: support updating bitmap parameters via sysfs.
md: factor out parsing of fixed-point numbers
md: support bitmap offset appropriate for external-metadata arrays.
md: remove needless setting of thread->timeout in raid10_quiesce
md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
md: move offset, daemon_sleep and chunksize out of bitmap structure
md: collect bitmap-specific fields into one structure.
md/raid1: add takeover support for raid5->raid1
md: add honouring of suspend_{lo,hi} to raid1.
...
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (75 commits)
NFS: Fix nfs_migrate_page()
rpc: remove unneeded function parameter in gss_add_msg()
nfs41: Invoke RECLAIM_COMPLETE on all new client ids
SUNRPC: IS_ERR/PTR_ERR confusion
NFSv41: Fix a potential state leakage when restarting nfs4_close_prepare
nfs41: Handle NFSv4.1 session errors in the delegation recall code
nfs41: Retry delegation return if it failed with session error
nfs41: Handle session errors during delegation return
nfs41: Mark stateids in need of reclaim if state manager gets stale clientid
NFS: Fix up the declaration of nfs4_restart_rpc when NFSv4 not configured
nfs41: Don't clear DRAINING flag on NFS4ERR_STALE_CLIENTID
nfs41: nfs41_setup_state_renewal
NFSv41: More cleanups
NFSv41: Fix up some bugs in the NFS4CLNT_SESSION_DRAINING code
NFSv41: Clean up slot table management
NFSv41: Fix nfs4_proc_create_session
nfs41: Invoke RECLAIM_COMPLETE
nfs41: RECLAIM_COMPLETE functionality
nfs41: RECLAIM_COMPLETE XDR functionality
Cleanup some NFSv4 XDR decode comments
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
m68k: rename global variable vmalloc_end to m68k_vmalloc_end
percpu: add missing per_cpu_ptr_to_phys() definition for UP
percpu: Fix kdump failure if booted with percpu_alloc=page
percpu: make misc percpu symbols unique
percpu: make percpu symbols in ia64 unique
percpu: make percpu symbols in powerpc unique
percpu: make percpu symbols in x86 unique
percpu: make percpu symbols in xen unique
percpu: make percpu symbols in cpufreq unique
percpu: make percpu symbols in oprofile unique
percpu: make percpu symbols in tracer unique
percpu: make percpu symbols under kernel/ and mm/ unique
percpu: remove some sparse warnings
percpu: make alloc_percpu() handle array types
vmalloc: fix use of non-existent percpu variable in put_cpu_var()
this_cpu: Use this_cpu_xx in trace_functions_graph.c
this_cpu: Use this_cpu_xx for ftrace
this_cpu: Use this_cpu_xx in nmi handling
this_cpu: Use this_cpu operations in RCU
this_cpu: Use this_cpu ops for VM statistics
...
Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c
The RAID ioctls are only implemented in md.c, so the
handling for them should also be moved there from
fs/compat_ioctl.c.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Andre Noll <maan@systemlinux.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (151 commits)
powerpc: Fix usage of 64-bit instruction in 32-bit altivec code
MAINTAINERS: Add PowerPC patterns
powerpc/pseries: Track previous CPPR values to correctly EOI interrupts
powerpc/pseries: Correct pseries/dlpar.c build break without CONFIG_SMP
powerpc: Make "intspec" pointers in irq_host->xlate() const
powerpc/8xx: DTLB Miss cleanup
powerpc/8xx: Remove DIRTY pte handling in DTLB Error.
powerpc/8xx: Start using dcbX instructions in various copy routines
powerpc/8xx: Restore _PAGE_WRITETHRU
powerpc/8xx: Add missing Guarded setting in DTLB Error.
powerpc/8xx: Fixup DAR from buggy dcbX instructions.
powerpc/8xx: Tag DAR with 0x00f0 to catch buggy instructions.
powerpc/8xx: Update TLB asm so it behaves as linux mm expects.
powerpc/8xx: Invalidate non present TLBs
powerpc/pseries: Serialize cpu hotplug operations during deactivate Vs deallocate
pseries/pseries: Add code to online/offline CPUs of a DLPAR node
powerpc: stop_this_cpu: remove the cpu from the online map.
powerpc/pseries: Add kernel based CPU DLPAR handling
sysfs/cpu: Add probe/release files
powerpc/pseries: Kernel DLPAR Infrastructure
...
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (58 commits)
tty: split the lock up a bit further
tty: Move the leader test in disassociate
tty: Push the bkl down a bit in the hangup code
tty: Push the lock down further into the ldisc code
tty: push the BKL down into the handlers a bit
tty: moxa: split open lock
tty: moxa: Kill the use of lock_kernel
tty: moxa: Fix modem op locking
tty: moxa: Kill off the throttle method
tty: moxa: Locking clean up
tty: moxa: rework the locking a bit
tty: moxa: Use more tty_port ops
tty: isicom: fix deadlock on shutdown
tty: mxser: Use the new locking rules to fix setserial properly
tty: mxser: use the tty_port_open method
tty: isicom: sort out the board init logic
tty: isicom: switch to the new tty_port_open helper
tty: tty_port: Add a kref object to the tty port
tty: istallion: tty port open/close methods
tty: stallion: Convert to the tty_port_open/close methods
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
ext3: Fix data / filesystem corruption when write fails to copy data
ext4: Support for 64-bit quota format
ext3: Support for vfsv1 quota format
quota: Implement quota format with 64-bit space and inode limits
quota: Move definition of QFMT_OCFS2 to linux/quota.h
ext2: fix comment in ext2_find_entry about return values
ext3: Unify log messages in ext3
ext2: clear uptodate flag on super block I/O error
ext2: Unify log messages in ext2
ext3: make "norecovery" an alias for "noload"
ext3: Don't update the superblock in ext3_statfs()
ext3: journal all modifications in ext3_xattr_set_handle
ext2: Explicitly assign values to on-disk enum of filetypes
quota: Fix WARN_ON in lookup_one_len
const: struct quota_format_ops
ubifs: remove manual O_SYNC handling
afs: remove manual O_SYNC handling
kill wait_on_page_writeback_range
vfs: Implement proper O_SYNC semantics
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: Fix error return for fallocate() on XFS
xfs: cleanup dmapi macros in the umount path
xfs: remove incorrect sparse annotation for xfs_iget_cache_miss
xfs: kill the STATIC_INLINE macro
xfs: uninline xfs_get_extsz_hint
xfs: rename xfs_attr_fetch to xfs_attr_get_int
xfs: simplify xfs_buf_get / xfs_buf_read interfaces
xfs: remove IO_ISAIO
xfs: Wrapped journal record corruption on read at recovery
xfs: cleanup data end I/O handlers
xfs: use WRITE_SYNC_PLUG for synchronous writeout
xfs: reset the i_iolock lock class in the reclaim path
xfs: I/O completion handlers must use NOFS allocations
xfs: fix mmap_sem/iolock inversion in xfs_free_eofblocks
xfs: simplify inode teardown
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (27 commits)
Driver core: fix race in dev_driver_string
Driver Core: Early platform driver buffer
sysfs: sysfs_setattr remove unnecessary permission check.
sysfs: Factor out sysfs_rename from sysfs_rename_dir and sysfs_move_dir
sysfs: Propagate renames to the vfs on demand
sysfs: Gut sysfs_addrm_start and sysfs_addrm_finish
sysfs: In sysfs_chmod_file lazily propagate the mode change.
sysfs: Implement sysfs_getattr & sysfs_permission
sysfs: Nicely indent sysfs_symlink_inode_operations
sysfs: Update s_iattr on link and unlink.
sysfs: Fix locking and factor out sysfs_sd_setattr
sysfs: Simplify iattr time assignments
sysfs: Simplify sysfs_chmod_file semantics
sysfs: Use dentry_ops instead of directly playing with the dcache
sysfs: Rename sysfs_d_iput to sysfs_dentry_iput
sysfs: Update sysfs_setxattr so it updates secdata under the sysfs_mutex
debugfs: fix create mutex racy fops and private data
Driver core: Don't remove kobjects in device_shutdown.
firmware_class: make request_firmware_nowait more useful
Driver-Core: devtmpfs - set root directory mode to 0755
...
devpts_get_tty() assumes that the inode passed in is associated with a valid
pty. But if the only reference to the pty is via a bind-mount, the inode
passed to devpts_get_tty() while valid, would refer to a pty that no longer
exists.
With a lot of debug effort, Grzegorz Nosek developed a small program (see
below) to reproduce a crash on recent kernels. This crash is a regression
introduced by the commit:
commit 527b3e4773
Author: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Date: Mon Oct 13 10:43:08 2008 +0100
To fix, ensure that the dentry associated with the inode has not yet been
deleted/unhashed by devpts_pty_kill().
See also:
https://lists.linux-foundation.org/pipermail/containers/2009-July/019273.html
tty-bug.c:
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/signal.h>
#include <unistd.h>
#include <stdio.h>
#include <linux/fs.h>
void dummy(int sig)
{
}
static int child(void *unused)
{
int fd;
signal(SIGINT, dummy); signal(SIGHUP, dummy);
pause(); /* cheesy synchronisation to wait for /dev/pts/0 to appear */
mount("/dev/pts/0", "/dev/console", NULL, MS_BIND, NULL);
sleep(2);
fd = open("/dev/console", O_RDWR);
dup(0); dup(0);
write(1, "Hello world!\n", sizeof("Hello world!\n")-1);
return 0;
}
int main(void)
{
pid_t pid;
char *stack;
stack = malloc(16384);
pid = clone(child, stack+16384, CLONE_NEWNS|SIGCHLD, NULL);
open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK);
unlockpt(fd); grantpt(fd);
sleep(2);
kill(pid, SIGHUP);
sleep(1);
return 0; /* exit before child opens /dev/console */
}
Reported-by: Grzegorz Nosek <root@localdomain.pl>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Noticed that through glibc fallocate would return 28 rather than -1
and errno = 28 for ENOSPC. The xfs routines uses XFS_ERROR format
positive return error codes while the syscalls use negative return
codes. Fixup the two cases in xfs_vn_fallocate syscall to convert to
negative.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
Stop the flag saving as we never mangle those in the unmount path, and
hide all the weird arguents to the dmapi code inside the
XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
xfs_iget_cache_miss does not get called with the pag_ici_lock held, so
the __releases annotation is incorrect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Remove our own STATIC_INLINE macro. For small function inside
implementation files just use STATIC and let gcc inline it, and for
those in headers do the normal static inline - they are all small
enough to be inlined for debug builds, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
This function is too large to efficiently be inlined.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Using a totally different name for the low-level get operation does
not fit the _int convention used in the rest of the attr code, so
rename it.
While we're at it also fix the prototype to use the normal convention
and mark it static as it's never used outside of xfs_attr.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently the low-level buffer cache interfaces are highly confusing
as we have a _flags variant of each that does actually respect the
flags, and one without _flags which has a flags argument that gets
ignored and overriden with a default set. Given that very few places
use the default arguments get rid of the duplication and convert all
callers to pass the flags explicitly. Also remove the now confusing
_flags postfix.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
We set the IO_ISAIO flag for all read/write I/O since early Linux
2.6.x. Remove it as it has lost it's purpose long ago.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
Summary of problem:
If a journal record wraps at the physical end of the journal, it has to be
read in two parts in xlog_do_recovery_pass(): a read at the physical end and a
read at the physical beginning. If xlog_bread() has to re-align the first
read, the second read request does not take that re-alignment into account.
If the first read was re-aligned, the second read over-writes the end of the
data from the first read, effectively corrupting it. This can happen either
when reading the record header or reading the record data.
The first sanity check in xlog_recover_process_data() is to check for a valid
clientid, so that is the error reported.
Summary of fix:
If there was a first read at the physical end, XFS_BUF_PTR() returns where the
data was requested to begin. Conversely, because it is the result of
xlog_align(), offset indicates where the requested data for the first read
actually begins - whether or not xlog_bread() has re-aligned it.
Using offset as the base for the calculation of where to place the second read
data ensures that it will be correctly placed immediately following the data
from the first read instead of sometimes over-writing the end of it.
The attached patch has resolved the reported problem of occasional inability
to recover the journal (reporting "bad clientid").
Signed-off-by: Andy Poling <andy@realbig.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently we have different end I/O handlers for read vs the different
types of write I/O. But they are all very similar so we could just
use one with a few conditionals and reduce code size a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The VM and I/O schedulers now expect us to use WRITE_SYNC_PLUG for
synchronous writeout. Right now I can't see any changes in performance
numbers with this, but we're getting some beating for not using it,
and the knowledge definitely could help the block code to make better
decisions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
The iolock is used for protecting reads, writes and block truncates
against each other. We have two classes of callers, the first one is
induced by a file operation and requires a reference to the inode be
held and not dropped after the operation is done:
- xfs_vm_vmap, xfs_vn_fallocate, xfs_read, xfs_write, xfs_splice_read,
xfs_splice_write and xfs_setattr are all implementations of VFS
methods that require a live inode
- xfs_getbmap and xfs_swap_extents are ioctl subcommand for which the
same is true
- xfs_truncate_file is only called on quota inodes just returned from
xfs_iget
- xfs_sync_inode_data does the lock just after an igrab()
- xfs_filestream_associate and xfs_filestream_new_ag take the iolock
on the parent inode of an inode which by VFS rules must be referenced
And we have various calls to truncate blocks past EOF or the whole
file when dropping the last reference to an inode. Unfortunately
lockdep complains when we do memory allocations that can recurse into
the filesystem in the first class because the second class happens to
take the same lock. To avoid this re-init the iolock in the beginning
of xfs_fs_clear_inode to get a new lock class.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
When completing I/O requests we must not allow the memory allocator to
recurse into the filesystem, as we might deadlock on waiting for the
I/O completion otherwise. The only thing currently allocating normal
GFP_KERNEL memory is the allocation of the transaction structure for
the unwritten extent conversion. Add a memflags argument to
_xfs_trans_alloc to allow controlling the allocator behaviour.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Thomas Neumann <tneumann@users.sourceforge.net>
Tested-by: Thomas Neumann <tneumann@users.sourceforge.net>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
When xfs_free_eofblocks is called from ->release the VM might already
hold the mmap_sem, but in the write path we take the iolock before
taking the mmap_sem in the generic write code.
Switch xfs_free_eofblocks to only trylock the iolock if called from
->release and skip trimming the prellocated blocks in that case.
We'll still free them later on the final iput.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Currently the reclaim code for the case where we don't reclaim the
final reclaim is overly complicated. We know that the inode is clean
but instead of just directly reclaiming the clean inode we go through
the whole process of marking the inode reclaimable just to directly
reclaim it from the calling context. Besides being overly complicated
this introduces a race where iget could recycle an inode between
marked reclaimable and actually being reclaimed leading to panics.
This patch gets rid of the existing reclaim path, and replaces it with
a simple call to xfs_ireclaim if the inode was clean. While we're at
it we also use the slightly more lax xfs_inode_clean check we'd use
later to determine if we need to flush the inode here.
Finally get rid of xfs_reclaim function and place the remaining small
bits of reclaim code directly into xfs_fs_destroy_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Patrick Schreurs <patrick@news-service.com>
Reported-by: Tommy van Leeuwen <tommy@news-service.com>
Tested-by: Patrick Schreurs <patrick@news-service.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (49 commits)
nilfs2: separate wait function from nilfs_segctor_write
nilfs2: add iterator for segment buffers
nilfs2: hide nilfs_write_info struct in segment buffer code
nilfs2: relocate io status variables to segment buffer
nilfs2: do not return io error for bio allocation failure
nilfs2: use list_splice_tail or list_splice_tail_init
nilfs2: replace mark_inode_dirty as nilfs_mark_inode_dirty
nilfs2: delete mark_inode_dirty in nilfs_delete_entry
nilfs2: delete mark_inode_dirty in nilfs_commit_chunk
nilfs2: change return type of nilfs_commit_chunk
nilfs2: split nilfs_unlink as nilfs_do_unlink and nilfs_unlink
nilfs2: delete redundant mark_inode_dirty
nilfs2: expand inode_inc_link_count and inode_dec_link_count
nilfs2: delete mark_inode_dirty from nilfs_set_link
nilfs2: delete mark_inode_dirty in nilfs_new_inode
nilfs2: add norecovery mount option
nilfs2: add helper to get if volume is in a valid state
nilfs2: move recovery completion into load_nilfs function
nilfs2: apply readahead for recovery on mount
nilfs2: clean up get/put function of a segment usage
...
inode_change_ok already clears the SGID bit when necessary
so there is no reason for sysfs_setattr to carry code to do
the same, and it is good to kill the extra copy because when
I moved the code last in certain corner cases the code will
look at the wrong gid.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
These two functions do 90% of the same work and it doesn't significantly
obfuscate the function to allow both the parent dir and the name to change
at the same time. So merge them together to simplify maintenance, and
increase testing.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
By teaching sysfs_revalidate to hide a dentry for
a sysfs_dirent if the sysfs_dirent has been renamed,
and by teaching sysfs_lookup to return the original
dentry if the sysfs dirent has been renamed. I can
show the results of renames correctly without having to
update the dcache during the directory rename.
This massively simplifies the rename logic allowing a lot
of weird sysfs special cases to be removed along with
a lot of now unnecesary helper code.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With lazy inode updates and dentry operations bringing everything
into sync on demand there is no longer any need to immediately
update the vfs or grab i_mutex to protect those updates as we
make changes to sysfs.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that sysfs_getattr and sysfs_permission refresh the vfs
inode there is no need to immediatly push the mode change
into the vfs cache. Reducing the amount of work needed and
simplifying the locking.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With the implementation of sysfs_getattr and sysfs_permission
sysfs becomes able to lazily propogate inode attribute changes
from the sysfs_dirents to the vfs inodes. This paves the way
for deleting significant chunks of now unnecessary code.
While doing this we did not reference sysfs_setattr from
sysfs_symlink_inode_operations so I added along with
sysfs_getattr and sysfs_permission.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Lining up the functions in sysfs_symlink_inode_operations
follows the pattern in the rest of sysfs and makes things
slightly more readable.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently sysfs updates the timestamps on the vfs directory
inode when we create or remove a directory entry but doesn't
update the cached copy on the sysfs_dirent, fix that oversight.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cleanly separate the work that is specific to setting the
attributes of a sysfs_dirent from what is needed to update
the attributes of a vfs inode.
Additionally grab the sysfs_mutex to keep any nasties from
surprising us when updating the sysfs_dirent.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The granularity of sysfs time when we keep it is 1 ns. Which
when passed to timestamp_trunc results in a nop. So remove
the unnecessary function call making sysfs_setattr slightly
easier to read.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently every caller of sysfs_chmod_file happens at either
file creation time to set a non-default mode or in response
to a specific user requested space change in policy. Making
timestamps of when the chmod happens and notification of
a file changing mode uninteresting.
Remove the unnecessary time stamp and filesystem change
notification, and removes the last of the explicit inotify
and donitfy support from sysfs.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Calling d_drop unconditionally when a sysfs_dirent is deleted has
the potential to leak mounts, so instead implement dentry delete
and revalidate operations that cause sysfs dentries to be removed
at the appropriate time.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Using dentry instead of d in the function name is what
several other filesystems are doing and it seems to be
a more readable convention.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The sysfs_mutex is required to ensure updates are and will remain
atomic with respect to other inode iattr updates, that do not happen
through the filesystem.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Setting fops and private data outside of the mutex at debugfs file
creation introduces a race where the files can be opened with the wrong
file operations and private data. It is easy to trigger with a process
waiting on file creation notification.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Half the compat_ioctl handling is in devio.c, the other
half is in fs/compat_ioctl.c. This moves everything into
one place for consistency.
As a positive side-effect, push down the BKL into the
ioctl methods.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Alon Bar-Lev <alon.barlev@gmail.com>
Cc: David Vrabel <david.vrabel@csr.com>
Cc: linux-usb@vger.kernel.org
Handling for LPSETTIMEOUT can easily be done in lp_ioctl, which
is the only user. As a positive side-effect, push the BKL
into the ioctl methods.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Instead of having each handler call compat_ptr, we can now
convert the pointer once and pass that to each handler.
This saves a little bit of both source and object code size.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The compat_ioctl table now only contains entries for
COMPATIBLE_IOCTL, so we only need to know if a number
is listed in it or now.
As an optimization, we hash the table entries with a
reversible transformation to get a more uniform distribution
over it, sort the table at startup and then guess the
position in the table when an ioctl number gets called
to do a linear search from there.
With the current set of ioctl numbers and the chosen
transformation function, we need an average of four
steps to find if a number is in the set, all of the
accesses within one or two cache lines.
This at least as good as the previous hash table
approach but saves 8.5 kb of kernel memory.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The compat_ioctl array now contains only entries for ioctl numbers
that do not require a separate handler. By special-casing the
ULONG_IOCTL case in the do_ioctl_trans function, we can kill the
final use of a function pointer in the array.
text data bss dec hex filename
7539 13352 2080 22971 59bb before/fs/compat_ioctl.o
7910 8552 2080 18542 486e after/fs/compat_ioctl.o
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This makes all ioctl conversion handlers called from
a single switch statement, leaving only COMPATIBLE_IOCTL
and ULONG_IOCTL statements in the table. This is somewhat
more space efficient and also lets us simplify the
handling of the lookup table significantly.
before:
text data bss dec hex filename
7619 14024 2080 23723 5cab obj/fs/compat_ioctl.o
after:
7567 13352 2080 22999 59d7 obj/fs/compat_ioctl.o
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
We have always called ioctl conversion handlers under the big kernel lock,
although that is generally not necessary. In particular it is not needed
for conversion of data structures and for calling sys_ioctl or
do_vfs_ioctl, which will get the BKL again if needed.
Handlers doing more than those two have been moved out, so we can kill off
the BKL from compat_sys_ioctl. This may significantly improve latencies
with 32 bit applications, and it avoids a common scenario where a thread
acquires the BKL twice.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The VT driver now handles all of these ioctls directly, so we can remove
the handlers from common code.
These are the only handlers that require the BKL because they directly
perform the ioctl action rather than just converting the data structures.
Once they are gone, we can remove the BKL from the remaining ioctl
conversion handlers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
ext4: Do not override ext2 or ext3 if built they are built as modules
jbd2: Export jbd2_log_start_commit to fix ext4 build
ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
ext4: Wait for proper transaction commit on fsync
ext4: fix incorrect block reservation on quota transfer.
ext4: quota macros cleanup
ext4: ext4_get_reserved_space() must return bytes instead of blocks
ext4: remove blocks from inode prealloc list on failure
ext4: wait for log to commit when umounting
ext4: Avoid data / filesystem corruption when write fails to copy data
ext4: Use ext4 file system driver for ext2/ext3 file system mounts
ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
ext4: remove unused parameter wbc from __ext4_journalled_writepage()
ext4: remove encountered_congestion trace
ext4: move_extent_per_page() cleanup
ext4: initialize moved_len before calling ext4_move_extents()
ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
ext4: use ext4_data_block_valid() in ext4_free_blocks()
...
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
exofs: Multi-device mirror support
exofs: Move all operations to an io_engine
exofs: move osd.c to ios.c
exofs: statfs blocks is sectors not FS blocks
exofs: Prints on mount and unmout
exofs: refactor exofs_i_info initialization into common helper
exofs: dbg-print less
exofs: More sane debug print
trivial: some small fixes in exofs documentation
The call to migrate_page() will cause the page->private field to be
cleared.
Also fix up the locking around the page->private transfer, so that we ensure
that calls to nfs_page_find_request() don't end up racing.
Finally, fix up a double free bug: nfs_unlock_request() already calls
nfs_release_request() for us...
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Andi Kleen <andi@firstfloor.org>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When ext3_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks already
instantiated beyond i_size. Although these blocks were never inside i_size, we
have to truncate pagecache of these blocks so that corresponding buffers get
unmapped. Otherwise subsequent __block_prepare_write (called because we are
retrying the write) will find the buffers mapped, not call ->get_block, and
thus the page will be backed by already freed blocks leading to filesystem and
data corruption.
Reported-by: James Y Knight <foom@fuhm.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.
Signed-off-by: Jan Kara <jack@suse.cz>
We just have to add proper mount options handling. The rest is handled by
the generic quota code.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
So far the maximum quota space limit was 4TB. Apparently this isn't enough
for Lustre guys anymore. So implement new quota format which raises block
limits to 2^64 bytes. Also store number of inodes and inode limits in
64-bit variables as 2^32 files isn't that insanely high anymore.
The first version of the patch has been developed by Andrew Perepechko
<Andrew.Perepechko@Sun.COM>.
CC: Andrew.Perepechko@Sun.COM
Signed-off-by: Jan Kara <jack@suse.cz>
Move definition of this constant to linux/quota.h so that it
cannot clash with other format IDs.
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Make messages produced by ext3 more unified. It should be
easy to parse.
dmesg before patch:
[ 4893.684892] reservations ON
[ 4893.684896] xip option not supported
[ 4893.684964] EXT3-fs warning: maximal mount count reached, running
e2fsck is recommended
dmesg after patch:
[ 873.300792] EXT3-fs (loop0): using internal journaln
[ 873.300796] EXT3-fs (loop0): mounted filesystem with writeback data mode
[ 924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.
[ 723.755642] EXT3-fs (loop0): error: bad blocksize 8192
[ 357.874687] EXT3-fs (loop0): error: no journal found. mounting ext3 over ext2?
[ 873.300764] EXT3-fs (loop0): warning: maximal mount count reached, running e2fsck is recommended
[ 924.163657] EXT3-fs (loop0): error: can't find ext3 filesystem on dev loop0.
Signed-off-by: Alexey Fisher <bug-track@fisher-privat.net>
Signed-off-by: Jan Kara <jack@suse.cz>
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during
unmount when a USB or floppy device is removed. I reported this a kernel
regression, but looks like it might have been there for longer
than that.
The super block update from a previous operation has marked the buffer
as in error, and the flag has to be cleared before doing the update.
(Similar code already exists in ext4).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Users on the list recently complained about differences across
filesystems w.r.t. how to mount without a journal replay.
In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext3.
Also show this status in /proc/mounts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
commit a71ce8c6c9 updated ext3_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change. This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.
The modifications were originally to keep the sb "more" in sync,
so that a readonly fsck of the device didn't flag this as an
error (as often), but apparently e2fsprogs deals with this differently
now, anyway.
Based on Ted's patch for ext4, which was in turn based on my
work on that bug and another preliminary patch...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ext3_xattr_set_handle() was zeroing out an inode outside
of journaling constraints; this is one of the accesses that
was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.
Although ext3 doesn't have the crc issue, modifications
out of journal control are a Bad Thing.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We should hold i_mutex when looking up quota files for journaled quotas,
otherwise a WARN_ON in lookup_one_len triggers. The fact that we didn't
hold i_mutex previously probably could not lead to a real bug since the
filesystem is just being mounted / remounted read-write and thus the
root directory cannot change anyway but it's definitely cleaner with
i_mutex.
Reported-by: Bastien ROUCARIES <roucaries.bastien@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate call to ubifs_sync_wbufs_by_inode which is already
covered by ubifs_fsync.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
generic_file_aio_write already calls into ->fsync to handle O_SYNC/O_DSYNC.
Remove the duplicate manual invocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment. This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.
This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag. To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.
This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition. Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set. The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.
We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.
Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op. We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch changes on-disk format, it is accompanied with a parallel
patch to mkfs.exofs that enables multi-device capabilities.
After this patch, old exofs will refuse to mount a new formatted FS and
new exofs will refuse an old format. This is done by moving the magic
field offset inside the FSCB. A new FSCB *version* field was added. In
the future, exofs will refuse to mount unmatched FSCB version. To
up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
before mounting.
Introduced, a new object that contains a *device-table*. This object
contains the default *data-map* and a linear array of devices
information, which identifies the devices used in the filesystem. This
object is only written to offline by mkfs.exofs. This is why it is kept
separate from the FSCB, since the later is written to while mounted.
Same partition number, same object number is used on all devices only
the device varies.
* define the new format, then load the device table on mount time make
sure every thing is supported.
* Change I/O engine to now support Mirror IO, .i.e write same data
to multiple devices, read from a random device to spread the
read-load from multiple clients (TODO: stripe read)
Implementation notes:
A few points introduced in previous patch should be mentioned here:
* Special care was made so absolutlly all operation that have any chance
of failing are done before any osd-request is executed. This is to
minimize the need for a data consistency recovery, to only real IO
errors.
* Each IO state has a kref. It starts at 1, any osd-request executed
will increment the kref, finally when all are executed the first ref
is dropped. At IO-done, each request completion decrements the kref,
the last one to return executes the internal _last_io() routine.
_last_io() will call the registered io_state_done. On sync mode a
caller does not supply a done method, indicating a synchronous
request, the caller is put to sleep and a special io_state_done is
registered that will awaken the caller. Though also in sync mode all
operations are executed in parallel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In anticipation for multi-device operations, we separate osd operations
into an abstract I/O API. Currently only one device is used but later
when adding more devices, we will drive all devices in parallel according
to a "data_map" that describes how data is arranged on multiple devices.
The file system level operates, like before, as if there is one object
(inode-number) and an i_size. The io engine will split this to the same
object-number but on multiple device.
At first we introduce Mirror (raid 1) layout. But at the final outcome
we intend to fully implement the pNFS-Objects data-map, including
raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12
* Define an io_state based API for accessing osd storage devices
in an abstract way.
Usage:
First a caller allocates an io state with:
exofs_get_io_state(struct exofs_sb_info *sbi,
struct exofs_io_state** ios);
Then calles one of:
exofs_sbi_create(struct exofs_io_state *ios);
exofs_sbi_remove(struct exofs_io_state *ios);
exofs_sbi_write(struct exofs_io_state *ios);
exofs_sbi_read(struct exofs_io_state *ios);
exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);
And when done
exofs_put_io_state(struct exofs_io_state *ios);
* Convert all source files to use this new API
* Convert from bio_alloc to bio_kmalloc
* In io engine we make use of the now fixed osd_req_decode_sense
There are no functional changes or on disk additions after this patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If I do a "git mv" together with a massive code change
and commit in one patch, git looses the rename and
records a delete/new instead. This is bad because I want
a rename recorded so later rebased/cherry-picked patches
to the old name will work. Also the --follow is lost.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Even though exofs has a 4k block size, statfs blocks
is in sectors (512 bytes).
Also if target returns 0 for capacity then make it
ULLONG_MAX. df does not like zero-size filesystems
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
It is important to print in the logs when a filesystem was
mounted and eventually unmounted.
Print the osd-device's osd_name and pid the FS was
mounted/unmounted on.
TODO: How to also print the namespace path the filesystem was
mounted on?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
There are two places that initialize inodes: exofs_iget() and
exofs_new_inode()
As more members of exofs_i_info that need initialization are
added this code will grow. (soon)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Iner-loops printing is converted to EXOFS_DBG2 which is #defined
to nothing.
It is now almost bareable to just leave debug-on. Every operation
is printed once, with most relevant info (I hope).
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
ext2 or ext3 file systems if the those file system drivers are
configured to be built as mdoules.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: (31 commits)
kill-the-bkl/reiserfs: turn GFP_ATOMIC flag to GFP_NOFS in reiserfs_get_block()
kill-the-bkl/reiserfs: drop the fs race watchdog from _get_block_create_0()
kill-the-bkl/reiserfs: definitely drop the bkl from reiserfs_ioctl()
kill-the-bkl/reiserfs: always lock the ioctl path
kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency
kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency
kill-the-bkl/reiserfs: panic in case of lock imbalance
kill-the-bkl/reiserfs: fix recursive reiserfs write lock in reiserfs_commit_write()
kill-the-bkl/reiserfs: fix recursive reiserfs lock in reiserfs_mkdir()
kill-the-bkl/reiserfs: fix "reiserfs lock" / "inode mutex" lock inversion dependency
kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock
kill-the-bkl/reiserfs: acquire the inode mutex safely
kill-the-bkl/reiserfs: unlock only when needed in search_by_key
kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe
kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end()
kill-the-bkl/reiserfs: reduce number of contentions in search_by_key()
kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup()
kill-the-bkl/reiserfs: lock only once on reiserfs_get_block()
kill-the-bkl/reiserfs: conditionaly release the write lock on fs_changed()
kill-the-BKL/reiserfs: add reiserfs_cond_resched()
...
The NFSv4.1 spec indicates RECLAIM_COMPLETE is to be issued
whenever a client establishes a new client id, not only after
detecting the server has rebooted.
Set the NFS4CLNT_RECLAIM_REBOOT bit after every new client id has
been established. This enables us to issue RECLAIM_COMPLETE
during the wrap up of the NFS4CLNT_RECLAIM_REBOOT state.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
cfq-iosched: Do not access cfqq after freeing it
block: include linux/err.h to use ERR_PTR
cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
blkio: Allow CFQ group IO scheduling even when CFQ is a module
blkio: Implement dynamic io controlling policy registration
blkio: Export some symbols from blkio as its user CFQ can be a module
block: Fix io_context leak after failure of clone with CLONE_IO
block: Fix io_context leak after clone with CLONE_IO
cfq-iosched: make nonrot check logic consistent
io controller: quick fix for blk-cgroup and modular CFQ
cfq-iosched: move IO controller declerations to a header file
cfq-iosched: fix compile problem with !CONFIG_CGROUP
blkio: Documentation
blkio: Wait on sync-noidle queue even if rq_noidle = 1
blkio: Implement group_isolation tunable
blkio: Determine async workload length based on total number of queues
blkio: Wait for cfq queue to get backlogged if group is empty
blkio: Propagate cgroup weight updation to cfq groups
blkio: Drop the reference to queue once the task changes cgroup
blkio: Provide some isolation between groups
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...
Currently, if the call to nfs4_setup_sequence() in nfs4_close_prepare
fails, any later retries will fail to launch an RPC call, due to the fact
that the &state->flags will have been cleared.
Ditto if nfs4_close_done() triggers a call to the NFSv4.1 version of
nfs_restart_rpc().
We therefore move the actual clearing of the state->flags to
nfs4_close_done(), when we know that the RPC call was successful.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Return the PTR_ERR of the correct pointer. This fixes the debugging code.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Update nfs4_delegreturn_done() to retry the operation after setting the
NFS4CLNT_SESSION_SETUP bit to indicate the need to reset the session.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The state manager was not marking the stateids as needing to be reclaimed
after reestablishing the clientid.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also rename it: it is used in generic code, and so should not have a 'nfs4'
prefix.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch fixes three problems in the handling of the
EXT4_IOC_MOVE_EXT ioctl:
1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for
original and donor files, but they allow the illegal write access to
donor file, since donor file is overwritten by original file data. To
fix this problem, change access mode checks of original (r->r/w) and
donor (r->w) files.
2. Disallow the use of donor files that have a setuid or setgid bits.
3. Call mnt_want_write() and mnt_drop_write() before and after
ext4_move_extents() calling to get write access to a mount.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Inside ->setattr() call both ATTR_UID and ATTR_GID may be valid
This means that we may end-up with transferring all quotas. Add
we have to reserve QUOTA_DEL_BLOCKS for all quotas, as we do in
case of QUOTA_INIT_BLOCKS.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently all quota block reservation macros contains hard-coded "2"
aka MAXQUOTAS value. This is no good because in some places it is not
obvious to understand what does this digit represent. Let's introduce
new macro with self descriptive name.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This fixes a leak of blocks in an inode prealloc list if device failures
cause ext4_mb_mark_diskspace_used() to fail.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is a potential race when a transaction is committing right when
the file system is being umounting. This could reduce in a race
because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super
before the commit code calls a callback so the mballoc code can
release freed blocks in the transaction, resulting in a panic trying
to access the freed s_group_info.
The fix is to wait for the transaction to finish committing before we
shutdown the multiblock allocator.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When ext4_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks
already instantiated beyond i_size. Although these blocks were never
inside i_size, we have to truncate the pagecache of these blocks so
that corresponding buffers get unmapped. Otherwise subsequent
__block_prepare_write (called because we are retrying the write) will
find the buffers mapped, not call ->get_block, and thus the page will
be backed by already freed blocks leading to filesystem and data
corruption.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add a new config option, CONFIG_EXT4_USE_FOR_EXT23 which if enabled,
will cause ext4 to be used for either ext2 or ext3 file system mounts
when ext2 or ext3 is not enabled in the configuration.
This allows minimalist kernel fanatics to drop to file system drivers
from their compiled kernel with out losing functionality.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, don't clear the
NFS4CLNT_SESSION_DRAINING flag and don't wake RPCs waiting for the
session to be reestablished. We don't have a session yet, so there
is no reason to wake other RPCs.
This avoids sending spurious compounds with bogus sequenceID during
session and state recovery.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[Trond.Myklebust@netapp.com: cleaned up patch by adding the
nfs41_begin/end_drain_session() helpers]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move call to get the lease time and the setup of the state
renewal out of nfs4_create_session so that it can be called
after clearing the DRAINING flag. We use the getattr RPC
to obtain the lease time, which requires a sequence slot.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We no longer need to maintain a distinction between nfs41_sequence_done and
nfs41_sequence_free_slot.
This fixes a number of slot table leakages in the NFSv4.1 code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We should not assume that nfs41_init_clientid() will always want to
initialise the session. If it is being called due to a server reboot, then
we just want to reset the session after re-establishing the clientid.
Fix this by getting rid of the 'reset' parameter in
nfs4_proc_create_session(), and instead relying on whether or not the
session slot table pointer is non-NULL.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch invokes RECLAIM_COMPLETE after the client is done
reclaiming state.
There are interpretations of the spec that suggest that
RECLAIM_COMPLETE should also be issued after a new clientid
has been obtained from the server and even if there is no
state to reclaim. This tells the server that the client
has no state to reclaim even if the client isn't aware the
server may have rebooted.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implements RECLAIM_COMPLETE as an asynchronous RPC.
NFS4ERR_DELAY is retried, NFS4ERR_DEADSESSION invokes the error handling
but does not result in a retry, since we don't want to have a lingering
RECLAIM_COMPLETE call sent in the middle of a possible new state recovery
cycle. If a session reset occurs, a new wave of reclaim operations will
follow, containing their own RECLAIM_COMPLETE call. We don't want a
retry to get on the way of recovery by incorrectly indicating to the
server that we're done reclaiming state.
A subsequent patch invokes the functionality.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
XDR encoding and decoding for RECLAIM_COMPLETE. Implements the necessary
encoding to indicate reclaim complete for the entire client. In the future,
it can be extended to provide reclaim complete functionality for a single
file system after migration.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise we have no guarantees that other processes won't start another
RPC call while we're resetting the session.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
in NFSv4.1 the seqid part of a stateid in CB_RECALL must be 0
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
the server can indicate a number of error conditions by setting the
appropriate bits in the SEQUENCE operation. The client re-establishes
state with the server when it receives one of those, with the action
depending on the specific case.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The v4.1 client should take into account the desired rsize, wsize when
negotiating the max size in CREATE_SESSION. Accordingly, it should use
rsize, wsize that are smaller than the session negotiated values.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In v4.1 the client MUST/SHOULD use the EXCLUSIVE4_1 flag instead of
EXCLUSIVE4, and GUARDED when the server supports persistent sessions.
For now (and until we support suppattr_exclcreat), we don't send any
attributes with EXCLUSIVE4_1 relying in the subsequent SETATTR as in v4.0
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For now the clients returns _all_ the delegations of the specificed type
it holds
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4.1 spec-29 (18.36.3) says that the server MUST use an ONC RPC
(program) version number equal to 4 in callbacks sent to the client.
For now we allow both versions 1 and 4.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (31 commits)
GFS2: Fix glock refcount issues
writeback: remove unused nonblocking and congestion checks (gfs2)
GFS2: drop rindex glock to refresh rindex list
GFS2: Tag all metadata with jid
GFS2: Locking order fix in gfs2_check_blk_state
GFS2: Remove dirent_first() function
GFS2: Display nobarrier option in /proc/mounts
GFS2: add barrier/nobarrier mount options
GFS2: remove division from new statfs code
GFS2: Improve statfs and quota usability
GFS2: Use dquot_send_warning()
VFS: Export dquot_send_warning
GFS2: Add set_xquota support
GFS2: Add get_xquota support
GFS2: Clean up gfs2_adjust_quota() and do_glock()
GFS2: Remove constant argument from qd_get()
GFS2: Remove constant argument from qdsb_get()
GFS2: Add proper error reporting to quota sync via sysfs
GFS2: Add get_xstate quota function
GFS2: Remove obsolete code in quota.c
...
"Definition" is misspelled "defintion" in several comments; this
patch fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
"Journaled" is misspelled "journlaled" in an output string; this patch
fixed it. No changes in functionality.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Replace sync and async handlers setting of the NFS4CLNT_SESSION_SETUP bit with
setting NFS4CLNT_CHECK_LEASE, and let the state manager decide to reset the session.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not wake up the next slot_tbl_waitq task in nfs4_free_slot because we
may be draining the slot. Either signal the state manager that the session
is drained (the state manager wakes up tasks) OR wake up the next task.
In nfs41_sequence_done, the slot dereference is only needed in the sequence
operation success case.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the session is reset during state recovery, the state manager thread can
sleep on the slot_tbl_waitq causing a deadlock.
Add a completion framework to the session. Have the state manager thread set
a new session state (NFS4CLNT_SESSION_DRAINING) and wait for the session slot
table to drain.
Signal the state manager thread in nfs41_sequence_free_slot when the
NFS4CLNT_SESSION_DRAINING bit is set and the session is drained.
Reported-by: Trond Myklebust <trond@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_recover_session can put rpciod to sleep. Just use nfs4_schedule_recovery.
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not fall through and set NFS4CLNT_SESSION_RESET bit on NFS4ERR_EXPIRED
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not fall through and call nfs4_delay on session error handling.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_read_done returns zero on unhandled errors. nfs_readpage_result will
return on a negative tk_status without freeing the slot.
Call nfs4_sequence_free_slot on unhandled errors in nfs4_read_done.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs41_sequence_free_slot can be called multiple times on SEQUENCE operation
errors.
No reason to inline nfs4_restart_rpc
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
nfs_writeback_done and nfs_readpage_retry call nfs4_restart_rpc outside the
error handler, and the slot is not freed prior to restarting in the rpc_prepare
state during session reset.
Fix this by moving the call to nfs41_sequence_free_slot from the error
path of nfs41_sequence_done into nfs4_restart_rpc, and by removing the test
for NFS4CLNT_SESSION_SETUP.
Always free slot and goto the rpc prepare state on async errors.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make this clear by calling rpc_restart-call.
Prepare for nfs4_restart_rpc() to free slots.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The bit is no longer used for session setup, only for session reset.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reported-by: Trond Myklebust <trond.myklebust@netapp.com>
Resetting the clientid from the state manager could result in not confirming
the clientid due to create session not being called.
Move the create session call from the NFS4CLNT_SESSION_SETUP state manager
initialize session case into the NFS4CLNT_LEASE_EXPIRED case establish_clid
call.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS4ERR_FILE_OPEN is return by the server when an operation cannot be
performed because the file is currently open and local (to the server)
semantics prohibit the operation while the file is open.
A typical case is a RENAME operation on an MS-Windows platform, which
prevents rename while the file is open.
While it is possible that such a condition is transitory, it is also
very possible that the file will be held open for an extended period
of time thus preventing the operation.
The current behaviour of Linux/NFS is to retry the operation
indefinitely. This is not appropriate - we do not expect a rename to
take an arbitrary amount of time to complete.
Rather, and error should be returned. The most obvious error code
would be EBUSY, which is a legal at least for 'rename' and 'unlink',
and accurately captures the reason for the error.
This patch allows a few retries until about 2 seconds have elapsed,
then returns EBUSY.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The d_instantiate(new_dentry, NULL) is superfluous, the dentry is
already negative. Rehashing this dummy dentry isn't needed either,
d_move() works fine on an unhashed target.
The re-checking for busy after a failed nfs_sillyrename() is bogus
too: new_dentry->d_count < 2 would be a bug here.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move unhashing the target to after the check for existence and being a
non-directory.
If renaming a directory then the VFS already unhashes the target if it
is not busy. If it's busy then acquiring more references during the
rename makes no difference.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Comments are wrong or out of date. In particular d_drop() doesn't
free the inode it just unhashes the dentry. And if target is a
directory then it is not checked for being busy.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
VFS already checks if both source and target are directories.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When the "rsize=" or "wsize=" mount options are not specified,
text-based mounts have slightly different behavior than legacy binary
mounts. Text-based mounts use the smaller of the server's maximum
and the client's maximum, but binary mounts use the smaller of the
server's _preferred_ size and the client's maximum.
This difference is actually pretty subtle. Most servers advertise
the same value as their maximum and their preferred transfer size, so
the end result is the same in most cases.
The reason for this difference is that for text-based mounts, if
r/wsize are not specified, they are set to the largest value supported
by the client. For legacy mounts, the values are set to zero if these
options are not specified.
nfs_server_set_fsinfo() can negotiate the transfer size defaults
correctly in any case. There's no need to specify any particular
value as default in the text-based option parsing logic.
Note that nfs4 doesn't use nfs_server_set_fsinfo(), but the mount.nfs4
command does set rsize and wsize to 0 if the user didn't specify these
options. So, make the same change for text-based NFSv4 mounts.
Thanks to James Pearson <james-p@moving-picture.com> for reporting and
diagnosing the problem.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Recent changes to snprintf() introduced the %pI6c formatter, which can
display an IPv6 address with standard shorthanding. Use this new
formatter when displaying IPv6 server addresses in /proc/mounts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Solaris uses netids as values for the proto= option, so that when
someone specifies "tcp6" they get traffic over TCP + IPv6. Until
recently, this has never really been an issue for Linux since it didn't
support NFS over IPv6. The netid and the protocol name were generally
always the same (modulo any strange configuration in /etc/netconfig).
The solaris manpage documents their proto= option as:
proto= _netid_ | rdma
This patch is intended to bring Linux closer to how the Solaris proto=
option works, by declaring a static netid mapping in the kernel and
converting the proto= and mountproto= options to follow it and display
the proper values in /proc/mounts.
Much of this functionality will need to be provided by a userspace
mount.nfs patch. Chuck Lever has a patch to change mount.nfs in
the same way. In principle, we could do *all* of this in userspace but
that would mean that the options in /proc/mounts may not match the
options used by userspace.
The alternative to the static mapping here is to add a mechanism to
upcall to userspace for netid's. I'm not opposed to that option, but
it'll probably mean more overhead (and quite a bit more code). Rather
than shoot for that at first, I figured it was probably better to
start simply.
Comments welcome.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfs4_state_manager should not be looking at the error values when
deciding whether or not to loop round in order to handle a higher priority
state recovery task. It should rather be looking at the clp->cl_state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If our lease expires, and the server reboots while we're recovering, we
need to be able to wait until the grace period is over.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_recovery_handle_error() will correctly handle errors such as
NFS4ERR_CB_PATH_DOWN, however because they are still passed back to the
main loop in nfs4_state_manager(), they can cause the latter to exit
prematurely.
Fix this by letting nfs4_recovery_handle_error() change the error value in
cases where there is no action required by the caller.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In practice, we need to ensure that we call nfs4_state_end_reclaim_reboot
in 2 cases:
- If we lose the lease while we were reclaiming state
OR
- After we're done with reboot recovery
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The nfsv4 state manager could potentially deadlock inside
__nfs_inode_return_delegation() if the server reboots, so that the calls to
nfs_msync_inode() end up waiting on state recovery to complete.
Also ensure that if a server reboot or network partition causes us to have
to stop returning delegations, that NFS4CLNT_DELEGRETURN is set so that
the state manager can resume any outstanding delegation returns after it
has dealt with the state recovery situation.
Finally, ensure that the state manager doesn't wait for the DELEGRETURN
call to complete. It doesn't need to, and that too can cause a deadlock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: [PATCH] nfs: fix acl decoding
Commit 28f566942c "NFS: use dynamically
computed compound_hdr.replen for xdr_inline_pages offset" accidentally
changed the amount of space to allow for the acl reply, resulting in an
IO error on attempts to get an acl.
Reported-by: Paul Rudin <paul@rudin.co.uk>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- no one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
congestion wait
So remove the congestion checks as suggested by Chris.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Alex Elder <aelder@sgi.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It will lower the flush priority for NFS, and maybe more in future.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is dead code because no bdi flush thread will be started for
!bdi_cap_writeback_dirty bdi.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes some ref counting issues. Firstly by moving
the point at which we drop the ref count after a dlm lock
operation has completed we ensure that we never call
gfs2_glock_hold() on a lock with a zero ref count.
Secondly, by using atomic_dec_and_lock() in gfs2_glock_put()
we ensure that at no time will a glock with zero ref count
appear on the lru_list. That means that we can remove the
check for this in our shrinker (which was racy).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
No one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more. And lumpy pageout will want to do
nonblocking writeback without the congestion wait.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a gfs2 filesystem is grown, it needs to rebuild the rindex list to be able
to use the new space. gfs2 does this when the rindex is marked not uptodate,
which happens when the rindex glock is dropped. However, on a single node
setup, there is never any reason to drop the rindex glock, so gfs2 never
invalidates the the rindex. This patch makes gfs2 automatically drop the
rindex glock after filesystem grows, so it can refresh the rindex list.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are two spare field in the header common to all GFS2
metadata. One is just the right size to fit a journal id
in it, and this patch updates the journal code so that each
time a metadata block is modified, we tag it with the journal
id of the node which is performing the modification.
The reason for this is that it should make it much easier to
debug issues which arise if we can tell which node was the
last to modify a particular metadata block.
Since the field is updated before the block is written into
the journal, each journal should only contain metadata which
is tagged with its own journal id. The one exception to this
is the journal header block, which might have a different node's
id in it, if that journal was recovered by another node in the
cluster.
Thus each journal will contain a record of which nodes recovered
it, via the journal header.
The other field in the metadata header could potentially be
used to hold information about what kind of operation was
performed, but for the time being we just zero it on each
transaction so that if we use it for that in future, we'll
know that the information (where it exists) is reliable.
I did consider using the other field to hold the journal
sequence number, however since in GFS2's journaling we write
the modified data into the journal and not the original
data, this gives no information as to what action caused the
modification, so I think we can probably come up with a better
use for those 64 bits in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This function only had one caller left, and that caller only
called it for leaf blocks, hence one branch of the "if" was
never taken. In addition the call to get_left had already
verified the metadata type, so the function can be reduced
to a single line of code in its caller.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently gfs2 issues barrier unconditionally. There are various reasons
to disable them, be that just for testing or for stupid devices flushing
large battert backed caches. Add a nobarrier option that matches xfs and
btrfs for this. Also add a symmetric barrier option to turn it back on
at remount time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It's not necessary to do any 64bit division for the statfs sync code, so
remove it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 now has three new mount options, statfs_quantum, quota_quantum and
statfs_percent. statfs_quantum and quota_quantum simply allow you to
set the tunables of the same name. Setting setting statfs_quantum to 0
will also turn on the statfs_slow tunable. statfs_percent accepts an
integer between 0 and 100. Numbers between 1 and 100 will cause GFS2 to
do any early sync when the local number of blocks free changes by at
least statfs_percent from the totoal number of blocks free. Setting
statfs_percent to 0 disables this.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support to GFS2 to send quota warnings via netlink.
Also it removes a stray \r which was left over from when the
code used to print warnings on the console.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Sending a message to userspace in a generic format to warn
of events (e.g. quota exceeded) in the quota subsystem is
a generically useful feature. This patch makes some minor
changes to the send_message function from dquot.c renaming
it quota_send_message, moving it to quota.c and exporting it
for use by filesystems which do not use the dquot code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support for viewing the current GFS2 quota settings
via the XFS quota API. The setting of quotas will be addressed
in a later patch. Fields which are not supported here are left
set to zero.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Both of these functions contained confusing and in one case
duplicate code. This patch adds a new check in do_glock()
so that we report -ENOENT if we are asked to sync a quota
entry which doesn't exist. Due to the previous patch this is
now reported correctly to userspace.
Also there are a few new comments, and I hope that the code
is easier to understand now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>