hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT. When stored
in size_t 'count', these errors will not be noticed, a large value will be
added to *ppos.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes block_write_begin() can map buffers in a page but later we
fail to copy data into those buffers (because the source page has been
paged out in the mean time). We then end up with !uptodate mapped
buffers. To add a bit more to the confusion, block_write_end() does
not commit any data (and thus does not any mark buffers as uptodate) if
we didn't succeed with copying all the data.
Commit f4fc66a894 (ext3: convert to new
aops) missed these cases and thus we were inserting non-uptodate
buffers to transaction's list which confuses JBD code and it reports IO
errors, aborts a transaction and generally makes users afraid about
their data ;-P.
This patch fixes the problem by reorganizing ext3_..._write_end() code
to first call block_write_end() to mark buffers with valid data
uptodate and after that we file only uptodate buffers to transaction's
lists.
We also fix a problem where we could leave blocks allocated beyond i_size
(i_disksize in fact) because of failed write. We now add inode to orphan
list when write fails (to be safe in case we crash) and then truncate blocks
beyond i_size in a separate transaction.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly. However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode. This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.
The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.
This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
from ext3_iget(), as ext3 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.
Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use unsigned instead of int for the parameter which carries a blocksize.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On 32-bit system with CONFIG_LBD getblk can fail because provided block
number is too big. Make JBD gracefully handle that.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <dmaciejak@fortinet.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove
the BKL around ext3_ioctl().
Signed-off-by: Cyrus Massoumi <cyrusm@gmx.net>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check bh->b_blocknr only if BH_Mapped is set.
akpm: I doubt if b_blocknr is ever uninitialised here, but it could
conceivably cause a problem if we're doing a lookup for block zero.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirtied_when value on an inode is supposed to represent the first time
that an inode has one of its pages dirtied. This value is in units of
jiffies. It's used in several places in the writeback code to determine
when to write out an inode.
The problem is that these checks assume that dirtied_when is updated
periodically. If an inode is continuously being used for I/O it can be
persistently marked as dirty and will continue to age. Once the time
compared to is greater than or equal to half the maximum of the jiffies
type, the logic of the time_*() macros inverts and the opposite of what is
needed is returned. On 32-bit architectures that's just under 25 days
(assuming HZ == 1000).
As the least-recently dirtied inode, it'll end up being the first one that
pdflush will try to write out. sync_sb_inodes does this check:
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
...but now dirtied_when appears to be in the future. sync_sb_inodes bails
out without attempting to write any dirty inodes. When this occurs,
pdflush will stop writing out inodes for this superblock. Nothing can
unwedge it until jiffies moves out of the problematic window.
This patch fixes this problem by changing the checks against dirtied_when
to also check whether it appears to be in the future. If it does, then we
consider the value to be far in the past.
This should shrink the problematic window of time to such a small period
(30s) as not to matter.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
_outside_ of inode_lock. So any I_FREEING testing is incomplete without a
coupled testing of I_CLEAR.
So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().
Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.
Masayoshi MIZUMA has a nice panic flow:
=====================================================================
[process A] | [process B]
| |
| prune_icache() | drop_pagecache()
| spin_lock(&inode_lock) | drop_pagecache_sb()
| inode->i_state |= I_FREEING; | |
| spin_unlock(&inode_lock) | V
| | | spin_lock(&inode_lock)
| V | |
| dispose_list() | |
| list_del() | |
| clear_inode() | |
| inode->i_state = I_CLEAR | |
| | | V
| | | if (inode->i_state & (I_FREEING|I_WILL_FREE))
| | | continue; <==== NOT MATCH
| | |
| | | (DANGER from here on! Accessing disposing inode!)
| | |
| | | __iget()
| | | list_move() <===== PANIC on poisoned list !!
V V |
(time)
=====================================================================
Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a number of issues with the per-MM VMA patch:
(1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
system.
(2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
lest it overflow.
(3) Move the allocation of the vm_area_struct slab back for fork.c.
(4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
(5) Use BUG_ON() rather than if () BUG().
(6) Make the default validate_nommu_regions() a static inline rather than a
#define.
(7) Make free_page_series()'s objection to pages with a refcount != 1 more
informative.
(8) Adjust the __put_nommu_region() banner comment to indicate that the
semaphore must be held for writing.
(9) Limit the number of warnings about munmaps of non-mmapped regions.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove an unneeded return statement and conditional
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We get this on 32 builds:
fs/built-in.o: In function `extent_fiemap':
(.text+0x1019f2): undefined reference to `__ucmpdi2'
Happens because of a switch statement with a 64 bit argument.
Convert this to an if statement to fix this.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_new_inode doesn't call iput to free the inode
when it fails.
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Need to check kthread_should_stop after schedule_timeout() before calling
schedule(). This causes threads to sleep with potentially no one to wake them
up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop.
Signed-off-by: Amit Gud <gud@ksu.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit. This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations). This was previously the behavior only when a snapshot is
created.
This is used by Ceph to ensure that completed writes make it to the
platter along with the metadata operations they are bound to (by
BTRFS_IOC_TRANS_{START,END}).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a 'notreelog' mount option to disable the tree log (used by fsync,
O_SYNC writes). This is much slower, but the tree logging produces
inconsistent views into the FS for ceph.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs options can change at times other than mount, yet /proc/mounts shows the
options string used when the fs was mounted (an example would be when btrfs
determines that barriers aren't useful and turns them off.) This patch
instead outputs the actual options in use by btrfs.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because btrfs is copy-on-write, we end up picking new locations for
blocks very often. This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.
This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation. The end result is that we tend to write sequentially to
the drive.
This happens all the time for metadata and it happens for data
when mounted -o ssd. But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.
This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.
The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents. This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_next_leaf was using blocking locks when it could have been using
faster spinning ones instead. This adds a few extra checks around
the pieces that block and switches over to spinning locks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes the pinned_mutex. The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there. We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch removes the block group alloc mutex used to protect the free space
tree for allocations and replaces it with a spin lock which is used only to
protect the free space rb tree. This means we only take the lock when we are
directly manipulating the tree, which makes us a touch faster with
multi-threaded workloads.
This patch also gets rid of btrfs_find_free_space and replaces it with
btrfs_find_space_for_alloc, which takes the number of bytes you want to
allocate, and empty_size, which is used to indicate how much free space should
be at the end of the allocation.
It will return an offset for the allocator to use. If we don't end up using it
we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to
kill the alloc_mutex, since we need to make sure nobody else comes along and
takes our space.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
I've replaced the strange looping constructs with a list_for_each_entry on
space_info->block_groups. If we have a hint we just jump into the loop with
the block group and start looking for space. If we don't find anything we
start at the beginning and start looking. We never come out of the loop with a
ref on the block_group _unless_ we found space to use, then we drop it after we
set the trans block_group.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch cleans up the free space cache code a bit. It better documents the
idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
I took out the info allocation at the start of __btrfs_add_free_space and put it
where it makes more sense. This was left over cruft from when alloc_mutex
existed. Also all of the re-searches we do to make sure we inserted properly.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Btrfs pages being written get set to writeback, and then may go through
a number of steps before they hit the block layer. This includes compression,
checksumming and async bio submission.
The end result is that someone who writes a page and then does
wait_on_page_writeback is likely to unplug the queue before the bio they
cared about got there.
We could fix this by marking bios sync, or by doing more frequent unplugs,
but this commit just changes the async bio submission code to unplug
after it has processed all the bios for a device. The async bio submission
does a fair job of collection bios, so this shouldn't be a huge problem
for reducing merging at the elevator.
For streaming O_DIRECT writes on a 5 drive array, it boosts performance
from 386MB/s to 460MB/s.
Thanks to Hisashi Hifumi for helping with this work.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses async helper threads to submit write bios so the checksumming
helper threads don't block on the disk.
The submit bio threads may process bios for more than one block device,
so when they find one device congested they try to move on to other
devices instead of blocking in get_request_wait for one device.
This does a pretty good job of keeping multiple devices busy, but the
congested flag has a number of problems. A congested device may still
give you a request, and other procs that aren't backing off the congested
device may starve you out.
This commit uses the io_context stored in current to decide if our process
has been made a batching process by the block layer. If so, it keeps
sending IO down for at least one batch. This helps make sure we do
a good amount of work each time we visit a bdev, and avoids large IO
stalls in multi-device workloads.
It's also very ugly. A better solution is in the works with Jens Axboe.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allow MAP_PRIVATE mmaps of "direct_io" files. This is necessary for
execute support.
MAP_SHARED mappings require some sort of coherency between the
underlying file and the mapping. With "direct_io" it is difficult to
provide this, so for the moment just disallow shared (read-write and
read-only) mappings altogether.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Allow the kernel read and write on "direct_io" files. This is
necessary for nfs export and execute support.
The implementation is simple: if an access from the kernel is
detected, don't perform get_user_pages(), just use the kernel address
provided by the requester to copy from/to the userspace filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We update information in logical volume integrity descriptor after each
allocation (as LVID contains free space, number of directories and files on
disk etc.). If the filesystem is on some phase change media, this leads to its
quick degradation as such media is able to handle only 10000 overwrites or so.
We solve the problem by writing new information into LVID only on umount,
remount-ro and sync. This solves the problem at the price of longer media
inconsistency (previously media became consistent after pdflush flushed dirty
LVID buffer) but that should be acceptable.
Report by and patch written in cooperation with
Rich Coe <Richard.Coe@med.ge.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Anchor block can be located at several places on the medium. Two of the
locations are relative to media end which is problematic to detect. Also
some drives report some block as last but are not able to read it or any
block nearby before it. So let's first try block 256 and if it is all fine,
don't look at other possible locations of anchor blocks to avoid IO errors.
This change required a larger reorganization of code but the new code is
hopefully more readable and definitely shorter.
Signed-off-by: Jan Kara <jack@suse.cz>
Make udf_check_valid() return 1 if the validity check passed and 0 otherwise.
So far it was the other way around which was a bit confusing. Also make
udf_vrs() return loff_t which is really the type it should return (not int).
Signed-off-by: Jan Kara <jack@suse.cz>
This patch makes the UDF FS driver use the hardware sector size as the
default logical block size, which is required by the UDF specifications.
While the previous default of 2048 bytes was correct for optical disks,
it was not for hard disks or USB storage devices, and made it impossible
to use such a device with the default mount options. (The Linux mkudffs
tool uses a default block size of 2048 bytes even on devices with
smaller hardware sectors, so this bug is unlikely to be noticed unless
UDF-formatted USB storage devices are exchanged with other OSs.)
To avoid regressions for people who use loopback optical disk images or
who used the (sometimes wrong) defaults of mkudffs, we also try with
a block size of 2048 bytes if no anchor was found with the hardware
sector size.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Functions udf_CS0toNLS() and udf_NLStoCS0() didn't count with the fact that
NLS can return negative length when invalid character is given to it for
conversion. Thus interesting things could happen (such as overwriting random
memory with the rest of filename). Add appropriate checks.
Signed-off-by: Jan Kara <jack@suse.cz>
This patch makes udf return f_fsid info for statfs(2).
Signed-off-by: Coly Li <coly.li@suse.de>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
On x86 (and several other archs) mode_t is defined as "unsigned short"
and comparing unsigned shorts to negative ints is broken (because short
is promoted to int and then compared). Fix it.
Reported-and-tested-by: Laurent Riffard <laurent.riffard@free.fr>
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
"dmode" allows overriding permissions of directories and
"mode" allows overriding permissions of files.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Allocate strings with kmalloc.
Checkstack output:
Before: udf_get_filename: 600
After: udf_get_filename: 136
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Allocate strings with kmalloc.
Checkstack output:
Before: udf_process_sequence: 712
After: udf_process_sequence: 200
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (58 commits)
SUNRPC: Ensure IPV6_V6ONLY is set on the socket before binding to a port
NSM: Fix unaligned accesses in nsm_init_private()
NFS: Simplify logic to compare socket addresses in client.c
NFS: Start PF_INET6 callback listener only if IPv6 support is available
lockd: Start PF_INET6 listener only if IPv6 support is available
SUNRPC: Remove CONFIG_SUNRPC_REGISTER_V4
SUNRPC: rpcb_register() should handle errors silently
SUNRPC: Simplify kernel RPC service registration
SUNRPC: Simplify svc_unregister()
SUNRPC: Allow callers to pass rpcb_v4_register a NULL address
SUNRPC: rpcbind actually interprets r_owner string
SUNRPC: Clean up address type casts in rpcb_v4_register()
SUNRPC: Don't return EPROTONOSUPPORT in svc_register()'s helpers
SUNRPC: Use IPv4 loopback for registering AF_INET6 kernel RPC services
SUNRPC: Set IPV6ONLY flag on PF_INET6 RPC listener sockets
NFS: Revert creation of IPv6 listeners for lockd and NFSv4 callbacks
SUNRPC: Remove @family argument from svc_create() and svc_create_pooled()
SUNRPC: Change svc_create_xprt() to take a @family argument
SUNRPC: svc_setup_socket() gets protocol family from socket
SUNRPC: Pass a family argument to svc_register()
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
ext4: Regularize mount options
ext4: fix locking typo in mballoc which could cause soft lockup hangs
ext4: fix typo which causes a memory leak on error path
jbd2: Update locking coments
ext4: Rename pa_linear to pa_type
ext4: add checks of block references for non-extent inodes
ext4: Check for an valid i_mode when reading the inode from disk
ext4: Use WRITE_SYNC for commits which are caused by fsync()
ext4: Add auto_da_alloc mount option
ext4: Use struct flex_groups to calculate get_orlov_stats()
ext4: Use atomic_t's in struct flex_groups
ext4: remove /proc tuning knobs
ext4: Add sysfs support
ext4: Track lifetime disk writes
ext4: Fix discard of inode prealloc space with delayed allocation.
ext4: Automatically allocate delay allocated blocks on rename
ext4: Automatically allocate delay allocated blocks on close
ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
ext4: Simplify delalloc code by removing mpage_da_writepages()
ext4: Save stack space by removing fake buffer heads
...
This fixes unaligned accesses in nsm_init_private() when
creating nlm_reboot keys.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: try to free metadata pages when we free btree blocks
Btrfs: add extra flushing for renames and truncates
Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
Btrfs: optimize fsyncs on old files
Btrfs: tree logging unlink/rename fixes
Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
Btrfs: limit balancing work while flushing delayed refs
Btrfs: readahead checksums during btrfs_finish_ordered_io
Btrfs: leave btree locks spinning more often
Btrfs: Only let very young transactions grow during commit
Btrfs: Check for a blocking lock before taking the spin
Btrfs: reduce stack in cow_file_range
Btrfs: reduce stalls during transaction commit
Btrfs: process the delayed reference queue in clusters
Btrfs: try to cleanup delayed refs while freeing extents
Btrfs: reduce stack usage in some crucial tree balancing functions
Btrfs: do extent allocation and reference count updates in the background
Btrfs: don't preallocate metadata blocks during btrfs_search_slot
A deadlock can occur when user space uses a signal (autofs version 4 uses
SIGCHLD for this) to effect expire completion.
The order of events is:
Expire process completes, but before being able to send SIGCHLD to it's parent
...
Another process walks onto a different mount point and drops the directory
inode semaphore prior to sending the request to the daemon as it must ...
A third process does an lstat on on the expired mount point causing it to wait
on expire completion (unfortunately) holding the directory semaphore.
The mount request then arrives at the daemon which does an lstat and,
deadlock.
For some time I was concerned about releasing the directory semaphore around
the expire wait in autofs4_lookup as well as for the mount call back. I
finally realized that the last round of changes in this function made the
expiring dentry and the lookup dentry separate and distinct so the check and
possible wait can be done anywhere prior to the mount call back. This patch
moves the check to just before the mount call back and inside the directory
inode mutex release.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A significant portion of the autofs_dev_ioctl_expire() and
autofs4_expire_multi() functions is duplicated code. This patch cleans that
up.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12843
"I use ramfs instead of tmpfs for /tmp because I don't use swap on my
laptop. Some apps need 1777 mode for /tmp directory, but ramfs does not
support 'mode=' mount option."
Reported-by: Avan Anishchuk <matimatik@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce keyed event wakeups inside the eventfd code.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the events hint now sent by some devices, to avoid unnecessary wakeups
for events that are of no interest for the caller. This code handles both
devices that are sending keyed events, and the ones that are not (and
event the ones that sometimes send events, and sometimes don't).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
People started using eventfd in a semaphore-like way where before they
were using pipes.
That is, counter-based resource access. Where a "wait()" returns
immediately by decrementing the counter by one, if counter is greater than
zero. Otherwise will wait. And where a "post(count)" will add count to
the counter releasing the appropriate amount of waiters. If eventfd the
"post" (write) part is fine, while the "wait" (read) does not dequeue 1,
but the whole counter value.
The problem with eventfd is that a read() on the fd returns and wipes the
whole counter, making the use of it as semaphore a little bit more
cumbersome. You can do a read() followed by a write() of COUNTER-1, but
IMO it's pretty easy and cheap to make this work w/out extra steps. This
patch introduces a new eventfd flag that tells eventfd to only dequeue 1
from the counter, allowing simple read/write to make it behave like a
semaphore. Simple test here:
http://www.xmailserver.org/eventfd-sem.c
To be back-compatible with earlier kernels, userspace applications should
probe for the availability of this feature via
#ifdef EFD_SEMAPHORE
fd = eventfd2 (CNT, EFD_SEMAPHORE);
if (fd == -1 && errno == EINVAL)
<fallback>
#else
<fallback>
#endif
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <linux-api@vger.kernel.org>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eventpoll.c uses void * in one place for no obvious reason; change it to
use the real type instead.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_modify() doesn't need to set event.data from within the ep->lock
spinlock as the comment suggests. The only place event.data is used is
ep_send_events_proc(), and this is protected by ep->mtx instead of
ep->lock. Also update the comment for mutex_lock() at the top of
ep_scan_ready_list(), which mentions epoll_ctl(EPOLL_CTL_DEL) but not
epoll_ctl(EPOLL_CTL_MOD).
ep_modify() can also use spin_lock_irq() instead of spin_lock_irqsave().
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xchg in ep_unregister_pollwait() is unnecessary because it is protected by
either epmutex or ep->mtx (the same protection as ep_remove()).
If xchg was necessary, it would be insufficient to protect against
problems: if multiple concurrent calls to ep_unregister_pollwait() were
possible then a second caller that returns without doing anything because
nwait == 0 could return before the waitqueues are removed by the first
caller, which looks like it could lead to problematic races with
ep_poll_callback().
So remove xchg and add comments about the locking.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If epoll_wait returns -EFAULT, the event that was being returned when the
fault was encountered will be forgotten. This is not a big deal since
EFAULT will happen only if a buggy userspace program passes in a bad
address, in which case what happens later usually doesn't matter.
However, it is easy to remember the event for later, and this patch makes
a simple change to do that.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ep_call_nested() (formerly ep_poll_safewake()) uses "current" (without
dereferencing it) to detect callback recursion, but it may be called from
irq context where the use of current is generally discouraged. It would
be better to use get_cpu() and put_cpu() to detect the callback recursion.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove debugging code from epoll. There's no need for it to be included
into mainline code.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug inside the epoll's f_op->poll() code, that returns POLLIN even
though there are no actual ready monitored fds. The bug shows up if you
add an epoll fd inside another fd container (poll, select, epoll).
The problem is that callback-based wake ups used by epoll does not carry
(patches will follow, to fix this) any information about the events that
actually happened. So the callback code, since it can't call the file*
->poll() inside the callback, chains the file* into a ready-list.
So, suppose you added an fd with EPOLLOUT only, and some data shows up on
the fd, the file* mapped by the fd will be added into the ready-list (via
wakeup callback). During normal epoll_wait() use, this condition is
sorted out at the time we're actually able to call the file*'s
f_op->poll().
Inside the old epoll's f_op->poll() though, only a quick check
!list_empty(ready-list) was performed, and this could have led to
reporting POLLIN even though no ready fds would show up at a following
epoll_wait(). In order to correctly report the ready status for an epoll
fd, the ready-list must be checked to see if any really available fd+event
would be ready in a following epoll_wait().
Operation (calling f_op->poll() from inside f_op->poll()) that, like wake
ups, must be handled with care because of the fact that epoll fds can be
added to other epoll fds.
Test code:
/*
* epoll_test by Davide Libenzi (Simple code to test epoll internals)
* Copyright (C) 2008 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <davidel@xmailserver.org>
*
*/
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <limits.h>
#include <poll.h>
#include <sys/epoll.h>
#include <sys/wait.h>
#define EPWAIT_TIMEO (1 * 1000)
#ifndef POLLRDHUP
#define POLLRDHUP 0x2000
#endif
#define EPOLL_MAX_CHAIN 100L
#define EPOLL_TF_LOOP (1 << 0)
struct epoll_test_cfg {
long size;
long flags;
};
static int xepoll_create(int n) {
int epfd;
if ((epfd = epoll_create(n)) == -1) {
perror("epoll_create");
exit(2);
}
return epfd;
}
static void xepoll_ctl(int epfd, int cmd, int fd, struct epoll_event *evt) {
if (epoll_ctl(epfd, cmd, fd, evt) < 0) {
perror("epoll_ctl");
exit(3);
}
}
static void xpipe(int *fds) {
if (pipe(fds)) {
perror("pipe");
exit(4);
}
}
static pid_t xfork(void) {
pid_t pid;
if ((pid = fork()) == (pid_t) -1) {
perror("pipe");
exit(5);
}
return pid;
}
static int run_forked_proc(int (*proc)(void *), void *data) {
int status;
pid_t pid;
if ((pid = xfork()) == 0)
exit((*proc)(data));
if (waitpid(pid, &status, 0) != pid) {
perror("waitpid");
return -1;
}
return WIFEXITED(status) ? WEXITSTATUS(status): -2;
}
static int check_events(int fd, int timeo) {
struct pollfd pfd;
fprintf(stdout, "Checking events for fd %d\n", fd);
memset(&pfd, 0, sizeof(pfd));
pfd.fd = fd;
pfd.events = POLLIN | POLLOUT;
if (poll(&pfd, 1, timeo) < 0) {
perror("poll()");
return 0;
}
if (pfd.revents & POLLIN)
fprintf(stdout, "\tPOLLIN\n");
if (pfd.revents & POLLOUT)
fprintf(stdout, "\tPOLLOUT\n");
if (pfd.revents & POLLERR)
fprintf(stdout, "\tPOLLERR\n");
if (pfd.revents & POLLHUP)
fprintf(stdout, "\tPOLLHUP\n");
if (pfd.revents & POLLRDHUP)
fprintf(stdout, "\tPOLLRDHUP\n");
return pfd.revents;
}
static int epoll_test_tty(void *data) {
int epfd, ifd = fileno(stdin), res;
struct epoll_event evt;
if (check_events(ifd, 0) != POLLOUT) {
fprintf(stderr, "Something is cooking on STDIN (%d)\n", ifd);
return 1;
}
epfd = xepoll_create(1);
fprintf(stdout, "Created epoll fd (%d)\n", epfd);
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
xepoll_ctl(epfd, EPOLL_CTL_ADD, ifd, &evt);
if (check_events(epfd, 0) & POLLIN) {
res = epoll_wait(epfd, &evt, 1, 0);
if (res == 0) {
fprintf(stderr, "Epoll fd (%d) is ready when it shouldn't!\n",
epfd);
return 2;
}
}
return 0;
}
static int epoll_wakeup_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write must come after the poll(2) call inside
* check_events(). This tests the nested wakeup code in
* fs/eventpoll.c:ep_poll_safewake()
* By having the check_events() (hence poll(2)) happens first,
* we have poll wait queue filled up, and the write(2) in the
* child will trigger the wakeup chain.
*/
if ((pid = xfork()) == 0) {
sleep(1);
write(pfds[1], "w", 1);
exit(0);
}
res = check_events(epfd, 2000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
static int epoll_poll_chain(void *data) {
struct epoll_test_cfg *tcfg = data;
int i, res, epfd, bfd, nfd, pfds[2];
pid_t pid;
struct epoll_event evt;
memset(&evt, 0, sizeof(evt));
evt.events = EPOLLIN;
epfd = bfd = xepoll_create(1);
for (i = 0; i < tcfg->size; i++) {
nfd = xepoll_create(1);
xepoll_ctl(bfd, EPOLL_CTL_ADD, nfd, &evt);
bfd = nfd;
}
xpipe(pfds);
if (tcfg->flags & EPOLL_TF_LOOP)
{
xepoll_ctl(bfd, EPOLL_CTL_ADD, epfd, &evt);
/*
* If we're testing for loop, we want that the wakeup
* triggered by the write to the pipe done in the child
* process, triggers a fake event. So we add the pipe
* read size with EPOLLOUT events. This will trigger
* an addition to the ready-list, but no real events
* will be there. The the epoll kernel code will proceed
* in calling f_op->poll() of the epfd, triggering the
* loop we want to test.
*/
evt.events = EPOLLOUT;
}
xepoll_ctl(bfd, EPOLL_CTL_ADD, pfds[0], &evt);
/*
* The pipe write mush come before the poll(2) call inside
* check_events(). This tests the nested f_op->poll calls code in
* fs/eventpoll.c:ep_eventpoll_poll()
* By having the pipe write(2) happen first, we make the kernel
* epoll code to load the ready lists, and the following poll(2)
* done inside check_events() will test nested poll code in
* ep_eventpoll_poll().
*/
if ((pid = xfork()) == 0) {
write(pfds[1], "w", 1);
exit(0);
}
sleep(1);
res = check_events(epfd, 1000) & POLLIN;
if (waitpid(pid, NULL, 0) != pid) {
perror("waitpid");
return -1;
}
return res;
}
int main(int ac, char **av) {
int error;
struct epoll_test_cfg tcfg;
fprintf(stdout, "\n********** Testing TTY events\n");
error = run_forked_proc(epoll_test_tty, NULL);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short wakeup chain\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing short poll chain\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == POLLIN ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = EPOLL_MAX_CHAIN;
tcfg.flags = 0;
fprintf(stdout, "\n********** Testing long poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy wakeup chain (HOLD ON)\n");
error = run_forked_proc(epoll_wakeup_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
tcfg.size = 3;
tcfg.flags = EPOLL_TF_LOOP;
fprintf(stdout, "\n********** Testing loopy poll chain (HOLD ON)\n");
error = run_forked_proc(epoll_poll_chain, &tcfg);
fprintf(stdout, error == 0 ?
"********** OK\n": "********** FAIL (%d)\n", error);
return 0;
}
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The base versions handle constant folding now and are shorter than these
private wrappers, use them directly.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the filesystem freeze operation has been elevated to the VFS, and
is just an ioctl away, some sort of safety net for unintentionally frozen
root filesystems may be in order.
The timeout thaw originally proposed did not get merged, but perhaps
something like this would be useful in emergencies.
For example, freeze /path/to/mountpoint may freeze your root filesystem if
you forgot that you had that unmounted.
I chose 'j' as the last remaining character other than 'h' which is sort
of reserved for help (because help is generated on any unknown character).
I've tested this on a non-root fs with multiple (nested) freezers, as well
as on a system rendered unresponsive due to a frozen root fs.
[randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_to_free_pages() is used for the direct reclaim of up to
SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
be used but this is not passed to try_to_free_pages(). This can lead to
unnecessary reclaim of pages that are unusable by the caller and int the
worst case lead to allocation failure as progress was not been make where
it is needed.
This patch passes the nodemask used for alloc_pages_nodemask() to
try_to_free_pages().
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of open-coding the lru-list-add pagevec batching when expanding a
file mapping from zero, defer to the appropriate page cache function that
also takes care of adding the page to the lru list.
This is cleaner, saves code and reduces the stack footprint by 16 words
worth of pagevec.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.com>
Cc: MinChan Kim <minchan.kim@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix warnings and return values in sysfs bin_page_mkwrite(), fixing
fs/sysfs/bin.c: In function `bin_page_mkwrite':
fs/sysfs/bin.c:250: warning: passing argument 2 of `bb->vm_ops->page_mkwrite' from incompatible pointer type
fs/sysfs/bin.c: At top level:
fs/sysfs/bin.c:280: warning: initialization from incompatible pointer type
Expects to have my [PATCH next] sysfs: fix some bin_vm_ops errors
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Eric W. Biederman" <ebiederm@aristanetworks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_mkwrite is called with neither the page lock nor the ptl held. This
means a page can be concurrently truncated or invalidated out from
underneath it. Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS. A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg. hole punched). Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.
The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors. Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).
This fixes core code, and converts btrfs as a template/example. All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow non root users with sufficient mlock rlimits to be able to allocate
hugetlb backed shm for now. Deprecate this though. This is being
deprecated because the mlock based rlimit checks for SHM_HUGETLB is not
consistent with mmap based huge page allocations.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix hugetlb subsystem so that non root users belonging to
hugetlb_shm_group can actually allocate hugetlb backed shm.
Currently non root users cannot even map one large page using SHM_HUGETLB
when they belong to the gid in /proc/sys/vm/hugetlb_shm_group. This is
because allocation size is verified against RLIMIT_MEMLOCK resource limit
even if the user belongs to hugetlb_shm_group.
This patch
1. Fixes hugetlb subsystem so that users with CAP_IPC_LOCK and users
belonging to hugetlb_shm_group don't need to be restricted with
RLIMIT_MEMLOCK resource limits
2. This patch also disables mlock based rlimit checking (which will
be reinstated and marked deprecated in a subsequent patch).
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a helper function account_page_dirtied(). Use that from two
callsites. reiser4 adds a function which adds a third callsite.
Signed-off-by: Edward Shishkin<edward.shishkin@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct tty_operations::proc_fops took it's place and there is one less
create_proc_read_entry() user now!
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Used for gradual switch of TTY drivers from using ->read_proc which helps
with gradual switch from ->read_proc for the whole tree.
As side effect, fix possible race condition when ->data initialized after
PDE is hooked into proc tree.
->proc_fops takes precedence over ->read_proc.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 29a814d2ee (vfs: add hooks for
ext4's delayed allocation support) exported the following functions
mpage_bio_submit()
__mpage_writepage()
for the benefit of ext4's delayed allocation support. Since commit
a1d6cc563b (ext4: Rework the
ext4_da_writepages() function), these functions are not used by the
ext4 driver anymore. However, the now unnecessary exports still
remain, and this patch removes those. Moreover, these two functions
can become static again.
The issue was spotted by namespacecheck.
Signed-off-by: Dmitri Vorobiev <dmitri.vorobiev@movial.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... since we don't tell anyone which descriptor does the file get.
We used to, but only in case of ELF binary with a.out loader and
that stuff has been gone for a while.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't pull it in sched.h; very few files actually need it and those
can include directly. sched.h itself only needs forward declaration
of struct fs_struct;
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* all changes of current->fs are done under task_lock and write_lock of
old fs->lock
* refcount is not atomic anymore (same protection)
* its decrements are done when removing reference from current; at the
same time we decide whether to free it.
* put_fs_struct() is gone
* new field - ->in_exec. Set by check_unsafe_exec() if we are trying to do
execve() and only subthreads share fs_struct. Cleared when finishing exec
(success and failure alike). Makes CLONE_FS fail with -EAGAIN if set.
* check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
is in progress.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pure code move; two new helper functions for nfsd and daemonize
(unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
the same code as used to be in callers). unshare_fs_struct()
exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
copy_fs_struct() and exit_fs() don't need exports anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Not because execve races with _that_ are serious - we really
need a situation when final drop of fs_struct refcount is
done by something that used to have it as current->fs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
COW means we cycle though blocks fairly quickly, and once we
free an extent on disk, it doesn't make much sense to keep the pages around.
This commit tries to immediately free the page when we free the extent,
which lowers our memory footprint significantly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.
This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.
If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.
Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).
For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).
For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Added some documentation in exofs.txt, as well as a BUGS file.
For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch ties all operation vectors into a file system superblock
and registers the exofs file_system_type at module's load time.
* The file system control block (AKA on-disk superblock) resides in
an object with a special ID (defined in common.h).
Information included in the file system control block is used to
fill the in-memory superblock structure at mount time. This object
is created before the file system is used by mkexofs.c It contains
information such as:
- The file system's magic number
- The next inode number to be allocated
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
implementation of directory and inode operations.
* A directory is treated as a file, and essentially contains a list
of <file name, inode #> pairs for files that are found in that
directory. The object IDs correspond to the files' inode numbers
and are allocated using a 64bit incrementing global counter.
* Each file's control block (AKA on-disk inode) is stored in its
object's attributes. This applies to both regular files and other
types (directories, device files, symlinks, etc.).
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
OK Now we start to read and write from osd-objects. We try to
collect at most contiguous pages as possible in a single write/read.
The first page index is the object's offset.
TODO:
In 64-bit a single bio can carry at most 128 pages.
Add support of chaining multiple bios
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
implementation of the file_operations and inode_operations for
regular data files.
Most file_operations are generic vfs implementations except:
- exofs_truncate will truncate the OSD object as well
- Generic file_fsync is not good for none_bd devices so open code it
- The default for .flush in Linux is todo nothing so call exofs_fsync
on the file.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch includes osd infrastructure that will be used later by
the file system.
Also the declarations of constants, on disk structures,
and prototypes.
And the Kbuild+Kconfig files needed to build the exofs module.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
UBIFS did not recovery in a situation in which it could
have. The relevant function assumed there could not be
more nodes in an eraseblock after a corrupted node, but
in fact the last (NAND) page written might contain anything.
The correct approach is to check for empty space (0xFF bytes)
from then on.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h
Remove include/raid/md.h as its only remaining use was to #include
other files.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc:
Revert "proc: revert /proc/uptime to ->read_proc hook"
proc 2/2: remove struct proc_dir_entry::owner
proc 1/2: do PDE usecounting even for ->read_proc, ->write_proc
proc: fix sparse warnings in pagemap_read()
proc: move fs/proc/inode-alloc.txt comment into a source file
Setting ->owner as done currently (pde->owner = THIS_MODULE) is racy
as correctly noted at bug #12454. Someone can lookup entry with NULL
->owner, thus not pinning enything, and release it later resulting
in module refcount underflow.
We can keep ->owner and supply it at registration time like ->proc_fops
and ->data.
But this leaves ->owner as easy-manipulative field (just one C assignment)
and somebody will forget to unpin previous/pin current module when
switching ->owner. ->proc_fops is declared as "const" which should give
some thoughts.
->read_proc/->write_proc were just fixed to not require ->owner for
protection.
rmmod'ed directories will be empty and return "." and ".." -- no harm.
And directories with tricky enough readdir and lookup shouldn't be modular.
We definitely don't want such modular code.
Removing ->owner will also make PDE smaller.
So, let's nuke it.
Kudos to Jeff Layton for reminding about this, let's say, oversight.
http://bugzilla.kernel.org/show_bug.cgi?id=12454
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
struct proc_dir_entry::owner is going to be removed. Now it's only necessary
to protect PDEs which are using ->read_proc, ->write_proc hooks.
However, ->owner assignments are racy and make it very easy for someone to switch
->owner on live PDE (as some subsystems do) without fixing refcounts and so on.
http://bugzilla.kernel.org/show_bug.cgi?id=12454
So, ->owner is on death row.
Proxy file operations exist already (proc_file_operations), just bump usecount
when necessary.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
fs/proc/task_mmu.c:696:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:696:9: warning: incorrect type in assignment (different address spaces)
fs/proc/task_mmu.c:696:9: expected unsigned long long [noderef] [usertype] <asn:1>*out
fs/proc/task_mmu.c:696:9: got unsigned long long [usertype] *<noident>
fs/proc/task_mmu.c:697:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:697:9: warning: incorrect type in assignment (different address spaces)
fs/proc/task_mmu.c:697:9: expected unsigned long long [noderef] [usertype] <asn:1>*end
fs/proc/task_mmu.c:697:9: got unsigned long long [usertype] *<noident>
fs/proc/task_mmu.c:723:12: warning: cast removes address space of expression
fs/proc/task_mmu.c:723:26: error: subtraction of different types can't work (different address spaces)
fs/proc/task_mmu.c:725:24: error: subtraction of different types can't work (different address spaces)
Signed-off-by: Milind Arun Choudhary <milindchoudhary@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
so that people will realize that it exists and can update it as needed.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
This patch renames n_, c_, etc variables to something more sane. This
is the sixth in a series of patches to rip out some of the awful
variable naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a simple s/p_._//g to the reiserfs code. This is the
fifth in a series of patches to rip out some of the awful variable
naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a simple s/p_s_tb/tb/g to the reiserfs code. This is the
fourth in a series of patches to rip out some of the awful variable
naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a simple s/p_s_inode/inode/g to the reiserfs code. This
is the third in a series of patches to rip out some of the awful
variable naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a simple s/p_s_bh/bh/g to the reiserfs code. This is the
second in a series of patches to rip out some of the awful variable
naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a simple s/p_s_sb/sb/g to the reiserfs code. This is the
first in a series of patches to rip out some of the awful variable
naming in reiserfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch strips trailing whitespace from the reiserfs code.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch cleans up some redundancies in the reiserfs tree path code.
decrement_bcount() is essentially the same function as brelse(), so we use
that instead.
decrement_counters_in_path() is exactly the same function as pathrelse(), so
we kill that and use pathrelse() instead.
There's also a bit of cleanup that makes the code a bit more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first in a series of patches to make balance_leaf() not
quite so insane.
This patch factors out the open coded initializations of buffer_info
structures and defines a few initializers for the 4 cases they're used.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some time ago, some changes were made to make security inode attributes
be atomically written during inode creation. ReiserFS fell behind in
this area, but with the reworking of the xattr code, it's now fairly
easy to add.
The following patch adds the ability for security attributes to be added
automatically during inode creation.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current reiserfs xattr implementation open codes reiserfs_readdir
and frees the path before calling the filldir function. Typically, the
filldir function is something that modifies the file system, such as a
chown or an inode deletion that also require reading of an inode
associated with each direntry. Since the file system is modified, the
path retained becomes invalid for the next run. In addition, it runs
backwards in attempt to minimize activity.
This is clearly suboptimal from a code cleanliness perspective as well
as performance-wise.
This patch implements a generic reiserfs_for_each_xattr that uses the
generic readdir and a specific filldir routine that simply populates an
array of dentries and then performs a specific operation on them. When
all files have been operated on, it then calls the operation on the
directory itself.
The result is a noticable code reduction and better performance.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deadlocks are possible in the xattr code between the journal lock and the
xattr sems.
This patch implements journalling for xattr operations. The benefit is
twofold:
* It gets rid of the deadlock possibility by always ensuring that xattr
write operations are initiated inside a transaction.
* It corrects the problem where xattr backing files aren't considered any
differently than normal files, despite the fact they are metadata.
I discussed the added journal load with Chris Mason, and we decided that
since xattrs (versus other journal activity) is fairly rare, the introduction
of larger transactions to support journaled xattrs wouldn't be too big a deal.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christoph Hellwig had asked me quite some time ago to port the reiserfs
xattrs to the generic xattr interface.
This patch replaces the reiserfs-specific xattr handling code with the
generic struct xattr_handler.
However, since reiserfs doesn't split the prefix and name when accessing
xattrs, it can't leverage generic_{set,get,list,remove}xattr without
needlessly reconstructing the name on the back end.
Update 7/26/07: Added missing dput() to deletion path.
Update 8/30/07: Added missing mark_inode_dirty when i_mode is used to
represent an ACL and no previous ACL existed.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the changes to xattr root locking, the i_has_xattr_dir flag
is no longer needed. This patch removes it.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per-inode locking can be made more fine-grained to surround just the
interaction with the filesystem itself. This really only applies to
protecting reads during a write, since concurrent writes are barred with
inode->i_mutex at the vfs level.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the switch to using inode->i_mutex locking during lookups/creation
in the xattr root, the per-super xattr lock is no longer needed.
This patch removes it.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The xattr file open/lookup code is needlessly complex. We can use
vfs-level operations to perform the same work, and also simplify the
locking constraints. The locking advantages will be exploited in future
patches.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current reiserfs xattr implementation will not clean up old xattr
files if files are deleted when REISERFS_FS_XATTR is unset. This
results in inaccessible lost files, wasting space.
This patch compiles in basic xattr knowledge, such as how to delete them
and change ownership for quota tracking. If the file system has never
used xattrs, then the operation is quite fast: it returns immediately
when it sees there is no .reiserfs_priv directory.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a number of helper functions for marking a reiserfs inode
private that were leftover from reiserfs did its own thing wrt to
private inodes. S_PRIVATE has been in the kernel for some time, so this
patch removes the helpers and uses IS_PRIVATE instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Early in the reiserfs xattr development, there was a plan to use
hardlinks to save disk space for identical xattrs. That code never
materialized and isn't going to, so this patch removes the detection
code.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes reiserfs_get_page to take an offset rather than an
index since no callers calculate the index differently.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the xinode and mapping variables from
reiserfs_xattr_{get,set}.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes many paths that are currently using warnings to handle
the error.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although reiserfs can currently handle severe errors such as journal failure,
it cannot handle less severe errors like metadata i/o failure. The following
patch adds a reiserfs_error() function akin to the one in ext3.
Subsequent patches will use this new error handler to handle errors more
gracefully in general.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch kills off reiserfs_journal_abort as it is never called, and
combines __reiserfs_journal_abort_{soft,hard} into one function called
reiserfs_abort_journal, which performs the same work. It is silent
as opposed to the old version, since the message was always issued
after a regular 'abort' message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ReiserFS panics can be somewhat inconsistent.
In some cases:
* a unique identifier may be associated with it
* the function name may be included
* the device may be printed separately
This patch aims to make warnings more consistent. reiserfs_warning() prints
the device name, so printing it a second time is not required. The function
name for a warning is always helpful in debugging, so it is now automatically
inserted into the output. Hans has stated that every warning should have
a unique identifier. Some cases lack them, others really shouldn't have them.
reiserfs_warning() now expects an id associated with each message. In the
rare case where one isn't needed, "" will suffice.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The formatting of the error buffer is race prone. It uses static buffers
for both formatting and output. While overwriting the error buffer
can product garbled output, overwriting the format buffer with incompatible
% directives can cause crashes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vsprintf will consume varargs on its own. Skipping them manually
results in garbage in the error buffer, or Oopses in the case of
pointers.
This patch removes the advancement and fixes a number of bugs where
crashes were observed as side effects of a regular error report.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ReiserFS warnings can be somewhat inconsistent.
In some cases:
* a unique identifier may be associated with it
* the function name may be included
* the device may be printed separately
This patch aims to make warnings more consistent. reiserfs_warning() prints
the device name, so printing it a second time is not required. The function
name for a warning is always helpful in debugging, so it is now automatically
inserted into the output. Hans has stated that every warning should have
a unique identifier. Some cases lack them, others really shouldn't have them.
reiserfs_warning() now expects an id associated with each message. In the
rare case where one isn't needed, "" will suffice.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several places, reiserfs_warning is used when there is no warning, just
a notice. This patch changes some of them to indicate that the message
is merely informational.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The output format between a warning/error/panic/info/etc changes with
which one is used.
The following patch makes the messages more internally consistent, but also
more consistent with other Linux filesystems.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes leaf_paste_entries more consistent with respect to the
other leaf operations. Using buffer_info instead of buffer_head
directly allows us to get a superblock pointer for use in error
handling.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes up the reiserfs code such that transaction ids are
always unsigned ints. In places they can currently be signed ints or
unsigned longs.
The former just causes an annoying clm-2200 warning and may join a
transaction when it should wait.
The latter is just for correctness since the disk format uses a 32-bit
transaction id. There aren't any runtime problems that result from it
not wrapping at the correct location since the value is truncated
correctly even on big endian systems. The 0 value might make it to
disk, but the mount-time checks will bump it to 10 itself.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following patch adds the fields for tracking mount counts and last
fsck timestamps to the superblock. It also increments the mount count
on every read-write mount.
Reiserfsprogs 3.6.21 added support for these fields.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This bug was found with smatch (http://repo.or.cz/w/smatch.git/). If
we return directly the inode->i_mutex lock doesn't get released.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Lockdep gripes if file->f_lock is taken in a no-IRQ situation, since that
is not always the case. We don't really want to disable IRQs for every
acquisition of f_lock; instead, just move it outside of fasync_lock.
Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
The uuid table handling should not be part of a semi-generic uuid library
but in the XFS code using it, so move those bits to xfs_mount.c and
refactor the whole glob to make it a proper abstraction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Remove the allocation of struct nfsd4_compound_state.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
With the upcoming v3 inodes the default attroffset needs to be calculated
for each specific inode, so we can't cache it in the superblock anymore.
Also replace the assert for wrong inode sizes with a proper error check
also included in non-debug builds. Note that the ENOSYS return for
that might seem odd, but that error is returned by xfs_mount_validate_sb
for all theoretically valid but not supported filesystem geometries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Hi.
I introduced "is_partially_uptodate" aops for XFS.
A page can have multiple buffers and even if a page is not uptodate,
some buffers can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read
after random write workloads can be optimized and we can get performance
improvement.
I did a performance test using the sysbench.
#sysbench --num-threads=4 --max-requests=100000 --test=fileio --file-num=1 \
--file-block-size=8K --file-total-size=1G --file-test-mode=rndrw \
--file-fsync-freq=0 --file-rw-ratio=0.5 run
-2.6.29-rc6
Test execution summary:
total time: 123.8645s
total number of events: 100000
total time taken by event execution: 442.4994
per-request statistics:
min: 0.0000s
avg: 0.0044s
max: 0.3387s
approx. 95 percentile: 0.0118s
-2.6.29-rc6-patched
Test execution summary:
total time: 108.0757s
total number of events: 100000
total time taken by event execution: 417.7505
per-request statistics:
min: 0.0000s
avg: 0.0042s
max: 0.3217s
approx. 95 percentile: 0.0118s
arch: ia64
pagesize: 16k
blocksize: 4k
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
With the upcoming v3 inodes the inode data/attr area size needs to be
calculated for each specific inode, so we can't cache it in the superblock
anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
The ino64 mount option adds a fixed offset to 32bit inode numbers
to bring them into the 64bit range. There's no need for this kind
of debug tool given that it's easy to produce real 64bit inode numbers
for testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
People continue to complain about this for weird reasons, but there's
really no point in keeping this typedef for a couple of users anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
check_unsafe_exec() also notes whether the fs_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
use of get_fs_struct(), which also raises that sharing count.
This might occasionally cause a setuid program not to change euid,
in the same way as happened with files->count (check_unsafe_exec
also looks at sighand->count, but /proc doesn't raise that one).
We'd prefer exec not to unshare fs_struct: so fix this in procfs,
replacing get_fs_struct() by get_fs_path(), which does path_get
while still holding task_lock, instead of raising fs->count.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
___
fs/proc/base.c | 50 +++++++++++++++--------------------------------
1 file changed, 16 insertions(+), 34 deletions(-)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.
Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.
There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.
[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]
Reported-by: Joe Malicki <jmalicki@metacarta.com>
Narrowed-down-by: Michael Itz <mitz@metacarta.com>
Tested-by: Joe Malicki <jmalicki@metacarta.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2.6.26's commit fd8328be87
"sanitize handling of shared descriptor tables in failing execve()"
moved the unshare_files() from flush_old_exec() and several binfmts
to the head of do_execve(); but forgot to make the same change to
compat_do_execve(), leaving a CLONE_FILES files_struct shared across
exec from a 32-bit process on a 64-bit kernel.
It's arguable whether the files_struct really ought to be unshared
across exec; but 2.6.1 made that so to stop the loading binary's fd
leaking into other threads, and a 32-bit process on a 64-bit kernel
ought to behave in the same way as 32 on 32 and 64 on 64.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Callback requests from IPv4 servers are now always guaranteed to be
AF_INET, and never mapped IPv4 AF_INET6 addresses. Both
nfs_match_client() and nfs_find_client() can now share the same
address comparison logic, so fold them together.
We can also dispense with of most of the conditional compilation
in here.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Apparently a lot of people need to disable IPv6 completely on their
distributor-built systems, which have CONFIG_IPV6_MODULE enabled at
build time.
They do this by blacklisting the ipv6.ko module. This causes the
creation of the NFSv4 callback service listener to fail if
CONFIG_IPV6_MODULE is set, but the module cannot be loaded.
Now that the kernel's PF_INET6 RPC listeners are completely separate
from PF_INET listeners, we can always start PF_INET. Then the NFS
client can try to start a PF_INET6 listener, but it isn't required
to be available.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Apparently a lot of people need to disable IPv6 completely on their
distributor-built systems, which have CONFIG_IPV6_MODULE enabled at
build time.
They do this by blacklisting the ipv6.ko module. This causes the
creation of the lockd service listener to fail if CONFIG_IPV6_MODULE
is set, but the module cannot be loaded.
Now that the kernel's PF_INET6 RPC listeners are completely separate
from PF_INET listeners, we can always start PF_INET. Then lockd can
try to start PF_INET6, but it isn't required to be available.
Note this has the added benefit that NLM callbacks from AF_INET6
servers will never come from AF_INET remotes. We no longer have to
worry about matching mapped IPv4 addresses to AF_INET when comparing
addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're about to convert over to using separate PF_INET and PF_INET6
listeners, instead of a single PF_INET6 listener that also receives
AF_INET requests and maps them to AF_INET6.
Clear the way by removing the logic in lockd and the NFSv4 callback
server that creates an AF_INET6 service listener.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Since an RPC service listener's protocol family is specified now via
svc_create_xprt(), it no longer needs to be passed to svc_create() or
svc_create_pooled(). Remove that argument from the synopsis of those
functions, and remove the sv_family field from the svc_serv struct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sv_family field is going away. Pass a protocol family argument to
svc_create_xprt() instead of extracting the family from the passed-in
svc_serv struct.
Again, as this is a listener socket and not an address, we make this
new argument an "int" protocol family, instead of an "sa_family_t."
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make sure port value read from user space by write_ports is valid before
passing it to svc_find_xprt(). If it wasn't, the writer would get ENOENT
instead of EINVAL.
Noticed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add support for using the mount options "barrier" and "nobarrier", and
"auto_da_alloc" and "noauto_da_alloc", which is more consistent than
"barrier=<0|1>" or "auto_da_alloc=<0|1>". Most other ext3/ext4 mount
options use the foo/nofoo naming convention. We allow the old forms
of these mount options for backwards compatibility.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
When doing synchronous writes because wbc->sync_mode is set to
WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
don't unduly block system calls such as fsync().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
With big installation current 25 maximum number of
supported ACL entries is not enough any more.
Increase the limit to 100.
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits)
ext2: Zero our b_size in ext2_quota_read()
trivial: fix typos/grammar errors in fs/Kconfig
quota: Coding style fixes
quota: Remove superfluous inlines
quota: Remove uppercase aliases for quota functions.
nfsd: Use lowercase names of quota functions
jfs: Use lowercase names of quota functions
udf: Use lowercase names of quota functions
ufs: Use lowercase names of quota functions
reiserfs: Use lowercase names of quota functions
ext4: Use lowercase names of quota functions
ext3: Use lowercase names of quota functions
ext2: Use lowercase names of quota functions
ramfs: Remove quota call
vfs: Use lowercase names of quota functions
quota: Remove dqbuf_t and other cleanups
quota: Remove NODQUOT macro
quota: Make global quota locks cacheline aligned
quota: Move quota files into separate directory
ext4: quota reservation for delayed allocation
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: fix length calculation in compat code
dlm: ignore cancel on granted lock
dlm: clear defunct cancel state
dlm: replace idr with hash table for connections
dlm: comment typo fixes
dlm: use ipv6_addr_copy
dlm: Change rwlock which is only used in write mode to a spinlock
Update information about locking in JBD2 revoke code. Inconsistency in
comments found by Lin Tan <tammy000@gmail.com>.
CC: Lin Tan <tammy000@gmail.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Impact: code cleanup
This patch rename pa_linear to pa_type and add MB_INODE_PA
and MB_GROUP_PA to indicate inode and group prealloc space.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Check block references in the inode and indorect blocks for non-extent
inodes to make sure they are valid, and flag an error if they are
invalid.
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
To be on the safe side, it should be less fragile to exclude I_NEW inodes
from inode list scans by default (unless there is an important reason to
have them).
Normally they will get excluded (eg. by zero refcount or writecount etc),
however it is a bit fragile for list walkers to know exactly what parts of
the inode state is set up and valid to test when in I_NEW. So along these
lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
checks with them too -- this shouldn't be a problem should it?)
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_pts_mount() (including the get_sb_nodev()), shares a lot of code
with init_pts_mount(). The only difference between them is the 'test-super'
function passed into sget().
Move all common code into devpts_get_sb() and remove the new_pts_mount() and
init_pts_mount() functions,
Changelog[v3]:
[Serge Hallyn]: Remove unnecessary printk()s
Changelog[v2]:
(Christoph Hellwig): Merge code in 'do_pts_mount()' into devpts_get_sb()
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With mknod_ptmx() moved to devpts_get_sb(), init_pts_mount() becomes
a wrapper around get_init_pts_sb(). Remove get_init_pts_sb() and
fold code into init_pts_mount().
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>