Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new freeze
protection. We also protect several ioctl entry points which were missing the
protection. Finally, we add freeze protection to the journaling mechanism so
that iput() of unlinked inode cannot modify a frozen filesystem.
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.
CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).
CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().
->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is unexpected to block reading of frozen filesystem because of atime update.
Also handling blocking on frozen filesystem because of atime update would make
locking more complex than it already is. So just skip atime update when
filesystem is frozen like we skip it when filesystem is remounted read-only.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.
Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.
This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions). Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.
Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.
[AV: microoptimize sb_start_write(); we want it fast in normal case]
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
which can adapt equally well to fast/slow devices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQF0/wAAoJECvKgwp+S8JaszsP/16EO5F5mUCOFgncVRp+8R9U
BxuKJ61j2R9ckHA+ngMEg72W5vJQds64cjywZnz6HMr0/+3tXUf4QBbU4/4sCeai
0lpK8MCKgp5KHHCxgO8zyoSaboankUgoDcbSmGJREV1WXoR8VWXsO9gXqiiH9XOe
e8ADjds/YdxkQbOYDRgZKvLzwWS61K9Kwq5/56GASh2uflw7rkJZ38xqvGbo3YiQ
IJJwOUYfJjFadIewYARQmkZZyWeAmtY0ADh15Z8pJt+iY4PgcDDlaWagwUH2Oaoi
vhTFO4KnCjhSpc872et21g/jN/VrcQqzuUF/LUE9rW5irXeDZVCDrCQOuHQ+3Uo5
YuV3rpNABW/LU8AvtIwt9hUunaKUnrXaSluoL9LzH2VpH++JljQeR4yZ62Q5rpRs
z4Bow25p7tlbcIJWzueMPdOFUr5s3P6XfQHoLVRWqN94eJ+Z1DPOgBlOain6cCUN
oivPh4FxZfUscNmX8/7cHaTNEpGzJ0FzLPMNFnGQ2zwG0gk41fnMdWb19aVIE5GL
/Q96TB24k/v+o1lbxGmyPi0L+Aq+NknvNm+p/YJHAAQrIpu5t2hPvka8/m7Nj7tu
c3rM75RKiZkEI+3U6Ws1DhhQPtVcfNIVlNYQMGzlIndtK6T0ByueQ4eN5Z7ltjSE
pS89rv2hyBw7yCaTU6ui
=gZrM
-----END PGP SIGNATURE-----
Merge tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback updates from Wu Fengguang:
"Use time based periods to age the writeback proportions, which can
adapt equally well to fast/slow devices."
Fix up trivial conflict in comment in fs/sync.c
* tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Fix some comment errors
block: Convert BDI proportion calculations to flexible proportions
lib: Fix possible deadlock in flexible proportion code
lib: Proportions with flexible period
Features include:
- More preparatory patches for modularising NFSv2/v3/v4.
Split out the various NFSv2/v3/v4-specific code into separate
files
- More preparation for the NFSv4 migration code
- Ensure that OPEN(O_CREATE) observes the pNFS mds threshold parameters
- pNFS fast failover when the data servers are down
- Various cleanups and debugging patches
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQFwYRAAoJEGcL54qWCgDy6YAQAIZuUPFCQbuzWev4L/XA0uhg
0xjr9YBmg57UOs8e7FhS0azJip+D/kmF2Rx5gCMX0QO/IwaEVoms4BS8zjFOr3yL
cPIyY5ErF+WJ25f3Mz8k0wOHTzrOvMdq4/7TQf5hztsrGYFid78sSb3O9ZO9bgb4
QxqhgA0Z4S2vIqeObJAF9BT5UOT/cXfVKV0iTd52ZQVA+ZhqJ2SDlNFsoxnVOweV
qzKOi0FZCPsjH+Z4a/LLdieHG5AYRPhOScBdG7gTFAdkysI8DDtSNCmJ68dMHYRw
OHtyt/X4k9TImfwauV3Eww1sw3NkDswX+dv3GOSDTlMvVBrm0WXWj2zoHZbOB+0Q
ETI9Zpolvd9pADtyfu9c+kCYIR+NdB4lGKd15iZfdEy/CZ0CtbLAbn5Szo2YcwNR
iVjswR0jsbklD+mZjlMeZoG4JX2d9ZRqLs20POz7QQBmdIQ8jb9/SwVj9zGvX0w0
NyVUHTFlGaNwkpo7oeRoWi1xjZWE6OzfoncV/46DOJWb08OKZ4fR4ugMYJWGi7Xx
I4nQrIiHtInqL+sF5Y3wlZ48dufLuIhAcHh7klxgud4GRj7t8j+a99pTrRmpHvvC
4K2Yu69VYcIA7N2RsSLohIllGPFmMwpdar7yERdD0So8/fz/zf7wn0dFFYY11+bL
YNVVGhvNxccMU5EU9YR4
=Lc59
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Features include:
- More preparatory patches for modularising NFSv2/v3/v4. Split out
the various NFSv2/v3/v4-specific code into separate files
- More preparation for the NFSv4 migration code
- Ensure that OPEN(O_CREATE) observes the pNFS mds threshold
parameters
- pNFS fast failover when the data servers are down
- Various cleanups and debugging patches"
* tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
nfs: fix fl_type tests in NFSv4 code
NFS: fix pnfs regression with directio writes
NFS: fix pnfs regression with directio reads
sunrpc: clnt: Add missing braces
nfs: fix stub return type warnings
NFS: exit_nfs_v4() shouldn't be an __exit function
SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors
NFS: Split out NFS v4 client functions
NFS: Split out the NFS v4 filesystem types
NFS: Create a single nfs_clone_super() function
NFS: Split out NFS v4 server creating code
NFS: Initialize the NFS v4 client from init_nfs_v4()
NFS: Move the v4 getroot code to nfs4getroot.c
NFS: Split out NFS v4 file operations
NFS: Initialize v4 sysctls from nfs_init_v4()
NFS: Create an init_nfs_v4() function
NFS: Split out NFS v4 inode operations
NFS: Split out NFS v3 inode operations
NFS: Split out NFS v2 inode operations
NFS: Clean up nfs4_proc_setclientid() and friends
...
There are two structures in which a count of snapshots are
maintained:
struct ceph_snap_context {
...
u32 num_snaps;
...
}
and
struct ceph_snap_realm {
...
u32 num_prior_parent_snaps; /* had prior to parent_since */
...
u32 num_snaps;
...
}
These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned. Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.
So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We re-run the loop but we don't re-set the attrs pointer back to NULL.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
When we detect a mds session reset, close the old ceph_connection before
reopening it. This ensures we clean up the old socket properly and keep
the ceph_connection state correct.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Merge Andrew's first set of patches:
"Non-MM patches:
- lots of misc bits
- tree-wide have_clk() cleanups
- quite a lot of printk tweaks. I draw your attention to "printk:
convert the format for KERN_<LEVEL> to a 2 byte pattern" which
looks a bit scary. But afaict it's solid.
- backlight updates
- lib/ feature work (notably the addition and use of memweight())
- checkpatch updates
- rtc updates
- nilfs updates
- fatfs updates (partial, still waiting for acks)
- kdump, proc, fork, IPC, sysctl, taskstats, pps, etc
- new fault-injection feature work"
* Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
drivers/misc/lkdtm.c: fix missing allocation failure check
lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table()
fault-injection: add tool to run command with failslab or fail_page_alloc
fault-injection: add selftests for cpu and memory hotplug
powerpc: pSeries reconfig notifier error injection module
memory: memory notifier error injection module
PM: PM notifier error injection module
cpu: rewrite cpu-notifier-error-inject module
fault-injection: notifier error injection
c/r: fcntl: add F_GETOWNER_UIDS option
resource: make sure requested range is included in the root range
include/linux/aio.h: cpp->C conversions
fs: cachefiles: add support for large files in filesystem caching
pps: return PTR_ERR on error in device_create
taskstats: check nla_reserve() return
sysctl: suppress kmemleak messages
ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION
ipc: compat: use signed size_t types for msgsnd and msgrcv
ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC
ipc: add COMPAT_SHMLBA support
...
When we restore file descriptors we would like them to look exactly as
they were at dumping time.
With help of fcntl it's almost possible, the missing snippet is file
owners UIDs.
To be able to read their values the F_GETOWNER_UIDS is introduced.
This option is valid iif CONFIG_CHECKPOINT_RESTORE is turned on, otherwise
returning -EINVAL.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__mem_open() which is called by both /proc/<pid>/environ and
/proc/<pid>/mem ->open() handlers will allow the use of negative offsets.
/proc/<pid>/mem has negative offsets but not /proc/<pid>/environ.
Clean this by moving the 'force FMODE_UNSIGNED_OFFSET flag' to mem_open()
to allow negative offsets only on /proc/<pid>/mem.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the following offset and environment address range check in
environ_read() of /proc/<pid>/environ is buggy:
int this_len = mm->env_end - (mm->env_start + src);
if (this_len <= 0)
break;
Large or negative offsets on /proc/<pid>/environ converted to 'unsigned
long' may pass this check since '(mm->env_start + src)' can overflow and
'this_len' will be positive.
This can turn /proc/<pid>/environ to act like /proc/<pid>/mem since
(mm->env_start + src) will point and read from another VMA.
There are two fixes here plus some code cleaning:
1) Fix the overflow by checking if the offset that was converted to
unsigned long will always point to the [mm->env_start, mm->env_end]
address range.
2) Remove the truncation that was made to the result of the check,
storing the result in 'int this_len' will alter its value and we can
not depend on it.
For kernels that have commit b409e578d ("proc: clean up
/proc/<pid>/environ handling") which adds the appropriate ptrace check and
saves the 'mm' at ->open() time, this is not a security issue.
This patch is taken from the grsecurity patch since it was just made
available.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 898b374af6 ("exec: replace call_usermodehelper_pipe with use
of umh init function and resolve limit"), the core limits recursive
check value was changed from 0 to 1, but the corresponding comments were
not updated.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nearly identical shortname parsing is performed in fat_search_long() and
__fat_readdir(). Extract this code into a function that may be called by
both.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify code by providing accessor functions for the directory entry
start cluster fields.
Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use -ENOMEM return value instead of -EINVAL when kzalloc() fails.
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An fs-thaw ioctl causes deadlock with a chcp or mkcp -s command:
chcp D ffff88013870f3d0 0 1325 1324 0x00000004
...
Call Trace:
nilfs_transaction_begin+0x11c/0x1a0 [nilfs2]
wake_up_bit+0x20/0x20
copy_from_user+0x18/0x30 [nilfs2]
nilfs_ioctl_change_cpmode+0x7d/0xcf [nilfs2]
nilfs_ioctl+0x252/0x61a [nilfs2]
do_page_fault+0x311/0x34c
get_unmapped_area+0x132/0x14e
do_vfs_ioctl+0x44b/0x490
__set_task_blocked+0x5a/0x61
vm_mmap_pgoff+0x76/0x87
__set_current_blocked+0x30/0x4a
sys_ioctl+0x4b/0x6f
system_call_fastpath+0x16/0x1b
thaw D ffff88013870d890 0 1352 1351 0x00000004
...
Call Trace:
rwsem_down_failed_common+0xdb/0x10f
call_rwsem_down_write_failed+0x13/0x20
down_write+0x25/0x27
thaw_super+0x13/0x9e
do_vfs_ioctl+0x1f5/0x490
vm_mmap_pgoff+0x76/0x87
sys_ioctl+0x4b/0x6f
filp_close+0x64/0x6c
system_call_fastpath+0x16/0x1b
where the thaw ioctl deadlocked at thaw_super() when called while chcp was
waiting at nilfs_transaction_begin() called from
nilfs_ioctl_change_cpmode(). This deadlock is 100% reproducible.
This is because nilfs_ioctl_change_cpmode() first locks sb->s_umount in
read mode and then waits for unfreezing in nilfs_transaction_begin(),
whereas thaw_super() locks sb->s_umount in write mode. The locking of
sb->s_umount here was intended to make snapshot mounts and the downgrade
of snapshots to checkpoints exclusive.
This fixes the deadlock issue by replacing the sb->s_umount usage in
nilfs_ioctl_change_cpmode() with a dedicated mutex which protects snapshot
mounts.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The checkpoint deletion ioctl (rmcp ioctl) has potential for breaking
snapshot because it is not fully exclusive with checkpoint mode change
ioctl (chcp ioctl).
The rmcp ioctl first tests if the specified checkpoint is a snapshot or
not within nilfs_cpfile_delete_checkpoint function, and then calls
nilfs_cpfile_delete_checkpoints function to actually invalidate the
checkpoint only if it's not a snapshot. However, the checkpoint can be
changed into a snapshot by the chcp ioctl between these two operations.
In that case, calling nilfs_cpfile_delete_checkpoints() wrongly
invalidates the snapshot, which leads to snapshot list corruption and
snapshot count mismatch.
This fixes the issue by changing nilfs_cpfile_delete_checkpoints() so
that it reconfirms the target checkpoints are snapshot or not.
This second check is exclusive with the chcp operation since it is
protected by an existing semaphore.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->delete_inode(), ->write_super_lockfs(), ->unlockfs() are gone so remove
references to them in the NTFS code. Noticed while cleaning up the
fsfreeze mess.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On minix2 and minix3 usually max_size is 7fffffff and the check in
question prohibits creation of last block spanning right before 7fffffff,
due to downward rounding during the division. Fix it by using
multiplication instead.
[akpm@linux-foundation.org: fix up code layout, use local `sb']
Signed-off-by: Vladimir Serbinenko <phcoder@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert ext4_count_free() to use memweight() instead of table lookup
based counting clear bits implementation. This change only affects the
code segments enabled by EXT4FS_DEBUG.
Note that this memweight() call can't be replaced with a single
bitmap_weight() call, although the pointer to the memory area is aligned
to long-word boundary. Because the size of the memory area may not be a
multiple of BITS_PER_LONG, then it returns wrong value on big-endian
architecture.
This also includes the following change.
- Remove unnecessary map == NULL check in ext4_count_free() which
always takes non-null pointer as the memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert ext3_count_free() to use memweight() instead of table lookup
based counting clear bits implementation. This change only affects the
code segments enabled by EXT3FS_DEBUG.
Note that this memweight() call can't be replaced with a single
bitmap_weight() call, although the pointer to the memory area is aligned
to long-word boundary. Because the size of the memory area may not be a
multiple of BITS_PER_LONG, then it returns wrong value on big-endian
architecture.
This also includes the following changes.
- Remove unnecessary map == NULL check in ext3_count_free() which
always takes non-null pointer as the memory area.
- Fix printk format warning that only reveals with EXT3FS_DEBUG.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert ext2_count_free() to use memweight() instead of table lookup
based counting clear bits implementation. This change only affects the
code segments enabled by EXT2FS_DEBUG.
Note that this memweight() call can't be replaced with a single
bitmap_weight() call, although the pointer to the memory area is aligned
to long-word boundary. Because the size of the memory area may not be a
multiple of BITS_PER_LONG, then it returns wrong value on big-endian
architecture.
This also includes the following changes.
- Remove unnecessary map == NULL check in ext2_count_free() which
always takes non-null pointer as the memory area.
- Fix printk format warning that only reveals with EXT2FS_DEBUG.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use memweight to count the total number of bits set in memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use memweight() to count the total number of bits set in memory area.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use memweight() to count the total number of bits clear in memory area.
Note that this memweight() call can't be replaced with a single
bitmap_weight() call, although the pointer to the memory area is aligned
to long-word boundary. Because the size of the memory area may not be a
multiple of BITS_PER_LONG, then it returns wrong value on big-endian
architecture.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Anders Larsen <al@alarsen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the generic printk_get_level() to search a message for a kern_level.
Add __printf to verify format and arguments. Fix a few messages that
had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks
to shrink the object size a bit when not using printk.
[akpm@linux-foundation.org: whitespace tweak]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When suid_dumpable=2, detect unsafe core_pattern settings and warn when
they are seen.
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the suid_dumpable sysctl is set to "2", and there is no core dump
pipe defined in the core_pattern sysctl, a local user can cause core files
to be written to root-writable directories, potentially with
user-controlled content.
This means an admin can unknowningly reintroduce a variation of
CVE-2006-2451, allowing local users to gain root privileges.
$ cat /proc/sys/fs/suid_dumpable
2
$ cat /proc/sys/kernel/core_pattern
core
$ ulimit -c unlimited
$ cd /
$ ls -l core
ls: cannot access core: No such file or directory
$ touch core
touch: cannot touch `core': Permission denied
$ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
$ pid=$!
$ sleep 1
$ kill -SEGV $pid
$ ls -l core
-rw------- 1 root kees 458752 Jun 21 11:35 core
$ sudo strings core | grep evil
OHAI=evil-string-here
While cron has been fixed to abort reading a file when there is any
parse error, there are still other sensitive directories that will read
any file present and skip unparsable lines.
Instead of introducing a suid_dumpable=3 mode and breaking all users of
mode 2, this only disables the unsafe portion of mode 2 (writing to disk
via relative path). Most users of mode 2 (e.g. Chrome OS) already use
a core dump pipe handler, so this change will not break them. For the
situations where a pipe handler is not defined but mode 2 is still
active, crash dumps will only be written to fully qualified paths. If a
relative path is defined (e.g. the default "core" pattern), dump
attempts will trigger a printk yelling about the lack of a fully
qualified path.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allocation can be as large as 64k.
- Add __GFP_NOWARN so the falied kmalloc() is silent
- Fall back to vmalloc() if the kmalloc() failed
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->delete_inode(), ->write_super_lockfs(), ->unlockfs() are gone so remove
refereces to them in the NTFS code. Remove unnecessary comments about
unimplemented methods while at it (suggested by Christoph Hellwig).
Noticed while cleaning up the fsfreeze mess.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is simply cleanup that will keep things more closely synced with the
userland code.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
This patch exports symbols needed by the v4 module. In addition, I also
switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or
CONFIG_NFS_V4_MODULE are set.
The module (nfs4.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch exports symbols and moves over the final structures needed by
the v3 module. In addition, I also switch over to using IS_ENABLED() to
check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set.
The module (nfs3.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The module (nfs2.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v2.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Somehow I missed this in my previous patch series, but these functions
are only needed by the v4 code and should be moved to a v4-only file. I
wasn't exactly sure where I should put these functions, so I moved them
into nfs4super.c where I could make them static.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I can set all variables in the nfs_fill_super() function, allowing me to
remove the nfs4_fill_super() function.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
v2 and v4 don't use it, so I create two new nfs_rpc_ops functions to
initialize the ACL client only when we are using v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I'm already looking up the nfs subversion in nfs_fs_mount(), so I have
easy access to rpc_ops that used to be difficult to reach. This allows
me to set up a different mount path for NFS v2/3 and NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I can now share this code with the v2 and v3 code by using the NFS
subversion structure.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds in the code to track multiple versions of the NFS
protocol. I created default structures for v2, v3 and v4 so that each
version can continue to work while I convert them into kernel modules.
I also removed the const parameter from the rpc_version array so that I
can change it at runtime.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a number of bugs in the NFS idmapper code:
(1) Only registered key types can be passed to the core keys code, so
register the legacy idmapper key type.
This is a requirement because the unregister function cleans up keys
belonging to that key type so that there aren't dangling pointers to the
module left behind - including the key->type pointer.
(2) Rename the legacy key type. You can't have two key types with the same
name, and (1) would otherwise require that.
(3) complete_request_key() must be called in the error path of
nfs_idmap_legacy_upcall().
(4) There is one idmap struct for each nfs_client struct. This means that
idmap->idmap_key_cons is shared without the use of a lock. This is a
problem because key_instantiate_and_link() - as called indirectly by
idmap_pipe_downcall() - releases anyone waiting for the key to be
instantiated.
What happens is that idmap_pipe_downcall() running in the rpc.idmapd
thread, releases the NFS filesystem in whatever thread that is running in
to continue. This may then make another idmapper call, overwriting
idmap_key_cons before idmap_pipe_downcall() gets the chance to call
complete_request_key().
I *think* that reading idmap_key_cons only once, before
key_instantiate_and_link() is called, and then caching the result in a
variable is sufficient.
Bug (4) is the cause of:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [< (null)>] (null)
PGD 0
Oops: 0010 [#1] SMP
CPU 1
Modules linked in: ppdev parport_pc lp parport ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack nfs fscache xt_CHECKSUM auth_rpcgss iptable_mangle nfs_acl bridge stp llc lockd be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_realtek snd_usb_audio snd_hda_intel snd_hda_codec snd_seq snd_pcm snd_hwdep snd_usbmidi_lib snd_rawmidi snd_timer uvcvideo videobuf2_core videodev media videobuf2_vmalloc snd_seq_device videobuf2_memops e1000e vhost_net iTCO_wdt joydev coretemp snd soundcore macvtap macvlan i2c_i801 snd_page_alloc tun iTCO_vendor_support microcode kvm_intel kvm sunrpc hid_logitech_dj usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Pid: 1229, comm: rpc.idmapd Not tainted 3.4.2-1.fc16.x86_64 #1 Gateway DX4710-UB801A/G33M05G1
RIP: 0010:[<0000000000000000>] [< (null)>] (null)
RSP: 0018:ffff8801a3645d40 EFLAGS: 00010246
RAX: ffff880077707e30 RBX: ffff880077707f50 RCX: ffff8801a18ccd80
RDX: 0000000000000006 RSI: ffff8801a3645e75 RDI: ffff880077707f50
RBP: ffff8801a3645d88 R08: ffff8801a430f9c0 R09: ffff8801a3645db0
R10: 000000000000000a R11: 0000000000000246 R12: ffff8801a18ccd80
R13: ffff8801a3645e75 R14: ffff8801a430f9c0 R15: 0000000000000006
FS: 00007fb6fb51a700(0000) GS:ffff8801afc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001a49b0000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rpc.idmapd (pid: 1229, threadinfo ffff8801a3644000, task ffff8801a3bf9710)
Stack:
ffffffff81260878 ffff8801a3645db0 ffff8801a3645db0 ffff880077707a90
ffff880077707f50 ffff8801a18ccd80 0000000000000006 ffff8801a3645e75
ffff8801a430f9c0 ffff8801a3645dd8 ffffffff81260983 ffff8801a3645de8
Call Trace:
[<ffffffff81260878>] ? __key_instantiate_and_link+0x58/0x100
[<ffffffff81260983>] key_instantiate_and_link+0x63/0xa0
[<ffffffffa057062b>] idmap_pipe_downcall+0x1cb/0x1e0 [nfs]
[<ffffffffa0107f57>] rpc_pipe_write+0x67/0x90 [sunrpc]
[<ffffffff8117f833>] vfs_write+0xb3/0x180
[<ffffffff8117fb5a>] sys_write+0x4a/0x90
[<ffffffff81600329>] system_call_fastpath+0x16/0x1b
Code: Bad RIP value.
RIP [< (null)>] (null)
RSP <ffff8801a3645d40>
CR2: 0000000000000000
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.4]
We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:
PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14"
#0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
#1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
#2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
#3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
#4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
#5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca
rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.
Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Current block layout driver read/write code assumes page
aligned IO in many places. Add a checker to validate the assumption.
Otherwise there would be data corruption like when application does
open(O_WRONLY) and page unaliged write.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
fl_type is not a bitmap.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 57208fa7e5 "NFS: Create an write_pageio_init() function"
did not modify the calls in direct.c, preventing direct io from
using pnfs. This reintroduces that capability.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 1abb50886a "NFS: Create an read_pageio_init() function"
did not modify the call in direct.c, preventing direct io from
using pnfs. This reintroduces that capability.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix numerous repeated warnings by making the stub function
void instead of non-void:
fs/nfs/nfs4_fs.h: In function 'nfs4_unregister_sysctl':
fs/nfs/nfs4_fs.h:385:1: warning: no return statement in function returning non-void
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.
CC: linux-nfs@vger.kernel.org
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
outside of i_mutex as in other places.
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.
One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex. Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.
[AV: scratch the previous, mnt_want_write() has been moved to kern_path_create()
by now]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: v9fs-developer@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
CC: Sage Weil <sage@newdream.net>
CC: ceph-devel@vger.kernel.org
Acked-by: Sage Weil <sage@newdream.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The write ref to vfsmount taken in lookup_open()/atomic_open() is going to
be dropped; we take the one to stay in dentry_open(). Just grab the temporary
in caller if it looks like we are going to need it (create/truncate/writable open)
and pass (by value) "has it succeeded" flag. Instead of doing mnt_want_write()
inside, check that flag and treat "false" as "mnt_want_write() has just failed".
mnt_want_write() is cheap and the things get considerably simpler and more robust
that way - we get it and drop it in the same function, to start with, rather
than passing a "has something in the guts of really scary functions taken it"
back to caller.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Numerous cleanups and several bug fixes. Here are some highlights:
* Discontiguous directory buffer support
* Inode allocator refactoring
* Removal of the IO lock in inode reclaim
* Implementation of .update_time
* Fix for handling of EOF in xfs_vm_writepage
* Fix for races in xfsaild, and idle mode is re-enabled
* Fix for a crash in xfs_buf completion handlers on unmount.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQFtCvAAoJENaLyazVq6ZOIuEQAINJXb4SK9oBrdwGmq+Vsqf2
Eh4OmzZmdnSPrxfFGmqvyL9DdUBvGBuidwOcVLMAXGtzbxE9USK9NuKC5zN/hJip
8tIyv/8bqZ0aD4RJlHGN5zKFoQh/9Tag+JsaaqWstO8Ir1tA/5p04hDAz492btfT
49SvnV64sJ1fi7pmaJblMWMMtlWJjD6iOldaHwnKBQ3LKmcgy9sD9DY5HiGOTr1j
ecKtucX7B8Q9oFLKHaKEwTYZRRYDNuTbqZmI6hlEcA5hT280jotsGA4q/aXx/gHS
lZuBaqVtNFT5WCKm+j/et76tmTfIh0CSbo64ZfgSOESy2BkEVXHg5XJ1gDvPdV+L
6eBlUx3jaiNyFVHxVzFhzwKC/XdaITCd/ixFEogRDmoppDXencTCibLJXHNXxupN
BCAyTLCxEJIE9WCeOMmwHA0450bMY4or13NGep57pIvG8GomtdG1WncTRIo84KV5
0W5ocaUTGP7ROsr+KF8U9C7H866OHzVFijA+vvcTy8GtsT/xOCFxuJrqPVb+kgD7
mIKaoK7iH6Kufu433TzsLEcUkF36gq/7NytPKjQhURLpZhxkHG3rq6LC0HXp6uuZ
QgX5Y5Gl7SwDovIrndXmQXRnGrzvqHLguZl65+rB1CKggjemkLSdSLhryoNVjLU2
iB7/hvzOUdYFMRRz2mLc
=2wkC
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.6-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs update from Ben Myers:
"Numerous cleanups and several bug fixes. Here are some highlights:
- Discontiguous directory buffer support
- Inode allocator refactoring
- Removal of the IO lock in inode reclaim
- Implementation of .update_time
- Fix for handling of EOF in xfs_vm_writepage
- Fix for races in xfsaild, and idle mode is re-enabled
- Fix for a crash in xfs_buf completion handlers on unmount."
Fix up trivial conflicts in fs/xfs/{xfs_buf.c,xfs_log.c,xfs_log_priv.h}
due to duplicate patches that had already been merged for 3.5.
* tag 'for-linus-v3.6-rc1' of git://oss.sgi.com/xfs/xfs: (44 commits)
xfs: wait for the write the superblock on unmount
xfs: re-enable xfsaild idle mode and fix associated races
xfs: remove iolock lock classes
xfs: avoid the iolock in xfs_free_eofblocks for evicted inodes
xfs: do not take the iolock in xfs_inactive
xfs: remove xfs_inactive_attrs
xfs: clean up xfs_inactive
xfs: do not read the AGI buffer in xfs_dialloc until nessecary
xfs: refactor xfs_ialloc_ag_select
xfs: add a short cut to xfs_dialloc for the non-NULL agbp case
xfs: remove the alloc_done argument to xfs_dialloc
xfs: split xfs_dialloc
xfs: remove xfs_ialloc_find_free
Prefix IO_XX flags with XFS_IO_XX to avoid namespace colision.
xfs: remove xfs_inotobp
xfs: merge xfs_itobp into xfs_imap_to_bp
xfs: handle EOF correctly in xfs_vm_writepage
xfs: implement ->update_time
xfs: fix comment typo of struct xfs_da_blkinfo.
xfs: do not call xfs_bdstrat_cb in xfs_buf_iodone_callbacks
...
d_parent is never NULL, and IS_ROOT() is the proper way to check for a
(non-self-referential) parent.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
O_EXCL without O_CREAT has different semantics; it's "fail if already opened",
not "fail if already exists". commit 71574865 broke that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
v2: Add the xfs_buf_lock to xfs_quiesce_attr().
Add explaination why xfs_buf_lock() is used to wait for write.
xfs_wait_buftarg() does not wait for the completion of the write of the
uncached superblock. This write can race with the shutdown of the log
and causes a panic if the write does not win the race.
During the log write, xfsaild_push() will lock the buffer and set the
XBF_ASYNC flag. Because the XBF_FLAG is set, complete() is not performed
on the buffer's iowait entry, we cannot call xfs_buf_iowait() to wait
for the write to complete. The buffer's lock is held until the write is
complete, so we can block on a xfs_buf_lock() request to be notified
that the write is complete.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfsaild idle mode logic currently leads to a couple hangs:
1.) If xfsaild is rescheduled in during an incremental scan
(i.e., tout != 0) and the target has been updated since
the previous run, we can hit the new target and go into
idle mode with a still populated ail.
2.) A wake up is only issued when the target is pushed forward.
The wake up can race with xfsaild if it is currently in the
process of entering idle mode, causing future wake up
events to be lost.
These hangs have been reproduced and verified as fixed by
running xfstests 273 in a loop on a slightly modified upstream
kernel. The kernel is modified to re-enable idle mode as
previously implemented (when count == 0) and with a revert of
commit 670ce93f, which includes performance improvements that
make this harder to reproduce.
The solution, the algorithm for which has been outlined by
Dave Chinner, is to modify xfsaild to enter idle mode only when
the ail is empty and the push target has not been moved forward
since the last push.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Content-Disposition: inline; filename=xfs-remove-iolock-classes
Now that we never take the iolock during inode reclaim we don't need
to play games with lock classes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Same rational as the last patch - these inodes are not reachable, so
don't bother with locking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
An inode that enters xfs_inactive has been removed from all global
lists but the inode hash, and can't be recycled in xfs_iget before
it has been marked reclaimable. Thus taking the iolock in here
is not nessecary at all, and given the amount of lockdep false
positives it has triggered already I'd rather remove the locking.
The only change outside of xfs_inactive is relaxing an assert in
xfs_itruncate_extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Remove this helper as the code flow is a lot more obvious when it gets
merged into its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The code to reserve log space and join the inode to the transaction is
common for all cases, so don't duplicate it. Also remove the trivial
xfs_inactive_symlink_local helper which can simply be opencode now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Refactor the AG selection loop in xfs_dialloc to operate on the in-memory
perag data as much as possible. We only read the AGI buffer once we have
selected an AG to allocate inodes now instead of for every AG considered.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Loop over the in-core perag structures and prefer using pagi_freecount over
going out to the AGI buffer where possible.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In this case we already have selected an AG and know it has free space
beause the buffer lock never got released. Jump directly into xfs_dialloc_ag
and short cut the AG selection loop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We can simplify check the IO_agbp pointer for being non-NULL instead of
passing another argument through two layers of function calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Move the actual allocation once we have selected an allocation group into a
separate helper, and make xfs_dialloc a wrapper around it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It's used both for client and server hosts; we can't do nlmclnt_release_host()
on failure exits, since the host might need nlmsvc_release_host(), with BUG_ON()
for calling the wrong one. Makes life simpler for callers, actually...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This adds symlink and hardlink restrictions to the Linux VFS.
Symlinks:
A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.
Some pointers to the history of earlier discussion that I could find:
1996 Aug, Zygo Blaxell
http://marc.info/?l=bugtraq&m=87602167419830&w=2
1996 Oct, Andrew Tridgell
http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
1997 Dec, Albert D Cahalan
http://lkml.org/lkml/1997/12/16/4
2005 Feb, Lorenzo Hernández García-Hierro
http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
2010 May, Kees Cook
https://lkml.org/lkml/2010/5/30/144
Past objections and rebuttals could be summarized as:
- Violates POSIX.
- POSIX didn't consider this situation and it's not useful to follow
a broken specification at the cost of security.
- Might break unknown applications that use this feature.
- Applications that break because of the change are easy to spot and
fix. Applications that are vulnerable to symlink ToCToU by not having
the change aren't. Additionally, no applications have yet been found
that rely on this behavior.
- Applications should just use mkstemp() or O_CREATE|O_EXCL.
- True, but applications are not perfect, and new software is written
all the time that makes these mistakes; blocking this flaw at the
kernel is a single solution to the entire class of vulnerability.
- This should live in the core VFS.
- This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
- This should live in an LSM.
- This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)
Hardlinks:
On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.
The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.
Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].
This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279
This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ->lookup() never gets hit with . or ..
* dentry it gets is unhashed, so unless we had gone and hashed it ourselves, there's
no need to d_drop() the sucker.
* wrong name printed in one of the printks (NULL, in fact)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now. Makes more sense, POSIX allows, etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that applying umask can't affect their results. While
that affects errno in cases like
mknod("/no_such_directory/a", 030000)
yielding -EINVAL (due to impossible mode_t) instead of
-ENOENT (due to inexistent directory), IMO that makes a lot
more sense, POSIX allows to return either and any software
that relies on getting -ENOENT instead of -EINVAL in that
case deserves everything it gets.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
greatest note is a speed up for parallel, non-allocating DIO writes,
since we no longer take the i_mutex lock in that case. For bug fixes,
we fix an incorrect overhead calculation which caused slightly
incorrect results for df(1) and statfs(2). We also fixed bugs in the
metadata checksum feature.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQE1SzAAoJENNvdpvBGATwzOMQAKxVkaTqkMc+cUahLLUkFdGQ
xnmigHufVuOvv+B8l1p6vbYx+qOftztqXF1WC41j8mTfEDs19jKXv3LI57TJ+bIW
a/Kp1CjMicRs/HGhFPNWp1D+N8J95MTFP6Y9biXSmBBvefg2NSwxpg48yZtjUy1/
Zl0414AqMvTJyqKKOUa++oyl3XlawnzDZ+6a7QPKsrAaDOU5977W4y2tZkNFk84d
qRjTfaiX13aVe7XupgQHe/jtk40Sj38pyGVGiAGlHOZhZtYKE6MzB8OreGiMTy8z
rmJz/CQUsQ8+sbKYhAcDru+bMrKghbO0u78XRz9CY+YpVFF/Xs2QiXoV0ZOGkIm6
eokq7jEV0K+LtDEmM3PkmUPgXfYw5damTv8qWuBUFd4wtVE5x/DmK8AJVMidCAUN
GkVR+rEbbEi7RCwsuac/aKB8baVQCTiJ5tfNTgWh9zll+9GZSk+U71Pp0KdcJGiS
nxitAZ+20hZN2CQctlmaGbCPTPYCWU4hQ3IuMdTlQTQAs8S0y1FtylTRsXcC1eVR
i1hBS/dVw5PVCaqoX79zYrByUymgX0ZaYY6seRT6+U9xPGDCSNJ9mjXZL3fttnOC
d5gsx/pbMIAv52G5Hj6DfsXR2JFmmxsaIzsLtRvKi9q89d84XaZfbUsHYjn4Neym
5lTKaSQHU71cKCxrStHC
=VAVB
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The usual collection of bug fixes and optimizations. Perhaps of
greatest note is a speed up for parallel, non-allocating DIO writes,
since we no longer take the i_mutex lock in that case.
For bug fixes, we fix an incorrect overhead calculation which caused
slightly incorrect results for df(1) and statfs(2). We also fixed
bugs in the metadata checksum feature."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: undo ext4_calc_metadata_amount if we fail to claim space
ext4: don't let i_reserved_meta_blocks go negative
ext4: fix hole punch failure when depth is greater than 0
ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
ext4: weed out ext4_write_super
ext4: remove unnecessary superblock dirtying
ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
ext4: remove useless marking of superblock dirty
ext4: fix ext4 mismerge back in January
ext4: remove dynamic array size in ext4_chksum()
ext4: remove unused variable in ext4_update_super()
ext4: make quota as first class supported feature
ext4: don't take the i_mutex lock when doing DIO overwrites
ext4: add a new nolock flag in ext4_map_blocks
ext4: split ext4_file_write into buffered IO and direct IO
ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()
ext4: fix out-of-date comments in extents.c
ext4: use s_csum_seed instead of i_csum_seed for xattr block
ext4: use proper csum calculation in ext4_rename
ext4: fix overhead calculation used by ext4_statfs()
...
NFSd's boot_time represents grace period start point in time.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Passed network namespace replaced hard-coded init_net
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>