Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.
The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.
To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.
To avoid issues with int->char conversion of flag fields (as noticed
in v1 of this patch) convert the flag fields in the struct
xfs_bmalloca to bool types. pahole indicates these variables are
still single byte variables, so no extra space is consumed by this
change.
cc: <stable@vger.kernel.org>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
perf tools like 'perf report' can aggregate samples by comm strings,
which generally works. However, there are other potential use-cases.
For example, to pair up 'calls' with 'returns' accurately (from branch
events like Intel BTS) it is necessary to identify whether the process
has exec'd. Although a comm event is generated when an 'exec' happens
it is also generated whenever the comm string is changed on a whim
(e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event
to differentiate one case from the other.
In order to determine whether the kernel supports the new flag, a
selection bit named 'exec' is added to struct perf_event_attr. The
bit does nothing but will cause perf_event_open() to fail if the bit
is set on kernels that do not have it defined.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_comm() assumes that set_task_comm() is only called on
exec(), and in particular that its only called on current.
Neither are true, as Dave reported a WARN triggered by set_task_comm()
being called on !current.
Separate the exec() hook from the comm hook.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140521153219.GH5226@laptop.programming.kicks-ass.net
[ Build fix. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's carried in state->args->geo, so there's no need to duplicate it
and use more stack space than necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
As it's only ever called from contexts where the xfs_da_args is
present and contains all the information needed inside the args
structure.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Rather than using the superblock value obtained through the
xfs_mount.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We don't pass the xfs_da_args or the geometry all the way down to
the directory buffer logging code, hence we have to use
mp->m_dir_geo here. Fix this to use the geometry passed via the
xfs_da_args, and convert all the directory logging functions for
consistency.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are many places in the directory code were we don't pass the
args into and so have to extract the geometry direct from the mount
structure. Push the args or the geometry into these leaf functions
so that we don't need to grab it from the struct xfs_mount.
This, in turn, brings use to the point where directory geometry is
no longer a property of the struct xfs_mount; it is not a global
property anymore, and hence we can start to consider per-directory
configuration of physical geometries.
Start by converting the xfs_dir_isblock/leaf code - pass in the
xfs_da_args and convert the readdir code to use xfs_da_args like
the rest of the directory code to pass information around.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
They are just simple wrappers around xfs_dir2_byte_to_db(), and
we've already removed one usage earlier in the patch set. Kill
the rest before we start removing the xfs_mount from conversion
functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Because they aren't actually part of the on-disk format, and so
shouldn't be in xfs_da_format.h.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The directory code has a dependency on the struct xfs_mount to
supply the directory block geometry. Block size, block log size,
and other parameters are pre-caclulated in the struct xfs_mount or
access directly from the superblock embedded in the struct
xfs_mount.
Extract all of this geometry information out of the struct xfs_mount
and superblock and place it into a new struct xfs_da_geometry
defined by the directory code. Allocate and initialise it at mount
time, and attach it to the struct xfs_mount so it canbe passed back
into the directory code appropriately rather than using the struct
xfs_mount.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We recently modified the client/MDS protocol to include a timestamp in the
client request. This allows ctime updates to follow the client's clock
in most cases, which avoids subtle problems when clocks are out of sync
and timestamps are updated sometimes by the MDS clock (for most requests)
and sometimes by the client clock (for cap writeback).
Signed-off-by: Sage Weil <sage@inktank.com>
If the return value of ceph_osdc_readpages() is not negative,
it is certainly greater than or equal to zero.
Remove the useless condition judgment and redundant braces.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
handle following sequence of events:
- mds0 exports an inode to mds1. client receives the cap import
message from mds1. caps from mds0 are removed while handling
the cap import message.
- mds1 exports an inode to mds0. client receives the cap export
message from mds1. handle_cap_export() adds placeholder caps
for mds0
- client receives the first cap export message (for exporting
inode from mds0 to mds1)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Move the code that update the i_fragtree into a separate function.
Also add simple probabilistic test to decide whether the i_fragtree
should be updated
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
cap import messages are processed by both handle_cap_import() and
handle_cap_grant(). These two functions are not executed in the same
atomic context, so they can races with cap release.
The fix is make handle_cap_import() not release the i_ceph_lock when
it returns. Let handle_cap_grant() release the lock after it finishes
its job.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
cap grant/revoke message from non-auth MDS can update inode's size
and truncate_seq/truncate_size. (the message arrives before auth
MDS's cap trunc message)
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
posix_acl_xattr_set() already does the check, and it's the only
way to feed in an ACL from userspace.
So the check here is useless, remove it.
Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
The addition of lockdep code to write_seqcount_begin/end has lead to
a bunch of false positive claims of ABBA deadlocks with the so_lock
spinlock. Audits show that this simply cannot happen because the
read side code does not spin while holding so_lock.
Cc: <stable@vger.kernel.org> # 3.13.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull x86 cdso updates from Peter Anvin:
"Vdso cleanups and improvements largely from Andy Lutomirski. This
makes the vdso a lot less ''special''"
* 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso, build: Make LE access macros clearer, host-safe
x86/vdso, build: Fix cross-compilation from big-endian architectures
x86/vdso, build: When vdso2c fails, unlink the output
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
x86, mm: Improve _install_special_mapping and fix x86 vdso naming
mm, fs: Add vm_ops->name as an alternative to arch_vma_name
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
x86, vdso: Move the 32-bit vdso special pages after the text
x86, vdso: Reimplement vdso.so preparation in build-time C
x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
x86, vdso: Clean up 32-bit vs 64-bit vdso params
x86, mm: Ensure correct alignment of the fixmap
autofs_dev_ioctl_init is only called by __init init_autofs4_fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
old_reloc() is only used in this file, make it static.
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All KERN_DEBUG callsites being under #ifdef DEBUG we can safely convert
everything to pr_debug without changing current behaviour.
Remove #ifdef DEBUG around pr_debugs only (suggested by Joe Perches)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also uniformize function arguments.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert all except KERN_DEBUG
(pr_debug doesn't work the same as printk(KERN_DEBUG and requires
special check)
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixing 2 typo in function comments.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...like other filesystems.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying to do
memory reclamation, e.g. when we are doing large amount of file read and
page cache is under pressure. The cached objects quickly got reclaimed
down to 0 and we are aborting the cache_scan() reclaim. But counting
creates a log jam acquiring the sb_lock.
We are holding the shrinker_rwsem which ensures the safety of call to
list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
unregistered now before ->kill_sb() so the operation is safe when we are
doing unmount.
The impact will depend heavily on the machine and the workload but for a
small machine using postmark tuned to use 4xRAM size the results were
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)
ffsb running in a configuration that is meant to simulate a mail server showed
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series is aimed at regressions noticed during reclaim activity. The
first two patches are shrinker patches that were posted ages ago but never
merged for reasons that are unclear to me. I'm posting them again to see
if there was a reason they were dropped or if they just got lost. Dave?
Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
retest the vm scalability test cases on a larger machine? Hugh, does this
work for you on the memcg test cases?
Based on ext4, I get the following results but unfortunately my larger
test machines are all unavailable so this is based on a relatively small
machine.
postmark
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)
ffsb (mail server simulator)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)
dd of a large file
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)
stutter (times mmap latency during large amounts of IO)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)
This patch (of 3):
We will like to unregister the sb shrinker before ->kill_sb(). This will
allow cached objects to be counted without call to grab_super_passive() to
update ref count on sb. We want to avoid locking during memory
reclamation especially when we are skipping the memory reclaim when we are
out of cached objects.
This is safe because grab_super_passive does a try-lock on the
sb->s_umount now, and so if we are in the unmount process, it won't ever
block. That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:
shrinker umount
down_read(shrinker_rwsem)
down_write(sb->s_umount)
shrinker_unregister
down_write(shrinker_rwsem)
<blocks>
grab_super_passive(sb)
down_read_trylock(sb->s_umount)
<fails>
<shrinker aborts>
....
<shrinkers finish running>
up_read(shrinker_rwsem)
<unblocks>
<removes shrinker>
up_write(shrinker_rwsem)
->kill_sb()
....
So it is safe to deregister the shrinker before ->kill_sb().
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix checkpatch warning:
WARNING: kfree(NULL) is safe this check is probably not required
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...like other filesystems.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlbfs_i_mmap_mutex_key is only used in inode.c
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Discarding buffers uses a bunch of atomic operations when discarding
buffers because ...... I can't think of a reason. Use a cmpxchg loop to
clear all the necessary flags. In most (all?) cases this will be a single
atomic operations.
[akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A block device driver may choose to provide a rw_page operation. These
will be called when the filesystem is attempting to do page sized I/O to
page cache pages (ie not for direct I/O). This does preclude I/Os that
are larger than page size, so this may only be a performance gain for
some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_endio() takes care of updating all the appropriate page flags once
I/O has finished to a page. Switch to using mapping_set_error() instead
of setting AS_EIO directly; this will handle thin-provisioned devices
correctly.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__mpage_writepage() is over 200 lines long, has 20 local variables, four
goto labels and could desperately use simplification. Splitting
clean_buffers() into a helper function improves matters a little,
removing 20+ lines from it.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last in-tree caller of block_write_full_page_endio() was removed in
January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
block_write_full_page() as the only caller of
block_write_full_page_endio(), so inline block_write_full_page_endio()
into block_write_full_page().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_refs_write() is called earlier than clear_soft_dirty() and it is
more natural to clear VM_SOFTDIRTY (which belongs to VMA entry but not
PTEs) that early instead of clearing it a way deeper inside call chain.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Description by Jan Kara:
"A lot of older filesystems don't properly flush volatile disk caches
on fsync(2) which can lead to loss of fsynced data after power failure.
This patch makes generic_file_fsync() issue proper cache flush to fix the
problem. Sysadmin can use /sys/devices/.../cache_type to tell the system
it should not send the cache flush."
[akpm@linux-foundation.org: nuke ifdef]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function parameters comment fixing.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
v9fs_sysfs_init is only called by __init init_v9fs
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_recovery_ctxt.received is unused.
ocfs2_should_refresh_lock_res() can only return 0 or 1, so the error
handling code in ocfs2_super_lock() is unneeded.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ocfs2 cluster size may be 1MB, which has 20 bits. When resize, the
input new clusters is mostly the number of clusters in a group
descriptor(32256).
Since the input clusters is defined as type int, so it will overflow
when shift left 20 bits and then lead to incorrect global bitmap i_size.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Parameters new_clusters and first_new_cluster are not used in
ocfs2_update_last_group_and_inode, so remove them.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found a race situation when dlm recovery and node joining occurs
simultaneously if the network state is bad.
N1 N4
start joining dlm and send
query join to all live nodes
set joining node to N1, return OK
send query join to other
live nodes and it may take
a while
call dlm_send_join_assert()
to send assert join message
when N2 is down, so keep
trying to send message to N2
until find N2 is down
send assert join message to
N3, but connection is down
with N3, so it may take a
while
become the recovery master for N2
and send begin reco message to other
nodes in domain map but no N1
connection with N3 is rebuild,
then send assert join to N4
call dlm_assert_joined_handler(),
add N1 to domain_map
dlm recovery done, send finalize message
to nodes in domain map, including N1
receiving finalize message,
trigger the BUG() because
recovery master mismatch.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 75f82eaa50 ("ocfs2: fix NULL pointer dereference when
dismount and ocfs2rec simultaneously") because it may cause a umount
hang while shutting down the truncate log.
fix NULL pointer dereference when dismount and ocfs2rec simultaneously
The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
-> free osb->recovery_map
-> ocfs2_truncate_shutdown
-> lock global bitmap inode
-> ocfs2_wait_for_recovery
-> check whether osb->recovery_map->rm_used is zero
Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.
To prevent NULL pointer dereference while getting sys_root_inode, we use
a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
truncate log shutdown.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs_info_foo() and ocfs2_get_request_ptr functions are only used in ioctl.c
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found there is a conversion deadlock when the owner of lockres
happened to crash before send DLM_PROXY_AST_MSG for a downconverting
lock. The situation is as follows:
Node1 Node2 Node3
the owner of lockresA
lock_1 granted at EX mode
and call ocfs2_cluster_unlock
to decrease ex_holders.
converting lock_3 from
NL to EX
send DLM_PROXY_AST_MSG
to Node1, asking Node 1
to downconvert.
receiving DLM_PROXY_AST_MSG,
thread ocfs2dc send
DLM_CONVERT_LOCK_MSG
to Node2 to downconvert
lock_1(EX->NL).
lock_1 can be granted and
put it into pending_asts
list, return DLM_NORMAL.
then something happened
and Node2 crashed.
received DLM_NORMAL, waiting
for DLM_PROXY_AST_MSG.
selected as the recovery
master, receving migrate
lock from Node1, queue
lock_1 to the tail of
converting list.
After dlm recovery, converting list in the master of lockresA(Node3)
will be: converting list head <-> lock_3(NL->EX) <->lock_1(EX<->NL).
Requested mode of lock_3 is not compatible with the granted mode of
lock_1, so it can not be granted. and lock_1 can not downconvert
because covnerting queue is strictly FIFO. So a deadlock is created.
We think function dlm_process_recovery_data() should queue_ast for
lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
process lock_1 first. And if there are multiple downconverting locks,
they must convert form PR to NL, so no need to sort them.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once JBD2_ABORT is set, ocfs2_commit_cache will fail in
ocfs2_commit_thread. Then it will get into a loop with mass logs. This
will meaninglessly consume a larger number of resource and may lead to
the system hanging. So limit printk in this case.
[akpm@linux-foundation.org: document the msleep]
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two standard techniques for dereferencing structures pointed
to by void *: cast to the right type each time they're used, or assign
to local variables of the right type.
But there's no need to do *both*.
Signed-off-by: George Spelvin <linux@horizon.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace strncpy(size 63) by defined value.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add pr_fmt based on module name.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix function parameter documentation
EXPORT_SYMBOLS moved after corresponding functions
Small coding style and checkpatch warning fixes
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the last pr_warning callsite in fs branch
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without this patch fanotify_init does not validate the value passed in
event_f_flags.
When a fanotify event is read from the fanotify file descriptor a new
file descriptor is created where file.f_flags = event_f_flags.
Internal and external open flags are stored together in field f_flags of
struct file. Hence, an application might create file descriptors with
internal flags like FMODE_EXEC, FMODE_NOCMTIME set.
Jan Kara and Eric Paris both aggreed that this is a bug and the value of
event_f_flags should be checked:
https://lkml.org/lkml/2014/4/29/522https://lkml.org/lkml/2014/4/29/539
This updated patch version considers the comments by Michael Kerrisk in
https://lkml.org/lkml/2014/5/4/10
With the patch the value of event_f_flags is checked.
When specifying an invalid value error EINVAL is returned.
Internal flags are disallowed.
File creation flags are disallowed:
O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TRUNC, and O_TTY_INIT.
Flags which do not make sense with fanotify are disallowed:
__O_TMPFILE, O_PATH, FASYNC, and O_DIRECT.
This leaves us with the following allowed values:
O_RDONLY, O_WRONLY, O_RDWR are basic functionality. The are stored in the
bits given by O_ACCMODE.
O_APPEND is working as expected. The value might be useful in a logging
application which appends the current status each time the log is opened.
O_LARGEFILE is needed for files exceeding 4GB on 32bit systems.
O_NONBLOCK may be useful when monitoring slow devices like tapes.
O_NDELAY is equal to O_NONBLOCK except for platform parisc.
To avoid code breaking on parisc either both flags should be
allowed or none. The patch allows both.
__O_SYNC and O_DSYNC may be used to avoid data loss on power disruption.
O_NOATIME may be useful to reduce disk activity.
O_CLOEXEC may be useful, if separate processes shall be used to scan files.
Once this patch is accepted, the fanotify_init.2 manpage has to be updated.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If fanotify_mark is called with illegal value of arguments flags and
marks it usually returns EINVAL.
When fanotify_mark is called with FAN_MARK_FLUSH the argument flags is
not checked for irrelevant flags like FAN_MARK_IGNORED_MASK.
The patch removes this inconsistency.
If an irrelevant flag is set error EINVAL is returned.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not initialize private_destroy_list twice. list_replace_init()
already takes care of initializing private_destroy_list. We don't need
to initialize it with LIST_HEAD() beforehand.
Signed-off-by: David Cohen <david.a.cohen@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before the patch, read creates FAN_ACCESS_PERM and FAN_ACCESS events,
readdir creates only FAN_ACCESS_PERM events.
This is inconsistent.
After the patch, readdir creates FAN_ACCESS_PERM and FAN_ACCESS events.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally from Tvrtko Ursulin (https://lkml.org/lkml/2011/1/12/112)
Avoid having to provide a fake/invalid fd and path when flushing marks
Currently for a group to flush marks it has set it needs to provide a
fake or invalid (but resolvable) file descriptor and path when calling
fanotify_mark. This patch pulls the flush handling a bit up so file
descriptor and path are completely ignored when flushing.
I reworked the patch to be applicable again (the signature of
fanotify_mark has changed since Tvrtko's work).
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace seq_printf where possible + coalesce formats from 2 existing
seq_puts
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't want the stateid to be found in the hash table before the delegation
is granted.
Currently this is protected by the client_mutex, but we want to break that
up and this is a necessary step toward that goal.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
...as the name is a bit more descriptive and we've started using it for
other purposes.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The memset of resp in svc_process_common should ensure that these are
already zeroed by the time they get here.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the NFS4_OPEN_CLAIM_PREVIOUS case, we should only mark it confirmed
if the nfs4_check_open_reclaim check succeeds.
In the NFS4_OPEN_CLAIM_DELEG_PREV_FH and NFS4_OPEN_CLAIM_DELEGATE_PREV
cases, I see no point in declaring the openowner confirmed when the
operation is going to fail anyway, and doing so might allow the client
to game things such that it wouldn't need to confirm a subsequent open
with the same owner.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes a bug in the handling of the fi_delegations list.
nfs4_setlease does not hold the recall_lock when adding to it. The
client_mutex is held, which prevents against concurrent list changes,
but nfsd_break_deleg_cb does not hold while walking it. New delegations
could theoretically creep onto the list while we're walking it there.
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It is probably also the first one without a single patch from me. That
is down to a combination of factors, and I have some things in the works
that are not quite ready yet, that I hope to put in next time around.
Returning to what is here this time... we have 3 patches which fix
various warnings. Two are bug fixes (for quotas and also a
rare recovery race condition). The final patch, from Ben Marzinski,
is an important change in the freeze code which has been in
progress for some time. This removes the need to take and drop the
transaction lock for every single transaction, when the only time it
was used, was at file system freeze time. Ben's patch integrates the
freeze operation into the journal flush code as an alternative with
lower overheads and also lands up resolving some difficult to fix races
at the same time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQIcBAABAgAGBQJTjawdAAoJEMrg3m4a/8jS+fEP/19pd9norrPgt+INKeWF3Nlj
0cCVyivBjjEOQseiokbn6AlO9sIBETCMsd3v/ke8haleR8J6F1K8OvRY2LV76vZT
SKTae4Lts7Pzbf8JF9wSi3mpr3zhtQ47v6DvRYEylc68HcwM4EybaSKWEX3By2zd
Xmhlv+v7V+PRYthaMalOXjhzuYA4Sv2BgdUGG9xKtIfzjvhHhzws/xBcr9UrotX2
oPjq08X9HY1TNuWN8tTs4P7BrOx8QCb7ZJzT2A9girFyVXNiduGTd11mzCguvHVQ
/Ove3/7Cg3fABMg/3Ub2dpARqYxJRV25FlTV8RrOWj0BMhndWAbzMt1KPexk4FAE
a3KCMBbo8WZbjRmOB4tmfknxDCdeUDAIlm1mwDPFJ1/0Vv8rkove1+xWHDvOPWD3
219GLiUe7PyVowBW4FQhW+CTjArqz3TWB+R/US18rXcwDS9s/vEIDVVwNYlrxRmK
pztGMr25UoFhbvMe3jtu5xRwQM5oZfQtlYdL09+0BgkgPmuOtEwzuopa7g5MBAze
Xq7h+oN8M4AtJs/msBF3di+fgOhUyoJmj129xgoZeCxbe80nA0ge0hnb93vQJHmE
uHe4zV26ChGjUtxUwf77xOZfCEWKsp1ORJkFN+2SMcpUIlfNNumBv/UhrVRN55AO
CneZaFLboYhxqc28K+Ms
=x9iK
-----END PGP SIGNATURE-----
Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw into next
Pull gfs2 updates from Steven Whitehouse:
"This must be about the smallest merge window patch set ever for GFS2.
It is probably also the first one without a single patch from me.
That is down to a combination of factors, and I have some things in
the works that are not quite ready yet, that I hope to put in next
time around.
Returning to what is here this time... we have 3 patches which fix
various warnings. Two are bug fixes (for quotas and also a rare
recovery race condition). The final patch, from Ben Marzinski, is an
important change in the freeze code which has been in progress for
some time. This removes the need to take and drop the transaction
lock for every single transaction, when the only time it was used, was
at file system freeze time. Ben's patch integrates the freeze
operation into the journal flush code as an alternative with lower
overheads and also lands up resolving some difficult to fix races at
the same time"
* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Prevent recovery before the local journal is set
GFS2: fs/gfs2/file.c: kernel-doc warning fixes
GFS2: fs/gfs2/bmap.c: kernel-doc warning fixes
GFS2: remove transaction glock
GFS2: lops.c: replace 0 by NULL for pointers
GFS2: quotas not being refreshed in gfs2_adjust_quota
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTjlWzAAoJEAAOaEEZVoIV4TAP/0EMmOBZLwIxNdj2Tfpx5dko
qD+pvC/0udIKbPXeUgb+u84zR37NPBsNFH0cpsheTlmP1rLykNaguCPMru3ngu9o
M3oGg8X6jzkknrWvU0NDVtXCIpHSNgKDv3KKAJwRaJHOkLKIzmUugnuNE5WnGfBa
ZpZ3UAOe6GCu5RSKPhkmqLV+wrq0dm2NjkoIu+zavK29n3ggnXOT4BOb5OuuyinA
pKGOC3irGRERXbcNAqS3LU5wPdA2dQjdZaw38XPsmrQhZlBdVpToRFUoCli+RMLS
zmYm2eYQztKkzq8LqOchJAozLgfzhIGvIR54Q/H/gTdbB6kGCpKENAZ+UY4smkJH
en6GldjyOIPF7g509bxhuq3Gs68gI5Jwqikgd0pP8U76qrYDS5KBBA0UTSSyJfEc
Xn6xn+n4qrQDQt2p/IN8LEYBNL1VAuHLT50Q1ZhbI6hSrbaByDUcY+ikqg6hvTyw
xLb1IrtZD9bB4YgYOa/wRk+d1VAdWTgwlRZz4nnP+PRaTqaQ63lOhIF619+sL1xy
UVDNrT+LikmFjk5wW0f3o+Xjplbd0FD50Ybhzc03zkdu3Q/MbfGICjYZIhCULDTa
6Dhg7xogwGCsN/fhm0+QXGfQp61mXzbpFM4femwrjsELFM+/VEyO9zE46Kh6luz8
a1syHZ9UsObXD1nBSO0o
=lgS1
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.16' of git://git.samba.org/jlayton/linux into next
Pull file locking changes from Jeff Layton:
"Pretty quiet on the file-locking related front this cycle. Just some
small cleanups and the addition of some tracepoints in the lease
handling code"
* tag 'locks-v3.16' of git://git.samba.org/jlayton/linux:
locks: add some tracepoints in the lease handling code
fs/locks.c: replace seq_printf by seq_puts
locks: ensure that fl_owner is always initialized properly in flock and lease codepaths
If data are overwritten through dio, previous f2fs doesn't remain the fsync mark
due to no additional node writes.
Note that this patch should resolve the xfstests:311.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs's cp has one page which consists of struct f2fs_checkpoint and
version bitmap of sit and nat. To support lots of segments, we need more
blocks for sit bitmap. So let's arrange sit bitmap as following:
+-----------------+------------+
| f2fs_checkpoint | sit bitmap |
| + nat bitmap | |
+-----------------+------------+
0 4k N blocks
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: simple code change for readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously we allocate pages with no mapping in ra_sum_pages(), so we may
encounter a crash in event trace of f2fs_submit_page_mbio where we access
mapping data of the page.
We'd better allocate pages in bd_inode mapping and invalidate these pages after
we restore data from pages. It could avoid crash in above scenario.
Changes from V1
o remove redundant code in ra_sum_pages() suggested by Jaegeuk Kim.
Call Trace:
[<f1031630>] ? ftrace_raw_event_f2fs_write_checkpoint+0x80/0x80 [f2fs]
[<f10377bb>] f2fs_submit_page_mbio+0x1cb/0x200 [f2fs]
[<f103c5da>] restore_node_summary+0x13a/0x280 [f2fs]
[<f103e22d>] build_curseg+0x2bd/0x620 [f2fs]
[<f104043b>] build_segment_manager+0x1cb/0x920 [f2fs]
[<f1032c85>] f2fs_fill_super+0x535/0x8e0 [f2fs]
[<c115b66a>] mount_bdev+0x16a/0x1a0
[<f102f63f>] f2fs_mount+0x1f/0x30 [f2fs]
[<c115c096>] mount_fs+0x36/0x170
[<c1173635>] vfs_kern_mount+0x55/0xe0
[<c1175388>] do_mount+0x1e8/0x900
[<c1175d72>] SyS_mount+0x82/0xc0
[<c16059cc>] sysenter_do_call+0x12/0x22
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When large directory feathure is enable, We have one case which could cause
overflow in dir_buckets() as following:
special case: level + dir_level >= 32 and level < MAX_DIR_HASH_DEPTH / 2.
Here we define MAX_DIR_BUCKETS to limit the return value when the condition
could trigger potential overflow.
Changes from V1
o modify description of calculation in f2fs.txt suggested by Changman Lee.
Suggested-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull scheduler updates from Ingo Molnar:
"The main scheduling related changes in this cycle were:
- various sched/numa updates, for better performance
- tree wide cleanup of open coded nice levels
- nohz fix related to rq->nr_running use
- cpuidle changes and continued consolidation to improve the
kernel/sched/idle.c high level idle scheduling logic. As part of
this effort I pulled cpuidle driver changes from Rafael as well.
- standardized idle polling amongst architectures
- continued work on preparing better power/energy aware scheduling
- sched/rt updates
- misc fixlets and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
sched/numa: Decay ->wakee_flips instead of zeroing
sched/numa: Update migrate_improves/degrades_locality()
sched/numa: Allow task switch if load imbalance improves
sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
sched: Initialize rq->age_stamp on processor start
sched, nohz: Change rq->nr_running to always use wrappers
sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
sched: Use clamp() and clamp_val() to make sys_nice() more readable
sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
sched/numa: Fix initialization of sched_domain_topology for NUMA
sched: Call select_idle_sibling() when not affine_sd
sched: Simplify return logic in sched_read_attr()
sched: Simplify return logic in sched_copy_attr()
sched: Fix exec_start/task_hot on migrated tasks
arm64: Remove TIF_POLLING_NRFLAG
metag: Remove TIF_POLLING_NRFLAG
sched/idle: Make cpuidle_idle_call() void
sched/idle: Reflow cpuidle_idle_call()
sched/idle: Delay clearing the polling bit
...
This patch replaces obsolete simple_str functions by kstr
use kstrtouint for
-uid_t ( __kernel_uid32_t )
-gid_t ( __kernel_gid32_t )
-jfs_sb_info->umask
-jfs_sb_info->minblks_trim
(all unsigned int)
newLVSize is s64 -> use kstrtol
Current parse_options behaviour stays the same ie it doesn't return kstr
rc but just 0 if function failed (parse_options callsites
return -EINVAL when there's anything wrong).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Static values are automatically initialized to NULL
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the "big" pull request for 3.16-rc1.
Not a lot of changes here, some kernfs work, a revert of a very old
driver core change that ended up cauing some memory leaks on driver
probe error paths, and other minor things.
As was pointed out earlier today, one commit here,
26fc9cd200 (kernfs: move the last
knowledge of sysfs out from kernfs) is also needed in your 3.15-final
branch as well. If you could cherry-pick it there, it would be most
appreciated by Andy Lutomirski to prevent a regression there.
All of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONV9YACgkQMUfUDdst+yn0sQCfWWYg1oVXyu6f0uJjYbVBFkpD
UHgAoJxxfwTZJq/xYrnk6+RqUowIsUlh
=ojAS
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into next
Pull driver core / kernfs changes from Greg KH:
"Here is the "big" pull request for 3.16-rc1.
Not a lot of changes here, some kernfs work, a revert of a very old
driver core change that ended up cauing some memory leaks on driver
probe error paths, and other minor things.
As was pointed out earlier today, one commit here, 26fc9cd200
("kernfs: move the last knowledge of sysfs out from kernfs") is also
needed in your 3.15-final branch as well. If you could cherry-pick it
there, it would be most appreciated by Andy Lutomirski to prevent a
regression there.
All of these have been in linux-next for a while"
* tag 'driver-core-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
crypto/nx/nx-842: dev_set_drvdata can no longer fail
kernfs: move the last knowledge of sysfs out from kernfs
sysfs: fix attribute_group bin file path on removal
sysfs.h: don't return a void-valued expression in sysfs_remove_file
init.h: Update initcall_sync variants to fix build errors
driver core: Inline dev_set/get_drvdata
driver core: dev_get_drvdata: Don't check for NULL dev
driver core: dev_set_drvdata returns void
driver core: dev_set_drvdata can no longer fail
driver core: Move driver_data back to struct device
lib/devres.c: fix checkpatch warnings
lib/devres.c: use dev in devm_request_and_ioremap
kobject: Make support for uevent_helper optional.
kernfs: make kernfs_notify() trigger inotify events too
kernfs: implement kernfs_root->supers list
This patch adds a new ubifs_assert() in ubifs_tnc_close() to check
if there are any leaks of per-filesystem @clean_zn_cnt. This new
assert inspects whether the return value of ubifs_destroy_tnc_subtree()
is equal to @clean_zn_cnt or not while umount.
Artem: a minor amendment
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This patch uses a completion to prevent dlm's recovery process from
referencing and trying to recover a journal before a journal has been
opened.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull block core updates from Jens Axboe:
"It's a big(ish) round this time, lots of development effort has gone
into blk-mq in the last 3 months. Generally we're heading to where
3.16 will be a feature complete and performant blk-mq. scsi-mq is
progressing nicely and will hopefully be in 3.17. A nvme port is in
progress, and the Micron pci-e flash driver, mtip32xx, is converted
and will be sent in with the driver pull request for 3.16.
This pull request contains:
- Lots of prep and support patches for scsi-mq have been integrated.
All from Christoph.
- API and code cleanups for blk-mq from Christoph.
- Lots of good corner case and error handling cleanup fixes for
blk-mq from Ming Lei.
- A flew of blk-mq updates from me:
* Provide strict mappings so that the driver can rely on the CPU
to queue mapping. This enables optimizations in the driver.
* Provided a bitmap tagging instead of percpu_ida, which never
really worked well for blk-mq. percpu_ida relies on the fact
that we have a lot more tags available than we really need, it
fails miserably for cases where we exhaust (or are close to
exhausting) the tag space.
* Provide sane support for shared tag maps, as utilized by scsi-mq
* Various fixes for IO timeouts.
* API cleanups, and lots of perf tweaks and optimizations.
- Remove 'buffer' from struct request. This is ancient code, from
when requests were always virtually mapped. Kill it, to reclaim
some space in struct request. From me.
- Remove 'magic' from blk_plug. Since we store these on the stack
and since we've never caught any actual bugs with this, lets just
get rid of it. From me.
- Only call part_in_flight() once for IO completion, as includes two
atomic reads. Hopefully we'll get a better implementation soon, as
the part IO stats are now one of the more expensive parts of doing
IO on blk-mq. From me.
- File migration of block code from {mm,fs}/ to block/. This
includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me,
from a discussion on lkml.
That should describe the meat of the pull request. Also has various
little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong,
Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam
Bradshaw"
* 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits)
blk-mq: push IPI or local end_io decision to __blk_mq_complete_request()
blk-mq: remember to start timeout handler for direct queue
block: ensure that the timer is always added
blk-mq: blk_mq_unregister_hctx() can be static
blk-mq: make the sysfs mq/ layout reflect current mappings
blk-mq: blk_mq_tag_to_rq should handle flush request
block: remove dead code in scsi_ioctl:blk_verify_command
blk-mq: request initialization optimizations
block: add queue flag for disabling SG merging
block: remove 'magic' from struct blk_plug
blk-mq: remove alloc_hctx and free_hctx methods
blk-mq: add file comments and update copyright notices
blk-mq: remove blk_mq_alloc_request_pinned
blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request
blk-mq: remove blk_mq_wait_for_tags
blk-mq: initialize request in __blk_mq_alloc_request
blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request
blk-mq: add helper to insert requests from irq context
blk-mq: remove stale comment for blk_mq_complete_request()
blk-mq: allow non-softirq completions
...
When attempting to mount a non-ubifs formatted volume, lots of error
messages (including a stack dump) are thrown to the kernel log even if
the MS_SILENT mount flag is set.
Fix this by introducing adding an additional state-variable in
struct ubifs_info and suppress error messages in ubifs_read_node if
MS_SILENT is set.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This commit tries to fix a bug that we can't read symlink properly with
inline data feature when the length of symlink is greater than 60 bytes
but less than extra space.
The key issue is in ext4_inode_is_fast_symlink() that it doesn't check
whether or not an inode has inline data. When the user creates a new
symlink, an inode will be allocated with MAY_INLINE_DATA flag. Then
symlink will be stored in ->i_block and extended attribute space. In
the mean time, this inode is with inline data flag. After remounting
it, ext4_inode_is_fast_symlink() function thinks that this inode is a
fast symlink so that the data in ->i_block is copied to the user, and
the data in extra space is trimmed. In fact this inode should be as a
normal symlink.
The following script can hit this bug.
#!/bin/bash
cd ${MNT}
filename=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
rm -rf test
mkdir test
cd test
echo "hello" >$filename
ln -s $filename symlinkfile
cd
sudo umount /mnt/sda1
sudo mount -t ext4 /dev/sda1 /mnt/sda1
readlink /mnt/sda1/test/symlinkfile
After applying this patch, it will break the assumption in e2fsck
because the original implementation doesn't want to support symlink
with inline data.
Reported-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Reported-by: Ian Nartowicz <claws@nartowicz.co.uk>
Cc: Ian Nartowicz <claws@nartowicz.co.uk>
Cc: Tao Ma <tm@tao.ma>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch should resolve the following recursive lock.
[<ffffffff8135a9c3>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffffa01749dc>] f2fs_setxattr+0x5c/0xa0 [f2fs]
[<ffffffffa0174c99>] __f2fs_set_acl+0x1b9/0x340 [f2fs]
[<ffffffffa017515a>] f2fs_init_acl+0x4a/0xcb [f2fs]
[<ffffffffa0159abe>] __f2fs_add_link+0x26e/0x780 [f2fs]
[<ffffffffa015d4d8>] f2fs_mkdir+0xb8/0x150 [f2fs]
[<ffffffff811cebd7>] vfs_mkdir+0xb7/0x160
[<ffffffff811cf89b>] SyS_mkdir+0xab/0xe0
[<ffffffff817244bf>] tracesys+0xe1/0xe6
[<ffffffffffffffff>] 0xffffffffffffffff
The call path indicates:
- f2fs_add_link
: down_write(&fi->i_sem);
- init_inode_metadata
- f2fs_init_acl
- __f2fs_set_acl
- f2fs_setxattr
: down_write(&fi->i_sem);
Here we should not call f2fs_setxattr, but __f2fs_setxattr.
But __f2fs_setxattr is a static function in xattr.c, so that I found the other
generic approach to use f2fs_setxattr.
In f2fs_setxattr, the page pointer is only given from init_inode_metadata.
So, this patch adds this condition to avoid this in f2fs_setxattr.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
v2: add a __break_lease tracepoint for non-blocking case
Recently, I needed these to help track down a softlockup when recalling a
delegation, but they might be helpful in other situations as well.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Replace seq_printf where possible
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Currently, the fl_owner isn't set for flock locks. Some filesystems use
byte-range locks to simulate flock locks and there is a common idiom in
those that does:
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
Since flock locks are generally "owned" by the open file description,
move this into the common flock lock setup code. The fl_start and fl_end
fields are already set appropriately, so remove the unneeded setting of
that in flock ops in those filesystems as well.
Finally, the lease code also sets the fl_owner as if they were owned by
the process and not the open file description. This is incorrect as
leases have the same ownership semantics as flock locks. Set them the
same way. The lease code doesn't actually use the fl_owner value for
anything, so this is more for consistency's sake than a bugfix.
Reported-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (Staging portion)
Acked-by: J. Bruce Fields <bfields@fieldses.org>
I hit the same assert failed as Dolev Raviv reported in Kernel v3.10
shows like this:
[ 9641.164028] UBIFS assert failed in shrink_tnc at 131 (pid 13297)
[ 9641.234078] CPU: 1 PID: 13297 Comm: mmap.test Tainted: G O 3.10.40 #1
[ 9641.234116] [<c0011a6c>] (unwind_backtrace+0x0/0x12c) from [<c000d0b0>] (show_stack+0x20/0x24)
[ 9641.234137] [<c000d0b0>] (show_stack+0x20/0x24) from [<c0311134>] (dump_stack+0x20/0x28)
[ 9641.234188] [<c0311134>] (dump_stack+0x20/0x28) from [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs])
[ 9641.234265] [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs]) from [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs])
[ 9641.234307] [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs]) from [<c00cdad8>] (shrink_slab+0x1d4/0x2f8)
[ 9641.234327] [<c00cdad8>] (shrink_slab+0x1d4/0x2f8) from [<c00d03d0>] (do_try_to_free_pages+0x300/0x544)
[ 9641.234344] [<c00d03d0>] (do_try_to_free_pages+0x300/0x544) from [<c00d0a44>] (try_to_free_pages+0x2d0/0x398)
[ 9641.234363] [<c00d0a44>] (try_to_free_pages+0x2d0/0x398) from [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8)
[ 9641.234382] [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8) from [<c00f62d8>] (new_slab+0x78/0x238)
[ 9641.234400] [<c00f62d8>] (new_slab+0x78/0x238) from [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c)
[ 9641.234419] [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c) from [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188)
[ 9641.234459] [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188) from [<bf227908>] (do_readpage+0x168/0x468 [ubifs])
[ 9641.234553] [<bf227908>] (do_readpage+0x168/0x468 [ubifs]) from [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs])
[ 9641.234606] [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs]) from [<c00c17c0>] (filemap_fault+0x304/0x418)
[ 9641.234638] [<c00c17c0>] (filemap_fault+0x304/0x418) from [<c00de694>] (__do_fault+0xd4/0x530)
[ 9641.234665] [<c00de694>] (__do_fault+0xd4/0x530) from [<c00e10c0>] (handle_pte_fault+0x480/0xf54)
[ 9641.234690] [<c00e10c0>] (handle_pte_fault+0x480/0xf54) from [<c00e2bf8>] (handle_mm_fault+0x140/0x184)
[ 9641.234716] [<c00e2bf8>] (handle_mm_fault+0x140/0x184) from [<c0316688>] (do_page_fault+0x150/0x3ac)
[ 9641.234737] [<c0316688>] (do_page_fault+0x150/0x3ac) from [<c000842c>] (do_DataAbort+0x3c/0xa0)
[ 9641.234759] [<c000842c>] (do_DataAbort+0x3c/0xa0) from [<c0314e38>] (__dabt_usr+0x38/0x40)
After analyzing the code, I found a condition that may cause this failed
in correct operations. Thus, I think this assertion is wrong and should be
removed.
Suppose there are two clean znodes and one dirty znode in TNC. So the
per-filesystem atomic_t @clean_zn_cnt is (2). If commit start, dirty_znode
is set to COW_ZNODE in get_znodes_to_commit() in case of potentially ops
on this znode. We clear COW bit and DIRTY bit in write_index() without
@tnc_mutex locked. We don't increase @clean_zn_cnt in this place. As the
comments in write_index() shows, if another process hold @tnc_mutex and
dirty this znode after we clean it, @clean_zn_cnt would be decreased to (1).
We will increase @clean_zn_cnt to (2) with @tnc_mutex locked in
free_obsolete_znodes() to keep it right.
If shrink_tnc() performs between decrease and increase, it will release
other 2 clean znodes it holds and found @clean_zn_cnt is less than zero
(1 - 2 = -1), then hit the assertion. Because free_obsolete_znodes() will
soon correct @clean_zn_cnt and no harm to fs in this case, I think this
assertion could be removed.
2 clean zondes and 1 dirty znode, @clean_zn_cnt == 2
Thread A (commit) Thread B (write or others) Thread C (shrinker)
->write_index
->clear_bit(DIRTY_NODE)
->clear_bit(COW_ZNODE)
@clean_zn_cnt == 2
->mutex_locked(&tnc_mutex)
->dirty_cow_znode
->!ubifs_zn_cow(znode)
->!test_and_set_bit(DIRTY_NODE)
->atomic_dec(&clean_zn_cnt)
->mutex_unlocked(&tnc_mutex)
@clean_zn_cnt == 1
->mutex_locked(&tnc_mutex)
->shrink_tnc
->destroy_tnc_subtree
->atomic_sub(&clean_zn_cnt, 2)
->ubifs_assert <- hit
->mutex_unlocked(&tnc_mutex)
@clean_zn_cnt == -1
->mutex_lock(&tnc_mutex)
->free_obsolete_znodes
->atomic_inc(&clean_zn_cnt)
->mutux_unlock(&tnc_mutex)
@clean_zn_cnt == 0 (correct after shrink)
Signed-off-by: hujianyang <hujianyang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
v9fs_fid_xattr_set is supposed to return 0 on success.
This corrects the behaviour introduced in commit
bdd5c28dcb
"9p: fix return value in case in v9fs_fid_xattr_set()"
(The function returns a negative error on error, as expected)
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
lock_parent() very much on purpose does nested locking of dentries, and
is careful to maintain the right order (lock parent first). But because
it didn't annotate the nested locking order, lockdep thought it might be
a deadlock on d_lock, and complained.
Add the proper annotation for the inner locking of the child dentry to
make lockdep happy.
Introduced by commit 046b961b45 ("shrink_dentry_list(): take parent's
->d_lock earlier").
Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The laundromat uses two variables to calculate when it should next run,
but one is completely ignored at the end of the run. Merge the two and
rename the variable to be more descriptive of what it does.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
sparse says:
CHECK fs/nfsd/nfs4xdr.c
fs/nfsd/nfs4xdr.c:2043:1: warning: symbol 'nfsd4_encode_fattr' was not declared. Should it be static?
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When debugging, rpc prints messages from dprintk(KERN_WARNING ...)
with "^A4" prefixed,
[ 2780.339988] ^A4nfsd: connect from unprivileged port: 127.0.0.1, port=35316
Trond tells,
> dprintk != printk. We have NEVER supported dprintk(KERN_WARNING...)
This patch removes using of dprintk with KERN_WARNING.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Note nobody's ever noticed because the typical client probably never
requests FILES_AVAIL without also requesting something else on the list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
ex_nflavors can't be negative number, just defined by uint32_t.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No need for a typedef wrapper for svc_export or svc_client, remove them.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 49b28684fd ("nfsd: Remove deprecated nfsctl system call and
related code") removed the only use of ipv6_addr_set_v4mapped(), so
net/ipv6.h is unneeded now.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit 8f6c5ffc89 ("kernel/groups.c: remove return value of
set_groups") removed the last use of "ret".
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After commit 4c1e1b34d5 ("nfsd: Store ex_anon_uid and ex_anon_gid as
kuids and kgids") using kuid/kgid for ex_anon_uid/ex_anon_gid,
user_namespace.h is not needed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If fsloc_parse() failed at kzalloc(), fs/nfsd/export.c
411
412 fsloc->locations = kzalloc(fsloc->locations_count
413 * sizeof(struct nfsd4_fs_location), GFP_KERNEL);
414 if (!fsloc->locations)
415 return -ENOMEM;
svc_export_parse() will call nfsd4_fslocs_free() with fsloc->locations = NULL,
so that, "kfree(fsloc->locations[i].path);" will cause a crash.
If fsloc_parse() failed after that, fsloc_parse() will call nfsd4_fslocs_free(),
and svc_export_parse() will call it again, so that, a double free is caused.
This patch checks the fsloc->locations, and set to NULL after it be freed.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
RPC_MAX_AUTH_SIZE is scattered around several places. Better to set it
once in the auth code, where this kind of estimate should be made. And
while we're at it we can leave it zero when we're not using krb5i or
krb5p.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
And switch a couple other functions from the encode(&p,...) convention
to the p = encode(p,...) convention mostly used elsewhere.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
encode_getattr, for example, can return nfserr_resource to indicate it
ran out of buffer space. That's not a legal error in the 4.1 case.
And in the 4.1 case, if we ran out of buffer space, we should have
exceeded a session limit too.
(Note in 1bc49d83c3 "nfsd4: fix
nfs4err_resource in 4.1 case" we originally tried fixing this error
return before fixing the problem that we could error out while we still
had lots of available space. The result was to trade one illegal error
for another in those cases. We decided that was helpful, so reverted
the change in fc208d026b, and are only
reinstating it now that we've elimited almost all of those cases.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I'm not sure why a client would want to stuff multiple reads in a
single compound rpc, but it's legal for them to do it, and we should
really support it.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The splice and readv cases are actually quite different--for example the
former case ignores the array of vectors we build up for the latter.
It is probably clearer to separate the two cases entirely.
There's some code duplication between the split out encoders, but this
is only temporary and will be fixed by a later patch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We currently allow only one read per compound, with operations before
and after whose responses will require no more than about a page to
encode.
While we don't expect clients to violate those limits any time soon,
this limitation isn't really condoned by the spec, so to future proof
the server we should lift the limitation.
At the same time we'd like to continue to support zero-copy reads.
Supporting multiple zero-copy-reads per compound would require a new
data structure to replace struct xdr_buf, which can represent only one
set of included pages.
So for now we plan to modify encode_read() to support either zero-copy
or non-zero-copy reads, and use some heuristics at the start of the
compound processing to decide whether a zero-copy read will work.
This will allow us to support more exotic compounds without introducing
a performance regression in the normal case.
Later patches handle those "exotic compounds", this one just makes sure
zero-copy is turned off in those cases.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We plan to use this estimate to decide whether or not to allow zero-copy
reads. Currently we're assuming all getattr's are a page, which can be
both too small (ACLs e.g. may be arbitrarily long) and too large (after
an upcoming read patch this will unnecessarily prevent zero copy reads
in any read compound also containing a getattr).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There's no advantage to this zero-copy-style readlink encoding, and it
unnecessarily limits the kinds of compounds we can handle. (In practice
I can't see why a client would want e.g. multiple readlink calls in a
comound, but it's probably a spec violation for us not to handle it.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
As long as we're here, let's enforce the protocol's limit on the number
of directory entries to return in a readdir.
I don't think anyone's ever noticed our lack of enforcement, but maybe
there's more of a chance they will now that we allow larger readdirs.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently we limit readdir results to a single page. This can result in
a performance regression compared to NFSv3 when reading large
directories.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we know the limits the session places on the size of the rpc, we
can also use that information to release any unnecessary reserved reply
buffer space.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We can simplify session limit enforcement by restricting the xdr buflen
to the session size.
Also fix a preexisting bug: we should really have been taking into
account the auth-required space when comparing against session limits,
which are limits on the size of the entire rpc reply, including any krb5
overhead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We don't necessarily want to assume that the buflen is the same
as the number of bytes available in the pages. We may have some reason
to set it to something less (for example, later patches will use a
smaller buflen to enforce session limits).
So, calculate the buflen relative to the previous buflen instead of
recalculating it from scratch.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It will turn out to be useful to have a more accurate estimate of reply
size; so, piggyback on the existing op reply-size estimators.
Also move nfsd4_max_reply to nfs4proc.c to get easier access to struct
nfsd4_operation and friends. (Thanks to Christoph Hellwig for pointing
out that simplification.)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
I ran into this corner case in testing: in theory clients can provide
state owners up to 1024 bytes long. In the sessions case there might be
a risk of this pushing us over the DRC slot size.
The conflicting owner isn't really that important, so let's humor a
client that provides a small maxresponsize_cached by allowing ourselves
to return without the conflicting owner instead of outright failing the
operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Limits on maxresp_sz mean that we only ever need to replay rpc's that
are contained entirely in the head.
The one exception is very small zero-copy reads. That's an odd corner
case as clients wouldn't normally ask those to be cached.
in any case, this seems a little more robust.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
After this we can handle for example getattr of very large ACLs.
Read, readdir, readlink are still special cases with their own limits.
Also we can't handle a new operation starting close to the end of a
page.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Now that all op encoders can handle running out of space, we no longer
need to check the remaining size for every operation; only nonidempotent
operations need that check, and that can be done by
nfsd4_check_resp_size.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Once we've included page-cache pages in the encoding it's difficult to
remove them and restart encoding. (xdr_truncate_encode doesn't handle
that case.) So, make sure we'll have adequate space to finish the
operation first.
For now COMPOUND_SLACK_SPACE checks should prevent this case happening,
but we want to remove those checks.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We've tried to prevent running out of space with COMPOUND_SLACK_SPACE
and special checking in those operations (getattr) whose result can vary
enormously.
However:
- COMPOUND_SLACK_SPACE may be difficult to maintain as we add
more protocol.
- BUG_ON or page faulting on failure seems overly fragile.
- Especially in the 4.1 case, we prefer not to fail compounds
just because the returned result came *close* to session
limits. (Though perfect enforcement here may be difficult.)
- I'd prefer encoding to be uniform for all encoders instead of
having special exceptions for encoders containing, for
example, attributes.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Normally xdr encoding proceeds in a single pass from start of a buffer
to end, but sometimes we have to write a few bytes to an earlier
position.
Use write_bytes_to_xdr_buf for these cases rather than saving a pointer
to write to. We plan to rewrite xdr_reserve_space to handle encoding
across page boundaries using a scratch buffer, and don't want to risk
writing to a pointer that was contained in a scratch buffer.
Also it will no longer be safe to calculate lengths by subtracting two
pointers, so use xdr_buf offsets instead.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
We have the same problem with ->d_lock order in the inner loop, where
we are dropping references to ancestors. Same solution, basically -
instead of using dentry_kill() we use lock_parent() (introduced in the
previous commit) to get that lock in a safe way, recheck ->d_count
(in case if lock_parent() has ended up dropping and retaking ->d_lock
and somebody managed to grab a reference during that window), trylock
the inode->i_lock and use __dentry_kill() to do the rest.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one. d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.
Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order. We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU. And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.
That deals with a half of the problem - killing dentries on the
shrink list itself. Another one (dropping their parents) is
in the next commit.
locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one. Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order. Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough. In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The object and block layouts already exist in their own
subdirectories. This patch completes the set!
Note that as a layout denotes nfs4 already, I stripped
that prefix out of the file names.
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Those flags are obsolete and checking them can incorrectly cause
remount operations to fail.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Place the call to resend the failed GETATTR under the error handler so that
when appropriate, the GETATTR is retried more than once.
The server can fail the GETATTR op in the OPEN compound with a recoverable
error such as NFS4ERR_DELAY. In the case of an O_EXCL open, the server has
created the file, so a retrans of the OPEN call will fail with NFS4ERR_EXIST.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We cannot allow nfs_page_group_lock to use TASK_KILLABLE here, since
the loop would cause a busy wait if somebody kills the task.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Handle the case where nfs_create_request() returns an error.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_read_completion relied on the fact that there was a 1:1 mapping
of page to nfs_request, but this has now changed.
Regions not covered by a request have already been zeroed elsewhere.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the new pg_test interface to adjust requests to fit in the current
stripe / segment.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Remove alignment checks that would revert to MDS and change pg_test
to return the max ammount left in the segment (or other pg_test call)
up to size of passed request, or 0 if no space is left.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Support direct requests that span multiple pnfs data servers by
comparing nfs_pgio_header->verf to a cached verf in pnfs_commit_bucket.
Continue to use dreq->verf if the MDS is used / non-pNFS.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Since the ability to split pages into subpage requests has been added,
nfs_pgio_header->rpc_list only ever has one pgio data.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Use the newly added support for multiple requests per page for
rsize/wsize < PAGE_SIZE, instead of having multiple read / write
data structures per pageio header.
This allows us to get rid of nfs_pgio_multi.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now that pg_test can change the size of the request (by returning a non-zero
size smaller than the request), pg_test functions that call other
pg_test functions must return the minimum of the result - or 0 if any fail.
Also clean up the logic of some pg_test functions so that all checks are
for contitions where coalescing is not possible.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Change how nfs_mark_uptodate checks to see if writes cover a whole page.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the write path, this is calling
end_page_writeback and removing the head request from an inode.
Both of these operations should not be called until all requests
in a page group have reached the point where they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the read path, this is calling
unlock_page and SetPageUptodate. Both of these functions should not be
called until all requests in a page group have reached the point where
they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add "page groups" - a circular list of nfs requests (struct nfs_page)
that all reference the same page. This gives nfs read and write paths
the ability to account for sub-page regions independently. This
somewhat follows the design of struct buffer_head's sub-page
accounting.
Only "head" requests are ever added/removed from the inode list in
the buffered write path. "head" and "sub" requests are treated the
same through the read path and the rest of the write/commit path.
Requests are given an extra reference across the life of the list.
Page groups are never rejoined after being split. If the read/write
request fails and the client falls back to another path (ie revert
to MDS in PNFS case), the already split requests are pushed through
the recoalescing code again, which may split them further and then
coalesce them into properly sized requests on the wire. Fragmentation
shouldn't be a problem with the current design, because we flush all
requests in page group when a non-contiguous request is added, so
the only time resplitting should occur is on a resend of a read or
write.
This patch lays the groundwork for sub-page splitting, but does not
actually do any splitting. For now all page groups have one request
as pg_test functions don't yet split pages. There are several related
patches that are needed support multiple requests per page group.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Call nfs_can_coalesce_requests for every request, even the first one.
This is needed for future patches to give pg_test a way to inform
add_request to reduce the size of the request.
Now @prev can be null in nfs_can_coalesce_requests and pg_test functions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This is a step toward allowing pg_test to inform the the
coalescing code to reduce the size of requests so they may fit in
whatever scheme the pg_test callback wants to define.
For now, just return the size of the request if there is space, or 0
if there is not. This shouldn't change any behavior as it acts
the same as when the pg_test functions returned bool.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
@inode is passed but not used.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point the read and write structures look identical, so combine
them into something shared by both.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
What we have here is two functions that look identical. Let's share
some more code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Once again, these two functions look identical in the read and write
case. Time to combine them together!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Most of this code is the same for both the read and write paths, so
combine everything and use the rw_ops when necessary.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Result will be massaged to saner shape in the next commits. It is
ugly, no questions - the point of that one is to be a provably
equivalent transformation (and it might be worth splitting a bit
more).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Check for NULL before using the acl in the access type switch
statement. This seems to be consistent with what is done in the JFFS
and ext4 filesystems and with the behaviour of JFS in the 3.13 kernel.
The bug seemed to be introduced in commit 2cc6a5a0.
The bug results in a kernel Oops, NULL dereference could not be handled
when accessing a JFS filesystem. The rdiff-backup process seemed to
trigger the bug. See also reported bug #75341:
https://bugzilla.kernel.org/show_bug.cgi?id=75341
Signed-off-by: William Burrow <wbkernel@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
These functions are almost identical on both the read and write side.
FLUSH_COND_STABLE will never be set for the read path, so leaving it in
the generic code won't hurt anything.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the read and write versions of this function look
identical so both should use the same function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Write adds a little bit of code dealing with flush flags, but since
"how" will always be 0 when reading we can share the code.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths set up this struct in exactly the same way, so
create a single shared struct.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Combining these functions will let me make a single nfs_rw_common_ops
struct (see the next patch).
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths do exactly the same thing for the rpc_prepare
rpc_op. This patch combines them together into a single function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I create a new struct nfs_rw_ops to decide the differences between reads
and writes. This struct will be set when initializing a new
nfs_pgio_descriptor, and then passed on to the nfs_rw_header when a new
header is allocated.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
These functions are identical for the read and write paths so they can
be combined.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The header had a pointer to the verifier that was set from the old write
data struct. We don't need to keep the pointer around now that we have
shared structures.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The only difference is the write verifier field, but we can keep that
for a little bit longer.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At this point, the only difference between nfs_read_data and
nfs_write_data is the write verifier.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reads and writes have very similar results. This patch combines the two
structs together with comments to show where the differing fields are
used.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reads and writes have very similar arguments. This patch combines them
together and documents the few fields used only by write.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_read based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The write_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_write based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
"fdatasync() is similar to fsync(), but does not flush modified metadata
unless that metadata is needed in order to allow a subsequent data
retrieval to be correctly handled."
We absolutely need to commit the layouts to be able to retrieve the data
in case either the client, the server or the storage subsystem go down.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
xdr_reserve_space should now be calculating the length correctly as we
go, so there's no longer any need to fix it up here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a cosmetic change for now; no change in behavior.
Note we're just depending on xdr_reserve_space to do the bounds checking
for us, we're not really depending on its adjustment of iovec or xdr_buf
lengths yet, as those are fixed up by as necessary after the fact by
read-link operations and by nfs4svc_encode_compoundres. However we do
have to update xdr->iov on read-like operations to prevent
xdr_reserve_space from messing with the already-fixed-up length of the
the head.
When the attribute encoding fails partway through we have to undo the
length adjustments made so far. We do it manually for now, but later
patches will add an xdr_truncate_encode() helper to handle cases like
this.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
It can happen only when dentry_kill() is called with unlock_on_failure
equal to 0 - other callers had dentry pinned until the moment they've
got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().
IOW, only one of three call sites of dentry_kill() might end up reaching
that code. Just move it there.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The debugging check which verifies that we never write outside of the file
length was incorrect, since it was multiplying file length by the page size,
instead of dividing. Fix this.
Spotted-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Commit 6130f5315e "switch vmsplice_to_user() to copy_page_to_iter()" in
v3.15-rc1 broke vmsplice(2).
This patch fixes two bugs:
- count is not initialized to a proper value, which resulted in no data
being copied
- if rw_copy_check_uvector() returns negative then the iov might be leaked.
Tested OK.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cody Schafer already fixed binary file creation for attribute groups, see [1].
This patch makes the appropriate changes for binary file removal
of attribute groups.
[1]: http://lkml.org/lkml/2014/2/27/832
Signed-off-by: Robert ABEL <rabel@cit-ec.uni-bielefeld.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.
Cc: stable@vger.kernel.org
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.
The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).
Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c
Fixes: bd2d0210cf
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove local variable "stored" from ext4_readdir(...). This variable
gets initialized but is never used inside the function.
Signed-off-by: Giedrius Rekasius <giedrius.rekasius@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
xfstests generic/091 is failing when mounting ext4 with data=journal.
I think that this regression is same problem that occurred prior to collapse
range issue. So ZERO RANGE also need to call ext4_force_commit as
collapse range.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This post-encoding check should be taking into account the need to
encode at least an out-of-space error to the following op (if any).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd4_check_resp_size() returns an error then we should really be
truncating the reply here, otherwise we may leave extra garbage at the
end of the rpc reply.
Also add a warning to catch any cases where our reply-size estimates may
be wrong in the case of a non-idempotent operation.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If UBIFS_DEBUG is defined an additional assertion of the ui_lock
spinlock in do_writepage cannot compile because the ui pointer has not
been previously declared.
Fix this by declaring and initializing the ui pointer in case
UBIFS_DEBUG is defined.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Function ubifs_read_one_lp will not set @lp and returns
an error when ubifs_read_one_lp failed. We should not
perform ubifs_dump_lprop in this case because @lp is not
initialized as we wanted.
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
we avoid taking global s_orphan_lock in some cases and hold it for
shorter time in other cases.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use sbi pointer consistently in ext4_orphan_del() instead of opencoding
it sometimes. Also ext4_orphan_add() uses EXT4_SB(sb) often so create
sbi variable for it as well and use it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Pull AFS fixes and cleanups from David Howells:
"Here are some patches to the AFS filesystem:
1) Fix problems in the clean-up parts of the cache manager service
handler.
2) Split afs_end_call() introduced in (1) and replace some identical
code elsewhere with a call to the first half of the split function.
3) Fix an error introduced in the workqueue PREPARE_WORK() elimination
commits.
4) Clean up argument passing to functions called from the workqueue as
there's now an insulating layer between them and the workqueue.
This is possible from (3)"
* 'afs' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
AFS: Pass an afs_call* to call->async_workfn() instead of a work_struct*
AFS: Fix kafs module unloading
AFS: Part of afs_end_call() is identical to code elsewhere, so split it
AFS: Fix cache manager service handlers
Pull two nfsd bugfixes from Bruce Fields:
"Just two bugfixes, one for a merge-window-introduced ACL regression,
the other for a longer-standing v4 state bug"
* 'for-3.15' of git://linux-nfs.org/~bfields/linux:
nfsd4: warn on finding lockowner without stateid's
nfsd4: remove lockowner when removing lock stateid
nfsd4: fix corruption on setting an ACL.
In dlm_init, if create dlm_lockname_cache failed in
dlm_init_master_caches, it will destroy dlm_lockres_cache which created
before twice. And this will cause system die when loading modules.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently if the nfs-level part of a reply would be too large, we'll
return an error to the client. But if the nfs-level part fits and
leaves no room for krb5p or krb5i stuff, then we just drop the request
entirely.
That's no good. Instead, reserve some slack space at the end of the
buffer and make sure we fail outright if we'd come close.
The slack space here is a massive overstimate of what's required, we
should probably try for a tighter limit at some point.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Just change the nfsd4_encode_getattr api. Not changing any code or
adding any new functionality yet.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a mechanical transformation with no change in behavior.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently a non-idempotent op reply may be cached if it fails in the
proc code but not if it fails at xdr decoding. I doubt there are any
xdr-decoding-time errors that would make this a problem in practice, so
this probably isn't a serious bug.
The space estimates should also take into account space required for
encoding of error returns. Again, not a practical problem, though it
would become one after future patches which will tighten the space
estimates.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The client is actually asking for 2532 bytes. I suspect that's a
mistake. But maybe we can allow some more. In theory lock needs more
if it might return a maximum-length lockowner in the denied case.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
OP_MODIFIES_SOMETHING flags operations that we should be careful not to
initiate without being sure we have the buffer space to encode a reply.
None of these ops fall into that category.
We could probably remove a few more, but this isn't a very important
problem at least for ops whose reply size is easy to estimate.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
call->async_workfn() can take an afs_call* arg rather than a work_struct* as
the functions assigned there are now called from afs_async_workfn() which has
to call container_of() anyway.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Reviewed-by: Tejun Heo <tj@kernel.org>
At present, it is not possible to successfully unload the kafs module if there
are outstanding async outgoing calls (those made with afs_make_call()). This
appears to be due to the changes introduced by:
commit 059499453a
Author: Tejun Heo <tj@kernel.org>
Date: Fri Mar 7 10:24:50 2014 -0500
Subject: afs: don't use PREPARE_WORK
which didn't go far enough. The problem is due to:
(1) The aforementioned commit introduced a separate handler function pointer
in the call, call->async_workfn, in addition to the original workqueue
item, call->async_work, for asynchronous operations because workqueues
subsystem cannot handle the workqueue item pointer being changed whilst
the item is queued or being processed.
(2) afs_async_workfn() was introduced in that commit to be the callback for
call->async_work. Its sole purpose is to run whatever call->async_workfn
points to.
(3) call->async_workfn is only used from afs_async_workfn(), which is only
set on async_work by afs_collect_incoming_call() - ie. for incoming
calls.
(4) call->async_workfn is *not* set by afs_make_call() when outgoing calls are
made, and call->async_work is set afs_process_async_call() - and not
afs_async_workfn().
(5) afs_process_async_call() now changes call->async_workfn rather than
call->async_work to point to afs_delete_async_call() to clean up, but this
is only effective for incoming calls because call->async_work does not
point to afs_async_workfn() for outgoing calls.
(6) Because, for incoming calls, call->async_work remains pointing to
afs_process_async_call() this results in an infinite loop.
Instead, make the workqueue uniformly vector through call->async_workfn, via
afs_async_workfn() and simply initialise call->async_workfn to point to
afs_process_async_call() in afs_make_call().
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Split afs_end_call() into two pieces, one of which is identical to code in
afs_process_async_call(). Replace the latter with a call to the first part of
afs_end_call().
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Signed-off-by: David Howells <dhowells@redhat.com>