In ->atomic_open(inode, dentry, file, opened) calling finish_no_open(file, NULL)
is equivalent to dget(dentry); return finish_no_open(file, dentry);
No need to open-code that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dentry is always hashed and negative, inode - non-error, non-NULL and
non-directory. In such conditions d_splice_alias() is equivalent to
"d_instantiate(dentry, inode) and return NULL", which simplifies the
downstream code and is consistent with the "have to create a new object"
case.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Even when security labels are disabled we support at least the same
attributes as v4.1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The locked page should be released before returning the function.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The functions iput() and put_pid() test whether their argument is NULL
and then return immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jan Kara <jack@suse.cz>
We no longer need mpx.h in exec.c. This will obviously also
break the build for non-x86 builds. We get the MPX includes that
we need from mmu_context.h now.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.
Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other
options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current gfs2 freezing code is considerably more complicated than it
should be because it doesn't use the vfs freezing code on any node except
the one that begins the freeze. This is because it needs to acquire a
cluster glock before calling the vfs code to prevent a deadlock, and
without the new freeze_super and thaw_super hooks, that was impossible. To
deal with the issue, gfs2 had to do some hacky locking tricks to make sure
that a frozen node couldn't be holding on a lock it needed to do the
unfreeze ioctl.
This patch makes use of the new hooks to simply the gfs2 locking code. Now,
all the nodes in the cluster freeze and thaw in exactly the same way. Every
node in the cluster caches the freeze glock in the shared state. The new
freeze_super hook allows the freezing node to grab this freeze glock in
the exclusive state without first calling the vfs freeze_super function.
All the nodes in the cluster see this lock change, and call the vfs
freeze_super function. The vfs locking code guarantees that the nodes can't
get stuck holding the glocks necessary to unfreeze the system. To
unfreeze, the freezing node uses the new thaw_super hook to drop the freeze
glock. Again, all the nodes notice this, reacquire the glock in shared mode
and call the vfs thaw_super function.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently, freezing a filesystem involves calling freeze_super, which locks
sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
hard for gfs2 (and potentially other cluster filesystems) to use the vfs
freezing code to do freezes on all the cluster nodes.
In order to communicate that a freeze has been requested, and to make sure
that only one node is trying to freeze at a time, gfs2 uses a glock
(sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
this lock before calling freeze_super. This means that two nodes can
attempt to freeze the filesystem by both calling freeze_super, acquiring
the sb->s_umount lock, and then attempting to grab the cluster glock
sd_freeze_gl. Only one will succeed, and the other will be stuck in
freeze_super, making it impossible to finish freezing the node.
To solve this problem, this patch adds the freeze_super and thaw_super
hooks. If a filesystem implements these hooks, they are called instead of
the vfs freeze_super and thaw_super functions. This means that every
filesystem that implements these hooks must call the vfs freeze_super and
thaw_super functions itself within the hook function to make use of the vfs
freezing code.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* Another attempt at moving x86 to libstub taking advantage of the
__pure attribute - Ard Biesheuvel
* Add EFI runtime services section to ptdump - Mathias Krause
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUZnQGAAoJEC84WcCNIz1VfyUP/1MCVt4vepl7+0JzdUP/eVPs
CwPM6gBOvgx1PviWrtvSU8UyjtYDqZx7jnCvyvbmlgixAqqIoFm80x5sd9DfyJBj
vSrmavXaJgQomJN3N+fvaIpGJXp8NQmeNT87++UMb6VE5nYvx7suDcwfTqOaxcYt
yDwKatTXTvQxDLlGgtymp2UhgVKBICs9WVo8weevB5LPmpt4TFCi1GDSimJkfg+0
JTvkKF+QxPmVqgwY7bgdFFcfhsYCux5VgtbD4DJKS3LgfLJLAMKPOt83DbeOSmIa
8zqtlF3eMwHgecKrMLSfIZH3XIl2gsIdPdvT6iBQkwwGZuSzG93JgwJ90HYiKoDm
yNlffnhmdgI2RXO97UZJpzqGor+eNc0auuS4485PcE8NtZ1tbo20A/OCpfIzK8j8
Vk7sfZxaaHKF5PUtRe6vo3myRlUCofMIuSEWSF8d709R6AEEuia6RZ3Y45EJPROn
fKOiLsf7Og1Mk43Iy2lb7kFT766OsUnZZHU/xiIZj/v94HPWFWoFPtxPC0IURvPx
24oiJxCnyXWGtoyn+SSprl+NAPuPsxVFYriTwaq2RBuoY0NAdy7NIXKe2HTp/WI0
oSTtYkCRcwiHv0aSrg+yQHmwH7y7m39S3yIS4t5LXenn2G4ObUUjhcAQdF2Ft0Pr
MT/l+stTt390cpfpcQbE
=he2O
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi
Pull EFI updates for v3.19 from Matt Fleming:
- Support module unload for efivarfs - Mathias Krause
- Another attempt at moving x86 to libstub taking advantage of the
__pure attribute - Ard Biesheuvel
- Add EFI runtime services section to ptdump - Mathias Krause
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Highlights include:
- Stable patches to fix NFSv4.x delegation reclaim error paths
- Fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features
- Fix a use-after-free problem with pNFS block layouts
- Fix a memory leak in the pNFS files O_DIRECT code
- Replace an intrusive and Oops-prone performance fix in the NFSv4 atomic
open code with a safer one-line version and revert the two original patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUZol9AAoJEGcL54qWCgDyNQQQALnngvpPR51BoO/iTz9ruXol
fGZy0SRIlTUKm1ArsQsQ+HGbV5K0hgP3Tg+z2AtEEZ8u/2Fi2Bqdl6+eNY12tKHd
uUctDdM5TXLrETAn1UULrnd2eX1cvPMBfOlXlAdNHHsGEgC7w7YQ+rzGwnls+HDy
LYXzY7Y3jYGdTMaRgZc5YRdtd8JBpCxciRvPEQLDIobwP0JnZC1afTLe1XInqB2I
TZ4NTHT+DEWA+Ou1P2deL7+RuJNEAeWWBvULJy76n4BqKvN/HNedOO5HyBYXrwSd
3UX3wbx9CWRxN1F0EqNKxjxZ/597JwqBeNoTDRcofLsqumUfAOtlbym1EahcD3Ls
pykopNfgUhGuhxolStmuHdS6CnyQPERpR5lFZcDp7XtcwSq4FcwD8DRzLJMZW5dg
N1lkfFlwQN3rqdk/NEHL+IxS41Hlk4HXjMoP6MNbRtqzIN6tW9tvC4MtAWd1aYxO
YuUW281pbWxXQ731s0kTIrMUdQ9vGSRBMcbnO9rL3o+xkh8y5SPVkx9lhdhJN0UD
VbQ5Ws/xZ54bD1PfyYb+Yx659lI8MSFOsDuMuLmDtfYnVicHwCA3H63StvQ3ihf/
q0gu8Iex9YbNNjf7IfYGuWPmPn3gwPBoURPC0bcZvMPdY6DXodU6Oj4BRTQ5VCie
9N0pt2wp2eRjaSzD7r5A
=8YN6
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- stable patches to fix NFSv4.x delegation reclaim error paths
- fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2
features
- fix a use-after-free problem with pNFS block layouts
- fix a memory leak in the pNFS files O_DIRECT code
- replace an intrusive and Oops-prone performance fix in the NFSv4
atomic open code with a safer one-line version and revert the two
original patches"
* tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor
NFS: Don't try to reclaim delegation open state if recovery failed
NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked
NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE
NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
NFS: SEEK is an NFS v4.2 feature
nfs: Fix use of uninitialized variable in nfs_getattr()
nfs: Remove bogus assignment
nfs: remove spurious WARN_ON_ONCE in write path
pnfs/blocklayout: serialize GETDEVICEINFO calls
nfs: fix pnfs direct write memory leak
Revert "NFS: nfs4_do_open should add negative results to the dcache."
Revert "NFS: remove BUG possibility in nfs4_open_and_get_state"
NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
gfs2_fallocate() wasn't updating ctime and mtime when modifying the
inode. Add a call to file_update_time() to do that.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This addresses an issue caught by fsx where the inode size was not being
updated to the expected value after fallocate(2) with mode 0.
The problem was caused by the offset and len parameters being converted
to multiples of the file system's block size, so i_size would be rounded
up to the nearest block size multiple instead of the requested size.
This replaces the per-chunk i_size updates with a single i_size_write on
successful completion of the operation. With this patch gfs2 gets
through a complete run of fsx.
For clarity, the check for (error == 0) following the loop is removed as
all failures before that point jump to out_* labels or return.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
gfs2_fallocate wasn't checking inode_newsize_ok nor get_write_access.
Split out the context setup and inode locking pieces into a separate
function to make it more clear and add these missing calls.
inode_newsize_ok is called conditional on FALLOC_FL_KEEP_SIZE as there
is no need to enforce a file size limit if it isn't going to change.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add IIO include files
kernel/panic.c: update comments for print_tainted
mem-hotplug: reset node present pages when hot-adding a new pgdat
mem-hotplug: reset node managed pages when hot-adding a new pgdat
mm/debug-pagealloc: correct freepage accounting and order resetting
fanotify: fix notification of groups with inode & mount marks
mm, compaction: prevent infinite loop in compact_zone
mm: alloc_contig_range: demote pages busy message from warn to info
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
mm/page_alloc: restrict max order of merging on isolated pageblock
mm/page_alloc: move freepage counting logic to __free_one_page()
mm/page_alloc: add freepage on isolate pageblock to correct buddy list
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
mm/compaction: skip the range until proper target pageblock is met
zram: avoid kunmap_atomic() of a NULL pointer
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).
Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.
Fix the problem by always using the same comparison function when
sorting / merging the mark lists.
Thanks to Heinrich Schuchardt for improvements of my patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bits one
to 16 bits.
Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Any attempt to call nfs_remove_bad_delegation() while a delegation is being
returned is currently a no-op. This means that we can end up looping
forever in nfs_end_delegation_return() if something causes the delegation
to be revoked.
This patch adds a mechanism whereby the state recovery code can communicate
to the delegation return code that the delegation is no longer valid and
that it should not be used when reclaiming state.
It also changes the return value for nfs4_handle_delegation_recall_error()
to ensure that nfs_end_delegation_return() does not reattempt the lock
reclaim before state recovery is done.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch removes the assumption made previously, that we only need to
check the delegation stateid when it matches the stateid on a cached
open.
If we believe that we hold a delegation for this file, then we must assume
that its stateid may have been revoked or expired too. If we don't test it
then our state recovery process may end up caching open/lock state in a
situation where it should not.
We therefore rename the function nfs41_clear_delegation_stateid as
nfs41_check_delegation_stateid, and change it to always run through the
delegation stateid test and recovery process as outlined in RFC5661.
http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling
SEEK over v4.1. This is wrong, and can make servers crash.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Tested-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Variable 'err' needn't be initialized when nfs_getattr() uses it to
check whether it should call generic_fillattr() or not. That can result
in spurious error returns. Initialize 'err' properly.
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Commit 3a6fd1f004 (pnfs/blocklayout: remove read-modify-write handling
in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index
in variable initialization. AFAICS it's just a typo so remove it.
Spotted by Coverity (id 1248711).
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This WARN_ON_ONCE was supposed to catch reference counting bugs, but can
trigger in inappropriate situations.
This was reproducible using NFSv2 on an architecture with 64K pages -- we
verified that it was not a reference counting bug and the warning was
safe to ignore.
Reported-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets.
So we need to take care to free them when freeing dreq.
Ideally this needs to be done inside layout driver where ds_cinfo.buckets
are allocated. But buckets are attached to dreq and reused across LD IO iterations.
So I feel it's OK to free them in the generic layer.
Cc: stable@vger.kernel.org [v3.4+]
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is no need to keep the module loaded when it serves no function in
case the EFI runtime services are disabled. Return an error in this case
so loading the module will fail.
Also supply a module_exit function to allow unloading the module.
Last, but not least, set the owner of the file_system_type struct.
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Matthew Garrett <matthew.garrett@nebula.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
If i_size becomes large outside of MAX_INLINE_DATA, we shoud convert the inode.
Otherwise, we can make some dirty pages during the truncation, and those pages
will be written through f2fs_write_data_page.
At that moment, the inode has still inline_data, so that it tries to write non-
zero pages into inline_data area.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The scenario is like this.
One trhead triggers:
f2fs_write_data_pages
lock_page
f2fs_write_data_page
f2fs_lock_op <- wait
The other thread triggers:
f2fs_truncate
truncate_blocks
f2fs_lock_op
truncate_partial_data_page
lock_page <- wait for locking the page
This patch resolves this bug by relocating truncate_partial_data_page.
This function is just to truncate user data page and not related to FS
consistency as well.
And, we don't need to call truncate_inline_data. Rather than that,
f2fs_write_data_page will finally update inline_data later.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The # of inline_data inode is decreased only when it has inline_data.
After clearing the flag, we can't decreased the number.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It needs to write node pages if checkpoint is not doing in order to avoid
memory pressure.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
All filesystems using VFS quotas are now converted to use their private
i_dquot fields. Remove the i_dquot field from generic inode structure.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
CC: jfs-discussion@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs,
reiserfs) so it is beneficial to move this array to fs-private part of
the inode. We cannot just pass quota pointers from filesystems to quota
functions because during quotaon and quotaoff we have to traverse list
of all inodes and manipulate i_dquot pointers for each inode. So we
provide a function which generic quota code can use to get pointer to
the i_dquot array from the filesystem.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user, group, and project quotas. Tell VFS about it.
CC: xfs@oss.sgi.com
CC: Dave Chinner <david@fromorbit.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We support user and group quotas. Tell vfs about it.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
CC: cluster-devel@redhat.com
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all filesystems supporting VFS quota support user and group
quotas. With introduction of project quotas this is going to change so
make sure filesystem isn't called for quota type it doesn't support by
introduction of a bitmask determining which quota types each filesystem
supports.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Pull btrfs fix from Chris Mason:
"It's a one liner for an error cleanup path that leads to crashes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
This update fixes:
- incorrect warnings about i_mutex locking in
pagecache_isize_extended() and updates comments to match expected
locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJUXTquAAoJEK3oKUf0dfodzPQP+wVWm3suRpvfpOeljlwcCB1w
r1MjJjEgKRmlRCwTe4IYSSS2ZBSz3f5qTunde1PEAcUyyf2gO5b/gV2WCaVDWIpV
0/1RDaXIRplTvY/i5UAOtqOSUpNwvWz7PCmCAR7RCHFfTyBBbFRlRdg1GPCrBAzv
rf/kRm9C6fRHHwLojwNCLEzA0MAdjKVG05A+Xv3MnWcd9fRxtNryQnhTIUJoDCVl
5keebmizquoc88WoQxDX29j2Ce+yjMPj27YhB9Z09mfmFvbLHT46UP2jI5Ty+DX5
rJikXA5Jv6wtig9wsbsenK7fsy1CyAAbmxS8vjyHsdCSAMpN98HkndsSadF8nH5U
sEh43OJjJS5LedVNqfV+LtI9ZD9+fGpAETQFVI8TpefFBX2aYw3fomWO0AUbRuMi
s4f6iz2sDvgDa+oE38XqKif6CqFsBX0QQSuCDiPaMi79vy2VLE/I5SU7QAj42+BW
sEyGVVcdJDsDpkUe6SicIpbNNwhXCR4GYc4jI4QYjDdcK6rrliCnsI8hHk5pPeKk
Qvt6ERiP5dLvp9f6KEzvPAkdxBmDKOtHKMSEMJzBBDtxFJXStJNuYZVtMYb8JwJq
nV0WKSoXR0xT/IeMw8336HF4GO04VCHjY3QnNh0qNdHYtfXJerZmpzVhae7dJmZZ
nLFimb3q2TlfgEvkPqmi
=8S0c
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs fixes from Dave Chinner:
"This update fixes a warning in the new pagecache_isize_extended() and
updates some related comments, another fix for zero-range
misbehaviour, and an unforntuately large set of fixes for regressions
in the bulkstat code.
The bulkstat fixes are large but necessary. I wouldn't normally push
such a rework for a -rcX update, but right now xfsdump can silently
create incomplete dumps on 3.17 and it's possible that even xfsrestore
won't notice that the dumps were incomplete. Hence we need to get
this update into 3.17-stable kernels ASAP.
In more detail, the refactoring work I committed in 3.17 has exposed a
major hole in our QA coverage. With both xfsdump (the major user of
bulkstat) and xfsrestore silently ignoring missing files in the
dump/restore process, incomplete dumps were going unnoticed if they
were being triggered. Many of the dump/restore filesets were so small
that they didn't evenhave a chance of triggering the loop iteration
bugs we introduced in 3.17, so we didn't exercise the code
sufficiently, either.
We have already taken steps to improve QA coverage in xfstests to
avoid this happening again, and I've done a lot of manual verification
of dump/restore on very large data sets (tens of millions of inodes)
of the past week to verify this patch set results in bulkstat behaving
the same way as it does on 3.16.
Unfortunately, the fixes are not exactly simple - in tracking down the
problem historic API warts were discovered (e.g xfsdump has been
working around a 20 year old bug in the bulkstat API for the past 10
years) and so that complicated the process of diagnosing and fixing
the problems. i.e. we had to fix bugs in the code as well as
discover and re-introduce the userspace visible API bugs that we
unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.
Summary:
- incorrect warnings about i_mutex locking in pagecache_isize_extended()
and updates comments to match expected locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17"
* tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track bulkstat progress by agino
xfs: bulkstat error handling is broken
xfs: bulkstat main loop logic is a mess
xfs: bulkstat chunk-formatter has issues
xfs: bulkstat chunk formatting cursor is broken
xfs: bulkstat btree walk doesn't terminate
mm: Fix comment before truncate_setsize()
xfs: rework zero range to prevent invalid i_size updates
mm: Remove false WARN_ON from pagecache_isize_extended()
xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs: bulkstat doesn't release AGI buffer on error
This patch adds to control the memory footprint used by ino entries.
This will conduct best effort, not strictly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The bulkstat main loop progress is tracked by the "lastino"
variable, which is a full 64 bit inode. However, the loop actually
works on agno/agino pairs, and so there's a significant disconnect
between the rest of the loop and the main cursor. Convert this to
use the agino, and pass the agino into the chunk formatting function
and convert it too.
This gets rid of the inconsistency in the loop processing, and
finally makes it simple for us to skip inodes at any point in the
loop simply by incrementing the agino cursor.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The error propagation is a horror - xfs_bulkstat() returns
a rval variable which is only set if there are formatter errors. Any
sort of btree walk error or corruption will cause the bulkstat walk
to terminate but will not pass an error back to userspace. Worse
is the fact that formatter errors will also be ignored if any inodes
were correctly formatted into the user buffer.
Hence bulkstat can fail badly yet still report success to userspace.
This causes significant issues with xfsdump not dumping everything
in the filesystem yet reporting success. It's not until a restore
fails that there is any indication that the dump was bad and tha
bulkstat failed. This patch now triggers xfsdump to fail with
bulkstat errors rather than silently missing files in the dump.
This now causes bulkstat to fail when the lastino cookie does not
fall inside an existing inode chunk. The pre-3.17 code tolerated
that error by allowing the code to move to the next inode chunk
as the agino target is guaranteed to fall into the next btree
record.
With the fixes up to this point in the series, xfsdump now passes on
the troublesome filesystem image that exposes all these bugs.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There are a bunch of variables tha tare more wildy scoped than they
need to be, obfuscated user buffer checks and tortured "next inode"
tracking. This all needs cleaning up to expose the real issues that
need fixing.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The loop construct has issues:
- clustidx is completely unused, so remove it.
- the loop tries to be smart by terminating when the
"freecount" tells it that all inodes are free. Just drop
it as in most cases we have to scan all inodes in the
chunk anyway.
- move the "user buffer left" condition check to the only
point where we consume space int eh user buffer.
- move the initialisation of agino out of the loop, leaving
just a simple loop control logic using the clusteridx.
Also, double handling of the user buffer variables leads to problems
tracking the current state - use the cursor variables directly
rather than keeping local copies and then having to update the
cursor before returning.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The xfs_bulkstat_agichunk formatting cursor takes buffer values from
the main loop and passes them via the structure to the chunk
formatter, and the writes the changed values back into the main loop
local variables. Unfortunately, this complex dance is full of corner
cases that aren't handled correctly.
The biggest problem is that it is double handling the information in
both the main loop and the chunk formatting function, leading to
inconsistent updates and endless loops where progress is not made.
To fix this, push the struct xfs_bulkstat_agichunk outwards to be
the primary holder of user buffer information. this removes the
double handling in the main loop.
Also, pass the last inode processed by the chunk formatter as a
separate parameter as it purely an output variable and is not
related to the user buffer consumption cursor.
Finally, the chunk formatting code is not shared by anyone, so make
it local to xfs_itable.c.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The bulkstat code has several different ways of detecting the end of
an AG when doing a walk. They are not consistently detected, and the
code that checks for the end of AG conditions is not consistently
coded. Hence the are conditions where the walk code can get stuck in
an endless loop making no progress and not triggering any
termination conditions.
Convert all the "tmp/i" status return codes from btree operations
to a common name (stat) and apply end-of-ag detection to these
operations consistently.
cc: <stable@vger.kernel.org> # 3.17
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
https://bugzilla.kernel.org/show_bug.cgi?id=86831
Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.
Add a warn trace in __dec_zone_state will catch this easily:
static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}
[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.
The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.
In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).
With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.
Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()"
by me ;-/
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The seq_printf() will soon just return void, and seq_has_overflowed()
should be used instead to see if the seq can no longer accept input.
As the return value of debugfs_print_regs32() has no users and
the seq_file descriptor should be checked with seq_has_overflowed()
instead of return values of functions, it is better to just have
debugfs_print_regs32() also return void.
Link: http://lkml.kernel.org/p/2634b19eb1c04a9d31148c1fe6f1f3819be95349.1412031505.git.joe@perches.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joe Perches <joe@perches.com>
[ original change only updated seq_printf() return, added return of
void to debugfs_print_regs32() as well ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_printf functions shouldn't really check the return value.
Checking seq_has_overflowed() occasionally is used instead.
Update vfs documentation.
Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com
Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ did a few clean ups ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the kernel.dmesg_restrict restriction is in place, only users with
CAP_SYSLOG should be able to access crash dumps (like: attacker is
trying to exploit a bug, watchdog reboots, attacker can happily read
crash dumps and logs).
This puts the restriction on console-* types as well as sensitive
information could have been leaked there.
Other log types are unaffected.
Signed-off-by: Sebastian Schmidt <yath@yath.de>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
pstore compression/decompression was added during 3.12.
The ramoops driver prepends a "====timestamp.timestamp-C|D\n"
header to the compressed record before handing it over to pstore
driver which doesn't know about the header. In pstore_decompress(),
the pstore driver reads the first "==" as a zlib header, so the
decompression always fails. For example, this causes the driver
to write /dev/pstore/dmesg-ramoops-0.enc.z instead of
/dev/pstore/dmesg-ramoops-0.
This patch makes the ramoops driver remove the header before
pstore decompression.
Signed-off-by: Ben Zhang <benzh@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be
rebuilt. We did list_del() on the cursor, which results in an Oops on the
poisoned pointer in ovl_seek_cursor().
Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If the OPEN rpc call to the server fails with an ENOENT call, nfs_atomic_open
will create a negative dentry for that file, however it currently fails
to call nfs_set_verifier(), thus causing the dentry to be immediately
revalidated on the next call to nfs_lookup_revalidate() instead of following
the usual lookup caching rules.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a system wants to reduce the booting time as a top priority, now we can
use a mount option, -o fastboot.
With this option, f2fs conducts a little bit slow write_checkpoint, but
it can avoid the node page reads during the next mount time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__submit_merged_bio f2fs_write_end_io f2fs_write_end_io
wait_io = X wait_io = x
complete(X) complete(X)
wait_io = NULL
wait_for_completion()
free(X)
spin_lock(X)
kernel panic
In order to avoid this, this patch removes the wait_io facility.
Instead, we can use wait_on_all_pages_writeback(sbi) to wait for end_ios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is a chance to make a huge sized discard command, we don't need
to split it out, since each blkdev_issue_discard should wait one at a
time.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch simplifies the inline_data usage with the following rule.
1. inline_data is set during the file creation.
2. If new data is requested to be written ranges out of inline_data,
f2fs converts that inode permanently.
3. There is no cases which converts non-inline_data inode to inline_data.
4. The inline_data flag should be changed under inode page lock.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them. But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.
This bug came from commit 0678b6185
btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()
Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Erik Berg <btrfs@slipsprogrammoer.no>
cc: stable@vger.kernel.org # 3.3 or newer
JK: Added VFS: prefix to the message when changing it to make it more
standard.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There's no point in using test_and_clear_bit_le() when we don't use the
return value of the function. Just use clear_bit_le() instead.
Coverity-id: 1016434
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_write_begin() doesn't initialize the 'dn' variable if the inode has
inline data. However it uses its contents to decide whether it should
just zero out the page or load data to it. Thus if we are unlucky we can
zero out page contents instead of loading inline data into a page.
CC: stable@vger.kernel.org
CC: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rename f2fs_set/clear_bit to f2fs_test_and_set/clear_bit, which mean
set/clear bit and return the old value, for better readability.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Set raw_super default to NULL to avoid the possibly used
uninitialized warning, though we may never hit it in fact.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce f2fs_change_bit to simplify the change bit logic in
function set_to_next_nat{sit}.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use clear_inode_flag to replace the redundant cond_clear_inode_flag.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the unneeded argument 'type' from __get_victim, use
NO_CHECK_TYPE directly when calling v_ops->get_victim().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user specifies too low end sector for trimming, f2fs_trim_fs() will
use uninitialized value as a number of trimmed blocks and returns it to
userspace. Initialize number of trimmed blocks early to avoid the
problem.
Coverity-id: 1248809
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch declares f2fs_convert_inline_dir as a static function, which was
reported by kbuild test robot.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_dentry_ptr structure for the use of a function
parameter in inline_dentry operations.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a core function, f2fs_fill_dentries, to remove
redundant code in f2fs_readdir and f2fs_read_inline_dir.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, init_inode_metadata does not hold any parent directory's inode
page. So, f2fs_init_acl can grab its parent inode page without any problem.
But, when we use inline_dentry, that page is grabbed during f2fs_add_link,
so that we can fall into deadlock condition like below.
INFO: task mknod:11006 blocked for more than 120 seconds.
Tainted: G OE 3.17.0-rc1+ #13
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mknod D ffff88003fc94580 0 11006 11004 0x00000000
ffff880007717b10 0000000000000002 ffff88003c323220 ffff880007717fd8
0000000000014580 0000000000014580 ffff88003daecb30 ffff88003c323220
ffff88003fc94e80 ffff88003ffbb4e8 ffff880007717ba0 0000000000000002
Call Trace:
[<ffffffff8173dc40>] ? bit_wait+0x50/0x50
[<ffffffff8173d4cd>] io_schedule+0x9d/0x130
[<ffffffff8173dc6c>] bit_wait_io+0x2c/0x50
[<ffffffff8173da3b>] __wait_on_bit_lock+0x4b/0xb0
[<ffffffff811640a7>] __lock_page+0x67/0x70
[<ffffffff810acf50>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811652cc>] pagecache_get_page+0x14c/0x1e0
[<ffffffffa029afa9>] get_node_page+0x59/0x130 [f2fs]
[<ffffffffa02a63ad>] read_all_xattrs+0x24d/0x430 [f2fs]
[<ffffffffa02a6ca2>] f2fs_getxattr+0x52/0xe0 [f2fs]
[<ffffffffa02a7481>] f2fs_get_acl+0x41/0x2d0 [f2fs]
[<ffffffff8122d847>] get_acl+0x47/0x70
[<ffffffff8122db5a>] posix_acl_create+0x5a/0x150
[<ffffffffa02a7759>] f2fs_init_acl+0x29/0xcb [f2fs]
[<ffffffffa0286a8d>] init_inode_metadata+0x5d/0x340 [f2fs]
[<ffffffffa029253a>] f2fs_add_inline_entry+0x12a/0x2e0 [f2fs]
[<ffffffffa0286ea5>] __f2fs_add_link+0x45/0x4a0 [f2fs]
[<ffffffffa028b5b6>] ? f2fs_new_inode+0x146/0x220 [f2fs]
[<ffffffffa028b816>] f2fs_mknod+0x86/0xf0 [f2fs]
[<ffffffff811e3ec1>] vfs_mknod+0xe1/0x160
[<ffffffff811e4b26>] SyS_mknod+0x1f6/0x200
[<ffffffff81741d7f>] tracesys+0xe1/0xe6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add inline dir functions into normal dir ops' function to handle inline ops.
Besides, we enable inline dir mode when a new dir inode is created if
inline_data option is on.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Adds Functions to implement inline dir init/lookup/insert/delete/convert ops.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove needless reserved area copy, pointed by Dan Carpenter]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch exports some dir operations for inline dir, additionally introduces
f2fs_drop_nlink from f2fs_delete_entry for reusing by inline dir function.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch defines macro/inline dentry structure, and adds some helpers for
inline dir infrastructure.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The sceanrio is like this.
inline_data i_size page write_begin/vm_page_mkwrite
X 30 dirty_page
X 30 write to #4096 position
X 30 get_dnode_of_data wait for get_dnode_of_data
O 30 write inline_data
O 30 get_dnode_of_data
O 30 reserve data block
..
In this case, we have #0 = NEW_ADDR and inline_data as well.
We should not allow this condition for further access.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's consider the following scenario.
blkaddr[0] inline_data i_size i_blocks writepage truncate
NEW X 4096 2 dirty page #0
NEW X 0 change i_size
NEW X 0 2 f2fs_write_inline_data
NEW X 0 2 get_dnode_of_data
NEW X 0 2 truncate_data_blocks_range
NULL O 0 1 memcpy(inline_data)
NULL O 0 1 f2fs_put_dnode
NULL O 0 1 f2fs_truncate
NULL O 0 1 get_dnode_of_data
NULL O 0 1 *invalid block addr*
This patch adds checking inline_data flag during f2fs_truncate not to refer
corrupted block indices.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When trying to write inline_data, we should truncate any data block allocated
and pointed by the inode block.
We should consider the data index is not 0.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
... by not hitting rename_retry for reasons other than rename having
happened. In other words, do _not_ restart when finding that
between unlocking the child and locking the parent the former got
into __dentry_kill(). Skip the killed siblings instead...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
No need to store gfs2_dir_check result and test it before returning.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Pull VFS fixes from Al Viro:
"A bunch of assorted fixes, most of them followups to overlayfs merge"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ovl: initialize ->is_cursor
Return short read or 0 at end of a raw device, not EIO
isofs: don't bother with ->d_op for normal case
isofs_cmp(): we'll never see a dentry for . or ..
overlayfs: fix lockdep misannotation
ovl: fix check for cursor
overlayfs: barriers for opening upper-layer directory
rcu: Provide counterpart to rcu_dereference() for non-RCU situations
staging: android: logger: Fix log corruption regression
Pull btrfs fixes from Chris Mason:
"Filipe is nailing down some problems with our skinny extent variation,
and Dave's patch fixes endian problems in the new super block checks"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
Btrfs: properly clean up btrfs_end_io_wq_cache
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
btrfs: use macro accessors in superblock validation checks
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUVAF/AAoJENNvdpvBGATwEbAQALNiAIChEyJTnQDkAQc2wqqn
dv8NQmFr5aefc63A/+n/yJJGrQZtKs0ceh29ty5ksYLFXzUdc2ctFg6vBmllQfbz
PQawAk2gOkF8zfVuqiQU7X+wTBpGmGXTa8HY+WJTtk0pBfhl+p0PDCYsWXMwZJ1D
tAZpxJ4AmPc7A4hApWOvce6r7Xg24vZk/8UA93Tif9AkeY6VoN272Hx5b/UGmBHY
RCEgpowuiIY38bghtLh5+T0J98/EQNof46cEHgGI9nIDZeXRzgvDojE5bLI0/IS/
K07MjYlm/WFWsLFkgNJkTiqEXgnji9BNYRF1xxUjMMBAR4+fnFLw9kXXgcETrPCx
U7lHOhs8M2FK40cWhUDz/tukvL4S4lQwPEeqBPlRE8J5/twRyXHeZDp4F7LOobwq
mk6AajSJlP+05XwXOuCx7Hcf9uxjw/IpqhBS5IZxy8Nn3T2guPlY9wMhYU1RYFws
54FeE76SJ8EDgjVK/txj7rgh11GggWsjsdXvftSElM2DsKsqYEOKAvDzvwmbm7eV
dsFOlRB6B/X4UpiAC2MiPJynYg9TJ7LkVBzDZeZ/fbm7JhTqChSJDzapqdrmNPIY
SQqwLmFXnHqaw6HNitZ5Bs+fD6nfvKqy85NeImxE3lhLWDuiTt77Y3o80IW30TgN
5bnuXq8Rkukrxs/VDvPq
=kI6P
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"A set of miscellaneous ext4 bug fixes for 3.18"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: make ext4_ext_convert_to_initialized() return proper number of blocks
ext4: bail early when clearing inode journal flag fails
ext4: bail out from make_indexed_dir() on first error
jbd2: use a better hash function for the revoke table
ext4: prevent bugon on race between write/fcntl
ext4: remove extent status procfs files if journal load fails
ext4: disallow changing journal_csum option during remount
ext4: enable journal checksum when metadata checksum feature enabled
ext4: fix oops when loading block bitmap failed
ext4: fix overflow when updating superblock backups after resize
Pull quota and ext3 fixes from Jan Kara.
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs, jbd: use a more generic hash function
quota: Properly return errors from dquot_writeback_dquots()
ext3: Don't check quota format when there are no quota files
Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device. Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.
The raw driver needs the same end of device handling as was added for normal
block devices. Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:
Note that this mode applies only to future accesses of the newly
created file; the open() call that creates a read-only file
may well return a read/write file descriptor.
The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.
O_TMPFILE, however, behaves differently:
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
assert(fd == -1);
assert(errno == EACCES);
int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
assert(fd > 0);
For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:
if (*opened & FILE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
will_truncate = false;
acc_mode = MAY_OPEN;
path_to_nameidata(path, nd);
goto finish_open_created;
}
But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().
This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().
A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523
The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.
On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.
But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.
Signed-off-by: Eric Rannaud <e@nanocritical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.
Coverity-id: 1226848
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.
Coverity-id: 989044
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).
Coverity-id: 741300
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior. Use the generic
64-bit hash function instead.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted. Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().
Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.
Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
The return values of seq_printf/puts/putc are frequently misused.
Start down a path to remove all the return value uses of these
functions.
Move the seq_overflow() to a global inlined function called
seq_has_overflowed() that can be used by the users of seq_file() calls.
Update the documentation to not show return types for seq_printf
et al. Add a description of seq_has_overflowed().
Link: http://lkml.kernel.org/p/848ac7e3d1c31cddf638a8526fa3c59fa6fdeb8a.1412031505.git.joe@perches.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ Reworked the original patch from Joe ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge misc fixes from Andrew Morton:
"21 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
mm/balloon_compaction: fix deflation when compaction is disabled
sh: fix sh770x SCIF memory regions
zram: avoid NULL pointer access in concurrent situation
mm/slab_common: don't check for duplicate cache names
ocfs2: fix d_splice_alias() return code checking
mm: rmap: split out page_remove_file_rmap()
mm: memcontrol: fix missed end-writeback page accounting
mm: page-writeback: inline account_page_dirtied() into single caller
lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
drivers/rtc/rtc-bq32k.c: fix register value
memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
kernel/kmod: fix use-after-free of the sub_info structure
drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
mm, thp: fix collapsing of hugepages on madvise
drivers: of: add return value to of_reserved_mem_device_init()
mm: free compound page with correct order
gcov: add ARM64 to GCOV_PROFILE_ALL
fsnotify: next_i is freed during fsnotify_unmount_inodes.
mm/compaction.c: avoid premature range skip in isolate_migratepages_range
...
The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.
The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
For example:
$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
...
$ xfs_io -d -c "pread 128k 128k" <file>
<BUG>
If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.
Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. This implementation is not efficient nor ideal as we
writeback dirty data over the range and remove existing extents rather
than convert to unwrittern. The former writeback, however, is currently
the only mechanism available to ensure consistency between pagecache and
extent state. Even a pagecache truncate/delalloc punch prior to hole
punch has lead to inconsistencies due to racing with writeback.
This provides a consistent, correct implementation of zero range that
survives fsstress/fsx testing without assert failures. The
implementation can be optimized from this point forward once the
fundamental issue of pagecache and delalloc extent state consistency is
addressed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In
case of specific fs corruption that could result in xfs_bulkstat()
entering an infinite loop because we would be looping over the same
chunk over and over again. Fix the problem by checking the return value
and terminating the loop properly.
Coverity-id: 1231338
cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jie Liu <jeff.u.liu@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
Currently the code checks not for ERR_PTR and will cuase an oops in
ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
spin_unlock(&inode->i_lock);
continue;
}
As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.
Multiple crash dumps showed:
The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself. As this is not the value of list upon entry to the
function, the kernel never exits the loop.
To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del. This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.
Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.
We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.
The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget. However, the code doesn't do the
__iget call on next_i
if i_count == 0 or
if i_state & (I_FREEING | I_WILL_FREE)
The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list. This makes the handling of next_i more closely match the
handling of the variable "inode."
The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.
During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate. Advance next_i in those cases where
__iget is not done.
Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ken Helias <kenhelias@firemail.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current kernel. This contains:
- Two error handling fixes from Jan Kara. One for null_blk on
failure to add a device, and the other for the block/scsi_ioctl
SCSI_IOCTL_SEND_COMMAND fixing up the error jump point.
- A commit added in the merge window for the bio integrity bits
unfortunately disabled merging for all requests if
CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that
integrity checking wont disallow merges when not enabled.
- A fix from Ming Lei for merging and generating too many segments.
This caused a BUG in virtio_blk.
- Two error handling printk() fixups from Robert Elliott, improving
the information given when we rate limit.
- Error handling fixup on elevator_init() failure from Sudip
Mukherjee.
- A fix from Tony Battersby, fixing up a memory leak in the
scatterlist handling with scsi-mq"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined
lib/scatterlist: fix memory leak with scsi-mq
block: fix wrong error return in elevator_init()
scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
null_blk: Cleanup error recovery in null_add_dev()
blk-merge: recaculate segment if it isn't less than max segments
fs: clarify rate limit suppressed buffer I/O errors
fs: merge I/O error prints into one line
In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:
touch /mnt/a/empty102/x
unlink /mnt/a/empty102/x
rmdir /mnt/a/empty102
It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ovl_cache_entry.name is now an array not a pointer, so it makes no sense
test for it being NULL.
Detected by coverity.
From: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 68bf861107 ("overlayfs: make ovl_cache_entry->name an array instead of
+pointer")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
make sure that
a) all stores done by opening struct file don't leak past storing
the reference in od->upperfile
b) the lockless side has read dependency barrier
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The recent refactoring of the bulkstat code left a small landmine in
the code. If a inobt read fails, then the tree walk is aborted and
returns without releasing the AGI buffer or freeing the cursor. This
can lead to a subsequent bulkstat call hanging trying to grab the
AGI buffer again.
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:
1) We search in the extent tree for the skinny extent, which returns > 0
(not found);
2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;
3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;
4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;
5) After the second search returned (with ret > 0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and completed
after we released out path in step 3) before doing the second search.
Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.
This race has been present since the introduction of skinny metadata (change
3173a18f70).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull two nfsd fixes from Bruce Fields:
"One regression from the 3.16 xdr rewrite, one an older bug exposed by
a separate bug in the client's new SEEK code"
* 'for-3.18' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix crash on unknown operation number
nfsd4: fix response size estimation for OP_SEQUENCE
inotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).
Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20140924082242.254858080@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module. This
fixes the problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.
Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The initial patch c926093ec5 (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
no sense having it a pointer - all instances have it pointing to
local variable in the same stack frame
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_splice_alias() callers expect it to either stash the inode reference
into a new alias, or drop the inode reference. That makes it possible
to just return d_splice_alias() result from ->lookup() instance, without
any extra housekeeping required.
Unfortunately, that should include the failure exits. If d_splice_alias()
returns an error, it leaves the dentry it has been given negative and
thus it *must* drop the inode reference. Easily fixed, but it goes way
back and will need backporting.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a simple read-only counter to super_block that indicates how deep this
is in the stack of filesystems. Previously ecryptfs was the only stackable
filesystem and it explicitly disallowed multiple layers of itself.
Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.
To limit the kernel stack usage we must limit the depth of the
filesystem stack. Initially the limit is set to 2.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This is useful because of the stacking nature of overlayfs. Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.
AV: even failing ovl_parse_opt() could've done some kstrdup()
AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add support for statfs to the overlayfs filesystem. As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs. There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.
Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.
This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.
The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.
The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.
Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.
Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).
Usage:
mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay
The following cotributions have been folded into this patch:
Neil Brown <neilb@suse.de>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read
Felix Fietkau <nbd@openwrt.org>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts
Erez Zadok <ezk@fsl.cs.sunysb.edu>
- fix cleanup after WARN_ON
Sedat Dilek <sedat.dilek@googlemail.com>
- fix up permission to confirm to new API
Robin Dong <hao.bigrat@gmail.com>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link
Andy Whitcroft <apw@canonical.com>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode
AV:
- ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
- ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
lack of check for udir being the parent of upper, dropping and regaining
the lock on udir (which would require _another_ check for parent being
right).
- bogus d_drop() in copyup and rename [fix from your mail]
- copyup/remove and copyup/rename races [fix from your mail]
- ovl_dir_fsync() leaving ERR_PTR() in ->realfile
- ovl_entry_free() is pointless - it's just a kfree_rcu()
- fold ovl_do_lookup() into ovl_lookup()
- manually assigning ->d_op is wrong. Just use ->s_d_op.
[patches picked from Miklos]:
* copyup/remove and copyup/rename races
* bogus d_drop() in copyup and rename
Also thanks to the following people for testing and reporting bugs:
Jordi Pujol <jordipujolp@gmail.com>
Andy Whitcroft <apw@canonical.com>
Michal Suchanek <hramrach@centrum.cz>
Felix Fietkau <nbd@openwrt.org>
Erez Zadok <ezk@fsl.cs.sunysb.edu>
Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add whiteout support to ext4_rename(). A whiteout inode (chrdev/0,0) is
created before the rename takes place. The whiteout inode is added to the
old entry instead of deleting it.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This adds a new RENAME_WHITEOUT flag. This flag makes rename() create a
whiteout of source. The whiteout creation is atomic relative to the
rename.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Whiteout isn't actually a new file type, but is represented as a char
device (Linus's idea) with 0/0 device number.
This has several advantages compared to introducing a new whiteout file
type:
- no userspace API changes (e.g. trivial to make backups of upper layer
filesystem, without losing whiteouts)
- no fs image format changes (you can boot an old kernel/fsck without
whiteout support and things won't break)
- implementation is trivial
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
It's already duplicated in btrfs and about to be used in overlayfs too.
Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Unknown operation numbers are caught in nfsd4_decode_compound() which
sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The
error causes the main loop in nfsd4_proc_compound() to skip most
processing. But nfsd4_proc_compound also peeks ahead at the next
operation in one case and doesn't take similar precautions there.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
While the hash function used by the revoke hashtable is good somewhere else,
it's not really good here.
The default hash shift (8) means that one third of the hashing function
gets lost (and is undefined anyways (8 - 12 = negative shift)):
"(block << (hash_shift - 12))) & (table->hash_size - 1)"
Instead, just use the kernel's generic hash function that gets used everywhere
else.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Due to a switched left and right side of an assignment,
dquot_writeback_dquots() never returned error. This could result in
errors during quota writeback to not be reported to userspace properly.
Fix it.
CC: stable@vger.kernel.org
Coverity-id: 1226884
Signed-off-by: Jan Kara <jack@suse.cz>
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
When quiet_error applies rate limiting to buffer_io_error calls, what the
they apply to is unclear because the name is so generic, particularly
if the messages are interleaved with others:
[ 1936.063572] quiet_error: 664293 callbacks suppressed
[ 1936.065297] Buffer I/O error on dev sdr, logical block 257429952, lost async page write
[ 1936.067814] Buffer I/O error on dev sdr, logical block 257429953, lost async page write
Also, the function uses printk_ratelimit(), although printk.h includes a
comment advising "Please don't use... Instead use printk_ratelimited()."
Change buffer_io_error to check the BH_Quiet bit itself, drop the
printk_ratelimit call, and print using printk_ratelimited.
This makes the messages look like:
[ 387.208839] buffer_io_error: 676394 callbacks suppressed
[ 387.210693] Buffer I/O error on dev sdr, logical block 211291776, lost async page write
[ 387.213432] Buffer I/O error on dev sdr, logical block 211291777, lost async page write
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
buffer.c uses two printk calls to print these messages:
[67353.422338] Buffer I/O error on device sdr, logical block 212868488
[67353.422338] lost page write due to I/O error on sdr
In a busy system, they may be interleaved with other prints,
losing the context for the second message. Merge them into
one line with one printk call so the prints are atomic.
Also, differentiate between async page writes, sync page writes, and
async page reads.
Also, shorten "device" to "dev" to match the block layer prints:
[67353.467906] blk_update_request: critical target error, dev sdr, sector
1707107328
Also, use %llu rather than %Lu.
Resulting prints look like:
[ 1356.437006] blk_update_request: critical target error, dev sdr, sector 1719693992
[ 1361.383522] quiet_error: 659876 callbacks suppressed
[ 1361.385816] Buffer I/O error on dev sdr, logical block 256902912, lost async page write
[ 1361.385819] Buffer I/O error on dev sdr, logical block 256903644, lost async page write
Signed-off-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Webb Scales <webbnh@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We added this new estimator function but forgot to hook it up. The
effect is that NFSv4.1 (and greater) won't do zero-copy reads.
The estimate was also wrong by 8 bytes.
Fixes: ccae70a9ee "nfsd4: estimate sequence response size"
Cc: stable@vger.kernel.org
Reported-by: Chuck Lever <chucklever@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
optimizations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
9g4GKAXAyh15PeBTJ5K/
=vWic
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A large number of cleanups and bug fixes, with some (minor) journal
optimizations"
[ This got sent to me before -rc1, but was stuck in my spam folder. - Linus ]
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
ext4: check s_chksum_driver when looking for bg csum presence
ext4: move error report out of atomic context in ext4_init_block_bitmap()
ext4: Replace open coded mdata csum feature to helper function
ext4: delete useless comments about ext4_move_extents
ext4: fix reservation overflow in ext4_da_write_begin
ext4: add ext4_iget_normal() which is to be used for dir tree lookups
ext4: don't orphan or truncate the boot loader inode
ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
ext4: optimize block allocation on grow indepth
ext4: get rid of code duplication
ext4: fix over-defensive complaint after journal abort
ext4: fix return value of ext4_do_update_inode
ext4: fix mmap data corruption when blocksize < pagesize
vfs: fix data corruption when blocksize < pagesize for mmaped data
ext4: fold ext4_nojournal_sops into ext4_sops
ext4: support freezing ext2 (nojournal) file systems
ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
ext4: don't check quota format when there are no quota files
jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
jbd2: avoid pointless scanning of checkpoint lists
...
Pull cifs/smb3 updates from Steve French:
"Improved SMB3 support (symlink and device emulation, and remapping by
default the 7 reserved posix characters) and a workaround for cifs
mounts to Mac (working around a commonly encountered Mac server bug)"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] Remove obsolete comment
Check minimum response length on query_network_interface
Workaround Mac server problem
Remap reserved posix characters by default (part 3/3)
Allow conversion of characters in Mac remap range (part 2)
Allow conversion of characters in Mac remap range. Part 1
mfsymlinks support for SMB2.1/SMB3. Part 2 query symlink
Add mfsymlinks support for SMB2.1/SMB3. Part 1 create symlink
Allow mknod and mkfifo on SMB2/SMB3 mounts
add defines for two new file attributes
This includes a single commit fixing a missing endian conversion.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUQT+qAAoJEDgbc8f8gGmqT1wQAMjq4O0wh0DNI08SrHHszwnI
e5btCmlMfNIvSfKmNcwstEJH5R86qi+8Q7fAKPj73XXlb1HLYjhQltuuysRSlmZE
Ll+eqnOWVJPxlFjyRDKzO7HxejMsqEE29gtmzw9iL43F2M1acmCMgrVLz8VJGHH2
21hK9eo9L51kLVXK2+rSexdJJNDVjuy9N25J1d/r9/9UEIp+gGhD5ZMpBlDmw1md
VJgbL/yv99mjicjVnd5nZE7fhcWFPPV5IyeyUQS0Pg92cAYZUNBBAzta3yU4ira4
tJU4V2k3zoK40TXqRnGNFgFNvm8cfzNUX+9HtWyWmr6VySfRVXFhSDcHa2/NQva9
4T1vVO5Dg/4+ELIdtaGMEv+B5KCv9/hWjmhnlLBIJktJdH+1NmRXjZAg6gM51XRB
ZwQIf929lELeqGTdfJyO4bR/NPcjrEnEpQFBBAuypko06ZJhXLJZqtKD/3o5lCjX
VOgPkuNtylPARdUo/m1EffF4Rm1LREcbMHDMUInaVt6qAeHRqq8QkvBhBQ7VzFX1
Y1dwGWh8m9pA3q48j2GIT1rWbl5Fue8f34RMnn4psRVM0rnQ8WHh3/p5wsSsXlib
BCX5ez5O2aF7pe4Lqj3NkGmyB6PvICRbU2ZEe2uPbvMZnoq0NouO4asCLSeycu52
fYlz9mZ3e+Y+sv2JN3vG
=eEi+
-----END PGP SIGNATURE-----
Merge tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fix from David Teigland:
"This includes a single commit fixing a missing endian conversion"
* tag 'dlm-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix missing endian conversion of rcom_status flags