Commit Graph

44603 Commits

Author SHA1 Message Date
Al Viro d375570fa8 romfs, squashfs: switch to ->iterate_shared()
don't need to lock directory in ->llseek(), either

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:41:15 -04:00
Al Viro c51da20c48 more trivial ->iterate_shared conversions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:41:14 -04:00
Al Viro 8cb0d2c1c7 kernfs: no point locking directory around that generic_file_llseek()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:41:13 -04:00
Al Viro a01b3007ff configfs_readdir(): make safe under shared lock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:41:13 -04:00
Al Viro 884be17535 nfs: per-name sillyunlink exclusion
use d_alloc_parallel() for sillyunlink/lookup exclusion and
explicit rwsem (nfs_rmdir() being a writer and nfs_call_unlink() -
a reader) for rmdir/sillyunlink one.

That ought to make lookup/readdir/!O_CREAT atomic_open really
parallel on NFS.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-09 11:39:45 -04:00
Adam Borowski 8eb0dfdbda btrfs: fix int32 overflow in shrink_delalloc().
UBSAN: Undefined behaviour in fs/btrfs/extent-tree.c:4623:21
signed integer overflow:
10808 * 262144 cannot be represented in type 'int [8]'

If 8192<=items<16384, we request a writeback of an insane number of pages
which is benign (everything will be written).  But if items>=16384, the
space reservation won't be enough.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-09 11:51:19 +02:00
Al Viro 99d825822e get_rock_ridge_filename(): handle malformed NM entries
Payloads of NM entries are not supposed to contain NUL.  When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries).  We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254.  So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected.  And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily.  And that's what will be passed to readdir callback as the
name length.  8Kb __copy_to_user() from a buffer allocated by
__get_free_page()

Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-07 22:52:39 -04:00
Jaegeuk Kim 0080c50764 f2fs: do not preallocate block unaligned to 4KB
Previously f2fs_preallocate_blocks() tries to allocate unaligned blocks.
In f2fs_write_begin(), however, prepare_write_begin() does not skip its
allocation due to (len != 4KB).
So, it needs locking node page twice unexpectedly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:44:57 -07:00
Jaegeuk Kim 79344efb93 f2fs: read node blocks ahead when truncating blocks
This patch enables reading node blocks in advance when truncating large
data blocks.

 > time rm $MNT/testfile (500GB) after drop_cachees
Before : 9.422 s
After  : 4.821 s

Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:44:56 -07:00
Jaegeuk Kim e12dd7bd87 f2fs: fallocate data blocks in single locked node page
This patch is to improve the expand_inode speed in fallocate by allocating
data blocks as many as possible in single locked node page.

In SSD,
 # time fallocate -l 500G $MNT/testfile

Before : 1m 33.410 s
After  : 24.758 s

Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:44:55 -07:00
Chao Yu f61cce5b81 f2fs: fix inode cache leak
When testing f2fs with inline_dentry option, generic/342 reports:
VFS: Busy inodes after unmount of dm-0. Self-destruct in 5 seconds.  Have a nice day...

After rmmod f2fs module, kenrel shows following dmesg:
 =============================================================================
 BUG f2fs_inode_cache (Tainted: G           O   ): Objects remaining in f2fs_inode_cache on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 Disabling lock debugging due to kernel taint
 INFO: Slab 0xf51ca0e0 objects=22 used=1 fp=0xd1e6fc60 flags=0x40004080
 CPU: 3 PID: 7455 Comm: rmmod Tainted: G    B      O    4.6.0-rc4+ #16
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
  00000086 00000086 d062fe18 c13a83a0 f51ca0e0 d062fe38 d062fea4 c11c7276
  c1981040 f51ca0e0 00000016 00000001 d1e6fc60 40004080 656a624f 20737463
  616d6572 6e696e69 6e692067 66326620 6e695f73 5f65646f 68636163 6e6f2065
 Call Trace:
  [<c13a83a0>] dump_stack+0x5f/0x8f
  [<c11c7276>] slab_err+0x76/0x80
  [<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0
  [<c11cbfc0>] ? __kmem_cache_shutdown+0x100/0x2f0
  [<c11cbfe5>] __kmem_cache_shutdown+0x125/0x2f0
  [<c1198a38>] kmem_cache_destroy+0x158/0x1f0
  [<c176b43d>] ? mutex_unlock+0xd/0x10
  [<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs]
  [<c10f596c>] SyS_delete_module+0x16c/0x1d0
  [<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0
  [<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20
  [<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210
  [<c10ad50b>] ? trace_hardirqs_off+0xb/0x10
  [<c1001b81>] do_fast_syscall_32+0xa1/0x1c0
  [<c176d888>] sysenter_past_esp+0x45/0x74
 INFO: Object 0xd1e6d9e0 @offset=6624
 kmem_cache_destroy f2fs_inode_cache: Slab cache still has objects
 CPU: 3 PID: 7455 Comm: rmmod Tainted: G    B      O    4.6.0-rc4+ #16
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
  00000286 00000286 d062fef4 c13a83a0 f174b000 d062ff14 d062ff28 c1198ac7
  c197fe18 f3c5b980 d062ff20 000d04f2 d062ff0c d062ff0c d062ff14 d062ff14
  f8f20dc0 fffffff5 d062e000 d062ff30 f8f15aa3 d062ff7c c10f596c 73663266
 Call Trace:
  [<c13a83a0>] dump_stack+0x5f/0x8f
  [<c1198ac7>] kmem_cache_destroy+0x1e7/0x1f0
  [<f8f15aa3>] exit_f2fs_fs+0x4b/0x5a8 [f2fs]
  [<c10f596c>] SyS_delete_module+0x16c/0x1d0
  [<c1001b10>] ? do_fast_syscall_32+0x30/0x1c0
  [<c13c59bf>] ? __this_cpu_preempt_check+0xf/0x20
  [<c10afa7d>] ? trace_hardirqs_on_caller+0xdd/0x210
  [<c10ad50b>] ? trace_hardirqs_off+0xb/0x10
  [<c1001b81>] do_fast_syscall_32+0xa1/0x1c0
  [<c176d888>] sysenter_past_esp+0x45/0x74

The reason is: in recovery flow, we use delayed iput mechanism for directory
which has recovered dentry block. It means the reference of inode will be
held until last dirty dentry page being writebacked.

But when we mount f2fs with inline_dentry option, during recovery, dirent
may only be recovered into dir inode page rather than dentry page, so there
are no chance for us to release inode reference in ->writepage when
writebacking last dentry page.

We can call paired iget/iput explicityly for inline_dentry case, but for
non-inline_dentry case, iput will call writeback_single_inode to write all
data pages synchronously, but during recovery, ->writepages of f2fs skips
writing all pages, result in losing dirent.

This patch fixes this issue by obsoleting old mechanism, and introduce a
new dir_list to hold all directory inodes which has recovered datas until
finishing recovery.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:44:54 -07:00
Jaegeuk Kim b5a7aef1ef fscrypto/f2fs: allow fs-specific key prefix for fs encryption
This patch allows fscrypto to handle a second key prefix given by filesystem.
The main reason is to provide backward compatibility, since previously f2fs
used "f2fs:" as a crypto prefix instead of "fscrypt:".
Later, ext4 should also provide key_prefix() to give "ext4:".

One concern decribed by Ted would be kinda double check overhead of prefixes.
In x86, for example, validate_user_key consumes 8 ms after boot-up, which turns
out derive_key_aes() consumed most of the time to load specific crypto module.
After such the cold miss, it shows almost zero latencies, which treats as a
negligible overhead.
Note that request_key() detects wrong prefix in prior to derive_key_aes() even.

Cc: Ted Tso <tytso@mit.edu>
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:33 -07:00
Chao Yu 09210c973a f2fs: avoid panic when truncating to max filesize
The following panic occurs when truncating inode which has inline
xattr to max filesize.

[<ffffffffa013d3be>] get_dnode_of_data+0x4e/0x580 [f2fs]
[<ffffffffa013aca1>] ? read_node_page+0x51/0x90 [f2fs]
[<ffffffffa013ad99>] ? get_node_page.part.34+0xb9/0x170 [f2fs]
[<ffffffffa01235b1>] truncate_blocks+0x131/0x3f0 [f2fs]
[<ffffffffa01238e3>] f2fs_truncate+0x73/0x100 [f2fs]
[<ffffffffa01239d2>] f2fs_setattr+0x62/0x2a0 [f2fs]
[<ffffffff811a72c8>] notify_change+0x158/0x300
[<ffffffff8118a42b>] do_truncate+0x6b/0xa0
[<ffffffff8118e539>] ? __sb_start_write+0x49/0x100
[<ffffffff8118a798>] do_sys_ftruncate.constprop.12+0x118/0x170
[<ffffffff8118a82e>] SyS_ftruncate+0xe/0x10
[<ffffffff8169efcf>] tracesys+0xe1/0xe6
[<ffffffffa0139ae0>] get_node_path+0x210/0x220 [f2fs]
 <ffff880206a89ce8>
--[ end trace 5fea664dfbcc6625 ]---

The reason is truncate_blocks tries to truncate all node and data blocks
start from specified block offset with value of (max filesize / block
size), but actually, our valid max block offset is (max filesize / block
size) - 1, so f2fs detects such invalid block offset with BUG_ON in
truncation path.

This patch lets f2fs skip truncating data which is exceeding max
filesize.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:33 -07:00
Chao Yu 43473f9645 f2fs: fix incorrect mapping in ->bmap
Currently, generic_block_bmap is used in f2fs_bmap, its semantics is when
the mapping is been found, return position of target physical block,
otherwise return zero.

But, previously, when there is no mapping info for specified logical block,
f2fs_bmap will map target physical block to a uninitialized variable, which
should be wrong. Fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:32 -07:00
Jaegeuk Kim fb58ae2206 f2fs: remove an obsolete variable
This patch removes an obsolete variable used in add_free_nid.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:31 -07:00
Jaegeuk Kim 29234b1d6d f2fs: don't worry about inode leak in evict_inode
Even if an inode failed to release its blocks, it should be kept in an orphan
inode list, so it will be released later.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:30 -07:00
Chao Yu f51b4ce6c1 f2fs: shrink size of struct seg_entry
Restructure struct seg_entry to eliminate holes in it, after that,
in 32-bits machine, it reduces size from 32 bytes to 24 bytes; in
64-bits machine, it reduces size from 56 bytes to 40 bytes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:29 -07:00
Chao Yu bd933d4fae f2fs: reuse get_extent_info
Reuse get_extent_info for readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:29 -07:00
Chao Yu e3bc808ca8 f2fs: remove unneeded memset when updating xattr
Each of fields in struct f2fs_xattr_entry will be assigned later,
so previously we don't need to memset the struct.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:28 -07:00
Chao Yu ae8d1db34f f2fs: remove unneeded readahead in find_fsync_dnodes
In find_fsync_dnodes, get_tmp_page will read dnode page synchronously,
previously, ra_meta_page did the same work, which is redundant, remove
it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:27 -07:00
Jaegeuk Kim 4c0c294934 f2fs: retry to truncate blocks in -ENOMEM case
This patch modifies to retry truncating node blocks in -ENOMEM case.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:26 -07:00
Jaegeuk Kim 74ef924167 f2fs: fix leak of orphan inode objects
When unmounting filesystem, we should release all the ino entries.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:25 -07:00
Jaegeuk Kim 221149c00e f2fs: revisit error handling flows
This patch fixes a couple of bugs regarding to orphan inodes when handling
errors.

This tries to
 - call alloc_nid_done with add_orphan_inode in handle_failed_inode
 - let truncate blocks in f2fs_evict_inode
 - not make a bad inode due to i_mode change

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:25 -07:00
Jaegeuk Kim cb78942b82 f2fs: inject ENOSPC failures
This patch injects ENOSPC failures.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:24 -07:00
Jaegeuk Kim c41f3cc3ae f2fs: inject page allocation failures
This patch adds page allocation failures.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:23 -07:00
Jaegeuk Kim 2c63fead9e f2fs: inject kmalloc failure
This patch injects kmalloc failure given a fault injection rate.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:22 -07:00
Jaegeuk Kim 73faec4d99 f2fs: add mount option to select fault injection ratio
This patch adds a mount option to select fault ratio.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:22 -07:00
Jaegeuk Kim 300e129c15 f2fs: use f2fs_grab_cache_page instead of grab_cache_page
This patch converts grab_cache_page to f2fs_grab_cache_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:21 -07:00
Jaegeuk Kim 0414b004a8 f2fs: introduce f2fs_kmalloc to wrap kmalloc
This patch adds f2fs_kmalloc.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:20 -07:00
Jaegeuk Kim f00d6fa727 f2fs: add proc entry to show valid block bitmap
This patch adds a new proc entry to show segment information in more detail.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:19 -07:00
Jaegeuk Kim b7a15f3dbe f2fs: introduce macros for proc entries
This adds macros to be used multiple proc entries.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-07 10:32:18 -07:00
Peter Jones 6c5450ef66 efivarfs: Make efivarfs_file_ioctl() static
There are no callers except through the file_operations struct below
this, so it should be static like everything else here.

Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1462570771-13324-6-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:06:13 +02:00
Julia Lawall 1cfd63166c efi: Merge boolean flag arguments
The parameters atomic and duplicates of efivar_init always have opposite
values.  Drop the parameter atomic, replace the uses of !atomic with
duplicates, and update the call sites accordingly.

The code using duplicates is slightly reorganized with an 'else', to avoid
duplicating the lock code.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Saurabh Sengar <saurabh.truth@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vaishali Thakkar <vaishali.thakkar@oracle.com>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1462570771-13324-5-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-07 07:06:13 +02:00
Bob Peterson 68cd4ce2ca GFS2: Refactor gfs2_remove_from_journal
This patch makes two simple changes to function gfs2_remove_from_journal.
First, it removes the parameter that specifies the transaction.
Since it's always passed in as current->journal_info, we might as well
set that in the function rather than passing it in. Second, it changes
the meta parameter to use an enum to make the code more clear.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2016-05-06 11:27:27 -05:00
Zygo Blaxell 2f3165ecf1 btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes
During a mount, we start the cleaner kthread first because the transaction
kthread wants to wake up the cleaner kthread.  We start the transaction
kthread next because everything in btrfs wants transactions.  We do reloc
recovery in the thread that was doing the original mount call once the
transaction kthread is running.  This means that the cleaner kthread
could already be running when reloc recovery happens (e.g. if a snapshot
delete was started before a crash).

Relocation does not play well with the cleaner kthread, so a mutex was
added in commit 5f3164813b "Btrfs: fix
race between balance recovery and root deletion" to prevent both from
being active at the same time.

If the cleaner kthread is already holding the mutex by the time we get
to btrfs_recover_relocation, the mount will be blocked until at least
one deleted subvolume is cleaned (possibly more if the mount process
doesn't get the lock right away).  During this time (which could be an
arbitrarily long time on a large/slow filesystem), the mount process is
stuck and the filesystem is unnecessarily inaccessible.

Fix this by locking cleaner_mutex before we start cleaner_kthread, and
unlocking the mutex after mount no longer requires it.  This ensures
that the mounting process will not be blocked by the cleaner kthread.
The cleaner kthread is already prepared for mutex contention and will
just go to sleep until the mutex is available.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 58d7bbf81f btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 7ab19625a9 btrfs: add write protection to SET_FEATURES ioctl
Perform the want_write check if we get far enough to do any writes.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain 48b3b9d401 btrfs: fix lock dep warning move scratch super outside of chunk_mutex
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.

To reproduce:
  (fresh boot)
  mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
  mount /dev/sdc /btrfs
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  (get devmgt from https://github.com/asj/devmgt.git)
  devmgt detach /dev/sde
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  sync
  btrfs replace start -Brf 3 /dev/sdf /btrfs <--
  devmgt attach host7

======================================================
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---------------------------------------------------

btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0

but task is already holding lock:
(&fs_info->chunk_mutex){+.+.+.}, at:
[<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]

which lock already depends on the new lock.

Chain exists of:
sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
Possible unsafe locking scenario:
CPU0				CPU1
----				----
lock(&fs_info->chunk_mutex);
				lock(&fs_devs->device_list_mutex);
				lock(&fs_info->chunk_mutex);
lock(sb_writers);

*** DEADLOCK ***

-> #0 (sb_writers){.+.+.+}:
[<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
[<ffffffff810e707e>] lock_acquire+0xbe/0x210
[<ffffffff810df49a>] percpu_down_read+0x4a/0xa0
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
[<ffffffff81265534>] mnt_want_write+0x24/0x50
[<ffffffff812508a2>] path_openat+0x952/0x1190
[<ffffffff81252451>] do_filp_open+0x91/0x100
[<ffffffff8123f5cc>] file_open_name+0xfc/0x140
[<ffffffff8123f643>] filp_open+0x33/0x60
[<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
[<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Ashish Samant 2473114981 btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()
pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Josef Bacik e042d1ec44 Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Liu Bo 3d8da67817 Btrfs: fix divide error upon chunk's stripe_len
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
 __btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba ee17fc8005 btrfs: sysfs: protect reading label by lock
If the label setting ioctl races with sysfs label handler, we could get
mixed result in the output, part old part new. We should either get the
old or new label. The chances to hit this race are low.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 66ac9fe7ba btrfs: add check to sysfs handler of label
Add a sanity check for the fs_info as we will dereference it, similar to
what the 'store features' handler does.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba ee6111386a btrfs: add read-only check to sysfs handler of features
We don't want to trigger the change on a read-only filesystem, similar
to what the label handler does.

Signed-off-by: David Sterba <dsterba@suse.cz>
2016-05-06 15:22:49 +02:00
David Sterba e6c11f9a46 btrfs: reuse existing variable in scrub_stripe, reduce stack usage
The key variable occupies 17 bytes, the key_start is used once, we can
simply reuse existing 'key' for that purpose. As the key is not a simple
type, compiler doest not do it on itself.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 49a3c4d9b6 btrfs: use dynamic allocation for root item in create_subvol
The size of root item is more than 400 bytes, which is quite a lot of
stack space. As we do IO from inside the subvolume ioctls, we should
keep the stack usage low in case the filesystem is on top of other
layers (NFS, device mapper, iscsi, etc).

Reviewed-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 153519559a btrfs: clone: use vmalloc only as fallback for nodesize bufer
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 2f91306a37 btrfs: send: use vmalloc only as fallback for clone_sources_tmp
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba c03d01f340 btrfs: send: use vmalloc only as fallback for clone_roots
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba e55d1153db btrfs: send: use temporary variable to store allocation size
We're going to use the argument multiple times later.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba eb5b75fe2e btrfs: send: use vmalloc only as fallback for read_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 6ff48ce06b btrfs: send: use vmalloc only as fallback for send_buf
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain 779bf3fefa btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch  + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.

Reported issue:

[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234]  Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234]        CPU0                    CPU1
[ 5375.719235]        ----                    ----
[ 5375.719236]   lock(&fs_devs->device_list_mutex);
[ 5375.719238]                                lock(namespace_sem);
[ 5375.719239]                                lock(&fs_devs->device_list_mutex);
[ 5375.719241]   lock(sb_writers);
[ 5375.719241]
[ 5375.719241]  *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Dan Carpenter f5ecec3ce2 btrfs: send: silence an integer overflow warning
The "sizeof(*arg->clone_sources) * arg->clone_sources_count" expression
can overflow.  It causes several static checker warnings.  It's all
under CAP_SYS_ADMIN so it's not that serious but lets silence the
warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt 41b34accb2 btrfs: avoid overflowing f_bfree
Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Suggested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Luis de Bethencourt ae02d1bd07 btrfs: fix mixed block count of available space
Metadata for mixed block is already accounted in total data and should not
be counted as part of the free metadata space.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=114281
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Austin S. Hemmelgarn 88be159c90 btrfs: allow balancing to dup with multi-device
Currently, we don't allow the user to try and rebalance to a dup profile
on a multi-device filesystem.  In most cases, this is a perfectly sensible
restriction as raid1 uses the same amount of space and provides better
protection.

However, when reshaping a multi-device filesystem down to a single device
filesystem, this requires the user to convert metadata and system chunks
to single profile before deleting devices, and then convert again to dup,
which leaves a period of time where metadata integrity is reduced.

This patch removes the single-device-only restriction from converting to
dup profile to remove this potential data integrity reduction.

Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
David Sterba 2355ac8495 btrfs: ioctl: reorder exclusive op check in RM_DEV
Move the op exclusivity check before the other code (same as in
ADD_DEV).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:58:00 +02:00
David Sterba 58409edd2d btrfs: kill unused writepage_io_hook callback
It seems to be long time unused, since 2008 and
6885f308b5 ("Btrfs: Misc 2.6.25 updates").

Propagating the removal touches some code but has no functional effect.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 14:57:57 +02:00
Nicolai Stange 816cd71b0c ext4: remove unmeetable inconsisteny check from ext4_find_extent()
ext4_find_extent(), stripped down to the parts relevant to this patch,
reads as

  ppos = 0;
  i = depth;
  while (i) {
    --i;
    ++ppos;
    if (unlikely(ppos > depth)) {
      ...
      ret = -EFSCORRUPTED;
      goto err;
    }
  }

Due to the loop's bounds, the condition ppos > depth can never be met.

Remove this dead code.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 22:43:04 -04:00
Jens Axboe 32157de29a ext4: remove unnecessary bio get/put
ext4_io_submit() used to check for EOPNOTSUPP after bio submission,
which is why it had to get an extra reference to the bio before
submitting it. But since we no longer touch the bio after submission,
get rid of the redundant get/put of the bio. If we do get the extra
reference, we enter the slower path of having to flag this bio as now
having external references.

Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 22:09:49 -04:00
Mathias Krause 8148a73c99 proc: prevent accessing /proc/<PID>/environ until it's ready
If /proc/<PID>/environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.

Fix this as it is done for /proc/<PID>/cmdline by testing env_end for
zero.  It is, apparently, intentionally set last in create_*_tables().

This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.

The expected consequence is that userland trying to access
/proc/<PID>/environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.

Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Pax Team <pageexec@freemail.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Nicolai Stange 935244cd54 ext4: silence UBSAN in ext4_mb_init()
Currently, in ext4_mb_init(), there's a loop like the following:

  do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
  } while (i <= sb->s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
  shift exponent 4294967295 is too large for 32-bit type 'int'
  [...]
  Call Trace:
   [<ffffffff818c4d25>] dump_stack+0xbc/0x117
   [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
   [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
   [<ffffffff814293c7>] ? create_cache+0x57/0x1f0
   [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
   [<ffffffff821c2168>] ? mutex_lock+0x38/0x60
   [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
   [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
   [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
   [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
   [...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 19:46:19 -04:00
James Morris 0250abcd72 Merge tag 'keys-next-20160505' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into next 2016-05-06 09:29:00 +10:00
Nicolai Stange b5cb316cdf ext4: address UBSAN warning in mb_find_order_for_block()
Currently, in mb_find_order_for_block(), there's a loop like the following:

  while (order <= e4b->bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
  }

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
  shift exponent -1 is negative
  [...]
  Call Trace:
   [<ffffffff818c4d35>] dump_stack+0xbc/0x117
   [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
   [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
   [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
   [...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 17:38:03 -04:00
Linus Torvalds c5e0666c5a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns fix from Eric Biederman:
 "This contains just a single fix for a nasty oops"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  propogate_mnt: Handle the first propogated copy being a slave
2016-05-05 08:41:57 -07:00
Jan Kara 74177f55b7 ext4: fix oops on corrupted filesystem
When filesystem is corrupted in the right way, it can happen
ext4_mark_iloc_dirty() in ext4_orphan_add() returns error and we
subsequently remove inode from the in-memory orphan list. However this
deletion is done with list_del(&EXT4_I(inode)->i_orphan) and thus we
leave i_orphan list_head with a stale content. Later we can look at this
content causing list corruption, oops, or other issues. The reported
trace looked like:

WARNING: CPU: 0 PID: 46 at lib/list_debug.c:53 __list_del_entry+0x6b/0x100()
list_del corruption, 0000000061c1d6e0->next is LIST_POISON1
0000000000100100)
CPU: 0 PID: 46 Comm: ext4.exe Not tainted 4.1.0-rc4+ #250
Stack:
 60462947 62219960 602ede24 62219960
 602ede24 603ca293 622198f0 602f02eb
 62219950 6002c12c 62219900 601b4d6b
Call Trace:
 [<6005769c>] ? vprintk_emit+0x2dc/0x5c0
 [<602ede24>] ? printk+0x0/0x94
 [<600190bc>] show_stack+0xdc/0x1a0
 [<602ede24>] ? printk+0x0/0x94
 [<602ede24>] ? printk+0x0/0x94
 [<602f02eb>] dump_stack+0x2a/0x2c
 [<6002c12c>] warn_slowpath_common+0x9c/0xf0
 [<601b4d6b>] ? __list_del_entry+0x6b/0x100
 [<6002c254>] warn_slowpath_fmt+0x94/0xa0
 [<602f4d09>] ? __mutex_lock_slowpath+0x239/0x3a0
 [<6002c1c0>] ? warn_slowpath_fmt+0x0/0xa0
 [<60023ebf>] ? set_signals+0x3f/0x50
 [<600a205a>] ? kmem_cache_free+0x10a/0x180
 [<602f4e88>] ? mutex_lock+0x18/0x30
 [<601b4d6b>] __list_del_entry+0x6b/0x100
 [<601177ec>] ext4_orphan_del+0x22c/0x2f0
 [<6012f27c>] ? __ext4_journal_start_sb+0x2c/0xa0
 [<6010b973>] ? ext4_truncate+0x383/0x390
 [<6010bc8b>] ext4_write_begin+0x30b/0x4b0
 [<6001bb50>] ? copy_from_user+0x0/0xb0
 [<601aa840>] ? iov_iter_fault_in_readable+0xa0/0xc0
 [<60072c4f>] generic_perform_write+0xaf/0x1e0
 [<600c4166>] ? file_update_time+0x46/0x110
 [<60072f0f>] __generic_file_write_iter+0x18f/0x1b0
 [<6010030f>] ext4_file_write_iter+0x15f/0x470
 [<60094e10>] ? unlink_file_vma+0x0/0x70
 [<6009b180>] ? unlink_anon_vmas+0x0/0x260
 [<6008f169>] ? free_pgtables+0xb9/0x100
 [<600a6030>] __vfs_write+0xb0/0x130
 [<600a61d5>] vfs_write+0xa5/0x170
 [<600a63d6>] SyS_write+0x56/0xe0
 [<6029fcb0>] ? __libc_waitpid+0x0/0xa0
 [<6001b698>] handle_syscall+0x68/0x90
 [<6002633d>] userspace+0x4fd/0x600
 [<6002274f>] ? save_registers+0x1f/0x40
 [<60028bd7>] ? arch_prctl+0x177/0x1b0
 [<60017bd5>] fork_handler+0x85/0x90

Fix the problem by using list_del_init() as we always should with
i_orphan list.

CC: stable@vger.kernel.org
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 11:10:15 -04:00
Eric W. Biederman 5ec0811d30 propogate_mnt: Handle the first propogated copy being a slave
When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>]  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38  EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS:  00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
>  ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
>  ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
>  0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
>  [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
>  [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
>  [<ffffffff811f1ec3>] graft_tree+0x63/0x70
>  [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
>  [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
>  [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
>  [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
>  [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
>  [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
>  RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

Cc: stable@vger.kernel.org
fixes: f2ebb3a921 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-05-05 09:54:45 -05:00
Seth Forshee ff0bc08454 ext4: fix check of dqget() return value in ext4_ioctl_setproject()
A failed call to dqget() returns an ERR_PTR() and not null. Fix
the check in ext4_ioctl_setproject() to handle this correctly.

Fixes: 9b7365fc1c ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-05-05 10:52:38 -04:00
Al Viro 6a480a7842 ecryptfs: fix handling of directory opening
First of all, trying to open them r/w is idiocy; it's guaranteed to fail.
Moreover, assigning ->f_pos and assuming that everything will work is
blatantly broken - try that with e.g. tmpfs as underlying layer and watch
the fireworks.  There may be a non-trivial amount of state associated with
current IO position, well beyond the numeric offset.  Using the single
struct file associated with underlying inode is really not a good idea;
we ought to open one for each ecryptfs directory struct file.

Additionally, file_operations both for directories and non-directories are
full of pointless methods; non-directories should *not* have ->iterate(),
directories should not have ->flush(), ->fasync() and ->splice_read().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-04 14:04:13 -04:00
Anand Jain 88acff64c6 btrfs: cleanup assigning next active device with a check
Creates helper fucntion as needed by the device delete
and replace operations. Also now it checks if the next
device being assigned is an active device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 10:41:08 +02:00
Anand Jain 8ed01abe7d btrfs: s_bdev is not null after missing replace
Yauhen reported in the ML that s_bdev is null at mount, and
s_bdev gets updated to some device when missing device is
replaced, as because bdev is null for missing device, things
gets matched up. Fix this by checking if s_bdev is set. I
didn't want to completely remove updating s_bdev because
the future multi device support at vfs layer may need it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 09:52:44 +02:00
David S. Miller cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Linus Torvalds 610603a520 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "Fix a regression and update the MAINTAINERS entry for fuse"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: update mailing list in MAINTAINERS
  fuse: Fix return value from fuse_get_user_pages()
2016-05-03 14:23:58 -07:00
Chao Yu 3f8ab27085 f2fs: factor out fsync inode entry operations
Factor out fsync inode entry operations into {add,del}_fsync_inode.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:07 -07:00
Chao Yu c81ced05b2 f2fs: fix to clear page private flag
Commit 28bc106b23 ("f2fs: support revoking atomic written pages")
forgot to clear page private flag correctly, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:06 -07:00
Chao Yu 23dc974eed f2fs: fix to clear private data in page
Private data in page should be removed during ->releasepage or
->invalidatepage, otherwise garbage data would be remained in that page.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-05-03 11:20:05 -07:00
Al Viro 9ac3d3e846 nfs: switch to ->iterate_shared()
aside of the usual care about seeding dcache from readdir, we need
to be careful about the pagecache evictions here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:53 -04:00
Al Viro 9cf843e3f4 lookup_open(): lock the parent shared unless O_CREAT is given
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:17 -04:00
Al Viro 6fbd07146d lookup_open(): put the dentry fed to ->lookup() or ->atomic_open() into in-lookup hash
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:16 -04:00
Al Viro 12fa5e2404 lookup_open(): expand the call of real_lookup()
... and lose the duplicate IS_DEADDIR() - we'd already checked that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:16 -04:00
Al Viro 384f26e28f atomic_open(): reorder and clean up a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:15 -04:00
Al Viro 1643b43fbd lookup_open(): lift the "fallback to !O_CREAT" logics from atomic_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:15 -04:00
Al Viro b3d58eaffb atomic_open(): be paranoid about may_open() return value
It should never return positives; however, with Linux S&M crowd
involved, no bogosity is impossible.  Results would be unpleasant...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:14 -04:00
Al Viro 0fb1ea0933 atomic_open(): delay open_to_namei_flags() until the method call
nobody else needs that transformation.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:14 -04:00
Al Viro fe9ec8291f do_last(): take fput() on error after opening to out:
make it conditional on *opened & FILE_OPENED; in addition to getting
rid of exit_fput: thing, it simplifies atomic_open() cleanup on
may_open() failure.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:13 -04:00
Al Viro 47f9dbd387 do_last(): get rid of duplicate ELOOP check
may_open() will catch it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:13 -04:00
Al Viro 55db2fd936 atomic_open(): massage the create_error logics a bit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:12 -04:00
Al Viro 9d0728e16e atomic_open(): consolidate "overridden ENOENT" in open-yourself cases
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:12 -04:00
Al Viro 5249e411b4 atomic_open(): don't bother with EEXIST check - it's done in do_last()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:51:11 -04:00
Al Viro df889b3631 Merge branch 'for-linus' into work.lookups 2016-05-02 19:49:46 -04:00
Al Viro ce8644fcad lookup_open(): expand the call of vfs_create()
Lift IS_DEADDIR handling up into the part common with atomic_open(),
remove it from the latter.  Collapse permission checks into the
call of may_o_create(), getting it closer to atomic_open() case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:33 -04:00
Al Viro 6ac087099e path_openat(): take O_PATH handling out of do_last()
do_last() and lookup_open() simpler that way and so does O_PATH
itself.  As it bloody well should: we find what the pathname
resolves to, same way as in stat() et.al. and associate it with
FMODE_PATH struct file.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:33 -04:00
Al Viro 3b0a3c1ac1 simple local filesystems: switch to ->iterate_shared()
no changes needed (XFS isn't simple, but it has the same parallelism
in the interesting parts exercised from CXFS).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:32 -04:00
Al Viro 4e82901cd6 dcache_{readdir,dir_lseek}() users: switch to ->iterate_shared
no need to lock directory in dcache_dir_lseek(), while we are
at it - per-struct file exclusion is enough.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:32 -04:00
Al Viro 3125d2650c cifs: switch to ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:31 -04:00
Al Viro d9b3dbdcfd fuse: switch to ->iterate_shared()
Switch dcache pre-seeding on readdir to d_alloc_parallel();
nothing else is needed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:31 -04:00
Al Viro f50752eaa0 switch all procfs directories ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Al Viro 76aab3ab61 proc_sys_fill_cache(): switch to d_alloc_parallel()
make it usable with directory locked shared

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Al Viro 3781764b5c proc_fill_cache(): switch to d_alloc_parallel()
... making it usable with directory locked shared

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:29 -04:00
Al Viro 6192269444 introduce a parallel variant of ->iterate()
New method: ->iterate_shared().  Same arguments as in ->iterate(),
called with the directory locked only shared.  Once all filesystems
switch, the old one will be gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:29 -04:00
Al Viro 63b6df1413 give readdir(2)/getdents(2)/etc. uniform exclusion with lseek()
same as read() on regular files has, and for the same reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Al Viro 9902af79c0 parallel lookups: actual switch to rwsem
ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:28 -04:00
Al Viro d9171b9345 parallel lookups machinery, part 4 (and last)
If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup.  Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup().  Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away.  If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held.  Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:27 -04:00
Al Viro 94bdd655ca parallel lookups machinery, part 3
We will need to be able to check if there is an in-lookup
dentry with matching parent/name.  Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those.  Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain.  On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway.  That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel().  Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry.  In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish.  Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review.  All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:27 -04:00
Al Viro 84e710da2a parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
	* we verify that there's no hashed match
	* existing in-lookup match gets hashed by another process
	* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
	* read the counter
	* do dcache lookup
	* if no matches found, check for in-lookup matches
	* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:26 -04:00
Al Viro 85c7f81041 beginning of transition to parallel lookups - marking in-lookup dentries
marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
	* __d_add() is about to make it hashed, positive or not.
	* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
	* in caller of ->lookup() if it has escaped all of the
above.  Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
	* remain negative unhashed with positive refcount
	* never have its ->d_alias looked at
	* never have its ->d_lru looked at
	* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
	* only exist when parent is locked shared
	* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
	* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series.  The
reason for not making it a single patch is to simplify review.

New primitives: d_in_lookup() (a predicate checking if dentry is in
the in-lookup state) and d_lookup_done() (tells the system that
we are done with lookup and if it's still marked as in-lookup, it
should cease to be such).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:51 -04:00
Al Viro 0568d705b0 __d_add(): don't drop/regain ->d_lock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:26 -04:00
Al Viro 1936386ea9 lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:26 -04:00
Al Viro d2caaa0a77 nfs: missing wakeup in nfs_unblock_sillyrename()
will be needed as soon as lookups are not serialized by ->i_mutex

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:25 -04:00
Al Viro be5b82dbfe make ext2_get_page() and friends work without external serialization
Right now ext2_get_page() (and its analogues in a bunch of other filesystems)
relies upon the directory being locked - the way it sets and tests Checked and
Error bits would be racy without that.  Switch to a slightly different scheme,
_not_ setting Checked in case of failure.  That way the logics becomes
	if Checked => OK
	else if Error => fail
	else if !validate => fail
	else => OK
with validation setting Checked or Error on success and failure resp. and
returning which one had happened.  Equivalent to the current logics, but unlike
the current logics not sensitive to the order of set_bit, test_bit getting
reordered by CPU, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:25 -04:00
Al Viro b9e1d435fd ovl_lookup_real(): use lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:24 -04:00
Al Viro 383d4e8ab0 reconnect_one(): use lookup_one_len_unlocked()
... and explain the non-obvious logics in case when lookup yields
a different dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:24 -04:00
Al Viro 1ae1f3f647 reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
... and have it use inode_lock()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:23 -04:00
Al Viro 5ecfcb265f orangefs: don't open-code inode_lock/inode_unlock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:23 -04:00
Al Viro 7b9743eb89 ocfs2: don't open-code inode_lock/inode_unlock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:22 -04:00
Al Viro 48f35b7b73 configfs_detach_prep(): make sure that wait_mutex won't go away
grab a reference to dentry we'd got the sucker from, and return
that dentry via *wait, rather than just returning the address of
->i_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:22 -04:00
Al Viro 779b839133 kernfs: use lookup_one_len_unlocked()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:21 -04:00
Al Viro b96809173e security_d_instantiate(): move to the point prior to attaching dentry to inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:47:02 -04:00
Al Viro 84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Linus Torvalds 33656a1f2e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
 "A fix of a regression in UDF that got introduced in 4.6-rc1 by one of
  the charset encoding fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix conversion of 'dstring' fields to UTF8
2016-05-02 09:59:57 -07:00
Serge Hallyn e99ed4de17 kernfs_path_from_node_locked: don't overwrite nlen
We've calculated @len to be the bytes we need for '/..' entries from
@kn_from to the common ancestor, and calculated @nlen to be the extra
bytes we need to get from the common ancestor to @kn_to.  We use them
as such at the end.  But in the loop copying the actual entries, we
overwrite @nlen.  Use a temporary variable for that instead.

Without this, the return length, when the buffer is large enough, is
wrong.  (When the buffer is NULL or too small, the returned value is
correct. The buffer contents are also correct.)

Interestingly, no callers of this function are affected by this as of
yet.  However the upcoming cgroup_show_path() will be.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
2016-05-02 12:36:00 -04:00
Anand Jain ad8403df05 btrfs: pass the right error code to the btrfs_std_error
Also drop the newline from the message.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-02 17:50:13 +02:00
Bob Peterson 8381e60227 GFS2: Remove allocation parms from gfs2_rbm_find
Struct gfs2_alloc_parms ap is never referenced in function
gfs2_rbm_find, so this patch removes it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-05-02 09:49:22 -05:00
Abhi Das 80f4781d2c gfs2: use inode_lock/unlock instead of accessing i_mutex directly
i_mutex has been replaced by i_rwsem and directly accessing the
non-existent i_mutex breaks the kernel build.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-05-02 07:07:01 -05:00
Christoph Hellwig 24368aad47 nfsd: use RWF_SYNC
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig e864f39569 fs: add RWF_DSYNC aand RWF_SYNC
This is the per-I/O equivalent of O_DSYNC and O_SYNC, and very useful for
all kinds of file servers and storage targets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig 6aa657c852 ceph: use generic_write_sync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig 716b9bc0cb direct-io: remove the offset argument to dio_complete
It has to be identical to ki_pos of the iocb, so use that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig 13712713ca xfs: eliminate the pos variable in xfs_file_dio_aio_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig 1af5bb491f filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Mimi Zohar 05d1a717ec ima: add support for creating files using the mknodat syscall
Commit 3034a14 "ima: pass 'opened' flag to identify newly created files"
stopped identifying empty files as new files.  However new empty files
can be created using the mknodat syscall.  On systems with IMA-appraisal
enabled, these empty files are not labeled with security.ima extended
attributes properly, preventing them from subsequently being opened in
order to write the file data contents.  This patch defines a new hook
named ima_post_path_mknod() to mark these empty files, created using
mknodat, as new in order to allow the file data contents to be written.

In addition, files with security.ima xattrs containing a file signature
are considered "immutable" and can not be modified.  The file contents
need to be written, before signing the file.  This patch relaxes this
requirement for new files, allowing the file signature to be written
before the file contents.

Changelog:
- defer identifying files with signatures stored as security.ima
  (based on Dmitry Rozhkov's comments)
- removing tests (eg. dentry, dentry->d_inode, inode->i_size == 0)
  (based on Al's review)

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Al Viro <<viro@zeniv.linux.org.uk>
Tested-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>
2016-05-01 09:23:52 -04:00
Dmitry Kasatkin 39d637af5a vfs: forbid write access when reading a file into memory
This patch is based on top of the "vfs: support for a common kernel file
loader" patch set.  In general when the kernel is reading a file into
memory it does not want anything else writing to it.

The kernel currently only forbids write access to a file being executed.
This patch extends this locking to files being read by the kernel.

Changelog:
- moved function to kernel_read_file() - Mimi
- updated patch description - Mimi

Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@huawei.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
2016-05-01 09:23:51 -04:00
Al Viro 10c64cea04 atomic_open(): fix the handling of create_error
* if we have a hashed negative dentry and either CREAT|EXCL on
r/o filesystem, or CREAT|TRUNC on r/o filesystem, or CREAT|EXCL
with failing may_o_create(), we should fail with EROFS or the
error may_o_create() has returned, but not ENOENT.  Which is what
the current code ends up returning.

* if we have CREAT|TRUNC hitting a regular file on a read-only
filesystem, we can't fail with EROFS here.  At the very least,
not until we'd done follow_managed() - we might have a writable
file (or a device, for that matter) bound on top of that one.
Moreover, the code downstream will see that O_TRUNC and attempt
to grab the write access (*after* following possible mount), so
if we really should fail with EROFS, it will happen.  No need
to do that inside atomic_open().

The real logics is much simpler than what the current code is
trying to do - if we decided to go for simple lookup, ended
up with a negative dentry *and* had create_error set, fail with
create_error.  No matter whether we'd got that negative dentry
from lookup_real() or had found it in dcache.

Cc: stable@vger.kernel.org # v3.6+
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-30 16:40:52 -04:00
Chris Wilson e4234a1fc3 kernfs: Move faulting copy_user operations outside of the mutex
A fault in a user provided buffer may lead anywhere, and lockdep warns
that we have a potential deadlock between the mm->mmap_sem and the
kernfs file mutex:

[   82.811702] ======================================================
[   82.811705] [ INFO: possible circular locking dependency detected ]
[   82.811709] 4.5.0-rc4-gfxbench+ #1 Not tainted
[   82.811711] -------------------------------------------------------
[   82.811714] kms_setmode/5859 is trying to acquire lock:
[   82.811717]  (&dev->struct_mutex){+.+.+.}, at: [<ffffffff8150d9c1>] drm_gem_mmap+0x1a1/0x270
[   82.811731]
but task is already holding lock:
[   82.811734]  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[   82.811745]
which lock already depends on the new lock.

[   82.811749]
the existing dependency chain (in reverse order) is:
[   82.811752]
-> #3 (&mm->mmap_sem){++++++}:
[   82.811761]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811766]        [<ffffffff8118bc65>] __might_fault+0x75/0xa0
[   82.811771]        [<ffffffff8124da4a>] kernfs_fop_write+0x8a/0x180
[   82.811787]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[   82.811792]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[   82.811797]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[   82.811801]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811807]
-> #2 (s_active#6){++++.+}:
[   82.811814]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811819]        [<ffffffff8124c070>] __kernfs_remove+0x210/0x2f0
[   82.811823]        [<ffffffff8124d040>] kernfs_remove_by_name_ns+0x40/0xa0
[   82.811828]        [<ffffffff8124e9e0>] sysfs_remove_file_ns+0x10/0x20
[   82.811832]        [<ffffffff815318d4>] device_del+0x124/0x250
[   82.811837]        [<ffffffff81531a19>] device_unregister+0x19/0x60
[   82.811841]        [<ffffffff8153c051>] cpu_cache_sysfs_exit+0x51/0xb0
[   82.811846]        [<ffffffff8153c628>] cacheinfo_cpu_callback+0x38/0x70
[   82.811851]        [<ffffffff8109ae89>] notifier_call_chain+0x39/0xa0
[   82.811856]        [<ffffffff8109aef9>] __raw_notifier_call_chain+0x9/0x10
[   82.811860]        [<ffffffff810786de>] cpu_notify+0x1e/0x40
[   82.811865]        [<ffffffff81078779>] cpu_notify_nofail+0x9/0x20
[   82.811869]        [<ffffffff81078ac3>] _cpu_down+0x233/0x340
[   82.811874]        [<ffffffff81079019>] disable_nonboot_cpus+0xc9/0x350
[   82.811878]        [<ffffffff810d2e11>] suspend_devices_and_enter+0x5a1/0xb50
[   82.811883]        [<ffffffff810d3903>] pm_suspend+0x543/0x8d0
[   82.811888]        [<ffffffff810d1b77>] state_store+0x77/0xe0
[   82.811892]        [<ffffffff813fa68f>] kobj_attr_store+0xf/0x20
[   82.811897]        [<ffffffff8124e740>] sysfs_kf_write+0x40/0x50
[   82.811902]        [<ffffffff8124dafc>] kernfs_fop_write+0x13c/0x180
[   82.811906]        [<ffffffff811d1023>] __vfs_write+0x23/0xe0
[   82.811910]        [<ffffffff811d1d74>] vfs_write+0xa4/0x190
[   82.811914]        [<ffffffff811d2c14>] SyS_write+0x44/0xb0
[   82.811918]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.811923]
-> #1 (cpu_hotplug.lock){+.+.+.}:
[   82.811929]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.811933]        [<ffffffff817b6f72>] mutex_lock_nested+0x62/0x3b0
[   82.811940]        [<ffffffff810784c1>] get_online_cpus+0x61/0x80
[   82.811944]        [<ffffffff811170eb>] stop_machine+0x1b/0xe0
[   82.811949]        [<ffffffffa0178edd>] gen8_ggtt_insert_entries__BKL+0x2d/0x30 [i915]
[   82.812009]        [<ffffffffa017d3a6>] ggtt_bind_vma+0x46/0x70 [i915]
[   82.812045]        [<ffffffffa017eb70>] i915_vma_bind+0x140/0x290 [i915]
[   82.812081]        [<ffffffffa01862b9>] i915_gem_object_do_pin+0x899/0xb00 [i915]
[   82.812117]        [<ffffffffa0186555>] i915_gem_object_pin+0x35/0x40 [i915]
[   82.812154]        [<ffffffffa019a23e>] intel_init_pipe_control+0xbe/0x210 [i915]
[   82.812192]        [<ffffffffa0197312>] intel_logical_rings_init+0xe2/0xde0 [i915]
[   82.812232]        [<ffffffffa0186fe3>] i915_gem_init+0xf3/0x130 [i915]
[   82.812278]        [<ffffffffa02097ed>] i915_driver_load+0xf2d/0x1770 [i915]
[   82.812318]        [<ffffffff81512474>] drm_dev_register+0xa4/0xb0
[   82.812323]        [<ffffffff8151467e>] drm_get_pci_dev+0xce/0x1e0
[   82.812328]        [<ffffffffa01472cf>] i915_pci_probe+0x2f/0x50 [i915]
[   82.812360]        [<ffffffff8143f907>] pci_device_probe+0x87/0xf0
[   82.812366]        [<ffffffff81535f89>] driver_probe_device+0x229/0x450
[   82.812371]        [<ffffffff81536233>] __driver_attach+0x83/0x90
[   82.812375]        [<ffffffff81533c61>] bus_for_each_dev+0x61/0xa0
[   82.812380]        [<ffffffff81535879>] driver_attach+0x19/0x20
[   82.812384]        [<ffffffff8153535f>] bus_add_driver+0x1ef/0x290
[   82.812388]        [<ffffffff81536e9b>] driver_register+0x5b/0xe0
[   82.812393]        [<ffffffff8143e83b>] __pci_register_driver+0x5b/0x60
[   82.812398]        [<ffffffff81514866>] drm_pci_init+0xd6/0x100
[   82.812402]        [<ffffffffa027c094>] 0xffffffffa027c094
[   82.812406]        [<ffffffff810003de>] do_one_initcall+0xae/0x1d0
[   82.812412]        [<ffffffff811595a0>] do_init_module+0x5b/0x1cb
[   82.812417]        [<ffffffff81106160>] load_module+0x1c20/0x2480
[   82.812422]        [<ffffffff81106bae>] SyS_finit_module+0x7e/0xa0
[   82.812428]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812433]
-> #0 (&dev->struct_mutex){+.+.+.}:
[   82.812439]        [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[   82.812443]        [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.812456]        [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[   82.812460]        [<ffffffff81196a14>] mmap_region+0x334/0x580
[   82.812466]        [<ffffffff81196fc4>] do_mmap+0x364/0x410
[   82.812470]        [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[   82.812474]        [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[   82.812479]        [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[   82.812484]        [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73
[   82.812489]
other info that might help us debug this:

[   82.812493] Chain exists of:
  &dev->struct_mutex --> s_active#6 --> &mm->mmap_sem

[   82.812502]  Possible unsafe locking scenario:

[   82.812506]        CPU0                    CPU1
[   82.812508]        ----                    ----
[   82.812510]   lock(&mm->mmap_sem);
[   82.812514]                                lock(s_active#6);
[   82.812519]                                lock(&mm->mmap_sem);
[   82.812522]   lock(&dev->struct_mutex);
[   82.812526]
 *** DEADLOCK ***

[   82.812531] 1 lock held by kms_setmode/5859:
[   82.812533]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117b364>] vm_mmap_pgoff+0x44/0xa0
[   82.812541]
stack backtrace:
[   82.812547] CPU: 0 PID: 5859 Comm: kms_setmode Not tainted 4.5.0-rc4-gfxbench+ #1
[   82.812550] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0040.2015.0814.1353 08/14/2015
[   82.812553]  0000000000000000 ffff880079407bf0 ffffffff813f8505 ffffffff825fb270
[   82.812560]  ffffffff825c4190 ffff880079407c30 ffffffff810c84ac ffff880079407c90
[   82.812566]  ffff8800797ed328 ffff8800797ecb00 0000000000000001 ffff8800797ed350
[   82.812573] Call Trace:
[   82.812578]  [<ffffffff813f8505>] dump_stack+0x67/0x92
[   82.812582]  [<ffffffff810c84ac>] print_circular_bug+0x1fc/0x310
[   82.812586]  [<ffffffff810cbe59>] __lock_acquire+0x1fc9/0x20f0
[   82.812590]  [<ffffffff810cc883>] lock_acquire+0xc3/0x1d0
[   82.812594]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[   82.812599]  [<ffffffff8150d9e7>] drm_gem_mmap+0x1c7/0x270
[   82.812603]  [<ffffffff8150d9c1>] ? drm_gem_mmap+0x1a1/0x270
[   82.812608]  [<ffffffff81196a14>] mmap_region+0x334/0x580
[   82.812612]  [<ffffffff81196fc4>] do_mmap+0x364/0x410
[   82.812616]  [<ffffffff8117b38d>] vm_mmap_pgoff+0x6d/0xa0
[   82.812629]  [<ffffffff811950f4>] SyS_mmap_pgoff+0x184/0x220
[   82.812633]  [<ffffffff8100a0fd>] SyS_mmap+0x1d/0x20
[   82.812637]  [<ffffffff817bb81b>] entry_SYSCALL_64_fastpath+0x16/0x73

Highly unlikely though this scenario is, we can avoid the issue entirely
by moving the copy operation from out under the kernfs_get_active()
tracking by assigning the preallocated buffer its own mutex. The
temporary buffer allocation doesn't require mutex locking as it is
entirely local.

The locked section was extended by the addition of the preallocated buf
to speed up md user operations in

commit 2b75869bba
Author: NeilBrown <neilb@suse.de>
Date:   Mon Oct 13 16:41:28 2014 +1100

    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-30 10:05:05 -07:00
Theodore Ts'o 7827a7f6eb ext4: clean up error handling when orphan list is corrupted
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted.  If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-30 00:49:54 -04:00
Theodore Ts'o c9eb13a910 ext4: fix hang when processing corrupted orphaned inode list
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly).  Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.

This can be reproduced via:

   mke2fs -t ext4 /tmp/foo.img 100
   debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
   mount -o loop /tmp/foo.img /mnt

(But don't do this if you are using an unpatched kernel if you care
about the system staying functional.  :-)

This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1].  (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)

[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf

Cc: stable@vger.kernel.org
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-30 00:48:54 -04:00
Linus Torvalds 1d003af2ef Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  Documentation/sysctl/vm.txt: update numa_zonelist_order description
  lib/stackdepot.c: allow the stack trace hash to be zero
  rapidio: fix potential NULL pointer dereference
  mm/memory-failure: fix race with compound page split/merge
  ocfs2/dlm: return zero if deref_done message is successfully handled
  Ananth has moved
  kcov: don't profile branches in kcov
  kcov: don't trace the code coverage code
  mm: wake kcompactd before kswapd's short sleep
  .mailmap: add Frank Rowand
  mm/hwpoison: fix wrong num_poisoned_pages accounting
  mm: call swap_slot_free_notify() with page lock held
  mm: vmscan: reclaim highmem zone if buffer_heads is over limit
  numa: fix /proc/<pid>/numa_maps for THP
  mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
  mailmap: fix Krzysztof Kozlowski's misspelled name
  thp: keep huge zero page pinned until tlb flush
  mm: exclude HugeTLB pages from THP page_mapped() logic
  kexec: export OFFSET(page.compound_head) to find out compound tail page
  kexec: update VMCOREINFO for compound_order/dtor
2016-04-29 11:21:22 -07:00
xuejiufei b73413647e ocfs2/dlm: return zero if deref_done message is successfully handled
dlm_deref_lockres_done_handler() should return zero if the message is
successfully handled.

Fixes: 60d663cb52 ("ocfs2/dlm: add DEREF_DONE message").
Signed-off-by: xuejiufei <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Gerald Schaefer 28093f9f34 numa: fix /proc/<pid>/numa_maps for THP
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>	[4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28 19:34:04 -07:00
Linus Torvalds 6fa9bffbcc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
 "There is a lifecycle fix in the auth code, a fix for a narrow race
  condition on map, and a helpful message in the log when there is a
  feature mismatch (which happens frequently now that the default
  server-side options have changed)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: report unsupported features to syslog
  rbd: fix rbd map vs notify races
  libceph: make authorizer destruction independent of ceph_auth_client
2016-04-28 18:59:24 -07:00
Jeff Mahoney db6711600e btrfs: uapi/linux/btrfs_tree.h migration, item types and defines
The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly
to userspace.  In order to decode them, full type information is required.

Create a new header, btrfs_tree to contain these since most users won't
need them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney 33ca913349 btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args
struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney 04cd01dffb btrfs: uapi/linux/btrfs.h migration, move balance flags
The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags
and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE
ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney 18db9ac644 btrfs: uapi/linux/btrfs.h migration, move feature flags
The compat/compat_ro/incompat feature flags are used by the feature set/get
ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney 83288b60bf btrfs: uapi/linux/btrfs.h migration, qgroup limit flags
The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which
fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Jeff Mahoney d4ae133b2d btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE
BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and
BTRFS_IOC_SET_FSLABEL ioctls.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 11:06:41 +02:00
Anand Jain b5255456c5 btrfs: refactor btrfs_dev_replace_start for reuse
A refactor patch, and avoids user input verification in the
btrfs_dev_replace_start(), and so this function can be reused.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain fc23c246d7 btrfs: use fs_info directly
Local variable fs_info, contains root->fs_info, use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba 735654ea91 btrfs: rename flags for vol args v2
Rename BTRFS_DEVICE_BY_ID so it's more descriptive that we specify the
device by id, it'll be part of the public API. The mask of supported
flags is also renamed, only for internal use.

The error code for unknown flags is EOPNOTSUPP, fixed.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba 5c5c0df05d btrfs: rename btrfs_find_device_by_user_input
For clarity how we are going to find the device, let's call it a device
specifier, devspec for short. Also rename the arguments that are a
leftover from previous function purpose.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba 418775a22b btrfs: use existing device constraints table btrfs_raid_array
We should avoid duplicating the device constraints, let's use the
btrfs_raid_array in btrfs_check_raid_min_devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba 621292bae6 btrfs: introduce raid-type to error-code table, for minimum device constraint
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba 3cc31a0d5b btrfs: pass number of devices to btrfs_check_raid_min_devices
Before this patch, btrfs_check_raid_min_devices would do an off-by-one
check of the constraints and not the miminmum check, as its name
suggests. This is not a problem if the only caller is device remove, but
would be confusing for others.

Add an argument with the exact number and let the caller(s) decide if
this needs any adjustments, like when device replace is running.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba f47ab2588e btrfs: rename __check_raid_min_devices
Underscores are for special functions, use the full prefix for better
stacktrace recognition.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 02feae3c55 btrfs: optimize check for stale device
Optimize check for stale device to only be checked when there is device
added or changed. If there is no update to the device, there is no need
to call btrfs_free_stale_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 6b526ed70c btrfs: introduce device delete by devid
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
btrfs_ioctl_vol_args_v2 to carry devid as an user argument.

The patch won't delete the old ioctl interface and so kernel remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del  3  /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
[ adjust messages, s/disk/device/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 42b6742715 btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()
With the previous patches now the btrfs_scratch_superblocks() is ready to
be used in btrfs_rm_device() so use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain b3d1b1532f btrfs: enhance btrfs_find_device_by_user_input() to check device path
The operation of device replace and device delete follows same steps upto
some depth with in btrfs kernel, however they don't share codes. This
enhancement will help replace and delete to share codes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 24fc572fe4 btrfs: make use of btrfs_find_device_by_user_input()
btrfs_rm_device() has a section of the code which can be replaced
btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 24e0474b59 btrfs: create helper btrfs_find_device_by_user_input()
The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c, so that
delete device can use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain bd45ffbcb1 btrfs: clean up and optimize __check_raid_min_device()
__check_raid_min_device() which was pealed from btrfs_rm_device()
maintianed its original code to show the block move. This patch cleans up
__check_raid_min_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain f1fa7f2642 btrfs: create helper function __check_raid_min_devices()
move a section of btrfs_rm_device() code to check for min number of the
devices into the function __check_raid_min_devices()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain 6cf86a006b btrfs: create a helper function to read the disk super
A part of code from btrfs_scan_one_device() is moved to a new function
btrfs_read_disk_super(), so that former function looks cleaner. (In this
process it also moves the code which ensures null terminating label). So
this creates easy opportunity to merge various duplicate codes on read
disk super. Earlier attempt to merge duplicate codes highlighted that
there were some issues for which there are duplicate codes (to read disk
super), however it was not clear what was the issue. So until we figure
that out, its better to keep them in a separate functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:04 +02:00
Jan Kara 2a28900be2 udf: Export superblock magic to userspace
Currently UDF superblock magic doesn't appear in any userspace header
files and thus userspace apps have hard time checking for this fs. Let's
export the magic to userspace as with any other filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-28 10:43:31 +02:00
Liu Bo cf25ce518e Btrfs: do not create empty block group if we have allocated data
Now we force to create empty block group to keep data profile alive,
however, in the below example, we eventually get an empty block group
while we're trying to get more space for other types (metadata/system),

- Before,
block group "A": size=2G, used=1.2G
block group "B": size=2G, used=512M

- After "btrfs balance start -dusage=50 mount_point",
block group "A": size=2G, used=(1.2+0.5)G
block group "C": size=2G, used=0

Since there is no data in block group C, it won't be deleted
automatically and we have to get the unused 2G until the next mount.

Balance itself just moves data and doesn't remove data, so it's safe
to not create such a empty block group if we already have data
 allocated in other block groups.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Chandan Rajendra a2af23b7d7 Btrfs: __btrfs_buffered_write: Pass valid file offset when releasing delalloc space
The delalloc reserved space is calculated in terms of number of bytes
used by an integral number of blocks. This is done by rounding down the
value of 'pos' to the nearest multiple of sectorsize.

The file offset value held by 'pos' variable may not be aligned to
sectorsize and hence when passing it as an argument to
btrfs_delalloc_release_space(), we may end up releasing larger delalloc
space than we originally had reserved.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo 894b36e35a Btrfs: cleanup error handling in extent_write_cached_pages
Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Liu Bo a91326679f Btrfs: make mapping->writeback_index point to the last written page
If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.

In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.

For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.

To avoid it, we just need to continue writeback from the last written page.

This also updates btrfs to behave like what write_cache_pages() does, ie, bail
 out immediately if there is an error in writepage().

<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>

Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Luke Dashjr 4c63c2454e btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl
32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can
be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr
fail.

Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org>
Cc: stable@vger.kernel.org
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:40:27 +02:00
Luis de Bethencourt 180e4d4700 btrfs: fix typos in comments
Correct a typo in the chunk_mutex name to make it grepable.

Since it is better to fix several typos at once, fixing the 2 more in the
same file.

Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Geert Uytterhoeven 6719afdcf8 Btrfs: Refactor btrfs_lock_cluster() to kill compiler warning
fs/btrfs/extent-tree.c: In function ‘btrfs_lock_cluster’:
fs/btrfs/extent-tree.c:6399: warning: ‘used_bg’ may be used uninitialized in this function

  - Replace "again: ... goto again;" by standard C "while (1) { ... }",
  - Move block not processed during the first iteration of the loop to the
    end of the loop, which allows to kill the "locked" variable,

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-and-Tested-by: Miao Xie <miaox@cn.fujitsu.com>
[ the compilation warning has been fixed by other patch, now we want to
  clean up the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain 0713d90c75 btrfs: remove save_error_info()
Actually save_error_info() sets the FS state to error and nothing else.
Further the word save doesn't induce caffeine when compared to the word
set in what actually it does.

So to make it better understandable move save_error_info() code to its
only consumer itself.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Satoru Takeuchi 13f48dc909 btrfs: Simplify conditions about compress while mapping btrfs flags to inode flags
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain c5f4ccb2f7 btrfs: move error handling code together in ctree.h
Looks like we added the incompatible defines in between the error
handling defines in the file ctree.h. Now group them back.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain 2351d743f6 btrfs: remove unused function btrfs_assert()
Apparently looks like ASSERT does the same intended job,
as intended btrfs_assert().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Anand Jain 34d9700702 btrfs: rename btrfs_std_error to btrfs_handle_fs_error
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Yunlong Song fe216c7a0f f2fs: fix to return 0 if err == -ENOENT in f2fs_readdir
Commit 57b62d29ad ("f2fs: fix to report
error in f2fs_readdir") causes f2fs_readdir to return -ENOENT when
get_lock_data_page returns -ENOENT. However, the original logic is to
continue when get_lock_data_page returns -ENOENT, but it forgets to
reset err to 0.

This will cause getdents64 incorretly return -ENOENT when lastdirent is
NULL in getdents64. This will lead to a wrong return value for syscall
caller.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:43 -07:00
Chao Yu da011cc0da f2fs: move node pages only in victim section during GC
For foreground GC, we cache node blocks in victim section and set them
dirty, then we call sync_node_pages to flush these node pages, but
meanwhile, those node pages which does not locate in victim section
will be flushed together, so more bandwidth and continuous free space
would be occupied.

So for this condition, it's better to leave those unrelated node page
in cache for further write hit, and let CP or VM to flush them afterward.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:42 -07:00
Chao Yu a4a13f582c f2fs: be aware of invalid filename length
The filename length in dirent of may become zero-sized after random junk
data injection, once encounter such dirent, find_target_dentry or
f2fs_add_inline_entries will run into an infinite loop. So let f2fs being
aware of that to avoid deadloop.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-27 14:10:42 -07:00
David S. Miller c0cc53162a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.

In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:43:10 -04:00
Jakub Wilk 8d2ae1cbe8 ext4: remove trailing \n from ext4_warning/ext4_error calls
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-27 01:11:21 -04:00
Linus Torvalds 8ead9dd547 devpts: more pty driver interface cleanups
This is more prep-work for the upcoming pty changes.  Still just code
cleanup with no actual semantic changes.

This removes a bunch pointless complexity by just having the slave pty
side remember the dentry associated with the devpts slave rather than
the inode.  That allows us to remove all the "look up the dentry" code
for when we want to remove it again.

Together with moving the tty pointer from "inode->i_private" to
"dentry->d_fsdata" and getting rid of pointless inode locking, this
removes about 30 lines of code.  Not only is the end result smaller,
it's simpler and easier to understand.

The old code, for example, depended on the d_find_alias() to not just
find the dentry, but also to check that it is still hashed, which in
turn validated the tty pointer in the inode.

That is a _very_ roundabout way to say "invalidate the cached tty
pointer when the dentry is removed".

The new code just does

	dentry->d_fsdata = NULL;

in devpts_pty_kill() instead, invalidating the tty pointer rather more
directly and obviously.  Don't do something complex and subtle when the
obvious straightforward approach will do.

The rest of the patch (ie apart from code deletion and the above tty
pointer clearing) is just switching the calling convention to pass the
dentry or file pointer around instead of the inode.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-26 15:47:32 -07:00
Jaegeuk Kim 6bfc49197e f2fs: issue cache flush on direct IO
Under direct IO path with O_(D)SYNC, it needs to set proper APPEND or UPDATE
flags, so taht f2fs_sync_file can make its data safe.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:25:05 -07:00
Jaegeuk Kim 608514deba f2fs: set fsync mark only for the last dnode
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:59 -07:00
Jaegeuk Kim c267ec1526 f2fs: report unwritten status in fsync_node_pages
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:54 -07:00
Jaegeuk Kim 5268137564 f2fs: split sync_node_pages with fsync_node_pages
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:48 -07:00
Jaegeuk Kim e6e5f5610d f2fs: avoid writing 0'th page in volatile writes
The first page of volatile writes usually contains a sort of header information
which will be used for recovery.
(e.g., journal header of sqlite)

If this is written without other journal data, user needs to handle the stale
journal information.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:39 -07:00
Jaegeuk Kim eca76e783c f2fs: avoid needless lock for node pages when fsyncing a file
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.

Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-26 14:24:30 -07:00
Nicolas Dichtel 3c6f3714d6 fs/quota: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Alden Tondettar a47241cdee udf: Prevent stack overflow on corrupted filesystem mount
Presently, a corrupted or malicious UDF filesystem containing a very large
number (or cycle) of Logical Volume Integrity Descriptor extent
indirections may trigger a stack overflow and kernel panic in
udf_load_logicalvolint() on mount.

Replace the unnecessary recursion in udf_load_logicalvolint() with
simple iteration. Set an arbitrary limit of 1000 indirections (which would
have almost certainly overflowed the stack without this fix), and treat
such cases as if there were no LVID.

Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-26 08:25:07 +02:00
Daeho Jeong c8585c6fca ext4: fix races between changing inode journal mode and ext4_writepages
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:22:35 -04:00
Daeho Jeong 4c54659269 ext4: handle unwritten or delalloc buffers before enabling data journaling
We already allocate delalloc blocks before changing the inode mode into
"per-file data journal" mode to prevent delalloc blocks from remaining
not allocated, but another issue concerned with "BH_Unwritten" status
still exists. For example, by fallocate(), several buffers' status
change into "BH_Unwritten", but these buffers cannot be processed by
ext4_alloc_da_blocks(). So, they still remain in unwritten status after
per-file data journaling is enabled and they cannot be changed into
written status any more and, if they are journaled and eventually
checkpointed, these unwritten buffer will cause a kernel panic by the
below BUG_ON() function of submit_bh_wbc() when they are submitted
during checkpointing.

static int submit_bh_wbc(int rw, struct buffer_head *bh,...
{
        ...
        BUG_ON(buffer_unwritten(bh));

Moreover, when "dioread_nolock" option is enabled, the status of a
buffer is changed into "BH_Unwritten" after write_begin() completes and
the "BH_Unwritten" status will be cleared after I/O is done. Therefore,
if a buffer's status is changed into unwrutten but the buffer's I/O is
not submitted and completed, it can cause the same problem after
enabling per-file data journaling. You can easily generate this bug by
executing the following command.

./kvm-xfstests -C 10000 -m nodelalloc,dioread_nolock generic/269

To resolve these problems and define a boundary between the previous
mode and per-file data journaling mode, we need to flush and wait all
the I/O of buffers of a file before enabling per-file data journaling
of the file.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:21:00 -04:00
Theodore Ts'o 7b8081912d ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
The function jbd2_journal_extend() takes as its argument the number of
new credits to be added to the handle.  We weren't taking into account
the currently unused handle credits; worse, we would try to extend the
handle by N credits when it had N credits available.

In the case where jbd2_journal_extend() fails because the transaction
is too large, when jbd2_journal_restart() gets called, the N credits
owned by the handle gets returned to the transaction, and the
transaction commit is asynchronously requested, and then
start_this_handle() will be able to successfully attach the handle to
the current transaction since the required credits are now available.

This is mostly harmless, but since ext4_ext_truncate_extend_restart()
returns EAGAIN, the truncate machinery will once again try to call
ext4_ext_truncate_extend_restart(), which will do the above sequence
over and over again until the transaction has committed.

This was found while I was debugging a lockup in caused by running
xfstests generic/074 in the data=journal case.  I'm still not sure why
we ended up looping forever, which suggests there may still be another
bug hiding in the transaction accounting machinery, but this commit
prevents us from looping in the first place.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-25 23:13:17 -04:00
Ilya Dryomov 6c1ea260f8 libceph: make authorizer destruction independent of ceph_auth_client
Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
    [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
    [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
    [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
    [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
    [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
    [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang <alan.zhang@linux.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-04-25 20:54:13 +02:00
Andrew Gabbasov c26f6c6157 udf: Fix conversion of 'dstring' fields to UTF8
Commit 9293fcfbc1
("udf: Remove struct ustr as non-needed intermediate storage"),
while getting rid of 'struct ustr', does not take any special care
of 'dstring' fields and effectively use fixed field length instead
of actual string length, encoded in the last byte of the field.

Also, commit 484a10f493
("udf: Merge linux specific translation into CS0 conversion function")
introduced checking of the length of the string being converted,
requiring proper alignment to number of bytes constituing each
character.

The UDF volume identifier is represented as a 32-bytes 'dstring',
and needs to be converted from CS0 to UTF8, while mounting UDF
filesystem. The changes in mentioned commits can in some cases
lead to incorrect handling of volume identifier:
- if the actual string in 'dstring' is of maximal length and
does not have zero bytes separating it from dstring encoded
length in last byte, that last byte may be included in conversion,
thus making incorrect resulting string;
- if the identifier is encoded with 2-bytes characters (compression
code is 16), the length of 31 bytes (32 bytes of field length minus
1 byte of compression code), taken as the string length, is reported
as an incorrect (unaligned) length, and the conversion fails, which
in its turn leads to volume mounting failure.

This patch introduces handling of 'dstring' encoded length field
in udf_CS0toUTF8 function, that is used in all and only cases
when 'dstring' fields are converted. Currently these cases are
processing of Volume Identifier and Volume Set Identifier fields.
The function is also renamed to udf_dstrCS0toUTF8 to distinctly
indicate that it handles 'dstring' input.

Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-04-25 15:18:50 +02:00
Ashish Samant 2c932d4c91 fuse: Fix return value from fuse_get_user_pages()
fuse_get_user_pages() should return error or 0. Otherwise fuse_direct_io
read will not return 0 to indicate that read has completed.

Fixes: 742f992708 ("fuse: return patrial success from fuse_direct_io()")
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-04-25 13:01:04 +02:00
Jan Kara ee0876bc69 ext4: do not ask jbd2 to write data for delalloc buffers
Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:08 -04:00
Jan Kara 41617e1a8d jbd2: add support for avoiding data writes during transaction commits
Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.

Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:07 -04:00
Jan Kara 3957ef53a5 ext4: remove EXT4_STATE_ORDERED_MODE
This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:05 -04:00
Jan Kara 06bd3c36a7 ext4: fix data exposure after a crash
Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a6 removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a6
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-24 00:56:03 -04:00
Theodore Ts'o 1f60fbe727 ext4: allow readdir()'s of large empty directories to be interrupted
If a directory has a large number of empty blocks, iterating over all
of them can take a long time, leading to scheduler warnings and users
getting irritated when they can't kill a process in the middle of one
of these long-running readdir operations.  Fix this by adding checks to
ext4_readdir() and ext4_htree_fill_tree().

This was reverted earlier due to a typo in the original commit where I
experimented with using signal_pending() instead of
fatal_signal_pending().  The test was in the wrong place if we were
going to return signal_pending() since we would end up returning
duplicant entries.  See 9f2394c9be for a more detailed explanation.

Added fix as suggested by Linus to check for signal_pending() in
in the filldir() functions.

Reported-by: Benjamin LaHaise <bcrl@kvack.org>
Google-Bug-Id: 27880676
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-23 22:50:07 -04:00
David S. Miller 1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Yan, Zheng b971e94e8f ceph: kill __ceph_removexattr()
when removing a xattr, generic_removexattr() calls __ceph_setxattr()
with NULL value and XATTR_REPLACE flag. __ceph_removexattr() is not
used any more.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher 2cdeb1e472 ceph: Switch to generic xattr handlers
Add a catch-all xattr handler at the end of ceph_xattr_handlers.  Check
for valid attribute names there, and remove those checks from
__ceph_{get,set,remove}xattr instead.  No "system.*" xattrs need to be
handled by the catch-all handler anymore.

The set xattr handler is called with a NULL value to indicate that the
attribute should be removed; __ceph_setxattr already handles that case
correctly (ceph_set_acl could already calling __ceph_setxattr with a NULL
value).

Move the check for snapshots from ceph_{set,remove}xattr into
__ceph_{set,remove}xattr.  With that, ceph_{get,set,remove}xattr can be
replaced with the generic iops.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher a26feccaba ceph: Get rid of d_find_alias in ceph_set_acl
Create a variant of ceph_setattr that takes an inode instead of a
dentry.  Change __ceph_setxattr (and also __ceph_removexattr) to take an
inode instead of a dentry.  Use those in ceph_set_acl so that we no
longer need a dentry there.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:41:30 -04:00
Andreas Gruenbacher a9ae008f40 cifs: Switch to generic xattr handlers
Use xattr handlers for resolving attribute names.  The amount of setup
code required on cifs is nontrivial, so use the same get and set
functions for all handlers, with switch statements for the different
types of attributes in them.

The set_EA handler can handle NULL values, so we don't need a separate
removexattr function anymore.  Remove the cifs_dbg statements related to
xattr name resolution; they don't add much.  Don't build xattr.o when
CONFIG_CIFS_XATTR is not defined.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher 534bb0c7bd cifs: Fix removexattr for os2.* xattrs
If cifs_removexattr finds a "user." or "os2." xattr name prefix, it
skips 5 bytes, one byte too many for "os2.".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher 45987e006c cifs: Check for equality with ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT
The two values ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT are meant to be
enumerations, not bits in a bit mask.  Use '==' instead of '&' to check
for these values.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Andreas Gruenbacher d9a1548921 cifs: Fix xattr name checks
Use strcmp(str, name) instead of strncmp(str, name, strlen(name)) for
checking if str and name are the same (as opposed to name being a prefix
of str) in the gexattr and setxattr inode operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-23 15:33:03 -04:00
Herbert Xu e81f3340bb eCryptfs: Do not allocate hash tfm in NORECLAIM context
You cannot allocate crypto tfm objects in NORECLAIM or NOFS contexts.
The ecryptfs code currently does exactly that for the MD5 tfm.

This patch fixes it by preallocating the MD5 tfm in a safe context.

The MD5 tfm is also reentrant so this patch removes the superfluous
cs_hash_tfm_mutex.

Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-04-20 17:50:01 +08:00
Daniel DeFreez 9c7fe83530 GFS2: Add calls to gfs2_holder_uninit in two error handlers
This patch fixes two locations that do not call gfs2_holder_uninit
if gfs2_glock_nq returns an error.

Signed-off-by: Daniel DeFreez <dcdefreez@ucdavis.edu>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-19 19:59:11 -04:00
Linus Torvalds 9a0e3eea25 Merge branch 'ptmx-cleanup'
Merge the ptmx internal interface cleanup branch.

This doesn't change semantics, but it should be a sane basis for
eventually getting the multi-instance devpts code into some sane shape
where we can get rid of the kernel config option.  Which we can
hopefully get done next merge window..

* ptmx-cleanup:
  devpts: clean up interface to pty drivers
2016-04-19 16:36:18 -07:00
Linus Torvalds 67245ff332 devpts: clean up interface to pty drivers
This gets rid of the horrible notion of having that

    struct inode *ptmx_inode

be the linchpin of the interface between the pty code and devpts.

By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
and we will have a much saner way forward.  In particular, this will
allow us to associate with any particular devpts instance at open-time,
and not be artificially tied to one particular ptmx inode.

The patch itself is actually fairly straightforward, and apart from some
locking and return path cleanups it's pretty mechanical:

 - the interfaces that devpts exposes all take "struct pts_fs_info *"
   instead of "struct inode *ptmx_inode" now.

   NOTE! The "struct pts_fs_info" thing is a completely opaque structure
   as far as the pty driver is concerned: it's still declared entirely
   internally to devpts. So the pty code can't actually access it in any
   way, just pass it as a "cookie" to the devpts code.

 - the "look up the pts fs info" is now a single clear operation, that
   also does the reference count increment on the pts superblock.

   So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
   ref" operation (devpts_get_ref(inode)), along with a "put ref" op
   (devpts_put_ref()).

 - the pty master "tty->driver_data" field now contains the pts_fs_info,
   not the ptmx inode.

 - because we don't care about the ptmx inode any more as some kind of
   base index, the ref counting can now drop the inode games - it just
   gets the ref on the superblock.

 - the pts_fs_info now has a back-pointer to the super_block. That's so
   that we can easily look up the information we actually need. Although
   quite often, the pts fs info was actually all we wanted, and not having
   to look it up based on some magical inode makes things more
   straightforward.

In particular, now that "devpts_get_ref(inode)" operation should really
be the *only* place we need to look up what devpts instance we're
associated with, and we do it exactly once, at ptmx_open() time.

The other side of this is that one ptmx node could now be associated
with multiple different devpts instances - you could have a single
/dev/ptmx node, and then have multiple mount namespaces with their own
instances of devpts mounted on /dev/pts/.  And that's all perfectly sane
in a model where we just look up the pts instance at open time.

This will eventually allow us to get rid of our odd single-vs-multiple
pts instance model, but this patch in itself changes no semantics, only
an internal binding model.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Aurelien Jarno <aurelien@aurel32.net>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jann Horn <jann@thejh.net>
Cc: Greg KH <greg@kroah.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-18 13:43:02 -07:00
Greg Kroah-Hartman 5614e77258 Merge 4.6-rc4 into driver-core-next
We want those fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-19 04:28:28 +09:00
Masanari Iida c19ca6cb4c treewide: Fix typos in printk
This patch fix spelling typos found in printk
within various part of the kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:23:24 +02:00
Jiri Kosina 9938b04472 Merge branch 'master' into for-next
Sync with Linus' tree so that patches against newer codebase can be applied.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:18:55 +02:00
Masanari Iida bd7ced9881 Doc: treewide : Fix typos in DocBook/filesystem.xml
This patch fix spelling typos found in DocBook/filesystem.xml.
It is because the file was generated from comments in code,
I have to fix the comments in codes, instead of xml file.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:13:05 +02:00
Linus Torvalds e1e22b27ec "driver core" fixes for 4.6-rc4
Here are 3 small fixes 4.6-rc4.  Two fix up some lz4 issues with big
 endian systems, and the remaining one resolves a minor debugfs issue
 that was reported.
 
 All have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlcS3ucACgkQMUfUDdst+ynw1QCbBGgbY7Xt08whNFcAP81z1Q5X
 fmsAn1fyrhfsxe+JnybzswsTOjFw99Xd
 =YX6h
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull misc fixes from Greg KH:
 "Here are three small fixes for 4.6-rc4.

  Two fix up some lz4 issues with big endian systems, and the remaining
  one resolves a minor debugfs issue that was reported.

  All have been in linux-next with no reported issues"

* tag 'driver-core-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  lib: lz4: cleanup unaligned access efficiency detection
  lib: lz4: fixed zram with lz4 on big endian machines
  debugfs: Make automount point inodes permanently empty
2016-04-16 20:53:50 -07:00
Jaegeuk Kim c27753d675 f2fs: flush dirty pages before starting atomic writes
If somebody wrote some data before atomic writes, we should flush them in order
to handle atomic data in a right period.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim 63c52d7878 f2fs: don't invalidate atomic page if successful
If we committed atomic write successfully, we don't need to invalidate pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim 58457f1c35 f2fs: give -E2BIG for no space in xattr
This patch returns -E2BIG if there is no space to add an xattr entry.
This should fix generic/026 in xfstests as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:50:25 -07:00
Jaegeuk Kim 4da7bf5a43 f2fs: remove redundant condition check
This patch resolves the redundant condition check reported by David.

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim 26dc3d4424 f2fs: unset atomic/volatile flag in f2fs_release_file
The atomic/volatile operation should be done in pair of start and commit
ioctl.
For example, if a killed process remains open-ended atomic operation, we should
drop its flag as well as its atomic data. Otherwise, if sqlite initiates another
operation which doesn't require atomic writes, it will lose every data, since
f2fs still treats with them as atomic writes; nobody will trigger its commit.

Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim de5307e46d f2fs: fix dropping inmemory pages in a wrong time
When one reader closes its file while the other writer is doing atomic writes,
f2fs_release_file drops atomic data resulting in an empty commit.
This patch fixes this wrong commit problem by checking openess of the file.

 Process0                       Process1
 				open file
 start atomic write
 write data
 read data
				close file
				f2fs_release_file()
				clear atomic data
 commit atomic write

Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim ff37355886 f2fs: add BUG_ON to avoid unnecessary flow
This patch adds BUG_ON instead of retrying loop.
In the case of node pages, we already got this inode page, but unlocked it.
By the fact that we don't truncate any node pages in operations, the page's
mapping should be unchangeable.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim 4a6de50d54 f2fs: use PGP_LOCK to check its truncation
Previously, after trylock_page is succeeded, it doesn't check its mapping.
In order to fix that, we can just give PGP_LOCK to pagecache_get_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Chao Yu 675f10bde6 f2fs: fix to convert inline directory correctly
With below serials, we will lose parts of dirents:

1) mount f2fs with inline_dentry option
2) echo 1 > /sys/fs/f2fs/sdX/dir_level
3) mkdir dir
4) touch 180 files named [1-180] in dir
5) touch 181 in dir
6) echo 3 > /proc/sys/vm/drop_caches
7) ll dir

ls: cannot access 2: No such file or directory
ls: cannot access 4: No such file or directory
ls: cannot access 5: No such file or directory
ls: cannot access 6: No such file or directory
ls: cannot access 8: No such file or directory
ls: cannot access 9: No such file or directory
...
total 360
drwxr-xr-x 2 root root 4096 Feb 19 15:12 ./
drwxr-xr-x 3 root root 4096 Feb 19 15:11 ../
-rw-r--r-- 1 root root    0 Feb 19 15:12 1
-rw-r--r-- 1 root root    0 Feb 19 15:12 10
-rw-r--r-- 1 root root    0 Feb 19 15:12 100
-????????? ? ?    ?       ?            ? 101
-????????? ? ?    ?       ?            ? 102
-????????? ? ?    ?       ?            ? 103
...

The reason is: when doing the inline dir conversion, we didn't consider
that directory has hierarchical hash structure which can be configured
through sysfs interface 'dir_level'.

By default, dir_level of directory inode is 0, it means we have one bucket
in hash table located in first level, all dirents will be hashed in this
bucket, so it has no problem for us to do the duplication simply between
inline dentry page and converted normal dentry page.

However, if we configured dir_level with the value N (greater than 0), it
will expand the bucket number of first level hash table by 2^N - 1, it
hashs dirents into different buckets according their hash value, if we
still move all dirents to first bucket, it makes incorrent locating for
inline dirents, the result is, although we can iterate all dirents through
->readdir, we can't stat some of them in ->lookup which based on hash
table searching.

This patch fixes this issue by rehashing dirents into correct position
when converting inline directory.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim 8c11a53fc2 f2fs: show current mount status
This patch remains the current mount status to f2fs status info.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim faa0e55bba f2fs: treat as a normal umount when remounting ro
When user remounts f2fs as read-only, we can mark the checkpoint as umount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim 6781eabba1 f2fs: give -EINVAL for norecovery and rw mount
Once detecting something to recover, f2fs should stop mounting, given norecovery
and rw mount options.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim df728b0f69 f2fs: recover superblock at RW remounts
This patch adds a sbi flag, SBI_NEED_SB_WRITE, which indicates it needs to
recover superblock when (re)mounting as RW. This is set only when f2fs is
mounted as RO.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Jaegeuk Kim f2353d7bd4 f2fs: give RO message when recovering superblock
When one of superblocks is missing, f2fs recovers it with the valid one.
But, even if f2fs is mounted as RO, we'd better notify that too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-15 08:49:47 -07:00
Linus Torvalds dfe70581c1 for-linus-4.6-rc4
These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4 crypto,
 which have not yet been fully propagated as follows.
  - use of dget_parent and file_dentry to avoid crashes
  - disallow RCU-mode lookup in d_invalidate
  - disallow -ENOMEM in the core data encryption path
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXDbkhAAoJEEAUqH6CSFDS6yMP/1L+1PcmTJKloe0wCft57fL0
 hXRXI57DeNwCxc9fKmaBnAn9uv96qc7fWq1oMqT0oYPTf8qGUaIc0bALxx/IFCwh
 Ul1QOzlGJmomPq4kGleOEPmDGBGUwDYgH55rEadhmevbOMX5IGQOSDGu9qxZ+LMA
 orYTYTHX5tjeR1UAxzjMkwa0WRyYJGRsqqycgIl8f/kIg06GIpxOx1IIuFQOBv1N
 BJlkKbbiX/snWKXXVCNlFzVAZOWjpgzfvCGB0Fge4HZlMZL876jKIj+4dVmQ5xiI
 K6i5NXk1U13hhQEFzK4McicHBwlMnMxeMbdR8wm1vQ/4UfQFZoQAeClAxz5/ZyPr
 lHi+VamyTbMfRwijp49hNQ8SKjMKPd70yFd2d2csQAwVra2t4fmXqieh6AEppkJb
 EDeFAK8Y+PGis6XVnAJ1wkXW7OSwzfJUVAmj3oFsZ7Onny+mm++f+i7a7I3RKUib
 SH23Imv1OQaoGcAFVSCYe/JP9dcfKUwzSFFYOiQneqdNTNN8rzoPYYhQ4hNCKv6/
 zHSEI2EcpGipmSTh4dROPJvNf42QG8Sw4lEuyGsXy251t7wxBrc7e43BzXtjyKsl
 /2d5hgAQbR9HeU3Oz41Dq/LEjReEAcdJDKMETP9/qrzs7qDJgtI0XoUkVSn3NyOR
 i3JSEix2JljR97DzUIW/
 =ML1+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs/fscrypto fixes from Jaegeuk Kim:
 "In addition to f2fs/fscrypto fixes, I've added one patch which
  prevents RCU mode lookup in d_revalidate, as Al mentioned.

  These patches fix f2fs and fscrypto based on -rc3 bug fixes in ext4
  crypto, which have not yet been fully propagated as follows.

   - use of dget_parent and file_dentry to avoid crashes
   - disallow RCU-mode lookup in d_invalidate
   - disallow -ENOMEM in the core data encryption path"

* tag 'for-linus-4.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  ext4/fscrypto: avoid RCU lookup in d_revalidate
  fscrypto: don't let data integrity writebacks fail with ENOMEM
  f2fs: use dget_parent and file_dentry in f2fs_file_open
  fscrypto: use dget_parent() in fscrypt_d_revalidate()
2016-04-14 18:22:42 -07:00
Linus Torvalds 34dbbcdbf6 Make file credentials available to the seqfile interfaces
A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.

The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.

So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones.  Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.

It turns out that seq_file _does_ save away the user_ns information of
the file, though.  Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-14 12:56:09 -07:00
Bob Peterson e97321fa09 GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid
Function gfs2_inode_lookup was dereferencing the inode, and after,
it checks for the value being NULL. We need to check that first.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2016-04-14 09:52:50 -04:00
Hannes Frederic Sowa fafc4e1ea1 sock: tigthen lockdep checks for sock_owned_by_user
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.

Cc: linux-cifs@vger.kernel.org
Cc: linux-bluetooth@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 22:37:20 -04:00
Jaegeuk Kim 03a8bb0e53 ext4/fscrypto: avoid RCU lookup in d_revalidate
As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d6662 ("fs: rcu-walk aware d_revalidate method").

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable <stable@vger.kernel.org>
2016-04-12 20:01:35 -07:00
Seth Forshee 87243deb88 debugfs: Make automount point inodes permanently empty
Starting with 4.1 the tracing subsystem has its own filesystem
which is automounted in the tracing subdirectory of debugfs.
Prior to this debugfs could be bind mounted in a cloned mount
namespace, but if tracefs has been mounted under debugfs this
now fails because there is a locked child mount. This creates
a regression for container software which bind mounts debugfs
to satisfy the assumption of some userspace software.

In other pseudo filesystems such as proc and sysfs we're already
creating mountpoints like this in such a way that no dirents can
be created in the directories, allowing them to be exceptions to
some MNT_LOCKED tests. In fact we're already do this for the
tracefs mountpoint in sysfs.

Do the same in debugfs_create_automount(), since the intention
here is clearly to create a mountpoint. This fixes the regression,
as locked child mounts on permanently empty directories do not
cause a bind mount to fail.

Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 15:01:53 -07:00
Nicolai Stange c4a74f63df debugfs: unproxify files created through debugfs_create_u32_array()
The struct file_operations u32_array_fops associated with files created
through debugfs_create_u32_array() has been lifetime aware already:
everything needed for subsequent operation is copied to a ->f_private
buffer at file opening time in u32_array_open(). Now, ->open() is always
protected against file removal issues by the debugfs core.

There is no need for the debugfs core to wrap the u32_array_fops
with a file lifetime managing proxy.

Make debugfs_create_u32_array() create its files in non-proxying operation
mode by means of debugfs_create_file_unsafe().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange 83b711cbf4 debugfs: unproxify files created through debugfs_create_blob()
Currently, the struct file_operations fops_blob associated with files
created through the debugfs_create_blob() helpers are not file
lifetime aware.

Thus, a lifetime managing proxy is created around fops_blob each time such
a file is opened which is an unnecessary waste of resources.

Implement file lifetime management for the fops_bool file_operations.
Namely, make read_file_blob() safe gainst file removals by means of
debugfs_use_file_start() and debugfs_use_file_finish().

Make debugfs_create_blob() create its files in non-proxying operation mode
by means of debugfs_create_file_unsafe().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange 4d45f7974c debugfs: unproxify files created through debugfs_create_bool()
Currently, the struct file_operations fops_bool associated with files
created through the debugfs_create_bool() helpers are not file
lifetime aware.

Thus, a lifetime managing proxy is created around fops_bool each time such
a file is opened which is an unnecessary waste of resources.

Implement file lifetime management for the fops_bool file_operations.
Namely, make debugfs_read_file_bool() and debugfs_write_file_bool() safe
against file removals by means of debugfs_use_file_start() and
debugfs_use_file_finish().

Make debugfs_create_bool() create its files in non-proxying operation mode
through debugfs_create_mode_unsafe().

Finally, purge debugfs_create_mode() as debugfs_create_bool() had been its
last user.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange 4909f16810 debugfs: unproxify integer attribute files
Currently, the struct file_operations associated with the integer attribute
style files created through the debugfs_create_*() helpers are not file
lifetime aware as they are defined by means of DEFINE_SIMPLE_ATTRIBUTE().

Thus, a lifetime managing proxy is created around the original fops each
time such a file is opened which is an unnecessary waste of resources.

Migrate all usages of DEFINE_SIMPLE_ATTRIBUTE() within debugfs itself
to DEFINE_DEBUGFS_ATTRIBUTE() in order to implement file lifetime managing
within the struct file_operations thus defined.

Introduce the debugfs_create_mode_unsafe() helper, analogous to
debugfs_create_mode(), but distinct in that it creates the files in
non-proxying operation mode through debugfs_create_file_unsafe().

Feed all struct file_operations migrated to DEFINE_DEBUGFS_ATTRIBUTE()
into debugfs_create_mode_unsafe() instead of former debugfs_create_mode().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange c646880814 debugfs: add support for self-protecting attribute file fops
In order to protect them against file removal issues, debugfs_create_file()
creates a lifetime managing proxy around each struct file_operations
handed in.

In cases where this struct file_operations is able to manage file lifetime
by itself already, the proxy created by debugfs is a waste of resources.

The most common class of struct file_operations given to debugfs are those
defined by means of the DEFINE_SIMPLE_ATTRIBUTE() macro.

Introduce a DEFINE_DEBUGFS_ATTRIBUTE() macro to allow any
struct file_operations of this class to be easily made file lifetime aware
and thus, to be operated unproxied.

Specifically, introduce debugfs_attr_read() and debugfs_attr_write()
which wrap simple_attr_read() and simple_attr_write() under the protection
of a debugfs_use_file_start()/debugfs_use_file_finish() pair.

Make DEFINE_DEBUGFS_ATTRIBUTE() set the defined struct file_operations'
->read() and ->write() members to these wrappers.

Export debugfs_create_file_unsafe() in order to allow debugfs users to
create their files in non-proxying operation mode.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange 49d200deaa debugfs: prevent access to removed files' private data
Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
still be attempted to access associated private file data through
previously opened struct file objects. If that data has been freed by
the caller of debugfs_remove*() in the meanwhile, the reading/writing
process would either encounter a fault or, if the memory address in
question has been reassigned again, unrelated data structures could get
overwritten.

However, since debugfs files are seldomly removed, usually from module
exit handlers only, the impact is very low.

Currently, there are ~1000 call sites of debugfs_create_file() spread
throughout the whole tree and touching all of those struct file_operations
in order to make them file removal aware by means of checking the result of
debugfs_use_file_start() from within their methods is unfeasible.

Instead, wrap the struct file_operations by a lifetime managing proxy at
file open:
- In debugfs_create_file(), the original fops handed in has got stashed
  away in ->d_fsdata already.
- In debugfs_create_file(), install a proxy file_operations factory,
  debugfs_full_proxy_file_operations, at ->i_fop.

This proxy factory has got an ->open() method only. It carries out some
lifetime checks and if successful, dynamically allocates and sets up a new
struct file_operations proxy at ->f_op. Afterwards, it forwards to the
->open() of the original struct file_operations in ->d_fsdata, if any.

The dynamically set up proxy at ->f_op has got a lifetime managing wrapper
set for each of the methods defined in the original struct file_operations
in ->d_fsdata.

Its ->release()er frees the proxy again and forwards to the original
->release(), if any.

In order not to mislead the VFS layer, it is strictly necessary to leave
those fields blank in the proxy that have been NULL in the original
struct file_operations also, i.e. aren't supported. This is why there is a
need for dynamically allocated proxies. The choice made not to allocate a
proxy instance for every dentry at file creation, but for every
struct file object instantiated thereof is justified by the expected usage
pattern of debugfs, namely that in general very few files get opened more
than once at a time.

The wrapper methods set in the struct file_operations implement lifetime
managing by means of the SRCU protection facilities already in place for
debugfs:
They set up a SRCU read side critical section and check whether the dentry
is still alive by means of debugfs_use_file_start(). If so, they forward
the call to the original struct file_operation stored in ->d_fsdata, still
under the protection of the SRCU read side critical section.
This SRCU read side critical section prevents any pending debugfs_remove()
and friends to return to their callers. Since a file's private data must
only be freed after the return of debugfs_remove(), the ongoing proxied
call is guarded against any file removal race.

If, on the other hand, the initial call to debugfs_use_file_start() detects
that the dentry is dead, the wrapper simply returns -EIO and does not
forward the call. Note that the ->poll() wrapper is special in that its
signature does not allow for the return of arbitrary -EXXX values and thus,
POLLHUP is returned here.

In order not to pollute debugfs with wrapper definitions that aren't ever
needed, I chose not to define a wrapper for every struct file_operations
method possible. Instead, a wrapper is defined only for the subset of
methods which are actually set by any debugfs users.
Currently, these are:

  ->llseek()
  ->read()
  ->write()
  ->unlocked_ioctl()
  ->poll()

The ->release() wrapper is special in that it does not protect the original
->release() in any way from dead files in order not to leak resources.
Thus, any ->release() handed to debugfs must implement file lifetime
management manually, if needed.
For only 33 out of a total of 434 releasers handed in to debugfs, it could
not be verified immediately whether they access data structures that might
have been freed upon a debugfs_remove() return in the meanwhile.

Export debugfs_use_file_start() and debugfs_use_file_finish() in order to
allow any ->release() to manually implement file lifetime management.

For a set of common cases of struct file_operations implemented by the
debugfs_core itself, future patches will incorporate file lifetime
management directly within those in order to allow for their unproxied
operation. Rename the original, non-proxying "debugfs_create_file()" to
"debugfs_create_file_unsafe()" and keep it for future internal use by
debugfs itself. Factor out code common to both into the new
__debugfs_create_file().

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Nicolai Stange 9fd4dcece4 debugfs: prevent access to possibly dead file_operations at file open
Nothing prevents a dentry found by path lookup before a return of
__debugfs_remove() to actually get opened after that return. Now, after
the return of __debugfs_remove(), there are no guarantees whatsoever
regarding the memory the corresponding inode's file_operations object
had been kept in.

Since __debugfs_remove() is seldomly invoked, usually from module exit
handlers only, the race is hard to trigger and the impact is very low.

A discussion of the problem outlined above as well as a suggested
solution can be found in the (sub-)thread rooted at

  http://lkml.kernel.org/g/20130401203445.GA20862@ZenIV.linux.org.uk
  ("Yet another pipe related oops.")

Basically, Greg KH suggests to introduce an intermediate fops and
Al Viro points out that a pointer to the original ones may be stored in
->d_fsdata.

Follow this line of reasoning:
- Add SRCU as a reverse dependency of DEBUG_FS.
- Introduce a srcu_struct object for the debugfs subsystem.
- In debugfs_create_file(), store a pointer to the original
  file_operations object in ->d_fsdata.
- Make debugfs_remove() and debugfs_remove_recursive() wait for a
  SRCU grace period after the dentry has been delete()'d and before they
  return to their callers.
- Introduce an intermediate file_operations object named
  "debugfs_open_proxy_file_operations". It's ->open() functions checks,
  under the protection of a SRCU read lock, whether the dentry is still
  alive, i.e. has not been d_delete()'d and if so, tries to acquire a
  reference on the owning module.
  On success, it sets the file object's ->f_op to the original
  file_operations and forwards the ongoing open() call to the original
  ->open().
- For clarity, rename the former debugfs_file_operations to
  debugfs_noop_file_operations -- they are in no way canonical.

The choice of SRCU over "normal" RCU is justified by the fact, that the
former may also be used to protect ->i_private data from going away
during the execution of a file's readers and writers which may (and do)
sleep.

Finally, introduce the fs/debugfs/internal.h header containing some
declarations internal to the debugfs implementation.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12 14:14:21 -07:00
Jaegeuk Kim b32e4482aa fscrypto: don't let data integrity writebacks fail with ENOMEM
This patch fixes the issue introduced by the ext4 crypto fix in a same manner.
For F2FS, however, we flush the pending IOs and wait for a while to acquire free
memory.

Fixes: c9af28fdd4 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-04-12 10:25:30 -07:00