linux_old1

Commit Graph

Author	SHA1	Message	Date
David Sterba	a1ee736268	btrfs: do an allocation earlier during snapshot creation We can allocate pending_snapshot earlier and do not have to do cleanup in case of failure. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:20:54 +01:00
David Sterba	4fb72bf2e9	btrfs: use smaller type for btrfs_path locks The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see: * overall size of btrfs_path drops down from 136 to 112 (-24 bytes), * better packing in a slab page +6 objects * the whole structure now fits to 2 cachelines * slight decrease in code size: text data bss dec hex filename 938731 43670 23144 1005545 f57e9 fs/btrfs/btrfs.ko.before 938203 43670 23144 1005017 f55d9 fs/btrfs/btrfs.ko.after (and the generated assembly does not change much) The main purpose is to decrease the size of the structure without affecting performance. The byte access is usually well behaving accross arches, the locks are not accessed frequently and sometimes just compared to zero. Note for further size reduction attempts: the slots could be made u16 but this might generate worse code on some arches (non-byte and non-int access). Also the range of operations on slots is wider compared to locks and the potential performance drop should be evaluated first. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:17 +01:00
David Sterba	7853f15b2a	btrfs: use smaller type for btrfs_path lowest_level The level is 0..7, we can use smaller type. The size of btrfs_path is now 136 bytes from 144, which is +2 objects that fit into a 4k slab. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:17 +01:00
David Sterba	dccabfad20	btrfs: use smaller type for btrfs_path reada The possible values for reada are all positive and bounded, we can later save some bytes by storing it in u8. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:16 +01:00
David Sterba	e4058b54d1	btrfs: cleanup, use enum values for btrfs_path reada Replace the integers by enums for better readability. The value 2 does not have any meaning since `a717531942` "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:15 +01:00
David Sterba	4d4ab6d6bc	btrfs: constify static arrays There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:15 +01:00
David Sterba	20e5506baf	btrfs: constify remaining structs with function pointers * struct extent_io_ops * struct btrfs_free_space_op Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:14 +01:00
David Sterba	28f0779a3f	btrfs tests: replace whole ops structure for free space tests Preparatory work for making btrfs_free_space_op constant. In test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with own version thus preventing constification. We can rework it so we replace the whole structure with the correct function pointers. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 15:01:14 +01:00
Geliang Tang	a7ca42256d	btrfs: use list_for_each_entry* in backref.c Use list_for_each_entry*() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:42:46 +01:00
Geliang Tang	7ae1681e12	btrfs: use list_for_each_entry_safe in free-space-cache.c Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:39:09 +01:00
Geliang Tang	b69f2bef48	btrfs: use list_for_each_entry* in check-integrity.c Use list_for_each_entry() instead of list_for_each() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:38:42 +01:00
Byongho Lee	ee22184b53	Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:38:02 +01:00
David Sterba	7928d672ff	btrfs: cleanup, remove stray return statements Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:30:52 +01:00
Alexandru Moise	352dd9c8d3	btrfs: zero out delayed node upon allocation It's slightly cleaner to zero-out the delayed node upon allocation than to do it by hand in btrfs_init_delayed_node() for a few members Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:30:17 +01:00
Alexandru Moise	575a75d6fa	btrfs: pass proper enum type to start_transaction() Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:30:00 +01:00
Alexandru Moise	9780c4976f	btrfs: switch __btrfs_fs_incompat return type from int to bool Conform to __btrfs_fs_incompat() cast-to-bool (!!) by explicitly returning boolean not int. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:29:20 +01:00
Byongho Lee	e40da0e58a	btrfs: remove unused inode argument from uncompress_inline() The inode argument is never used from the beginning, so remove it. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:29:02 +01:00
David Sterba	100d57025c	btrfs: don't use slab cache for struct btrfs_delalloc_work Although we prefer to use separate caches for various structs, it seems better not to do that for struct btrfs_delalloc_work. Objects of this type are allocated rarely, when transaction commit calls btrfs_start_delalloc_roots, requesting delayed iputs. The objects are temporary (with some IO involved) but still allocated and freed within __start_delalloc_inodes. Memory allocation failure is handled. The slab cache is empty most of the time (observed on several systems), so if we need to allocate a new slab object, the first one has to allocate a full page. In a potential case of low memory conditions this might fail with higher probability compared to using the generic slab caches. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
David Sterba	0de270fa83	btrfs: drop duplicate prefix from scrub workqueues The helper btrfs_alloc_workqueue will add the "btrfs-" prefix. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
David Sterba	93a3d46780	btrfs: verbose error when we find an unexpected item in sys_array Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
David Sterba	f5cdedd73f	btrfs: handle invalid num_stripes in sys_array We can handle the special case of num_stripes == 0 directly inside btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to catch other unhandled cases where we fail to validate external data. A crafted or corrupted image crashes at mount time: BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0 BTRFS info (device loop0): disk space caching is enabled BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()! Kernel panic - not syncing: BUG! CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25 Stack: 637af890 60062489 602aeb2e 604192ba 60387961 00000011 637af8a0 6038a835 637af9c0 6038776b 634ef32b 00000000 Call Trace: [<6001c86d>] show_stack+0xfe/0x15b [<6038a835>] dump_stack+0x2a/0x2c [<6038776b>] panic+0x13e/0x2b3 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff [<601cfbbe>] open_ctree+0x192d/0x27af [<6019c2c1>] btrfs_mount+0x8f5/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<6019bcb0>] btrfs_mount+0x2e4/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<600d710b>] do_mount+0xa35/0xbc9 [<600d7557>] SyS_mount+0x95/0xc8 [<6001e884>] handle_syscall+0x6b/0x8e Reported-by: Jiri Slaby <jslaby@suse.com> Reported-by: Vegard Nossum <vegard.nossum@oracle.com> CC: stable@vger.kernel.org # 3.19+ Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
David Sterba	35b3ad50ba	btrfs: better packing of btrfs_delayed_extent_op btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now and has 8 unused bytes. Reducing the level type to u8 makes it possible to squeeze it to the padding byte after key. The bitfields were switched to bool as there's space to store the full byte without increasing the whole structure, besides that the generated assembly is smaller. struct btrfs_delayed_extent_op { struct btrfs_disk_key key; /* 0 17 / u8 level; / 17 1 / bool update_key; / 18 1 / bool update_flags; / 19 1 / bool is_data; / 20 1 / / XXX 3 bytes hole, try to pack / u64 flags_to_set; / 24 8 / / size: 32, cachelines: 1, members: 6 / / sum members: 29, holes: 1, sum holes: 3 / / last cacheline: 32 bytes */ }; The final size is 32 bytes which gives +26 object per slab page. text data bss dec hex filename 938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before 938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
David Sterba	8089fe62c6	btrfs: put delayed item hook into inode Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode. The inode can be put into the delayed_iputs list more than once and we have to keep the count. This means we can't use the list_splice to process a bunch of inodes because we'd lost track of the count if the inode is put into the delayed iputs again while it's processed. Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
Zhao Lei	c5ca87819d	btrfs: Support convert to -d dup for btrfs-convert Since we will add support for -d dup for non-mixed filesystem, kernel need to support converting to this raid-type. This patch remove limitation of above case. Tested by following script: (combination of dup conversion with fsck): export TEST_DEV='/dev/vdc' export TEST_DIR='/var/ltf/tester/mnt' do_dup_test() { local m_from="$1" local d_from="$2" local m_to="$3" local d_to="$4" echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to" umount "$TEST_DIR" &>/dev/null ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null \|\| return 1 mount "$TEST_DEV" "$TEST_DIR" \|\| return 1 cp -a /sbin/* "$TEST_DIR" [[ "$m_from" != "$m_to" ]] && { ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" \|\| return 1 } [[ "$d_from" != "$d_to" ]] && { local opt=() [[ "$d_to" == single ]] && opt+=("-f") ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" \|\| return 1 } umount "$TEST_DIR" \|\| return 1 ./btrfsck "$TEST_DEV" \|\| return 1 echo return 0 } test_all() { for m_from in single dup; do for d_from in single dup; do for m_to in single dup; do for d_to in single dup; do do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" \|\| return 1 done done done done } test_all Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
Josef Bacik	be7bd73084	Btrfs: igrab inode in writepage We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:58 +01:00
Anand Jain	b2acdddfad	Btrfs: add missing brelse when superblock checksum fails Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:26:53 +01:00
Fan Li	de1475cc53	f2fs: read isize while holding i_mutex in fiemap make sure the isize we read doesn't change during the process. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-01-06 19:15:49 -08:00
Jaegeuk Kim	957efb0c21	Revert "f2fs: check the node block address of newly allocated nid" Original issue is fixed by: f2fs: cover more area with nat_tree_lock This reverts commit `24928634f8`.	2016-01-06 19:15:49 -08:00
Jaegeuk Kim	a51311938e	f2fs: cover more area with nat_tree_lock There was a subtle bug on nat cache management which incurs wrong nid allocation or wrong block addresses when try_to_free_nats is triggered heavily. This patch enlarges the previous coverage of nat_tree_lock to avoid data race. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-01-06 19:15:48 -08:00
Geliang Tang	43584c1d42	jffs2: use to_delayed_work Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Brian Norris <computersforpeace@gmail.com>	2016-01-06 15:16:46 -08:00
Filipe Manana	271dba4521	Btrfs: fix transaction handle leak on failure to create hard link If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2016-01-06 22:52:38 +00:00
Dmitry Monakhov	a1c6f05733	fs: use block_device name vsprintf helper Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-06 13:03:18 -05:00
Dmitry Monakhov	424081f3c8	fs: use gendisk->disk_name where possible gendisk with part==0 is obviously gendisk->disk_name. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-06 12:42:11 -05:00
Mateusz Guzik	ccec5ee302	poll: plug an unused argument to do_poll Number of fds is already known based on passed list. No functional changes. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-06 08:26:52 -05:00
Dave Chinner	4922be51ef	Merge branch 'xfs-dax-fixes-for-4.5' into for-next	2016-01-05 08:08:47 +11:00
Dave Chinner	7eeabbd4b6	Merge branch 'xfs-misc-fixes-for-4.5' into for-next	2016-01-05 08:08:35 +11:00
Brian Foster	609adfc2ed	xfs: debug mode log record crc error injection XFS now uses CRC verification over a limited section of the log to detect torn writes prior to a crash. This is difficult to test directly due to the timing and hardware requirements to cause a short write. Add a mechanism to inject CRC errors into log records to facilitate testing torn write detection during log recovery. This mechanism is dangerous and can result in filesystem corruption. Thus, it is only available in DEBUG mode for testing/development purposes. Set a non-zero value to the following sysfs entry to enable error injection: /sys/fs/xfs/<dev>/log/log_badcrc_factor Once enabled, XFS intentionally writes an invalid CRC to a log record at some random point in the future based on the provided frequency. The filesystem immediately shuts down once the record has been written to the physical log to prevent metadata writeback (e.g., AIL insertion) once the log write completes. This helps reasonably simulate a torn write to the log as the affected record must be safe to discard. The next mount after the intentional shutdown requires log recovery and should detect and recover from the torn write. Note again that this _will_ result in data loss or worse. For testing and development purposes only! Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-05 07:41:16 +11:00
Brian Foster	7088c4136f	xfs: detect and trim torn writes during log recovery Certain types of storage, such as persistent memory, do not provide sector atomicity for writes. This means that if a crash occurs while XFS is writing log records, only part of those records might make it to the storage. This is problematic because log recovery uses the cycle value packed at the top of each log block to locate the head/tail of the log. This can lead to CRC verification failures during log recovery and an unmountable fs for a filesystem that is otherwise consistent. Update log recovery to incorporate log record CRC verification as part of the head/tail discovery process. Once the head is located via the traditional algorithm, run a CRC-only pass over the records up to the head of the log. If CRC verification fails, assume that the records are torn as a matter of policy and trim the head block back to the start of the first bad record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-05 07:40:16 +11:00
Trond Myklebust	942e3d72a6	Merge branch 'pnfs_generic' * pnfs_generic: NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid() NFSv4.1/pNFS: Fix a race in initiate_file_draining() NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn() NFS: Relax requirements in nfs_flush_incompatible NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid NFS: Allow multiple commit requests in flight per file NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs NFSv4: List stateid information in the callback tracepoints NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1 pNFS: If we have to delay the layout callback, mark the layout for return NFSv4.1/pNFS: Add a helper to mark the layout as returned pNFS: Ensure nfs4_layoutget_prepare returns the correct error	2016-01-04 13:19:55 -05:00
Trond Myklebust	506c0d6826	NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 13:07:15 -05:00
Trond Myklebust	e144e5391c	NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:52:53 -05:00
Trond Myklebust	71b39854a5	NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid() Make it more obvious what we're returning... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:41:15 -05:00
Trond Myklebust	4b0934baf9	NFSv4.1/pNFS: Fix a race in initiate_file_draining() Peng Tao points out that the call to pnfs_mark_matching_lsegs_return() could race with pnfs_put_lseg(), in which case the layout segment is cleared, but no layoutreturn will be sent. Fix is to replace the call to pnfs_mark_matching_lsegs_invalid(). Reported-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:36:12 -05:00
Trond Myklebust	10335556c9	NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout Fix a bug whereby if all the layout segments could be immediately freed, the call to pnfs_error_mark_layout_for_return() would never result in a layoutreturn. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:36:11 -05:00
Trond Myklebust	5c97f5de2c	NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode If pnfs_mark_matching_lsegs_return() needs to mark a layout segment for return, then it must also set the return iomode. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:36:11 -05:00
Trond Myklebust	50f563ef5d	NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:36:10 -05:00
Trond Myklebust	ed429d6b93	NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn() A stateid is a structure, pass it as a pointer. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-01-04 12:35:47 -05:00
Al Viro	80f8dccf95	HFS wants 8Kb per-superblock allocation; just use kmalloc() ... rather than play with __get_free_pages() (and figuring out the allocation order, etc.) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:34 -05:00
Al Viro	76e8d7cb71	jfs: microoptimize get_zeroed_page / virt_to_page get_zeroed_page does alloc_page and returns page_address of the result; subsequent virt_to_page will recover the page, but since the caller needs both page and its page_address() anyway, why bother going through that wrapper at all? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:28 -05:00
Al Viro	4e728cf8ff	hpfs: missing endianness annotation Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:29:03 -05:00
Al Viro	62fb4a155f	don't carry MAY_OPEN in op->acc_mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:28:40 -05:00
Al Viro	b40ef8696f	saner calling conventions for copy_mount_options() let it just return NULL, pointer to kernel copy or ERR_PTR(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:28:32 -05:00
Al Viro	bb646cdb12	proc_pid_attr_write(): switch to memdup_user() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:28:00 -05:00
Al Viro	16e5c1fc36	convert a bunch of open-coded instances of memdup_user_nul() A _lot_ of ->write() instances were open-coding it; some are converted to memdup_user_nul(), a lot more remain... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-04 10:26:58 -05:00
Al Viro	7e935c7ca1	Merge branch 'memdup_user_nul' into work.misc	2016-01-04 10:25:34 -05:00
Andrew Gabbasov	bb00c898ad	udf: Check output buffer length when converting name to CS0 If a name contains at least some characters with Unicode values exceeding single byte, the CS0 output should have 2 bytes per character. And if other input characters have single byte Unicode values, then the single input byte is converted to 2 output bytes, and the length of output becomes larger than the length of input. And if the input name is long enough, the output length may exceed the allocated buffer length. All this means that conversion from UTF8 or NLS to CS0 requires checking of output length in order to stop when it exceeds the given output buffer size. [JK: Make code return -ENAMETOOLONG instead of silently truncating the name] CC: stable@vger.kernel.org Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz>	2016-01-04 15:57:49 +01:00
Andrew Gabbasov	ad402b265e	udf: Prevent buffer overrun with multi-byte characters udf_CS0toUTF8 function stops the conversion when the output buffer length reaches UDF_NAME_LEN-2, which is correct maximum name length, but, when checking, it leaves the space for a single byte only, while multi-bytes output characters can take more space, causing buffer overflow. Similar error exists in udf_CS0toNLS function, that restricts the output length to UDF_NAME_LEN, while actual maximum allowed length is UDF_NAME_LEN-2. In these cases the output can override not only the current buffer length field, causing corruption of the name buffer itself, but also following allocation structures, causing kernel crash. Adjust the output length checks in both functions to prevent buffer overruns in case of multi-bytes UTF8 or NLS characters. CC: stable@vger.kernel.org Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz>	2016-01-04 14:30:43 +01:00
Pantelis Antoniou	03607ace80	configfs: implement binary attributes ConfigFS lacked binary attributes up until now. This patch introduces support for binary attributes in a somewhat similar manner of sysfs binary attributes albeit with changes that fit the configfs usage model. Problems that configfs binary attributes fix are everything that requires a binary blob as part of the configuration of a resource, such as bitstream loading for FPGAs, DTBs for dynamically created devices etc. Look at Documentation/filesystems/configfs/configfs.txt for internals and howto use them. This patch is against linux-next as of today that contains Christoph's configfs rework. Signed-off-by: Pantelis Antoniou <pantelis.antoniou@konsulko.com> [hch: folded a fix from Geert Uytterhoeven <geert+renesas@glider.be>] [hch: a few tiny updates based on review feedback] Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-01-04 12:31:46 +01:00
Julia Lawall	d1b98c23f7	quota: constify qtree_fmt_operations structures The qtree_fmt_operations structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Jan Kara <jack@suse.cz>	2016-01-04 10:58:35 +01:00
Arnd Bergmann	4f1b1519f7	udf: avoid uninitialized variable use A new warning has come up from a recent cleanup: fs/udf/inode.c: In function 'udf_setup_indirect_aext': fs/udf/inode.c:1927:28: warning: 'adsize' may be used uninitialized in this function [-Wmaybe-uninitialized] If the alloc_type is neither ICBTAG_FLAG_AD_SHORT nor ICBTAG_FLAG_AD_LONG, the value of adsize is undefined. Currently, callers of these functions make sure alloc_type is one of the two valid ones but for future proofing make sure we handle the case of invalid alloc type as well. This changes the code to return -EIOin that case. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `fcea62babc` ("udf: Factor out code for creating indirect extent") Signed-off-by: Jan Kara <jack@suse.cz>	2016-01-04 10:53:29 +01:00
Chao Yu	e0afc4d6d0	f2fs: introduce max_file_blocks in sbi Introduce max_file_blocks in sbi to store max block index of file in f2fs, it could be used to avoid unneeded calculation of max block index in runtime. Signed-off-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: fix overflow of sbi->max_file_blocks] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-01-03 21:40:04 -08:00
Dave Chinner	a6d7636e8d	xfs: fix recursive splice read locking with DAX Doing a splice read (generic/249) generates a lockdep splat because we recursively lock the inode iolock in this path: SyS_sendfile64 do_sendfile do_splice_direct splice_direct_to_actor do_splice_to xfs_file_splice_read <<<<<< lock here default_file_splice_read vfs_readv do_readv_writev do_iter_readv_writev xfs_file_read_iter <<<<<< then here The issue here is that for DAX inodes we need to avoid the page cache path and hence simply push it into the normal read path. Unfortunately, we can't tell down at xfs_file_read_iter() whether we are being called from the splice path and hence we cannot avoid the locking at this layer. Hence we simply have to drop the inode locking at the higher splice layer for DAX. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:28:25 +11:00
Dave Chinner	3b0fe47805	xfs: Don't use reserved blocks for data blocks with DAX Commit `1ca1915` ("xfs: Don't use unwritten extents for DAX") enabled the DAX allocation call to dip into the reserve pool in case it was converting unwritten extents rather than allocating blocks. This was a direct copy of the unwritten extent conversion code, but had an unintended side effect of allowing normal data block allocation to use the reserve pool. Hence normal block allocation could deplete the reserve pool and prevent unwritten extent conversion at ENOSPC, hence violating fallocate guarantees on preallocated space. Fix it by checking whether the incoming map from __xfs_get_blocks() spans an unwritten extent and only use the reserve pool if the allocation covers an unwritten extent. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:22:45 +11:00
Markus Elfring	a841b64df2	XFS: Use a signed return type for suffix_kstrtoint() The return type "unsigned long" was used by the suffix_kstrtoint() function even though it will eventually return a negative error code. Improve this implementation detail by using the type "int" instead. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:13:21 +11:00
Darrick J. Wong	c5ab131ba0	libxfs: refactor short btree block verification Create xfs_btree_sblock_verify() to verify short-format btree blocks (i.e. the per-AG btrees with 32-bit block pointers) instead of open-coding them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:13:21 +11:00
Darrick J. Wong	96f859d52b	libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct Because struct xfs_agfl is 36 bytes long and has a 64-bit integer inside it, gcc will quietly round the structure size up to the nearest 64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE macro returning incorrect results for v5 filesystems on 64-bit machines (118 items instead of 119). As a result, a 32-bit xfs_repair will see garbage in AGFL item 119 and complain. Therefore, tell gcc not to pad the structure so that the AGFL size calculation is correct. cc: <stable@vger.kernel.org> # 3.10 - 4.4 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:13:21 +11:00
Darrick J. Wong	6d3eb1eca0	libxfs: use a convenience variable instead of open-coding the fork Use a convenience variable instead of open-coding the inode fork. This isn't really needed for now, but will become important when we add the copy-on-write fork later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:12:42 +11:00
Darrick J. Wong	9b434a347c	xfs: fix log ticket type printing Update the log ticket reservation type printing code to reflect all the types of log tickets, to avoid incorrect debug output and avoid running off the end of the array. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:11:42 +11:00
Darrick J. Wong	2e9101da60	libxfs: make xfs_alloc_fix_freelist non-static Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the static designation. xfsprogs already has this; this simply brings the kernel up to date. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:42 +11:00
Alexander Kuleshov	211fe1a4db	xfs: make xfs_buf_ioend_async() static There are no callers of the xfs_buf_ioend_async() function outside of the fs/xfs/xfs_buf.c. So, let's make it static. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:42 +11:00
Masatake YAMATO	ffc671f1ea	xfs: send warning of project quota to userspace via netlink Linux's quota subsystem has an ability to handle project quota. This commit just utilizes the ability from xfs side. dbus-monitor and quota_nld shipped as part of quota-tools can be used for testing. See the patch posting on the XFS list for details on testing. Signed-off-by: Masatake YAMATO <yamato@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:42 +11:00
Eric Sandeen	f1f96c4946	xfs: get mp from bma->ip in xfs_bmap code In my earlier commit `c29aad4` xfs: pass mp to XFS_WANT_CORRUPTED_GOTO I added some local mp variables with code which indicates that mp might be NULL. Coverity doesn't like this now, because the updated per-fs XFS_STATS macros dereference mp. I don't think this is actually a problem; from what I can tell, we cannot get to these functions with a null bma->tp, so my NULL check was probably pointless. Still, it's not super obvious. So switch this code to get mp from the inode on the xfs_bmalloca structure, with no conditional, because the functions are already using bmap->ip directly. Addresses-Coverity-Id: 1339552 Addresses-Coverity-Id: 1339553 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:42 +11:00
Eric Sandeen	233135b763	xfs: print name of verifier if it fails This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:19 +11:00
Jia He	1d4292bfdc	libxfs: Optimize the loop for xfs_bitmap_empty If there is any non zero bit in a long bitmap, it can jump out of the loop and finish the function as soon as possible. Signed-off-by: Jia He <hejianet@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 16:10:19 +11:00
Brian Foster	eed6b462fb	xfs: refactor log record start detection into a new helper As part of the head/tail discovery process, log recovery locates the head block and then reverse seeks to find the start of the last active record in the log. This is non-trivial as the record itself could have wrapped around the end of the physical log. Log recovery torn write detection potentially needs to walk further behind the last record in the log, as multiple log I/Os can be in-flight at one time during a crash event. Therefore, refactor the reverse log record header search mechanism into a new helper that supports the ability to seek past an arbitrary number of log records (or until the tail is hit). Update the head/tail search mechanism to call the new helper, but otherwise there is no change in log recovery behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Brian Foster	6528250b71	xfs: support a crc verification only log record pass Log recovery torn write detection uses CRC verification over a range of the active log to identify torn writes. Since the generic log recovery pass code implements a superset of the functionality required for CRC verification, it can be easily modified to support a CRC verification only pass. Create a new CRC pass type and update the log record processing helper to skip everything beyond CRC verification when in this mode. This pass will be invoked in subsequent patches to implement torn write detection. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Brian Foster	d7f37692e3	xfs: return start block of first bad log record during recovery Each log recovery pass walks from the tail block to the head block and processes records appropriately based on the associated log pass type. There are various failure conditions that can occur through this sequence, such as I/O errors, CRC errors, etc. Log torn write detection will perform CRC verification near the head of the log to detect torn writes and trim torn records from the log appropriately. As it is, xlog_do_recovery_pass() only returns an error code in the event of CRC failure, which isn't enough information to trim the head of the log. Update xlog_do_recovery_pass() to optionally return the start block of the associated record when an error occurs. This patch contains no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Brian Foster	b94fb2d178	xfs: refactor and open code log record crc check Log record CRC verification currently occurs during active log recovery, immediately before a log record is unpacked. Therefore, the CRC calculation code is buried within the data unpack function. CRC verification pass support only needs to go so far as check the CRC, but this is not easily allowed as the code is currently organized. Since we now have a new log record processing helper, pull the record CRC verification code out from the unpack helper and open-code it at the top of the new process helper. This facilitates the ability to modify how records are processed based on the type of the current pass. This patch contains no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Brian Foster	9d94901f6e	xfs: refactor log record unpack and data processing xlog_do_recovery_pass() duplicates a couple function calls related to processing log records because the function must handle wrapping around the end of the log if the head is behind the tail. This is implemented as separate loops. CRC verification pass support will modify how records are processed in both of these loops. Rather than continue to duplicate code, factor the calls that process a log record into a new helper and call that helper from both loops. This patch contains no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Brian Foster	a70f9fe52d	xfs: detect and handle invalid iclog size set by mkfs XFS log records have separate fields for the record size and the iclog size used to write the record. mkfs.xfs zeroes the log and writes an unmount record to generate a clean log for the subsequent mount. The userspace record logging code has a bug where the iclog size (h_size) field of the log record is hardcoded to 32k, even if a log stripe unit is specified. The log record length is correctly extended to the stripe unit. Since the kernel log recovery code uses the h_size field to determine the log buffer size, this means that the kernel can attempt to read/process records larger than the buffer size and overrun the buffer. This has historically not been a problem because the kernel doesn't actually run through log recovery in the clean unmount case. Instead, the kernel detects that a single unmount record exists between the head and tail and pushes the tail forward such that the log is viewed as clean (head == tail). Once CRC verification is enabled, however, all records at the head of the log are verified for CRC errors and thus we are susceptible to overrun problems if the iclog field is not correct. While the core problem must be fixed in userspace, this is historical behavior that must be detected in the kernel to avoid severe side effects such as memory corruption and crashes. Update the log buffer size calculation code to detect this condition, warn the user and resize the log buffer based on the log stripe unit. Return a corruption error in cases where this does not look like a clean filesystem (i.e., the log record header indicates more than one operation). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-01-04 15:55:10 +11:00
Darrick J. Wong	2b3909f8a7	btrfs: use new dedupe data function pointer Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-01 02:36:40 -05:00
Darrick J. Wong	54dbc15172	vfs: hoist the btrfs deduplication ioctl to the vfs Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name more systematic (FIDEDUPERANGE). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-01 02:36:19 -05:00
Darrick J. Wong	d79bdd52d8	vfs: wire up compat ioctl for CLONE/CLONE_RANGE Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-01-01 02:36:02 -05:00
Chao Yu	3a9e6433a3	f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order to avoid unneeded registry of ->{get,set,remove}xattr. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 18:54:50 -08:00
Jaegeuk Kim	137d09f002	f2fs: introduce zombie list for fast shrinking extent trees This patch removes refcount, and instead, adds zombie_list to shrink directly without radix tree traverse. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 15:39:22 -08:00
Jaegeuk Kim	c00ba55485	f2fs: monitor zombie_tree count This patch adds an entry to show the number of zombie extent_tree. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 15:33:00 -08:00
David S. Miller	c07f30ad68	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2015-12-31 18:20:10 -05:00
Jaegeuk Kim	c46a155bdf	f2fs: use IPU for fdatasync This patch fixes missing IPU condition when fdatasync is called. With this patch, fdatasync is able to avoid additional node writes for recovery. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 13:49:17 -08:00
Jaegeuk Kim	8d4ea29b64	f2fs: write pending bios when cp_error is set When testing ioc_shutdown, put_super is able to be hanged by waiting for writebacking pages as follows. INFO: task umount:2723 blocked for more than 120 seconds. Tainted: G O 4.4.0-rc3+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D ffff88000859f9d8 0 2723 2110 0x00000000 ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540 ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c Call Trace: [<ffffffff818239f0>] ? bit_wait+0x50/0x50 [<ffffffff8182310c>] schedule+0x3c/0x90 [<ffffffff81827fb9>] schedule_timeout+0x2d9/0x430 [<ffffffff810e0f8f>] ? mark_held_locks+0x6f/0xa0 [<ffffffff8111614d>] ? ktime_get+0x7d/0x140 [<ffffffff818239f0>] ? bit_wait+0x50/0x50 [<ffffffff8106a655>] ? kvm_clock_get_cycles+0x25/0x30 [<ffffffff8111617c>] ? ktime_get+0xac/0x140 [<ffffffff818239f0>] ? bit_wait+0x50/0x50 [<ffffffff81822564>] io_schedule_timeout+0xa4/0x110 [<ffffffff81823a25>] bit_wait_io+0x35/0x50 [<ffffffff818235bd>] __wait_on_bit+0x5d/0x90 [<ffffffff811b9e8b>] wait_on_page_bit+0xcb/0xf0 [<ffffffff810d5f90>] ? autoremove_wake_function+0x40/0x40 [<ffffffff811cf84c>] truncate_inode_pages_range+0x4bc/0x840 [<ffffffff811cfc3d>] truncate_inode_pages_final+0x4d/0x60 [<ffffffffc023ced5>] f2fs_evict_inode+0x75/0x400 [f2fs] [<ffffffff812639bc>] evict+0xbc/0x190 [<ffffffff81263d19>] iput+0x229/0x2c0 [<ffffffffc0241885>] f2fs_put_super+0x105/0x1a0 [f2fs] [<ffffffff8124756a>] generic_shutdown_super+0x6a/0xf0 [<ffffffff812478f7>] kill_block_super+0x27/0x70 [<ffffffffc0241290>] kill_f2fs_super+0x20/0x30 [f2fs] [<ffffffff81247b03>] deactivate_locked_super+0x43/0x70 [<ffffffff81247f4c>] deactivate_super+0x5c/0x60 [<ffffffff81268d2f>] cleanup_mnt+0x3f/0x90 [<ffffffff81268dc2>] __cleanup_mnt+0x12/0x20 [<ffffffff810ac463>] task_work_run+0x73/0xa0 [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0 [<ffffffff81003e7c>] syscall_return_slowpath+0xcc/0xe0 [<ffffffff81829ea2>] int_ret_from_sys_call+0x25/0x9f Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 13:08:02 -08:00
Trond Myklebust	138a2935dc	NFS: Relax requirements in nfs_flush_incompatible If two processes share the same credentials and NFSv4 open stateid, then allow them both to dirty the same page, even if their nfs_open_context differs. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-31 15:55:35 -05:00
Trond Myklebust	b20135d0b2	NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid If the layout segment is invalid, then we should not be adding more write requests to the commit list. Instead, those writes should be replayed after requesting a new layout. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-31 15:55:35 -05:00
Jaegeuk Kim	1f6fa26199	f2fs: remove f2fs_bug_on in terms of max_depth There is no report on this bug_on case, but if malicious attacker changed this field intentionally, we can just reset it as a MAX value. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-31 11:42:46 -08:00
Trond Myklebust	af7cf05793	NFS: Allow multiple commit requests in flight per file Allow synchronous RPC calls to wait for pending RPC calls to finish, but also allow asynchronous ones to just fire off another commit. With this patch, the xfstests generic/074 test completes in 226s instead of 242s Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-31 13:53:48 -05:00
Trond Myklebust	dc602dd706	NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs The flexfiles layout in particular, seems to want to poke around in the O_DIRECT flags when retransmitting. This patch sets up an interface to allow it to call back into O_DIRECT to handle retransmission correctly. It also fixes a potential bug whereby we could change the behaviour of O_DIRECT if an error is already pending. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-31 13:50:42 -05:00
Filipe Manana	9269d12b2d	Btrfs: fix number of transaction units required to create symlink We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-31 18:18:40 +00:00
Filipe Manana	d50866d00f	Btrfs: don't leave dangling dentry if symlink creation failed When we are creating a symlink we might fail with an error after we created its inode and added the corresponding directory indexes to its parent inode. In this case we end up never removing the directory indexes because the inode eviction handler, called for our symlink inode on the final iput(), only removes items associated with the symlink inode and not with the parent inode. Example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt $ touch /mnt/foo $ ln -s /mnt/foo /mnt/bar ln: failed to create symbolic link ‘bar’: Cannot allocate memory $ umount /mnt $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1 checking extents checking free space cache checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref found 131073 bytes used err is 1 total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124305 file data blocks allocated: 262144 referenced 262144 btrfs-progs v4.2.3 So fix this by adding the directory index entries as the very last step of symlink creation. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-31 18:10:56 +00:00
Filipe Manana	a879719b8c	Btrfs: send, don't BUG_ON() when an empty symlink is found When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-31 18:08:20 +00:00
Jaegeuk Kim	732d56489f	f2fs: fix f2fs_ioc_abort_volatile_write There are two rules to handle aborting volatile or atomic writes. 1. drop atomic writes - we don't need to keep any stale db data. 2. write journal data - we should keep the journal data with fsync for db recovery. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:53:25 -08:00
Chao Yu	4e0d836d5f	f2fs: fix to skip recovering dot dentries in a readonly fs If filesystem is readonly, leave user message info instead of recovering inline dot inode. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:51:28 -08:00
Jaegeuk Kim	ed3d12561a	f2fs: load largest extent all the time Otherwise, we can get mismatched largest extent information. One example is: 1. mount f2fs w/ extent_cache 2. make a small extent 3. umount 4. mount f2fs w/o extent_cache 5. update the largest extent 6. umount 7. mount f2fs w/ extent_cache 8. get the old extent made by #2 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:20 -08:00
Jaegeuk Kim	819d9153d4	f2fs: use i_size_read to get i_size We need to use i_size_read() to get inode->i_size. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:19 -08:00
Jaegeuk Kim	8dc0d6a11e	f2fs: early check broken symlink length in the encrypted case If link is broken, its len is zero, and we don't need to move forward. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:19 -08:00
Chao Yu	e96248bb45	f2fs: clean up f2fs_ioc_write_checkpoint Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint. Signed-off-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: remove unused err variable] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:18 -08:00
Yunlei He	179448bfe4	f2fs: add a max block check for get_data_block_bmap This patch adds a max block check for get_data_block_bmap. Trinity test program will send a block number as parameter into ioctl_fibmap, which will be used in get_node_path(), when the block number large than f2fs max blocks, it will trigger kernel bug. Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com> [Jaegeuk Kim: fix missing condition, pointed by Chao Yu] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:17 -08:00
Fan Li	9a950d52b7	f2fs: fix bugs and simplify codes of f2fs_fiemap fix bugs: 1. len could be updated incorrectly when start+len is beyond isize. 2. If there is a hole consisting of more than two blocks, it could fail to add FIEMAP_EXTENT_LAST flag for the last extent. 3. If there is an extent beyond isize, when we search extents in a range that ends at isize, it will also return the extent beyond isize, which is outside the range. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:16 -08:00
Chao Yu	6d5a1495ee	f2fs: let user being aware of IO error Sometimes we keep dumb when IO error occur in lower layer device, so user will not receive any error return value for some operation, but actually, the operation did not succeed. This sould be avoided, so this patch reports such kind of error to user. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:15 -08:00
Chao Yu	d53841740f	f2fs: add missing f2fs_balance_fs in __recover_dot_dentries __recover_do_dentries will try to grab free space in storage, so fix to add missing f2fs_balance_fs here. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:14 -08:00
Jaegeuk Kim	06d6f22639	f2fs: declare static function The __f2fs_commit_super is static. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:14 -08:00
Jaegeuk Kim	b4d07a3e1a	f2fs: avoid f2fs_lock_op in f2fs_write_begin If f2fs_write_begin is to update data, we can bypass calling f2fs_lock_op() in order to avoid the checkpoint latency in the write syscall. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:13 -08:00
Jaegeuk Kim	4aa69d5667	f2fs: return early when trying to read null nid If get_node_page() gets zero nid, we can return early without getting a wrong page. For example, get_dnode_of_data() can try to do that. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:12 -08:00
Jaegeuk Kim	2aadac085c	f2fs: introduce prepare_write_begin to clean up This patch adds prepare_write_begin to clean f2fs_write_begin. The major role of this function is to convert any inline_data and allocate or find block address. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:11 -08:00
Chao Yu	fba48a8b14	f2fs: don't convert inline inode when inline_data option is disable If inline_data option is disable, when truncating an inline inode with size which is not exceed maxinum inline size, we should not convert inline inode to regular one to avoid the overhead of synchronizing conversion. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:10 -08:00
Chao Yu	c34f42e2cb	f2fs: report error of do_checkpoint do_checkpoint and write_checkpoint can fail due to reasons like triggering in a readonly fs or encountering IO error of storage device. So it's better to report such error info to user, let user be aware of failure of doing checkpoint. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:09 -08:00
Jaegeuk Kim	2a34076070	f2fs: call f2fs_balance_fs only when node was changed If user tries to update or read data, we don't need to call f2fs_balance_fs which triggers f2fs_gc, which increases unnecessary long latency. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:09 -08:00
Chao Yu	3104af35eb	f2fs: reduce covered region of sbi->cp_rwsem in f2fs_map_blocks Only cover sbi->cp_rwsem on one dnode page's allocation and modification instead of multiple's in f2fs_map_blocks, it can reduce the covered region of cp_rwsem, then we can avoid potential long time delay for concurrent checkpointer. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:08 -08:00
Jaegeuk Kim	93bae099ea	f2fs: record node block allocation in dnode_of_data This patch introduces recording node block allocation in dnode_of_data. This information helps to figure out whether any node block is allocated during specific file operations. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:07 -08:00
Jaegeuk Kim	00623e6bcf	f2fs: avoid unnecessary f2fs_gc for dir operations The f2fs_balance_fs doesn't need to cover f2fs_new_inode or f2fs_find_entry works. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:06 -08:00
Jaegeuk Kim	b9d777b85f	f2fs: check inline_data flag at converting time We can check inode's inline_data flag when calling to convert it. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:14:05 -08:00
Jaegeuk Kim	74fd8d9927	f2fs: speed up shrinking extent tree entries If there is no candidates for shrinking slab entries, we don't need to traverse any trees at all. Reviewed-by: Chao Yu <chao2.yu@samsung.com> [Jaegeuk Kim: fix missing initialization reported by Yunlei He] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-30 10:13:00 -08:00
Al Viro	fceef393a5	switch ->get_link() to delayed_call, kill ->put_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-30 13:01:03 -05:00
Filipe Manana	2bc0bb5fe7	Btrfs: fix race between free space endio workers and space cache writeout While running a stress test I ran into the following trace/transaction abort: [471626.672243] ------------[ cut here ]------------ [471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [471626.675492] BTRFS: Transaction aborted (error -2) [471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix [471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [471626.691901] 0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38 [471626.695009] ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe [471626.697490] ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90 [471626.699201] Call Trace: [471626.699804] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [471626.701049] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [471626.702542] [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [471626.704326] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [471626.705636] [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [471626.707048] [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [471626.708616] [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs] [471626.709950] [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs] [471626.711286] [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31 [471626.712611] [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [471626.715610] [<ffffffff811962a2>] ? SyS_tee+0x226/0x226 [471626.716718] [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22 [471626.717672] [<ffffffff8116fc01>] iterate_supers+0x75/0xc2 [471626.718800] [<ffffffff8119669a>] sys_sync+0x52/0x80 [471626.719990] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [471626.721835] ---[ end trace baf57f43d76693f4 ]--- [471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry This is a very rare situation and it happened due to a race between a free space endio worker and writing the space caches for dirty block groups at a transaction's commit critical section. The steps leading to this are: 1) A task calls btrfs_commit_transaction() and starts the writeout of the space caches for all currently dirty block groups (i.e. it calls btrfs_start_dirty_block_groups()); 2) The previous step starts writeback for space caches; 3) When the writeback finishes it queues jobs for free space endio work queue (fs_info->endio_freespace_worker) that execute btrfs_finish_ordered_io(); 4) The task committing the transaction sets the transaction's state to TRANS_STATE_COMMIT_DOING and shortly after calls btrfs_write_dirty_block_groups(); 5) A free space endio job joins the transaction, through btrfs_join_transaction_nolock(), and updates a free space inode item in the root tree through btrfs_update_inode_fallback(); 6) Updating the free space inode item resulted in COWing one or more nodes/leaves of the root tree, and that resulted in creating a new metadata block group, which gets added to the transaction's list of dirty block groups (this is a very rare case); 7) The free space endio job has not released yet its transaction handle at this point, so the new metadata block group was not yet fully created (didn't go through btrfs_create_pending_block_groups() yet); 8) The transaction commit task sees the new metadata block group in the transaction's list of dirty block groups and processes it. When it attempts to update the block group's block group item in the extent tree, through write_one_cache_group(), it isn't able to find it and aborts the transaction with error -ENOENT - this is because the free space endio job hasn't yet released its transaction handle (which calls btrfs_create_pending_block_groups()) and therefore the block group item was not yet added to the extent tree. Fix this waiting for free space endio jobs if we fail to find a block group item in the extent tree and then retry once updating the block group item. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-30 16:08:13 +00:00
Trond Myklebust	86fb449b07	pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh() Jeff reports seeing an Oops in ff_layout_alloc_lseg. Turns out copy+paste has played cruel tricks on a nested loop. Reported-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-30 10:57:01 -05:00
Chris Mason	511711af91	btrfs: don't run delayed references while we are creating the free space tree This is a short term solution to make sure btrfs_run_delayed_refs() doesn't change the extent tree while we are scanning it to create the free space tree. Longer term we need to synchronize scanning the block groups one by one, similar to what happens during a balance. Signed-off-by: Chris Mason <clm@fb.com>	2015-12-30 07:52:35 -08:00
Chris Mason	b4570aa994	btrfs: fix compiling with CONFIG_BTRFS_DEBUG enabled. Merging in the free space tree deleted a variable needed when CONFIG_BTRFS_DEBUG=y Signed-off-by: Chris Mason <clm@fb.com>	2015-12-30 07:37:26 -08:00
Trond Myklebust	ade14a7df7	NFS: Fix attribute cache revalidation If a NFSv4 client uses the cache_consistency_bitmask in order to request only information about the change attribute, timestamps and size, then it has not revalidated all attributes, and hence the attribute timeout timestamp should not be updated. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-30 10:18:26 -05:00
xuejiufei	cc28d6d80f	ocfs2/dlm: clear migration_pending when migration target goes down We have found a BUG on res->migration_pending when migrating lock resources. The situation is as follows. dlm_mark_lockres_migration res->migration_pending = 1; __dlm_lockres_reserve_ast dlm_lockres_release_ast returns with res->migration_pending remains because other threads reserve asts wait dlm_migration_can_proceed returns 1 >>>>>>> o2hb found that target goes down and remove target from domain_map dlm_migration_can_proceed returns 1 dlm_mark_lockres_migrating returns -ESHOTDOWN with res->migration_pending still remains. When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON with res->migration_pending. So clear migration_pending when target is down. Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-29 17:45:49 -08:00
Junxiao Bi	b5a8bc338e	ocfs2: fix flock panic issue Commit `4f6563677a` ("Move locks API users to locks_lock_inode_wait()") move flock/posix lock indentify code to locks_lock_inode_wait(), but missed to set fl_flags to FL_FLOCK which caused the following kernel panic on 4.4.0_rc5. kernel BUG at fs/locks.c:1895! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2(O) ocfs2_dlmfs(O) ocfs2_stack_o2cb(O) ocfs2_dlm(O) ocfs2_nodemanager(O) ocfs2_stackglue(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront CPU: 0 PID: 20268 Comm: flock_unit_test Tainted: G O 4.4.0-rc5-next-20151217 #1 Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014 task: ffff88007b3672c0 ti: ffff880028b58000 task.ti: ffff880028b58000 RIP: locks_lock_inode_wait+0x2e/0x160 Call Trace: ocfs2_do_flock+0x91/0x160 [ocfs2] ocfs2_flock+0x76/0xd0 [ocfs2] SyS_flock+0x10f/0x1a0 entry_SYSCALL_64_fastpath+0x12/0x71 Code: e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 89 f3 48 81 ec 88 00 00 00 8b 46 40 83 e0 03 83 f8 01 0f 84 ad 00 00 00 83 f8 02 74 04 <0f> 0b eb fe 4c 8d ad 60 ff ff ff 4c 8d 7b 58 e8 0e 8e 73 00 4d RIP locks_lock_inode_wait+0x2e/0x160 RSP <ffff880028b5bce8> ---[ end trace dfca74ec9b5b274c ]--- Fixes: `4f6563677a` ("Move locks API users to locks_lock_inode_wait()") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-29 17:45:49 -08:00
Joseph Qi	5c9ee4cbf2	ocfs2: fix BUG when calculate new backup super When resizing, it firstly extends the last gd. Once it should backup super in the gd, it calculates new backup super and update the corresponding value. But it currently doesn't consider the situation that the backup super is already done. And in this case, it still sets the bit in gd bitmap and then decrease from bg_free_bits_count, which leads to a corrupted gd and trigger the BUG in ocfs2_block_group_set_bits: BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits); So check whether the backup super is done and then do the updates. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-29 17:45:49 -08:00
Al Viro	cd3417c8fc	kill free_page_put_link() all callers are better off with kfree_put_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-29 16:03:53 -05:00
Trond Myklebust	5c5fc09a11	NFS: Ensure we revalidate attributes before using execute_ok() Donald Buczek reports that NFS clients can also report incorrect results for access() due to lack of revalidation of attributes before calling execute_ok(). Looking closely, it seems chdir() is afflicted with the same problem. Fix is to ensure we call nfs_revalidate_inode_rcu() or nfs_revalidate_inode() as appropriate before deciding to trust execute_ok(). Reported-by: Donald Buczek <buczek@molgen.mpg.de> Link: http://lkml.kernel.org/r/1451331530-3748-1-git-send-email-buczek@molgen.mpg.de Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 19:37:05 -05:00
Trond Myklebust	58baac0ac7	Merge branch 'flexfiles' * flexfiles: pNFS/flexfiles: Ensure we record layoutstats even if RPC is terminated early pNFS: Add flag to track if we've called nfs4_ff_layout_stat_io_start_read/write pNFS/flexfiles: Fix a statistics gathering imbalance pNFS/flexfiles: Don't mark the entire layout as failed, when returning it pNFS/flexfiles: Don't prevent flexfiles client from retrying LAYOUTGET pnfs/flexfiles: count io stat in rpc_count_stats callback pnfs/flexfiles: do not mark delay-like status as DS failure NFS41: map NFS4ERR_LAYOUTUNAVAILABLE to ENODATA nfs: only remove page from mapping if launder_page fails nfs: handle request add failure properly nfs: centralize pgio error cleanup nfs: clean up rest of reqs when failing to add one NFS41: pop some layoutget errors to application pNFS/flexfiles: Support server-supplied layoutstats sampling period	2015-12-28 14:49:41 -05:00
Trond Myklebust	e07db907eb	NFSv4: List stateid information in the callback tracepoints The stateid is extremely valuable when debugging. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:05 -05:00
Trond Myklebust	e0d9243048	NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL If the client is promising to return the layout ASAP, then there is no need to return DELAY and have the server retry. Instead default to the normal procedure described in RFC5661. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:05 -05:00
Trond Myklebust	41c9127d6d	NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1 The RFC requires us to check if the server is recalling a stateid that we haven't yet received. If so, tell it to wait. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:04 -05:00
Trond Myklebust	fc7ff36747	pNFS: If we have to delay the layout callback, mark the layout for return If the client needs to delay the layout callback, then speed up the recall process by marking the remaining layout segments to be actively returned by the client. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:04 -05:00
Trond Myklebust	0654cc726f	NFSv4.1/pNFS: Add a helper to mark the layout as returned This ensures that we don't reuse the stateid if a layout return or implied layout return means that we've returned all layout segments Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:04 -05:00
Trond Myklebust	ab7d763e47	pNFS: Ensure nfs4_layoutget_prepare returns the correct error If we're unable to perform the layoutget due to an invalid open stateid or a bulk recall, ensure that we return the error so that the caller can decide on an appropriate action. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:33:03 -05:00
Trond Myklebust	4d0ac22109	pNFS/flexfiles: Ensure we record layoutstats even if RPC is terminated early Currently, we will only record the layoutstats correctly if the RPC call successfully obtains a slot. If we exit before that happens, then we may find ourselves starting the busy timer through the call in ff_layout_(read\|write)_prepare_layoutstats, but never stopping it. The same thing happens if we're doing DA-DS. The fix is to ensure that we catch these cases in the rpc_release() callback. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:41 -05:00
Trond Myklebust	37e9ed22b1	pNFS: Add flag to track if we've called nfs4_ff_layout_stat_io_start_read/write Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:41 -05:00
Trond Myklebust	7eeea16797	pNFS/flexfiles: Fix a statistics gathering imbalance When we replay a failed read, write or commit to the dataserver, we need to ensure that we call ff_layout_read_prepare_v3(), ff_layout_write_prepare_v3 or ff_layout_commit_prepare_v3() so that we reset the statistics. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:40 -05:00
Trond Myklebust	b9fc773ef5	pNFS/flexfiles: Don't mark the entire layout as failed, when returning it In pNFS/flexfiles, we want to return the layout without necessarily marking it as having completely failed. We therefore move the call to pnfs_layout_io_set_failed() out of pnfs_error_mark_layout_for_return(), and then ensura that pNFS/files layout calls it separately. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:40 -05:00
Trond Myklebust	2e5b29f044	pNFS/flexfiles: Don't prevent flexfiles client from retrying LAYOUTGET Fix a bug in which flexfiles clients are falling back to I/O through the MDS even when the FF_FLAGS_NO_IO_THRU_MDS flag is set. The flexfiles client will always report errors through the LAYOUTRETURN and/or LAYOUTERROR mechanisms, so it should normally be safe for it to retry the LAYOUTGET until it fails or succeeds. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:40 -05:00
Peng Tao	141b9b59ed	pnfs/flexfiles: count io stat in rpc_count_stats callback If client ever restarts IO due to some errors, we'll endup mis-counting IO stats if we do the counting in .rpc_done callback. Move it to .rpc_count_stats callback that is only called when releasing RPC. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:39 -05:00
Peng Tao	c22eeb8697	pnfs/flexfiles: do not mark delay-like status as DS failure We just need to delay and retry in these cases. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:39 -05:00
Peng Tao	7c1e6e58e2	NFS41: map NFS4ERR_LAYOUTUNAVAILABLE to ENODATA Instead of mapping it to EIO that is a fatal error and fails application. We'll go inband after getting NFS4ERR_LAYOUTUNAVAILABLE. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:38 -05:00
Peng Tao	d6c843b96e	nfs: only remove page from mapping if launder_page fails Instead of dropping pages when write fails, only do it when we get fatal failure in launder_page write back. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:38 -05:00
Peng Tao	0bcbf039f6	nfs: handle request add failure properly When we fail to queue a read page to IO descriptor, we need to clean it up otherwise it is hanging around preventing nfs module from being removed. When we fail to queue a write page to IO descriptor, we need to clean it up and also save the failure status to open context. Then at file close, we can try to write pages back again and drop the page if it fails to writeback in .launder_page, which will be done in the next patch. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:37 -05:00
Peng Tao	2bff228857	nfs: centralize pgio error cleanup In case we fail during setting things up for read/write IO, set pg_error in IO descriptor and do the cleanup in nfs_pageio_add_request, where we clean up all pages that are still hanging around on the IO descriptor. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:37 -05:00
Peng Tao	c18b96a1b8	nfs: clean up rest of reqs when failing to add one If we fail to set up things before sending anything over wire, we need to clean up the reqs that are still attached to the IO descriptor. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:37 -05:00
Peng Tao	d600ad1f2b	NFS41: pop some layoutget errors to application For ERESTARTSYS/EIO/EROFS/ENOSPC/E2BIG in layoutget, we should just bail out instead of hiding the error and retrying inband IO. Change all the call sites to pop the error all the way up. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:36 -05:00
Trond Myklebust	d0379a5d06	pNFS/flexfiles: Support server-supplied layoutstats sampling period Some servers want to be able to control the frequency with which clients report layoutstats, for instance, in order to monitor QoS for a particular file or set of file. In order to support this, the flexfiles layout allows the server to pass this info as a hint in the layout payload. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 14:32:36 -05:00
Trond Myklebust	494f74a26c	NFS: Flush reclaim writes using FLUSH_COND_STABLE If there are already writes queued up for commit, then don't flush just this page even if it is a reclaim issue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 13:34:59 -05:00
Trond Myklebust	b0ac1bd2bb	NFS: Background flush should not be low priority Background flush is needed in order to satisfy the global page limits. Don't subvert by reducing the priority. This should also address a write starvation issue that was reported by Neil Brown. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 13:30:42 -05:00
Trond Myklebust	1a093ceb05	NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn Since commit `2d8ae84fbc`, nothing is bumping lo->plh_block_lgets in the layoutreturn path, so it should not be touched in nfs4_layoutreturn_release either. Fixes: `2d8ae84fbc` ("NFSv4.1/pnfs: Remove redundant lo->plh_block_lgets...") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 13:23:36 -05:00
Trond Myklebust	762674f86d	NFSv4: Don't perform cached access checks before we've OPENed the file Donald Buczek reports that a nfs4 client incorrectly denies execute access based on outdated file mode (missing 'x' bit). After the mode on the server is 'fixed' (chmod +x) further execution attempts continue to fail, because the nfs ACCESS call updates the access parameter but not the mode parameter or the mode in the inode. The root cause is ultimately that the VFS is calling may_open() before the NFS client has a chance to OPEN the file and hence revalidate the access and attribute caches. Al Viro suggests: >>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know >>> that things will be caught by server anyway? >> >> That can work as long as we're guaranteed that everything that calls >> inode_permission() with MAY_OPEN on a regular file will also follow up >> with a vfs_open() or dentry_open() on success. Is this always the >> case? > > 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since > it doesn't have ->tmpfile() instance anyway) > > 2) in atomic_open(), after the call of ->atomic_open() has succeeded. > > 3) in do_last(), followed on success by vfs_open() > > That's all. All calls of inode_permission() that get MAY_OPEN come from > may_open(), and there's no other callers of that puppy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771 Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 13:22:20 -05:00
Andrew Elble	99ade3c71b	nfs: machine credential support for additional operations Allow LAYOUTRETURN and DELEGRETURN to use machine credentials if the server supports it. Add request for OPEN_DOWNGRADE as the close path also uses that. Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:15 -05:00
Wei Tang	ed47675249	nfs: do not initialise statics to 0 This patch fixes the checkpatch.pl error to nfs4sysctl.c: ERROR: do not initialise statics to 0 Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:15 -05:00
Trond Myklebust	f4848303ce	pNFS: Modify pnfs_update_layout tracepoints to use layout stateid Instead of displaying a layout segment pointer in these tracepoints, let's use the layout stateid, now that Olga gave us a set of tools for displaying them. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:14 -05:00
Trond Myklebust	f2dd436edb	NFSv4: Fix unused variable warnings in nfs4_init_*_client_string() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:14 -05:00
Jeff Layton	9a4bf31d05	nfs: add new tracepoint for pnfs_update_layout pnfs_update_layout is really the "nexus" of layout handling. If it returns NULL then we end up going through the MDS. This patch adds some tracepoints to that function that allow us to determine the cause when we end up going through the MDS unexpectedly. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:14 -05:00
Olga Kornievskaia	9759b0fb1d	Adding tracepoint to cached open Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:14 -05:00
Olga Kornievskaia	48c9579a1a	Adding stateid information to tracepoints Operations to which stateid information is added: close, delegreturn, open, read, setattr, layoutget, layoutcommit, test_stateid, write, lock, locku, lockt Format is "stateid=<seqid>:<crc32 hash stateid.other>", also "openstateid=", "layoutstateid=", and "lockstateid=" for open_file, layoutget, set_lock tracepoints. New function is added to internal.h, nfs_stateid_hash(), to compute the hash trace_nfs4_setattr() is moved from nfs4_do_setattr() to _nfs4_do_setattr() to get access to stateid. trace_nfs4_setattr and trace_nfs4_delegreturn are changed from INODE_EVENT to new event type, INODE_STATEID_EVENT which is same as INODE_EVENT but adds stateid information for locking tracepoints, moved trace_nfs4_set_lock() into _nfs4_do_setlk() to get access to stateid information, and removed trace_nfs4_lock_reclaim(), trace_nfs4_lock_expired() as they call into _nfs4_do_setlk() and both were previously same LOCK_EVENT type. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:14 -05:00
Trond Myklebust	95864c9154	NFS: Allow the combination pNFS and labeled NFS Fix the nfs4_pnfs_open_bitmap so that it also allows for labeled NFS. Signed-off-by: Trond Myklebust <trond,myklebust@primarydata.com>	2015-12-28 09:57:08 -05:00
Peng Tao	68d264cf02	NFS42: handle layoutstats stateid error When server returns layoutstats stateid error, we should invalidate client's layout so that next IO can trigger new layoutget. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:08 -05:00
Andrew Elble	361cad3c89	nfs: Fix race in __update_open_stateid() We've seen this in a packet capture - I've intermixed what I think was going on. The fix here is to grab the so_lock sooner. 1964379 -> #1 open (for write) reply seqid=1 1964393 -> #2 open (for read) reply seqid=2 __nfs4_close(), state->n_wronly-- nfs4_state_set_mode_locked(), changes state->state = [R] state->flags is [RW] state->state is [R], state->n_wronly == 0, state->n_rdonly == 1 1964398 -> #3 open (for write) call -> because close is already running 1964399 -> downgrade (to read) call seqid=2 (close of #1) 1964402 -> #3 open (for write) reply seqid=3 __update_open_stateid() nfs_set_open_stateid_locked(), changes state->flags state->flags is [RW] state->state is [R], state->n_wronly == 0, state->n_rdonly == 1 new sequence number is exposed now via nfs4_stateid_copy() next step would be update_open_stateflags(), pending so_lock 1964403 -> downgrade reply seqid=2, fails with OLD_STATEID (close of #1) nfs4_close_prepare() gets so_lock and recalcs flags -> send close 1964405 -> downgrade (to read) call seqid=3 (close of #1 retry) __update_open_stateid() gets so_lock * update_open_stateflags() updates state->n_wronly. nfs4_state_set_mode_locked() updates state->state state->flags is [RW] state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1 * should have suppressed the preceding nfs4_close_prepare() from sending open_downgrade 1964406 -> write call 1964408 -> downgrade (to read) reply seqid=4 (close of #1 retry) nfs_clear_open_stateid_locked() state->flags is [R] state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1 1964409 -> write reply (fails, openmode) Signed-off-by: Andrew Elble <aweits@rit.edu> Cc: stable@vger,kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:08 -05:00
Andrew Elble	52618df95d	nfs: fix missing assignment in nfs4_sequence_done tracepoint status_flags not set Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-12-28 09:57:08 -05:00
Andreas Gruenbacher	f39814f60a	gfs2: Invalid security labels of inodes when they go invalid When gfs2 releases the glock of an inode, it must invalidate all information cached for that inode, including the page cache and acls. Use the new security_inode_invalidate_secctx hook to also invalidate security labels in that case. These items will be reread from disk when needed after reacquiring the glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com [PM: fixed spelling errors and description line lengths] Signed-off-by: Paul Moore <pmoore@redhat.com>	2015-12-24 11:09:40 -05:00
Chris Mason	140e639f1a	btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc map->num_stripes really can't be zero, but just in case. Signed-off-by: Chris Mason <clm@fb.com>	2015-12-23 13:30:51 -08:00
Chris Mason	f0f76413d3	Merge branch 'freespace-4.5' into for-linus-4.5	2015-12-23 13:29:09 -08:00
Chris Mason	a53fe25769	Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5	2015-12-23 13:28:35 -08:00
Chris Mason	bb9d687618	Merge branch 'dev/simplify-set-bit' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>	2015-12-23 13:17:42 -08:00
Chris Mason	13d5d15d63	Merge branch 'dev/gfp-flags' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5	2015-12-23 13:11:27 -08:00
Chris Mason	afa427cf9d	Merge branch 'cleanup/misc-simplify' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5	2015-12-23 13:10:26 -08:00
Jan Kara	6c37157874	udf: Fix lost indirect extent block When inode ends with empty indirect extent block and we extended that file, udf_do_extend_file() ended up just overwriting pointer to it with another extent and thus effectively leaking the block and also corruptiong length of allocation descriptors. Fix the problem by properly following into next indirect extent when it is present. Signed-off-by: Jan Kara <jack@suse.cz>	2015-12-23 18:05:03 +01:00
Jan Kara	fcea62babc	udf: Factor out code for creating indirect extent Factor out code for creating indirect extent from udf_add_aext(). It was mostly duplicated in two places. Also remove some opencoded versions of udf_write_aext(). Signed-off-by: Jan Kara <jack@suse.cz>	2015-12-23 18:04:52 +01:00
Al Viro	b25472f9b9	new helpers: no_seek_end_llseek{,_size}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-12-23 10:41:31 -05:00
Scott Mayhew	0751ddf77b	lockd: Register callbacks on the inetaddr_chain and inet6addr_chain Register callbacks on inetaddr_chain and inet6addr_chain to trigger cleanup of lockd transport sockets when an ip address is deleted. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-12-23 10:08:16 -05:00
Scott Mayhew	366849966f	nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain Register callbacks on inetaddr_chain and inet6addr_chain to trigger cleanup of nfsd transport sockets when an ip address is deleted. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-12-23 10:08:15 -05:00
J. Bruce Fields	d4f72cb7fa	nfsd: don't base cl_cb_status on stale information The rpc client we use to send callbacks may change occasionally. (In the 4.0 case, the client can use setclientid/setclientid_confirm to update the callback parameters. In the 4.1+ case, sessions and connections can come and go.) The update is done from the callback thread by nfsd4_process_cb_update, which shuts down the old rpc client and then creates a new one. The client shutdown kills any ongoing rpc calls. There won't be any new ones till the new one's created and the callback thread moves on. When an rpc encounters a problem that may suggest callback rpc's aren't working any longer, it normally sets NFSD4_CB_DOWN, so the server can tell the client something's wrong. But if the rpc notices CB_UPDATE is set, then the failure may just be a normal result of shutting down the callback client. Or it could just be a coincidence, but in any case, it means we're runing with the old about-to-be-discarded client, so the failure's not interesting. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-12-23 10:08:14 -05:00
Vegard Nossum	b0918d9f47	udf: limit the maximum number of indirect extents in a row udf_next_aext() just follows extent pointers while extents are marked as indirect. This can loop forever for corrupted filesystem. Limit number the of indirect extents we are willing to follow in a row. [JK: Updated changelog, limit, style] Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: stable@vger.kernel.org Cc: Jan Kara <jack@suse.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>	2015-12-23 11:47:55 +01:00
Linus Torvalds	0bee6ec80b	Just one fix for a NFSv4 callback bug introduced in 4.4. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWeF4NAAoJECebzXlCjuG+PvcQAL3AvxDzDnaNFhZJgWZMnRyC OlXlPE4clfiFXSB7C39xNBcn7eCJYLkINCQLu4ywAS+y7/22sX7unCTt7UXL99K3 GffV/QvxOatSssik+CtS9gIMkRLW9Fs6fuQZ4k5w+UtveISpyFoRfw8hbISABL1w NtgGIESXL8WXO+OSbVF/wRV8g1+FVi/gXWAOAoUtHBzyUho2JfECXO1XYz6mQ44M HN4Bvx75dU3SieECHRKsq8yRbkPYHP9ron/+MskBZm7VkV/6mboFlFfivNncid0Y ivpjeYP5xTj4KoXPlQ3feA9AbADNVshAKeDQYpDxRJimMjr6VVRFVDpzbKJc+5ou if9AjZUiX02mHZShKMDsJR3kHBu+OzWLtQIDJUtLTIAaeb+V/2NEScnCyCIibXv7 l52zqJ7upEYFuUGFYIZgsEKZgOAm7e3appIAtGG5Nt9ejUVR1LVPfsa8u2xXhUgp FN1TLmeQw6ZLRXcXa7vHcyQh/gJbPsm3PH514QYS+G3nMyXG8XnYKlMe98uhReno A3MH5MxfgyiuUITJopVpZfKoEFpYcid21osmVqiZfawoxr4iTocogDArETW7prCL QjN9sF+drlG70m/unDBKpQMPI0fhlmjY/VrK9YNlgvNaYKsJFVJnVFE1rCOuzj01 ekT3egZmGUR7kX94DuTt =UJhV -----END PGP SIGNATURE----- Merge tag 'nfsd-4.4-1' of git://linux-nfs.org/~bfields/linux Pull nfsd fix from Bruce Fields: "Just one fix for a NFSv4 callback bug introduced in 4.4" * tag 'nfsd-4.4-1' of git://linux-nfs.org/~bfields/linux: nfsd: don't hold ls_mutex across a layout recall	2015-12-22 15:52:32 -08:00
Jaegeuk Kim	7441ccef33	f2fs: use atomic variable for total_extent_tree It would be better to use atomic variable for total_extent_tree. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-12-22 10:31:41 -08:00
Junxiao Bi	a93a998382	gfs2: fix flock panic issue Commit `4f6563677a` ("Move locks API users to locks_lock_inode_wait()") moved flock/posix lock identify code to locks_lock_inode_wait(), but missed to set fl_flags to FL_FLOCK which will cause kernel panic in locks_lock_inode_wait(). Fixes: `4f6563677a` ("Move locks API users to locks_lock_inode_wait()") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-12-22 08:06:08 -06:00
Filipe Manana	e44081ef61	Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups We call btrfs_write_dirty_block_groups() in the critical section of a transaction's commit, when no other tasks can join the transaction and add more block groups to the transaction's list of dirty block groups, so we not taking the dirty block groups spinlock when checking for the list's emptyness, grabbing its first element or deleting elements from it. However there's a special and rare case where we can have a concurrent task adding elements to this list. We trigger writeback for space caches before at btrfs_start_dirty_block_groups() and in past iterations of the loop at btrfs_write_dirty_block_groups(), this means that when the writeback finishes (which happens asynchronously) it creates a task for the endio free space work queue that executes btrfs_finish_ordered_io() - this function is able to join the transaction, through btrfs_join_transaction_nolock(), and update the free space cache's inode item in the root tree, which can result in COWing nodes of this tree and therefore allocation of a new block group can happen, which gets added to the transaction's list of dirty block groups while the transaction commit task is operating on it concurrently. So fix this by taking the dirty block groups spinlock before doing operations on the dirty block groups list at btrfs_write_dirty_block_groups(). Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-21 17:51:22 +00:00
Borislav Petkov	362f924b64	x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros Those are stupid and code should use static_cpu_has_safe() or boot_cpu_has() instead. Kill the least used and unused ones. The remaining ones need more careful inspection before a conversion can happen. On the TODO. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de Cc: David Sterba <dsterba@suse.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2015-12-19 11:49:55 +01:00
Linus Torvalds	fc315e3e5c	Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A couple of small fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: check prepare_uptodate_page() error code earlier Btrfs: check for empty bitmap list in setup_cluster_bitmaps btrfs: fix misleading warning when space cache failed to load Btrfs: fix transaction handle leak in balance Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list	2015-12-18 15:35:08 -08:00
Colin Ian King	41a0c249cb	proc: fix -ESRCH error when writing to /proc/$pid/coredump_filter Writing to /proc/$pid/coredump_filter always returns -ESRCH because commit `774636e19e` ("proc: convert to kstrto()/kstrto_from_user()") removed the setting of ret after the get_proc_task call and incorrectly left it as -ESRCH. Instead, return 0 when successful. Example breakage: echo 0 > /proc/self/coredump_filter bash: echo: write error: No such process Fixes: `774636e19e` ("proc: convert to kstrto()/kstrto_from_user()") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-12-18 14:25:40 -08:00
Chris Mason	f7d3d2f99e	Merge branch 'freespace-tree' into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>	2015-12-18 11:11:10 -08:00
Bob Peterson	6cc4b6e801	GFS2: Don't do glock put on when inode creation fails Currently the error path of function gfs2_inode_lookup calls function gfs2_glock_put corresponding to an earlier call to gfs2_glock_get for the inode glock. That's wrong because the error path also calls iget_failed() which eventually calls iput, which eventually calls gfs2_evict_inode, which does another gfs2_glock_put. This double-put can cause the glock reference count to get off. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-12-18 11:04:46 -06:00
Bob Peterson	5ea31bc0a6	GFS2: Always use iopen glock for gl_deletes Before this patch, when function try_rgrp_unlink queued a glock for delete_work to reclaim the space, it used the inode glock to do so. That's different from the iopen callback which uses the iopen glock for the same purpose. We should be consistent and always use the iopen glock. This may also save us reference counting problems with the inode glock, since clear_glock does an extra glock_put() for the inode glock. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-12-18 11:02:52 -06:00
Bob Peterson	783013c0f5	GFS2: Release iopen glock in gfs2_create_inode error cases Some error cases in gfs2_create_inode were not unlocking the iopen glock, getting the reference count off. This adds the proper unlock. The error logic in function gfs2_create_inode was also convoluted, so this patch simplifies it. It also takes care of a bug in which gfs2_qa_delete() was not called in an error case. Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2015-12-18 10:57:21 -06:00
Bob Peterson	ee530beafe	GFS2: Truncate address space mapping when deleting an inode In function gfs2_delete_inode() we write and flush the mapping for a glock, among other things. We truncate the mapping for the inode, but we never truncate the mapping for the glock. This patch makes it also truncate the metamapping. This avoid cases where the glock is reused by another process who is trying to recreate an inode in its place using the same block. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2015-12-18 10:52:21 -06:00
Bob Peterson	86d067a797	GFS2: Wait for iopen glock dequeues This patch changes every glock_dq for iopen glocks into a dq_wait. This makes sure that iopen glocks do not outlive the inode itself. In turn, that ensures that anyone trying to unlink the glock will be able to find the inode when it receives a remote iopen callback. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2015-12-18 10:49:22 -06:00
Paul Gortmaker	9189922675	fs: make locks.c explicitly non-modular The Kconfig currently controlling compilation of this code is: config FILE_LOCKING bool "Enable POSIX file locking API" if EXPERT ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modularity so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering gets bumped to one level earlier when we use the more appropriate fs_initcall here. However we've made similar changes before without any fallout and none is expected here either. Cc: Jeff Layton <jlayton@poochiereds.net> Acked-by: Jeff Layton <jlayton@poochiereds.net> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>	2015-12-18 07:05:06 -05:00
David S. Miller	b3e0d3d7ba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-17 22:08:28 -05:00
Filipe Manana	0376374a98	Btrfs: fix locking bugs when defragging leaves When running fstests btrfs/070, with a higher number of fsstress operations, I ran frequently into two different locking bugs when defragging directories. The first bug produced the following traces: [133860.229792] ------------[ cut here ]------------ [133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]() [133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.286827] 0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000 [133860.288341] ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00 [133860.294219] ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0 [133860.295831] Call Trace: [133860.306518] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [133860.307473] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [133860.308619] [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.310068] [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c [133860.312552] [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs] [133860.314630] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.323596] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.325233] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.332427] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.337259] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.340147] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.344833] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.346343] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.353248] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.354242] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.355232] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.356237] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.358587] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.360195] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.361380] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.363578] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f [133860.366217] ---[ end trace 2cadb2f653437e49 ]--- [133860.367399] ------------[ cut here ]------------ [133860.368162] kernel BUG at fs/btrfs/locking.c:307! [133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport [133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1 [133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000 [133860.370205] RIP: 0010:[<ffffffffa052d466>] [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs] [133860.370205] RSP: 0018:ffff880207697bc0 EFLAGS: 00010246 [133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000 [133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00 [133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000 [133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00 [133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000 [133860.370205] FS: 00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [133860.370205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0 [133860.370205] Stack: [133860.370205] ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8 [133860.370205] ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590 [133860.370205] ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000 [133860.370205] Call Trace: [133860.370205] [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs] [133860.370205] [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs] [133860.370205] [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs] [133860.370205] [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs] [133860.370205] [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs] [133860.370205] [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs] [133860.370205] [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs] [133860.370205] [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7 [133860.370205] [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7 [133860.370205] [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174 [133860.370205] [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6 [133860.370205] [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e [133860.370205] [<ffffffff8118574d>] ? __fget_light+0x4d/0x71 [133860.370205] [<ffffffff8117c726>] SyS_ioctl+0x57/0x79 [133860.370205] [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f This bug happened because we assumed that by setting keep_locks to 1 in our search path, our path after a call to btrfs_search_slot() would have all nodes locked, which is not always true because unlock_up() (called by btrfs_search_slot()) will unlock a node in a path if the slot of the node below it doesn't point to the last item or beyond the last item. For example, when the tree has a heigth of 2 and path->slots[0] has a value smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2 will be unlocked (also because lowest_unlock is set to 1 due to the fact that the value passed as ins_len to btrfs_search_slot is 0). This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(), to release out path and call again btrfs_search_slot(), but this time with the cow parameter set to 0, meaning the resulting path got only read locks. Therefore when we called btrfs_realloc_node(), with path->nodes[1] having a read lock, it resulted in the warning and BUG_ON when calling btrfs_set_lock_blocking() against the node, as that function expects the node to have a write lock. The second bug happened often when the first bug didn't happen, and made us hang and hitting the following warning at fs/btrfs/locking.c: 251 void btrfs_tree_lock(struct extent_buffer *eb) 252 { 253 WARN_ON(eb->lock_owner == current->pid); This happened because the tree search we made at btrfs_defrag_leaves() before calling btrfs_find_next_key() locked a leaf and all the other nodes in the path, so btrfs_find_next_key() had no need to release the path and make a new search (with path->lowest_level set to 1). This made btrfs_realloc_node() attempt to write lock the same leaf again, resulting in a hang/deadlock. So fix these issues by calling btrfs_find_next_key() after calling btrfs_realloc_node() and setting the search path's lowest_level to 1 to avoid the hang/deadlock when attempting to write lock the leaves at btrfs_realloc_node(). Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-12-18 02:51:32 +00:00
Omar Sandoval	70f6d82ec7	Btrfs: add free space tree mount option Now we can finally hook up everything so we can actually use free space tree. The free space tree is enabled by passing the space_cache=v2 mount option. On the first mount with the this option set, the free space tree will be created and the FREE_SPACE_TREE read-only compat bit will be set. Any time the filesystem is mounted from then on, we must use the free space tree. The clear_cache option will also clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:47 -08:00
Omar Sandoval	1e144fb8f4	Btrfs: wire up the free space tree to the extent tree The free space tree is updated in tandem with the extent tree. There are only a handful of places where we need to hook in: 1. Block group creation 2. Block group deletion 3. Delayed refs (extent creation and deletion) 4. Block group caching Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:47 -08:00
Omar Sandoval	7c55ee0c4a	Btrfs: add free space tree sanity tests This tests the operations on the free space tree trying to excercise all of the main cases for both formats. Between this and xfstests, the free space tree should have pretty good coverage. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:47 -08:00
Omar Sandoval	a5ed918285	Btrfs: implement the free space B-tree The free space cache has turned out to be a scalability bottleneck on large, busy filesystems. When the cache for a lot of block groups needs to be written out, we can get extremely long commit times; if this happens in the critical section, things are especially bad because we block new transactions from happening. The main problem with the free space cache is that it has to be written out in its entirety and is managed in an ad hoc fashion. Using a B-tree to store free space fixes this: updates can be done as needed and we get all of the benefits of using a B-tree: checksumming, RAID handling, well-understood behavior. With the free space tree, we get commit times that are about the same as the no cache case with load times slower than the free space cache case but still much faster than the no cache case. Free space is represented with extents until it becomes more space-efficient to use bitmaps, giving us similar space overhead to the free space cache. The operations on the free space tree are: adding and removing free space, handling the creation and deletion of block groups, and loading the free space for a block group. We can also create the free space tree by walking the extent tree and clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:47 -08:00

... 2 3 4 5 6 ...

43101 Commits