linux

Commit Graph

Author	SHA1	Message	Date
Naoya Horiguchi	14eb6fdd42	smaps: remove mem_size_stats->vma and use walk_page_vma() pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:05 -08:00
Konstantin Khlebnikov	05fbf357d9	proc/pagemap: walk page tables under pte lock Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page(): CPU A (pagemap) CPU B (migration) lock_page() try_to_unmap(page, TTU_MIGRATION...) make_migration_entry() set_pte_at() <read *pte> pte_to_pagemap_entry() remove_migration_ptes() unlock_page() if(is_migration_entry()) migration_entry_to_page() BUG_ON(!PageLocked(page)) Also lockless read might be non-atomic if pte is larger than wordsize. Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes. Fixes: `052fb0d635` ("proc: report file/anon bit in /proc/pid/pagemap") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:05 -08:00
Konstantin Khebnikov	8d38633c3b	page_writeback: put account_page_redirty() after set_page_dirty() Helper account_page_redirty() fixes dirty pages counter for redirtied pages. This patch puts it after dirtying and prevents temporary underflows of dirtied pages counters on zone/bdi and current->nr_dirtied. Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:04 -08:00
Kirill A. Shutemov	dc6c9a35b6	mm: account pmd page tables to the process Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE\|PROT_READ, MAP_ANONYMOUS\|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE\|PROT_READ, MAP_ANONYMOUS\|MAP_PRIVATE\|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:04 -08:00
Wang, Yalin	56873f43ab	mm:add KPF_ZERO_PAGE flag for /proc/kpageflags Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can detect zero_page in /proc/kpageflags, and then do memory analysis more accurately. Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:00 -08:00
Chao Yu	1a118ccfd6	f2fs: use spinlock for segmap_lock instead of rwlock rwlock can provide better concurrency when there are much more readers than writers because readers can hold the rwlock simultaneously. But now, for segmap_lock rwlock in struct free_segmap_info, there is only one reader 'mount' from below call path: ->f2fs_fill_super ->build_segment_manager ->build_dirty_segmap ->init_dirty_segmap ->find_next_inuse read_lock ... read_unlock Now that our concurrency can not be improved since there is no other reader for this lock, we do not need to use rwlock_t type for segmap_lock, let's replace it with spinlock_t type. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:51 -08:00
Jaegeuk Kim	f1a3b98e73	f2fs: fix accessing wrong indexed data blocks This patch fixes the following test. This causes: attempt to access beyond end of device sdb2: rw=16384, want=14413962000, limit=16777216 The reason is: - f2fs_write_begin - f2fs_convert_inline_inode returns -ENOSPC - f2fs_write_failed - truncate_blocks - truncate_partial_data_page - find_data_page - get_dnode_of_data returns wrong data index retrieved from inline_data - f2fs_submit_page_bio(wrong data index) - submit_bio(wrong data index) Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:51 -08:00
Jaegeuk Kim	60a3b782b1	f2fs: avoid variable length array Instead of using variable length array, this patch let preallocate memory for them. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:50 -08:00
Jaegeuk Kim	29e7043f40	f2fs: fix sparse warnings This patch resolves the following warnings. include/trace/events/f2fs.h:150:1: warning: expression using sizeof bool include/trace/events/f2fs.h:180:1: warning: expression using sizeof bool include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool include/trace/events/f2fs.h:990:1: warning: expression using sizeof bool include/trace/events/f2fs.h:150:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1) include/trace/events/f2fs.h:180:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1) include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1) include/trace/events/f2fs.h:990:1: warning: odd constant _Bool cast (ffffffffffffffff becomes 1) fs/f2fs/checkpoint.c:27:19: warning: symbol 'inode_entry_slab' was not declared. Should it be static? fs/f2fs/checkpoint.c:577:15: warning: cast to restricted __le32 fs/f2fs/checkpoint.c:592:15: warning: cast to restricted __le32 fs/f2fs/trace.c:19:1: warning: symbol 'pids' was not declared. Should it be static? fs/f2fs/trace.c:21:21: warning: symbol 'last_io' was not declared. Should it be static? Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:49 -08:00
Jaegeuk Kim	59b802e5a4	f2fs: allocate data blocks in advance for f2fs_direct_IO This patch adds preallocation for data blocks to prepare f2fs_direct_IO. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:49 -08:00
Jaegeuk Kim	f7ef9b83b5	f2fs: introduce macros to convert bytes and blocks in f2fs This patch adds two macros for transition between byte and block offsets. Currently, f2fs only supports 4KB blocks, so use the default size for now. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:48 -08:00
Jaegeuk Kim	da17eece03	f2fs: call set_buffer_new for get_block This patch fixes wrong handling of buffer_new flag in get_block. If f2fs allocates new blocks and mapped buffer_head, it needs to set buffer_new for the bh_result. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:47 -08:00
Jaegeuk Kim	aaf9607516	f2fs: check node page contents all the time In get_node_page, if the page is up-to-date, we assumed that the page was not reclaimed at all. But, sometimes it was reported that its contents was missing. So, just for sure, let's check its mapping and contents. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:46 -08:00
Chao Yu	2e023174a8	f2fs: avoid data offset overflow when lseeking huge file xfstest generic/285 complains our issue in lseeking huge file. Here is the detail output of generic/285: "./check -f2fs tests/generic/285 Ran: generic/285 Failures: generic/285 Failed 1 of 1 tests 10. Test a huge file for offset overflow 10.01 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ 10.02 SEEK_HOLE expected 65536 or 8589934592, got 65536. succ 10.03 SEEK_DATA expected 0 or 0, got 0. succ 10.04 SEEK_DATA expected 1 or 1, got 1. succ 10.05 SEEK_HOLE expected 8589934592 or 8589934592, got 0. FAIL 10.06 SEEK_DATA expected 8589869056 or 8589869056, got 8589869056. succ 10.07 SEEK_DATA expected 8589869057 or 8589869057, got 8589869057. succ 10.08 SEEK_DATA expected 8589869056 or 8589869056, got 4294901760. FAIL" The reason of this issue is: We will calculate current offset through left shifting page-offset with PAGE_CACHE_SHIFT bits, but our page-offset is a type of unsigned long, its size is 4 bytes in 32-bits machine. So if our page-offset is bigger than (1 << 32 / pagesize - 1), result of left shifting will overflow. Let's fix this issue by casting type of page-offset to type of current offset: loff_t. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:46 -08:00
Chao Yu	560d4672e2	f2fs: fix to use highmem for pages of newly created directory In commit `a78186ebe5` ("f2fs: use highmem for directory pages"), we have set __GFP_HIGHMEM into dir mapping's gfp flag in f2fs_iget, so high address memory could be used for these existing dir's page. But we forgot to set flag for newly created dir, due to this reason, our newly created dir pages could not be allocated from high address memory. Fix it. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:45 -08:00
Jaegeuk Kim	bba681cbb2	f2fs: introduce a batched trim This patch introduces a batched trimming feature, which submits split discard commands. This is to avoid long latency due to huge trim commands. If fstrim was triggered ranging from 0 to the end of device, we should lock all the checkpoint-related mutexes, resulting in very long latency. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:44 -08:00
Chao Yu	487261f39b	f2fs: merge {invalidate,release}page for meta/node/data pages This patch merges ->{invalidate,release}page function for meta/node/data pages. After this, duplication of codes could be removed. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:44 -08:00
Jaegeuk Kim	d24bdcbfc6	f2fs: show the number of writeback pages in stat This patch adds the # of writeback pages in stat info. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:43 -08:00
Jaegeuk Kim	f68daeebba	f2fs: keep PagePrivate during releasepage If PagePrivate is removed by releasepage, f2fs loses counting dirty pages. e.g., try_to_release_page will not release page when the page is dirty, but our releasepage removes PagePrivate. [<ffffffff81188d75>] try_to_release_page+0x35/0x50 [<ffffffff811996f9>] invalidate_inode_pages2_range+0x2f9/0x3b0 [<ffffffffa02a7f54>] ? truncate_blocks+0x384/0x4d0 [f2fs] [<ffffffffa02b7583>] ? f2fs_direct_IO+0x283/0x290 [f2fs] [<ffffffffa02b7fb0>] ? get_data_block_fiemap+0x20/0x20 [f2fs] [<ffffffff8118aa53>] generic_file_direct_write+0x163/0x170 [<ffffffff8118ad06>] __generic_file_write_iter+0x2a6/0x350 [<ffffffff8118adef>] generic_file_write_iter+0x3f/0xb0 [<ffffffff81203081>] new_sync_write+0x81/0xb0 [<ffffffff81203837>] vfs_write+0xb7/0x1f0 [<ffffffff81204459>] SyS_write+0x49/0xb0 [<ffffffff817c286d>] system_call_fastpath+0x16/0x1b Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:42 -08:00
Jaegeuk Kim	081d78c2fc	f2fs: should fail mount when trying to recover data on read-only dev If device is read-only, we should not proceed data recovery. But, if the previous checkpoint was done by normal clean shutdown, it's safe to proceed the recovery, since there will be no data to be recovered. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:42 -08:00
Jaegeuk Kim	119ee91445	f2fs: split UMOUNT and FASTBOOT flags This patch adds FASTBOOT flag into checkpoint as follows. - CP_UMOUNT_FLAG is set when system is umounted. - CP_FASTBOOT_FLAG is set when intermediate checkpoint having node summaries was done. So, if you get CP_UMOUNT_FLAG from checkpoint, the system was umounted cleanly. Instead, if there was sudden-power-off, you can get CP_FASTBOOT_FLAG or nothing. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:41 -08:00
Jaegeuk Kim	11504a8e7e	f2fs: avoid write_checkpoint if f2fs is mounted readonly Do not change any partition when f2fs is changed to readonly mode. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:40 -08:00
Jaegeuk Kim	2d834bf9ac	f2fs: support norecovery mount option This patch adds a mount option, norecovery, which is mostly same as disable_roll_forward. The only difference is that norecovery should be activated with read-only mount option. This can be used when user wants to check whether f2fs is mountable or not without any recovery process. (e.g., xfstests/200) Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:40 -08:00
Jaegeuk Kim	dabc4a5c60	f2fs: fix not to drop mount options when retrying fill_super If wrong mount option was requested, f2fs tries to fill_super again. But, during the next trial, f2fs has no valid mount options, since parse_options deleted all the separators in the original string. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:39 -08:00
Chao Yu	caf0047e7e	f2fs: merge flags in struct f2fs_sb_info Currently, there are several variables with Boolean type as below: struct f2fs_sb_info { ... int s_dirty; bool need_fsck; bool s_closing; ... bool por_doing; ... } For this there are some issues: 1. there are some space of f2fs_sb_info is wasted due to aligning after Boolean type variables by compiler. 2. if we continuously add new flag into f2fs_sb_info, structure will be messed up. So in this patch, we try to: 1. switch s_dirty to Boolean type variable since it has two status 0/1. 2. merge s_dirty/need_fsck/s_closing/por_doing variables into s_flag. 3. introduce an enum type which can indicate different states of sbi. 4. use new introduced universal interfaces is_sbi_flag_set/{set,clear}_sbi_flag to operate flags for sbi. After that, above issues will be fixed. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:38 -08:00
Chao Yu	88dd893419	f2fs: clean up {in,de}create_sleep_time Use pointer parameter @wait to pass result in {in,de}create_sleep_time for cleanup. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:37 -08:00
Chao Yu	feeb0debfb	f2fs: make truncate_inline_date static 1. make truncate_inline_date static; 2. remove parameter @from of truncate_inline_date as callers only pass zero. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:37 -08:00
Kinglong Mee	3b6709b771	f2fs: fix a bug of inheriting default ACL from parent Introduced by `a6dda0e63e` "f2fs: use generic posix ACL infrastructure". When testing default acl, gets in recent kernel (3.19.0-rc5), user::rwx group::r-x other::r-x default:user::rwx default:group::r-x default:group:root:rwx default😷:rwx default:other::r-x ]# getfacl testdir/ user::rwx group::rwx // missing an acl "group:root:rwx" inherited from parent other::r-x default:user::rwx default:group::r-x default:group:root:rwx default😷:rwx default:other::r-x Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:36 -08:00
Chao Yu	f28e503429	f2fs: use f2fs_radix_tree_insert to clean codes No modification in functionality, just clean codes with f2fs_radix_tree_insert. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:35 -08:00
Chao Yu	d49f3e8902	f2fs: add F2FS_IOC_GETVERSION support In this patch we add the FS_IOC_GETVERSION ioctl for getting i_generation from inode, after that, users can list file's generation number by using "lsattr -v". Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:35 -08:00
Jaegeuk Kim	bc4a1f873b	f2fs: leave comment for code readability During the recovery, any xattr blocks should not be found, since they are written into cold log, not the warm node chain. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:34 -08:00
Chao Yu	1601839e9e	f2fs: fix to release count of meta page in ->invalidatepage We will encounter deadloop in below scenario: 1. increase page count for F2FS_DIRTY_META type in following path: ->recover_fsync_data ->recover_data ->do_recover_data ->recover_data_page ->change_curseg ->write_sum_page ->set_page_dirty 2. fail in recover_data() 3. invalidate meta pages in truncate_inode_pages_final without decreasing page count. 4. deadloop when sync_meta_pages as page count will always be non-zero. message: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [<c1129a37>] pagevec_lookup_tag+0x27/0x30 [<f0e774c7>] sync_meta_pages+0x87/0x160 [f2fs] [<f0e86dd9>] recover_fsync_data+0xeb9/0xf10 [f2fs] [<f0e75398>] f2fs_fill_super+0x888/0x980 [f2fs] [<c11733ca>] mount_bdev+0x16a/0x1a0 [<f0e7180f>] f2fs_mount+0x1f/0x30 [f2fs] [<c1173da6>] mount_fs+0x36/0x170 [<c118b6f5>] vfs_kern_mount+0x55/0xe0 [<c118d63f>] do_mount+0x1df/0x9f0 [<c118e110>] SyS_mount+0x70/0xb0 [<c15a0c48>] sysenter_do_call+0x12/0x12 To avoid page count leak, let's add ->invalidatepage and ->releasepage in f2fs_meta_aops as f2fs_node_aops to release meta page count correctly. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:33 -08:00
Jaegeuk Kim	85dc2f2c6c	f2fs: do checkpoint when umount flag is not set If the previous checkpoint was done without CP_UMOUNT flag, it needs to do checkpoint with CP_UMOUNT for the next fast boot. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:33 -08:00
Jaegeuk Kim	30a5537f9a	f2fs: trigger correct checkpoint during umount This patch fixes to trigger checkpoint with umount flag when kill_sb was called. In kill_sb, f2fs_sync_fs was finally called, but at this time, f2fs can't do checkpoint with CP_UMOUNT. After then, f2fs_put_super is not doing checkpoint, since it is not dirty. So, this patch adds a flag to indicate f2fs_sync_fs is called during umount. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:32 -08:00
Jaegeuk Kim	6f0aacbc3c	f2fs: update memory footprint information This patch adds missing memory usages, and splits them in detail. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:31 -08:00
Chao Yu	9066c6a7eb	f2fs: fix wrong memory footprint statistics in debugfs Our value of memory footprint statistics showed in debugfs is not calculated correctly. Fix it in this patch. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:31 -08:00
Jaegeuk Kim	871f599f4a	f2fs: avoid infinite loop on cp_error If cp_error is set, we should avoid all the infinite loop. In f2fs_sync_file, there is a hole, and this patch fixes that. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2015-02-11 17:04:30 -08:00
Trond Myklebust	a4f743a6bb	NFSv4.1: Convert open-coded array allocation calls to kmalloc_array() For added overflow protection... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-11 19:02:52 -05:00
Trond Myklebust	d8ba1f9714	NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args If the call to decode_rc_list() fails due to a memory allocation error, then we need to truncate the array size to ensure that we only call kfree() on those pointer that were allocated. Reported-by: David Ramos <daramos@stanford.edu> Fixes: `4aece6a19c` ("nfs41: cb_sequence xdr implementation") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-11 19:02:51 -05:00
Linus Torvalds	c5ce28df0e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) More iov_iter conversion work from Al Viro. [ The "crypto: switch af_alg_make_sg() to iov_iter" commit was wrong, and this pull actually adds an extra commit on top of the branch I'm pulling to fix that up, so that the pre-merge state is ok. - Linus ] 2) Various optimizations to the ipv4 forwarding information base trie lookup implementation. From Alexander Duyck. 3) Remove sock_iocb altogether, from CHristoph Hellwig. 4) Allow congestion control algorithm selection via routing metrics. From Daniel Borkmann. 5) Make ipv4 uncached route list per-cpu, from Eric Dumazet. 6) Handle rfs hash collisions more gracefully, also from Eric Dumazet. 7) Add xmit_more support to r8169, e1000, and e1000e drivers. From Florian Westphal. 8) Transparent Ethernet Bridging support for GRO, from Jesse Gross. 9) Add BPF packet actions to packet scheduler, from Jiri Pirko. 10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer. 11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman Kwok. 12) More sanely handle out-of-window dupacks, which can result in serious ACK storms. From Neal Cardwell. 13) Various rhashtable bug fixes and enhancements, from Herbert Xu, Patrick McHardy, and Thomas Graf. 14) Support xmit_more in be2net, from Sathya Perla. 15) Group Policy extensions for vxlan, from Thomas Graf. 16) Remove Checksum Offload support for vxlan, from Tom Herbert. 17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From Vlad Yasevich. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits) crypto: fix af_alg_make_sg() conversion to iov_iter ipv4: Namespecify TCP PMTU mechanism i40e: Fix for stats init function call in Rx setup tcp: don't include Fast Open option in SYN-ACK on pure SYN-data openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set ipv6: Make __ipv6_select_ident static ipv6: Fix fragment id assignment on LE arches. bridge: Fix inability to add non-vlan fdb entry net: Mellanox: Delete unnecessary checks before the function call "vunmap" cxgb4: Add support in cxgb4 to get expansion rom version via ethtool ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version net: dsa: Remove redundant phy_attach() IB/mlx4: Reset flow support for IB kernel ULPs IB/mlx4: Always use the correct port for mirrored multicast attachments net/bonding: Fix potential bad memory access during bonding events tipc: remove tipc_snprintf tipc: nl compat add noop and remove legacy nl framework tipc: convert legacy nl stats show to nl compat tipc: convert legacy nl net id get to nl compat tipc: convert legacy nl net id set to nl compat ...	2015-02-10 20:01:30 -08:00
Linus Torvalds	992de5a8ec	Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: "Bite-sized chunks this time, to avoid the MTA ratelimiting woes. - fs/notify updates - ocfs2 - some of MM" That laconic "some MM" is mainly the removal of remap_file_pages(), which is a big simplification of the VM, and which gets rid of a lot of random cruft and special cases because we no longer support the non-linear mappings that it used. From a user interface perspective, nothing has changed, because the remap_file_pages() syscall still exists, it's just done by emulating the old behavior by creating a lot of individual small mappings instead of one non-linear one. The emulation is slower than the old "native" non-linear mappings, but nobody really uses or cares about remap_file_pages(), and simplifying the VM is a big advantage. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits) memcg: zap memcg_slab_caches and memcg_slab_mutex memcg: zap memcg_name argument of memcg_create_kmem_cache memcg: zap __memcg_{charge,uncharge}_slab mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check mm: hugetlb: fix type of hugetlb_treat_as_movable variable mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? mm: memory: merge shared-writable dirtying branches in do_wp_page() mm: memory: remove ->vm_file check on shared writable vmas xtensa: drop _PAGE_FILE and pte_file()-related helpers x86: drop _PAGE_FILE and pte_file()-related helpers unicore32: drop pte_file()-related helpers um: drop _PAGE_FILE and pte_file()-related helpers tile: drop pte_file()-related helpers sparc: drop pte_file()-related helpers sh: drop _PAGE_FILE and pte_file()-related helpers score: drop _PAGE_FILE and pte_file()-related helpers s390: drop pte_file()-related helpers parisc: drop _PAGE_FILE and pte_file()-related helpers openrisc: drop _PAGE_FILE and pte_file()-related helpers nios2: drop _PAGE_FILE and pte_file()-related helpers ...	2015-02-10 16:45:56 -08:00
Linus Torvalds	b2718bffb4	This time we have mostly clean ups. There is a bug fix for a NULL dereference relating to ACLs, and another which improves (but does not fix entirely) an allocation fall-back code path. The other three patches are small clean ups. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.15 (GNU/Linux) iQIcBAABAgAGBQJU2dvJAAoJEMrg3m4a/8jSQJYQAKJXz9l0MDHPpm4qcjEO1p8T JLSEa8X24taugcbSavC15g0v3HOCveleMymNo2wrfTFqkcpy/YTlqj0LjZywXOZc KJMZWL4ZYZXmdf2DDhZ6WFb6Ouw+0l2MbtIWb/L2NQiv1K6nSdOVsgXiRvyehvxD YEQBbdQxsa8uFr1QbBVNvaM6X6iHOLLvkLHHpw84N1dMcijt3BjQqqXQphkD6aM8 uSEYSY4aMl2nZv/fEItjqLqLZvkhQX/dbN5aUnixHd65bSYkdsWMjn4DInJujbNd aT/iLNFhCKN3FqAcl+FAfnOhanJsj0jDSS4puyufuLOv2vgK9AmF8TiAW4TEZ0ny IAqdurn7+1dAINxKTuo/ktBdkdwWI35cwP3PUrvKJifgKxX/EvmGDC3U2EoeItP1 /TbwE2PVdQspCiXUGIAx40BhmEEszrQHIdM+p+cUxxmmt74Hm9582pby1aMOPnHQ FZoPhtFRKD8JttxRu+MyEwC4K0/JRK7R496fFPoY225+DSqeNibXrhU3USznZXYc 0h2kRoXTMO74ptdZr+I91RKUWZo17Ujm96mRZgnCSGgEX3IwD/6dxgd0cjU7075r mSVEy6fngwM6cGF0qQsnqhf2bktCPg+8zhSilbxwuFfWIa66NUjoupqB3wzpdlyC sd+3tpZijhStqjPYTKmM =CiRw -----END PGP SIGNATURE----- Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Pull gfs2 updates from Steven Whitehouse: "This time we have mostly clean ups. There is a bug fix for a NULL dereference relating to ACLs, and another which improves (but does not fix entirely) an allocation fall-back code path. The other three patches are small clean ups" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Fix crash during ACL deletion in acl max entry check in gfs2_set_acl() GFS2: use __vmalloc GFP_NOFS for fs-related allocations. GFS2: Eliminate a nonsense goto GFS2: fix sprintf format specifier GFS2: Eliminate __gfs2_glock_remove_from_lru	2015-02-10 16:20:49 -08:00
Linus Torvalds	ae90fb1420	xfs: update for 3.20-rc1 This update contains: o RENAME_EXCHANGE support o Rework of the superblock logging infrastructure o Rework of the XFS_IOCTL_SETXATTR implementation - enables use inside user namespaces - fixes inconsistencies setting extent size hints o fixes for missing buffer type annotations used in log recovery o more consolidation of libxfs headers o preparation patches for block based PNFS support o miscellaneous bug fixes and cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJU2UQ6AAoJEK3oKUf0dfodNAwQAKD2T154VZCXZtjC6LobdEL4 tuk72eDEyduNQZ1yL8MPNf090lsFizoLmXocawQEKPHZnSnZu1E6VQLughI5wSge N658/5mNvuRQkvUTw13onPhNdOzSoqigiy+6u0qiZbhab72ZExfuloqAAgZ908Ak DWZRQNXDspG8olGD1EKyEOKAoWDgRljiBEUw7+wVAN2thdQRXs6pyyTcaIXst3j6 bk9EAeWZOK78OJUjtWlvIFv7RqjjUQL/8Z4Arr3lMuv8Y1fxS3oL8V9GjKae0MUn ZFPJzRHE9kfcO4AYAwYDUDW0Ia4ccq3sYu8FDzX6FTf3AqziQXKW/ovw83So5oEE vpT/ltsugWfu2feShvfa9FTYxqmecXUqoITZ4Sczm/8F9tDz0Q4HZ/GxleAmJyH9 89IbTlSgm6qy1nmSbg21x0CVuOqJqHFAG83nnMKv8X4gmaMg9SzdNlU1iL2ugBQ7 l2XKS2DKDbjgBnSdW+bOdBoFiMyCUjrwMAp23vYELuTIFSdHkTFsBQBvqtpjkJ0X KEutMaZdy8uo5DdPRN9k78OANbzOc4qYKJHTjiO5sQsMybsPNkciLBX49CgYFmw4 xoVQ+EYrebpBKi4WHyLKWFOWb5CP/AUtLgxsX0clQVBuuaEBhA306qic1ZZixtIm ea2Q8ODEgpzs8QWZ7j6+ =eZEU -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs update from Dave Chinner: "This update contains: - RENAME_EXCHANGE support - Rework of the superblock logging infrastructure - Rework of the XFS_IOCTL_SETXATTR implementation * enables use inside user namespaces * fixes inconsistencies setting extent size hints - fixes for missing buffer type annotations used in log recovery - more consolidation of libxfs headers - preparation patches for block based PNFS support - miscellaneous bug fixes and cleanups" * tag 'xfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (37 commits) xfs: only trace buffer items if they exist xfs: report proper f_files in statfs if we overshoot imaxpct xfs: fix panic_mask documentation xfs: xfs_ioctl_setattr_check_projid can be static xfs: growfs should use synchronous transactions xfs: fix behaviour of XFS_IOC_FSSETXATTR on directories xfs: factor projid hint checking out of xfs_ioctl_setattr xfs: factor extsize hint checking out of xfs_ioctl_setattr xfs: XFS_IOCTL_SETXATTR can run in user namespaces xfs: kill xfs_ioctl_setattr behaviour mask xfs: disaggregate xfs_ioctl_setattr xfs: factor out xfs_ioctl_setattr transaciton preamble xfs: separate xflags from xfs_ioctl_setattr xfs: FSX_NONBLOCK is not used xfs: don't allocate an ioend for direct I/O completions xfs: change kmem_free to use generic kvfree() xfs: factor out a xfs_update_prealloc_flags() helper xfs: remove incorrect error negation in attr_multi ioctl xfs: set superblock buffer type correctly xfs: set buf types when converting extent formats ...	2015-02-10 16:15:17 -08:00
Linus Torvalds	c5452a58db	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota interface unification and misc cleanups from Jan Kara: "The first part of the series unifying XFS and VFS quota interfaces. This part unifies turning quotas on and off so quota-tools and xfs_quota can be used to manage any filesystem. This is useful so that userspace doesn't have to distinguish which filesystem it is working with. As a result we can then easily reuse tests for project quotas in XFS for ext4. This also contains minor cleanups and fixes for udf, isofs, and ext3" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (23 commits) udf: remove bool assignment to 0/1 udf: use bool for done quota: Store maximum space limit in bytes quota: Remove quota_on_meta callback ocfs2: Use generic helpers for quotaon and quotaoff ext4: Use generic helpers for quotaon and quotaoff quota: Add ->quota_{enable,disable} callbacks for VFS quotas quota: Wire up ->quota_{enable,disable} callbacks into Q_QUOTA{ON,OFF} quota: Split ->set_xstate callback into two xfs: Remove some pointless quota checks xfs: Remove some useless flags tests xfs: Remove useless test quota: Verify flags passed to Q_SETINFO quota: Cleanup flags definitions ocfs2: Move OLQF_CLEAN flag out of generic quota flags quota: Don't store flags for v2 quota format jbd: drop jbd_ENOSYS debug udf: destroy sbi mutex in put_super udf: Check length of extended attributes and allocation descriptors udf: Remove repeated loads blocksize ...	2015-02-10 15:52:38 -08:00
Linus Torvalds	4b4f8580a4	File locking related changes for v3.20 (pile #1 ) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJU1MYmAAoJEAAOaEEZVoIV/rAQAKoHj/PCOATTy05lF/NDhJlS 6NbNjupnC8HrbNPv6Z/cQ902eC1YRVH96gf6we4FeAm9Tjctpje6uEqvPQCUxpot 2jWgCG+g95OeEaQEjXQvR3x5ZfXvPUtwKVOnMF423L1p5Xfbj3kJfGi+dv2k8XOi GArsUB7uCwqLyyz+L47RJ2Cz7s47M9O25HkVRfWlgYOv+4afq5OpADGKQAhMLL/s CPhYgqw/7r1p+pLkjUE/x+5BAliDzUinFtDatgD4CeHOdq0RKlxzQ1rFg6uJVg/k 3ZttGOxWUtGIeGM4v5cosDFReLPCESax/TUzn58jxxFR702MjHAA+lHRgjZoWvW/ 9EnShl0XlznQX1ns6f0rI1seWe4M5R3CWus8AcG0kDmdbTp8nARo+pBLFhCME/kZ 15GHLz4tDSRt5SNow6aqJdlYJR7p3WrsceKyM5aH9M7odM3eaB5vJxIJ0fljsZbS Qtz4t+Ua1oVSYD7TX3y7EUiQVPVo8VKS3o6Ua73wCHIXNbSH7hZLOvPLFs6V1Psi RKqRiad5iO3+iavVGuDDcs12zXZ5hmksE8oMh0NkjFZ6wJlO4Hf5iOt5thABNDmT Km+40IBq1DYwclPTofaRpB+ytDOnWedMxdWfWdEWQ710zuuNY3cfi/XMXEX34kBY fLhUMabqcyfUegpA6S0R =6+UV -----END PGP SIGNATURE----- Merge tag 'locks-v3.20-1' of git://git.samba.org/jlayton/linux Pull file locking related changes #1 from Jeff Layton: "This patchset contains a fairly major overhaul of how file locks are tracked within the inode. Rather than a single list, we now create a per-inode "lock context" that contains individual lists for the file locks, and a new dedicated spinlock for them. There are changes in other trees that are based on top of this set so it may be easiest to pull this in early" * tag 'locks-v3.20-1' of git://git.samba.org/jlayton/linux: locks: update comments that refer to inode->i_flock locks: consolidate NULL i_flctx checks in locks_remove_file locks: keep a count of locks on the flctx lists locks: clean up the lm_change prototype locks: add a dedicated spinlock to protect i_flctx lists locks: remove i_flock field from struct inode locks: convert lease handling to file_lock_context locks: convert posix locks to file_lock_context locks: move flock locks to file_lock_context ceph: move spinlocking into ceph_encode_locks_to_buffer and ceph_count_locks locks: add a new struct file_locking_context pointer to struct inode locks: have locks_release_file use flock_lock_file to release generic flock locks locks: add new struct list_head to struct file_lock	2015-02-10 15:34:42 -08:00
Kirill A. Shutemov	27ba0644ea	rmap: drop support of non-linear mappings We don't create non-linear mappings anymore. Let's drop code which handles them in rmap. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:31 -08:00
Kirill A. Shutemov	1da4b35b00	proc: drop handling non-linear mappings We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs} which is unused now. Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:31 -08:00
Kirill A. Shutemov	d83a08db5b	mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:30 -08:00
Dmitry Monakhov	913e027ca1	fsioctl.c: make generic_block_fiemap() signal-tolerant __generic_block_fiemap may spin very long time for large sparse files. Without this patch an unprivileged user may abuse system resources simply by spawning a vast number of unkilable busyloops (works on ext2/ext3): truncate --size 1T test for ((i=0;i<1024;i++)) do filefrag test > /dev/null & done Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:30 -08:00
Srinivas Eeda	99b8874e79	o2dlm: fix NULL pointer dereference in o2dlm_blocking_ast_wrapper A tiny race between BAST and unlock message causes the NULL dereference. A node sends an unlock request to master and receives a response. Before processing the response it receives a BAST from the master. Since both requests are processed by different threads it creates a race. While the BAST is being processed, lock can get freed by unlock code. This patch makes bast to return immediately if lock is found but unlock is pending. The code should handle this race. We also have to fix master node to skip sending BAST after receiving unlock message. Below is the crash stack BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 IP: o2dlm_blocking_ast_wrapper+0xd/0x16 dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm] dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm] o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager] o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager] worker_thread+0x14d/0x1ed [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:30 -08:00
alex chen	10ab88117d	ocfs2: prune the dcache before deleting the dentry of directory In ocfs2_dentry_convert_worker, we should prune the dcache before deleting the dentry of directory, otherwise, in the following cases the inode of directory will still remain in orphan directory until the device being umounted. Mount point: /mnt/ocfs2 Node A Node B mkdir /mnt/ocfs2/testdir ocfs2_mkdir ->ocfs2_mknod ->ocfs2_dentry_attach_lock ->ocfs2_dentry_lock(dentry, 0) ... ... touch /mnt/ocfs2/testdir/testfile unlink /mnt/test/testdir/testfile rmdir /mnt/ocfs2/testdir ocfs2_unlink ->ocfs2_remote_dentry_delete ->ocfs2_dentry_lock(dentry, 1) ... ... ... ... ocfs2_downconvert_thread ->ocfs2_unblock_lock ->ocfs2_dentry_convert_worker ->ocfs2_find_local_alias ->dget_dlock ->d_delete Here the dentry can not be released because the children's dentry is negative but still exist. Finally, this inode will still remain in orphan directory until its children are destroyed. So before deleting dentry of directory, we should prune the dcache to remove unused children of the parent dentry by shrink_dcache_parent(). Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:30 -08:00
Fabian Frederick	8989b67330	ocfs2: make resv_lock spinlock static resv_lock is only used in reservations.c Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Daeseok Youn	9d6008c759	ocfs2: remove unreachable code in __ocfs2_recovery_thread() Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Daeseok Youn	9b57269170	ocfs2: removes mlog_errno() call twice in ocfs2_find_dir_space_el() mlog_errno() is called twice when some functions are failed. Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Daeseok Youn	592a202a3d	ocfs2: remove unreachable code Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Dan Carpenter	e6d9f86d6b	ocfs2: o2net: silence uninitialized variable warning Smatch complains that, if o2net_tx_can_proceed() returns false, then "sc" and "ret" are uninialized or maybe we are re-using the data from previous iteration. I do not know if we can hit this bug in real life but checking the return value is harmless and we may as well silence the static checker warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Jan Kara	d9510a20f8	ocfs2: remove pointless assignment from ocfs2_calc_refcount_meta_credits() The assigned value is never used. Coverity-id 1226847. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
alex chen	1dfeb76847	ocfs2: add a mount option journal_async_commit on ocfs2 filesystem Add a mount option to support JBD2 feature: JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. This option will enable 'journal_checksum' internally. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement, the files per second go up from 215.2 to 317.5. test script: fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000 default: FSUse% Count Size Files/sec App Overhead 0 1000 10240 215.2 17878 with journal_async_commit option: FSUse% Count Size Files/sec App Overhead 0 1000 10240 317.5 17881 Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Weiwei Wang <wangww631@huawei.comm> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
alex chen	15eba0fe3e	ocfs2: fix journal commit deadlock in ocfs2_convert_inline_data_to_extents Similar to ocfs2_write_end_nolock() which is metioned at commit `136f49b917` ("ocfs2: fix journal commit deadlock"), we should unlock pages before ocfs2_commit_trans() in ocfs2_convert_inline_data_to_extents. Otherwise, it will cause a deadlock with journal commit threads. Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Rickard Strandqvist	95671c63d5	ocfs2: dlm: dlmdomain: remove unused function Remove dlm_joined() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Rickard Strandqvist	7ea62d7031	ocfs2: quota_local: remove unused function Remove ol_dqblk_file_block() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Rickard Strandqvist	d6edc87af8	ocfs2: xattr: remove unused function Remove ocfs2_xattr_bucket_get_val() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
alex chen	79c83ea1aa	ocfs2: fix snprintf format specifier in dlmdebug.c Use snprintf format specifier "%lu" instead of "%ld" for argument of type 'unsigned long'. Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:29 -08:00
Junxiao Bi	41d6247fdd	ocfs2: fix wrong comment O2NET_CONN_IDLE_DELAY is not defined, connection attempts will not be canceled due to timeout. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Junxiao Bi	696cdf730f	ocfs2: fix uninitialized variable access Variable "why" is not yet initialized at line 615, fix it. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Fabian Frederick	e6e746187d	ocfs2: remove unnecessary else in ocfs2_set_acl() else is unnecessary after return. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Xue jiufei	b934beaf4b	ocfs2/dlm: add missing dlm_lock_put() when recovery master down When the recovery master is down, the owner of $RECOVERY calls dlm_do_local_recovery_cleanup() to prune any $RECOVERY entries for dead nodes. The lock is in the granted list and the refcount must be 2. We should put twice to remove this lock. Otherwise, it will lead to a memory leak. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reported-by: yangwenfang <vicky.yangwenfang@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Lino Sanfilippo	66ba93c0d7	fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask Currently FAN_ONDIR is always set on a mark's ignored mask when the event mask is extended without FAN_MARK_ONDIR being set. This may result in events for directories being ignored unexpectedly for call sequences like fanotify_mark(fd, FAN_MARK_ADD, FAN_OPEN \| FAN_ONDIR , AT_FDCWD, "dir"); fanotify_mark(fd, FAN_MARK_ADD, FAN_CLOSE, AT_FDCWD, "dir"); Also FAN_MARK_ONDIR is only honored when adding events to a mark's mask, but not for event removal. Fix both issues by not setting FAN_ONDIR implicitly on the ignore mask any more. Instead treat FAN_ONDIR as any other event flag and require FAN_MARK_ONDIR to be set by the user for both event mask and ignore mask. Furthermore take FAN_MARK_ONDIR into account when set for event removal. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Lino Sanfilippo	d2c1874ce6	fanotify: don't recalculate a marks mask if only the ignored mask changed If removing bits from a mark's ignored mask, the concerning inodes/vfsmounts mask is not affected. So don't recalculate it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Lino Sanfilippo	a118449a77	fanotify: only destroy mark when both mask and ignored_mask are cleared In fanotify_mark_remove_from_mask() a mark is destroyed if only one of both bitmasks (mask or ignored_mask) of a mark is cleared. However the other mask may still be set and contain information that should not be lost. So only destroy a mark if both masks are cleared. Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-10 14:30:28 -08:00
Dan Carpenter	4c21462acc	pnfs: delete an unintended goto There was an extra goto here where it shouldn't be, because of a merge error. Fixes: `e2c63e091e` ('Merge branch 'flexfiles'') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-10 08:41:23 -05:00
Andrew Elble	278702074f	GFS2: Fix crash during ACL deletion in acl max entry check in gfs2_set_acl() Fixes: `e01580bf9e` ("gfs2: use generic posix ACL infrastructure") Reported-by: Eric Meddaugh <etmsys@rit.edu> Tested-by: Eric Meddaugh <etmsys@rit.edu> Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2015-02-10 10:14:56 +00:00
hujianyang	88cff0f0fb	UBIFS: return -EINVAL if log head is empty CS node is recognized as a sign in UBIFS log replay mechanism. Log relaying during mount should find the CS node in log head at beginning and then replay the following uncommitted buds. Here is a bug in log replay path: If the log head, which is indicated by @log_lnum in mst_node, is empty, current UBIFS replay nothing and directly mount the partition without any warning. This action will put filesystem in an abnormal state, e.g. space management in LPT area is incorrect to the real space usage in main area. We reproduced this bug by fault injection: turn log head leb into all 0xFF. UBIFS driver mount the polluted partition normally. But errors occur while running fs_stress on this mount: [89068.055183] UBI error: ubi_io_read: error -74 (ECC error) while reading 59 bytes from PEB 711:33088, read 59 bytes [89068.179877] UBIFS error (pid 10517): ubifs_check_node: bad magic 0x101031, expected 0x6101831 [89068.179882] UBIFS error (pid 10517): ubifs_check_node: bad node at LEB 591:28992 [89068.179891] Not a node, first 24 bytes: [89068.179892] 00000000: 31 10 10 00 37 84 64 04 10 04 00 00 00 00 00 00 20 00 00 00 02 01 00 00 1...7.d......... ....... [89068.180282] UBIFS error (pid 10517): ubifs_read_node: expected node type 2 This patch fix the problem by checking lnum to guarantee the empty leb is not log head leb and return an error if the log head leb is incorrectly empty. After this, we could catch log head empty error in place. Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2015-02-10 10:06:24 +02:00
Tom Haynes	480486b473	pnfs/flexfiles: Do not dprintk after the free Found by 0-DAY kernel test infrastructure: fs/nfs/flexfilelayout/flexfilelayoutdev.c:520:13-16: ERROR: reference preceded by free on line 518 fs/nfs/flexfilelayout/flexfilelayoutdev.c:520:26-29: ERROR: reference preceded by free on line 518 fs/nfs/flexfilelayout/flexfilelayoutdev.c:520:39-42: ERROR: reference preceded by free on line 518 fs/nfs/flexfilelayout/flexfilelayoutdev.c:521:3-6: ERROR: reference preceded by free on line 518 Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-09 22:34:29 -05:00
Linus Torvalds	072bc448cc	Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "Main changes: - Move efivarfs from the misc filesystem section to pseudo filesystem - Expose firmware platform size in sysfs - Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t. - various cleanups and fixes The biggest risk is the get_memory_map() change, which changes the way that both the arm64 and x86 EFI boot stub build the early memory map. There are no known regressions with it at the moment, BYMMV" * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: Don't look for chosen@0 node on DT platforms firmware: efi: Remove unneeded guid unparse efi/libstub: Call get_memory_map() to obtain map and desc sizes efi: Small leak on error in runtime map code efi: rtc-efi: Mark UIE as unsupported arm64/efi: efistub: Apply __init annotation efi: Expose underlying UEFI firmware platform size to userland efi: Rename efi_guid_unparse to efi_guid_to_str efi: Update the URLs for efibootmgr fs: Make efivarfs a pseudo filesystem, built by default with EFI	2015-02-09 17:53:53 -08:00
Linus Torvalds	23e8fe2e16	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle are: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) rcu: Initialize tiny RCU stall-warning timeouts at boot rcu: Fix RCU CPU stall detection in tiny implementation rcu: Add GP-kthread-starvation checks to CPU stall warnings rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors rcu: Optionally run grace-period kthreads at real-time priority ksoftirqd: Use new cond_resched_rcu_qs() function ksoftirqd: Enable IRQs and call cond_resched() before poking RCU rcutorture: Add more diagnostics in rcu_barrier() test failure case torture: Flag console.log file to prevent holdovers from earlier runs torture: Add "-enable-kvm -soundhw pcspk" to qemu command line rcutorture: Handle different mpstat versions rcutorture: Check from beginning to end of grace period rcu: Remove redundant rcu_batches_completed() declaration rcutorture: Drop rcu_torture_completed() and friends rcu: Provide rcu_batches_completed_sched() for TINY_RCU rcutorture: Use unsigned for Reader Batch computations rcutorture: Make build-output parsing correctly flag RCU's warnings rcu: Make _batches_completed() functions return unsigned long rcutorture: Issue warnings on close calls due to Reader Batch blows documentation: Fix smp typo in memory-barriers.txt ...	2015-02-09 14:28:42 -08:00
Dave Chinner	bad962662d	Merge branch 'xfs-misc-fixes-for-3.20-4' into for-next	2015-02-10 09:24:25 +11:00
Dave Chinner	e9892d3cc8	xfs: only trace buffer items if they exist The commit `2d3d0c5` ("xfs: lobotomise xfs_trans_read_buf_map()") left a landmine in the tracing code: trace_xfs_trans_buf_read() is now call on all buffers that are read through this interface rather than just buffers in transactions. For buffers outside transaction context, bp->b_fspriv is null, and so the buf log item tracing functions cannot be called. This causes a NULL pointer dereference in the trace_xfs_trans_buf_read() function when tracing is turned on. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-10 09:23:40 +11:00
J. Bruce Fields	c23ae60178	nfsd: default NFSv4.2 to on The code seems to work. The protocol looks stable. The kernel's version defaults can be overridden by rpc.nfsd arguments. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-02-09 14:58:50 -05:00
Linus Torvalds	cdecbb336e	Merge git://git.kvack.org/~bcrl/aio-fixes Pull aio nested sleep annotation from Ben LaHaise, * git://git.kvack.org/~bcrl/aio-fixes: aio: annotate aio_read_event_ring for sleep patterns	2015-02-08 18:27:58 -08:00
Linus Torvalds	bdfeb5a104	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "Forrest Liu tracked down a missing blk_finish_plug in the btrfs logging code. This isn't a new bug, and it's hard to hit. But, it's safe enough for inclusion now, and in my for-linus branch" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: add missing blk_finish_plug in btrfs_sync_log()	2015-02-07 11:04:48 -08:00
Trond Myklebust	4ef2e4f84c	NFSv4.1: Fix pnfs_put_lseg races pnfs_layoutreturn_free_lseg_async() can also race with inode put in the general case. We can now fix this, and also simplify the code. Cc: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-05 23:44:18 -05:00
Trond Myklebust	e4af440aaf	NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS In we want to be able to call pnfs_send_layoutreturn() from within the writeback path, we really want it to use GFP_NOFS in order to prevent recursion. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-05 22:16:50 -05:00
Trond Myklebust	5a0ec8acb9	NFSv4.1: Pin the inode and super block in asynchronous layoutreturns If we're sending an asynchronous layoutreturn, then we need to ensure that the inode and the super block remain pinned. Cc: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com>	2015-02-05 22:16:45 -05:00
Trond Myklebust	472e259449	NFSv4.1: Pin the inode and super block in asynchronous layoutcommit If we're sending an asynchronous layoutcommit, then we need to ensure that the inode and the super block remain pinned. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com>	2015-02-05 22:16:33 -05:00
Trond Myklebust	ea7c38fef0	NFSv4: Ensure we reference the inode for return-on-close in delegreturn If we have to do a return-on-close in the delegreturn code, then we must ensure that the inode and super block remain referenced. Cc: Peng Tao <tao.peng@primarydata.com> Cc: stable@vger.kernel.org # 3.17.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com>	2015-02-05 21:31:06 -05:00
Eric Sandeen	01f9882eac	xfs: report proper f_files in statfs if we overshoot imaxpct Normally, a statfs syscall reports m_maxicount as f_files (total file nodes in file system) because it is supposed to be the upper limit for dynamically-allocated inodes. It's possible, however, to overshoot imaxpct / m_maxicount. If this happens, we should report the actual number of allocated inodes, which is contained in sb_icount. Add one more adjustment to the statfs code to make this happen. Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-06 09:53:02 +11:00
David S. Miller	6e03f896b5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-05 14:33:28 -08:00
Ryusuke Konishi	7ef3ff2fea	nilfs2: fix deadlock of segment constructor over I_SYNC flag Nilfs2 eventually hangs in a stress test with fsstress program. This issue was caused by the following deadlock over I_SYNC flag between nilfs_segctor_thread() and writeback_sb_inodes(): nilfs_segctor_thread() nilfs_segctor_thread_construct() nilfs_segctor_unlock() nilfs_dispose_list() iput() iput_final() evict() inode_wait_for_writeback() * wait for I_SYNC flag writeback_sb_inodes() * set I_SYNC flag on inode->i_state __writeback_single_inode() do_writepages() nilfs_writepages() nilfs_construct_dsync_segment() nilfs_segctor_sync() * wait for completion of segment constructor inode_sync_complete() * clear I_SYNC flag after __writeback_single_inode() completed writeback_sb_inodes() calls do_writepages() for dirty inodes after setting I_SYNC flag on inode->i_state. do_writepages() in turn calls nilfs_writepages(), which can run segment constructor and wait for its completion. On the other hand, segment constructor calls iput(), which can call evict() and wait for the I_SYNC flag on inode_wait_for_writeback(). Since segment constructor doesn't know when I_SYNC will be set, it cannot know whether iput() will block or not unless inode->i_nlink has a non-zero count. We can prevent evict() from being called in iput() by implementing sop->drop_inode(), but it's not preferable to leave inodes with i_nlink == 0 for long periods because it even defers file truncation and inode deallocation. So, this instead resolves the deadlock by calling iput() asynchronously with a workqueue for inodes with i_nlink == 0. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-05 13:35:29 -08:00
Fabian Frederick	6981498d79	udf: remove bool assignment to 0/1 Fix the following coccinelle warnings: fs/udf/inode.c:753:2-13: WARNING: Assignment of bool to 0/1 fs/udf/inode.c:795:2-13: WARNING: Assignment of bool to 0/1 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz>	2015-02-05 16:34:25 +01:00
Fabian Frederick	2b8f942111	udf: use bool for done variable 'done' is only used for true/false in loop. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz>	2015-02-05 16:34:22 +01:00
Christoph Hellwig	8650b8a058	nfsd: pNFS block layout driver Add a small shim between core nfsd and filesystems to translate the somewhat cumbersome pNFS data structures and semantics to something more palatable for Linux filesystems. Thanks to Rick McNeal for the old prototype pNFS blocklayout server code, which gave a lot of inspiration to this version even if no code is left from it. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-05 14:35:18 +01:00
Theodore Ts'o	a26f49926d	ext4: add optimization for the lazytime mount option Add an optimization for the MS_LAZYTIME mount option so that we will opportunistically write out any inodes with the I_DIRTY_TIME flag set in a particular inode table block when we need to update some inode in that inode table block anyway. Also add some temporary code so that we can set the lazytime mount option without needing a modified /sbin/mount program which can set MS_LAZYTIME. We can eventually make this go away once util-linux has added support. Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-05 02:45:00 -05:00
Theodore Ts'o	fe032c422c	vfs: add find_inode_nowait() function Add a new function find_inode_nowait() which is an even more general version of ilookup5_nowait(). It is designed for callers which need very fine grained control over when the function is allowed to block or increment the inode's reference count. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-05 02:45:00 -05:00
Theodore Ts'o	0ae45f63d4	vfs: add support for a lazytime mount option Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-05 02:45:00 -05:00
Forrest Liu	3da5ab5648	Btrfs: add missing blk_finish_plug in btrfs_sync_log() Add missing blk_finish_plug in btrfs_sync_log() Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-04 18:02:37 -08:00
kbuild test robot	f92090e95c	xfs: xfs_ioctl_setattr_check_projid can be static fs/xfs/xfs_ioctl.c:1146:1: sparse: symbol 'xfs_ioctl_setattr_check_projid' was not declared. Should it be static? Also fix xfs_ioctl_setattr_check_extsize at the same time. Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-05 11:13:21 +11:00
Christoph Hellwig	f8079b850c	xfs: growfs should use synchronous transactions Growfs updates the secondary superblocks using synchronous unlogged buffer writes after committing the updates to the primary superblock. Mark the transaction to the primary superblock as synchronous so that we guarantee it is committed to disk before we update the secondary superblocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-05 11:13:21 +11:00
Linus Torvalds	5ee0e96260	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Three small cifs fixes. One fixes a hang under stress, and the other two are security related" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix MUST SecurityFlags filtering Complete oplock break jobs before closing file handle cifs: use memzero_explicit to clear stack buffer	2015-02-04 10:22:08 -08:00
Trond Myklebust	6ae373394c	NFSv4.1: Ask for no delegation on OPEN if using O_DIRECT If we're using NFSv4.1, then we have the ability to let the server know whether or not we believe that returning a delegation as part of our OPEN request would be useful. The feature needs to be used with care, since the client sending the request doesn't necessarily know how other clients are using that file, and how they may be affected by the delegation. For this reason, our initial use of the feature will be to let the server know when the client believes that handing out a delegation would not be useful. The first application for this function is when opening the file using O_DIRECT. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-04 10:35:32 -05:00
Oleg Drokin	7456a37d55	GFS2: use __vmalloc GFP_NOFS for fs-related allocations. leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so it clearly does not want any shrinker activity within the fs itself. convert vzalloc into __vmalloc(GFP_NOFS\|__GFP_ZERO) to better achieve this goal. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2015-02-04 09:58:41 +00:00
Al Viro	2e90b1c45e	rxrpc: make the users of rxrpc_kernel_send_data() set kvec-backed msg_iter properly Use iov_iter_kvec() there, get rid of set_fs() games - now that rxrpc_send_data() uses iov_iter primitives, it'll handle ITER_KVEC just fine. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:14 -05:00
Dave Chinner	9c9ce763b1	aio: annotate aio_read_event_ring for sleep patterns Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw warnings like the following due to being called from wait_event context: WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810d85a3>] prepare_to_wait_event+0x63/0x110 Modules linked in: CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8 ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8 ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300 Call Trace: [<ffffffff81daf2bd>] dump_stack+0x4c/0x65 [<ffffffff8109beda>] warn_slowpath_common+0x8a/0xc0 [<ffffffff8109bf56>] warn_slowpath_fmt+0x46/0x50 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810d85a3>] ? prepare_to_wait_event+0x63/0x110 [<ffffffff810bdfcf>] __might_sleep+0x7f/0x90 [<ffffffff81db8344>] mutex_lock+0x24/0x45 [<ffffffff81216b7c>] aio_read_events+0x4c/0x290 [<ffffffff81216fac>] read_events+0x1ec/0x220 [<ffffffff810d8650>] ? prepare_to_wait_event+0x110/0x110 [<ffffffff810fdb10>] ? hrtimer_get_res+0x50/0x50 [<ffffffff8121899d>] SyS_io_getevents+0x4d/0xb0 [<ffffffff81dba5a9>] system_call_fastpath+0x12/0x17 ---[ end trace bde69eaf655a4fea ]--- There is not actually a bug here, so annotate the code to tell the debug logic that everything is just fine and not to fire a false positive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>	2015-02-03 19:29:05 -05:00
Javi Merino	adf305f778	sysfs: fix warning when creating a sysfs group without attributes When attempting to create a gropu without attrs, the warning prints the name of the group. However, the check for name being a NULL pointer is wrong: it uses the pointer to the name when it's NULL. Fix it to use the name if present, otherwise just put an empty string. Cc: Bruno Prémont <bonbons@linux-vserver.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2015-02-03 15:50:31 -08:00
Trond Myklebust	03a9a42a1a	SUNRPC: NULL utsname dereference on NFS umount during namespace cleanup Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, which now apparently happens after the utsname has been freed. Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: stable@vger.kernel.org # 3.18 Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-03 16:40:17 -05:00
Trond Myklebust	e2c63e091e	Merge branch 'flexfiles' * flexfiles: (53 commits) pnfs: lookup new lseg at lseg boundary nfs41: .init_read and .init_write can be called with valid pg_lseg pnfs: Update documentation on the Layout Drivers pnfs/flexfiles: Add the FlexFile Layout Driver nfs: count DIO good bytes correctly with mirroring nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags nfs/flexfiles: send layoutreturn before freeing lseg nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE nfs41: allow async version layoutreturn nfs41: add range to layoutreturn args pnfs: allow LD to ask to resend read through pnfs nfs: add nfs_pgio_current_mirror helper nfs: only reset desc->pg_mirror_idx when mirroring is supported nfs41: add a debug warning if we destroy an unempty layout pnfs: fail comparison when bucket verifier not set nfs: mirroring support for direct io nfs: add mirroring support to pgio layer pnfs: pass ds_commit_idx through the commit path ... Conflicts: fs/nfs/pnfs.c fs/nfs/pnfs.h	2015-02-03 16:01:27 -05:00
Weston Andros Adamson	7c13789e3e	pnfs: lookup new lseg at lseg boundary Before mirroring support was added, the pageio descriptor's pg_lseg was set to null when an RPC was sent. Because of this, pg_init was called at lseg boundaries with pg_lseg = NULL, and it could be set to the new lseg. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:54 -08:00
Peng Tao	cb5d04bc39	nfs41: .init_read and .init_write can be called with valid pg_lseg With pgio refactoring in v3.15, .init_read and .init_write can be called with valid pgio->pg_lseg. file layout was fixed at that time by commit `c6194271f` (pnfs: filelayout: support non page aligned layouts). But the generic helper still needs to be fixed. Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:53 -08:00
Tom Haynes	d67ae825a5	pnfs/flexfiles: Add the FlexFile Layout Driver The flexfile layout is a new layout that extends the file layout. It is currently being drafted as a specification at https://datatracker.ietf.org/doc/draft-ietf-nfsv4-layout-types/ Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Tao Peng <bergwolf@primarydata.com>	2015-02-03 11:06:52 -08:00
Peng Tao	5fadeb47dc	nfs: count DIO good bytes correctly with mirroring When resending to MDS, we might resend multiple mirroring requests to MDS. As a result, nfs_direct_good_bytes() ends up counting bytes multiple times, causing application to get wrong return results in read/write syscalls. Fix it by tracking start of a dreq and checking the range of pgio header. Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:52 -08:00
Peng Tao	aa8a45ee97	nfs41: wait for LAYOUTRETURN before retrying LAYOUTGET Also take care to stop waiting if someone clears retry bit. Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:51 -08:00
Peng Tao	012fa16dca	nfs: add a helper to set NFS_ODIRECT_RESCHED_WRITES to direct writes To allow pnfs LD to ask direct writes to be resend. Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:51 -08:00
Peng Tao	c829013dca	nfs41: add NFS_LAYOUT_RETRY_LAYOUTGET to layout header flags Use it to indicate that LD wants to retry layoutget. LD can set it whenever it wants the common pnfs code to return and retry pnfs path through a new layout. The bit gets cleared when client does a new layoutget, when client closes the file (ROC case), or when kernel needs to evict the inode (non-ROC case). Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:50 -08:00
Peng Tao	27b6f53987	nfs/flexfiles: send layoutreturn before freeing lseg Otherwise we'll lose error tracking information when encoding layoutreturn. pnfs_put_lseg may be called from rpc callbacks. So we should not call pnfs_send_layoutreturn directly because it can deadlock in the rpc layer. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:50 -08:00
Peng Tao	193e3aa2cc	nfs41: introduce NFS_LAYOUT_RETURN_BEFORE_CLOSE When it is set, generic pnfs would try to send layoutreturn right before last close/delegation_return regard less NFS_LAYOUT_ROC is set or not. LD can then make sure layoutreturn is always sent rather than being omitted. The difference against NFS_LAYOUT_RETURN is that NFS_LAYOUT_RETURN_BEFORE_CLOSE does not block usage of the layout so LD can set it and expect generic layer to try pnfs path at the same time. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:50 -08:00
Peng Tao	6c16605d6e	nfs41: allow async version layoutreturn Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:49 -08:00
Peng Tao	15eb67c153	nfs41: add range to layoutreturn args So that callers can specify which range to return. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:49 -08:00
Peng Tao	ceb11e13df	pnfs: allow LD to ask to resend read through pnfs If current IO cannot be completed due to some transient errors, LD may want to ask generic layer to resend the request through pnfs again. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:48 -08:00
Peng Tao	48d635f14a	nfs: add nfs_pgio_current_mirror helper Let it return current nfs_pgio_mirror in use depending on pg_mirror_count. For read, we always use pg_mirrors[0], so this effectively gives us freedom to use pg_mirror_idx to track the actual mirror to read from through out the IO stack. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:48 -08:00
Peng Tao	47af81f295	nfs: only reset desc->pg_mirror_idx when mirroring is supported so that we don't reset desc->pg_mirror_idx for read unnecessarily. Remove WARN_ON_ONCE from __nfs_pageio_add_request to allow LD to set pg_mirror_idx for read where pg_mirror_count is always 1. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:47 -08:00
Peng Tao	566f873763	nfs41: add a debug warning if we destroy an unempty layout So that we can detect the case if some layout segments are still pinned which is surely a bug that we need to fix. Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:47 -08:00
Weston Andros Adamson	80c76fe314	pnfs: fail comparison when bucket verifier not set This skips the WARN_ON_ONCE, but doesnt change behavior (the memcmp would fail). Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:46 -08:00
Weston Andros Adamson	0a00b77b33	nfs: mirroring support for direct io The current mirroring code only notices short writes to the first mirror. This patch keeps per-mirror byte counts and only considers a byte to be written once all mirrors report so. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:46 -08:00
Weston Andros Adamson	a7d42ddb30	nfs: add mirroring support to pgio layer This patch adds mirrored write support to the pgio layer. The default is to use one mirror, but pgio callers may define callbacks to change this to any value up to the (arbitrarily selected) limit of 16. The basic idea is to break out members of nfs_pageio_descriptor that cannot be shared between mirrored DSes and put them in a new structure. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:45 -08:00
Weston Andros Adamson	b57ff1303a	pnfs: pass ds_commit_idx through the commit path Pass ds_commit_idx through the nfs commit path. It's used to select the commit bucket when using pnfs and is ignored when not using pnfs. Several functions had to be changed: nfs_retry_commit, nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout driver .mark_request_commit functions. Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:45 -08:00
Weston Andros Adamson	6cccbb6f52	nfs: rename pgio header ds_idx to ds_commit_idx 'ds_commit_idx' is a better name - it is used to select the right commit bucket for pnfs. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:44 -08:00
Weston Andros Adamson	309a1d65b1	nfs: handle overlapping reqs in lock_and_join This is needed for mirrored DS support, where multuple requests cover the same range. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:44 -08:00
Weston Andros Adamson	180bb5ec06	pnfs: release lseg in pnfs_generic_pg_cleanup This is needed to support mirrored writes - the first write can't just trash the lseg, we need to keep it around until all mirrors have written. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:44 -08:00
Weston Andros Adamson	2176bf4269	nfs: introduce pg_cleanup op for pgio descriptors Add a new operation to nfs_pageio_ops that is called on nfs_pageio_complete. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>	2015-02-03 11:06:43 -08:00
Peng Tao	c220106fb4	nfs/filelayout: use pnfs_error_mark_layout_for_return Instead of calling layoutreturn directly, call pnfs_error_mark_layout_for_return to mark layouts for return and let generic code return layout when layout segments are freed. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com> Conflicts: fs/nfs/filelayout/filelayout.c	2015-02-03 11:06:43 -08:00
Peng Tao	e736a5b98c	nfs41: clear NFS_LAYOUT_RETURN if layoutreturn is sent or failed to send So that pnfs path is not disabled for ever. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:42 -08:00
Peng Tao	aa1e0e3a8e	nfs41: send layoutreturn in last put_lseg If current lseg is the last lseg marked with NFS_LSEG_LAYOUTRETURN, send layoutreturn. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:42 -08:00
Peng Tao	ce6ab4f238	nfs41: don't use a layout if it is marked for returning And if we are to return the same type of layouts, don't bother sending more layoutgets. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:41 -08:00
Peng Tao	016256df3a	nfs41: add a helper to mark layout for return It marks all matching layout segments as NFS_LSEG_LAYOUTRETURN, which is an indicator for pnfs_put_lseg() to send layoutreturn, and also prevents pnfs_update_layout() from using the returning segments. Once it is set, it never gets cleared. It also sets proper io failure bit so that pnfs path can be retried after PNFS_LAYOUTGET_RETRY_TIMEOUT second. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:41 -08:00
Peng Tao	f40eb5d044	nfs41: make a helper function to send layoutreturn It allows to specify different iomode to return. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:41 -08:00
Peng Tao	4579d6b897	nfs41: pass iomode through layoutreturn args So that it is possible to return a specific iomode layouts. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:40 -08:00
Peng Tao	aabff4ddca	nfs: save server READ/WRITE/COMMIT status Flexfiles layout would want to use them to report DS IO status. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:40 -08:00
Peng Tao	9bf87482dd	nfs41: serialize first layoutget of a file Per RFC 5661 Errata 3208: \| A client MAY always forget its layout state and associated \| layout stateid at any time (See also section 12.5.5.1). \| In such case, the client MUST use a non-layout stateid for the next \| LAYOUTGET operation. This will signal the server that the client has \| no more layouts on the file and its respective layout state can be \| released before issuing a new layout in response to LAYOUTGET. In order to make such a signal unique to server, client needs to serialize all layoutgets using non-layout stateid. We implement this by serializing layoutgets when client has no layout segments at hand. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:39 -08:00
Peng Tao	abb9a0079c	nfs41: close a small race window when adding new layout to global list Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:39 -08:00
Peng Tao	72cff4494e	nfs/flexclient: export pnfs_layoutcommit_inode flexfiles needs to start layoutcommit when necessary Signed-off-by: Peng Tao <tao.peng@primarydata.com>	2015-02-03 11:06:38 -08:00
Peng Tao	36d3e3dcc9	nfs: set hostname when creating nfsv3 ds connection lockd assumes hostname exists otherwise kernel oops. It can be reproduced by following steps: 1. mount flexfile MDS 2. write some files 3. mount DS via nfsv3 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8134f332>] strlen+0x2/0x20 PGD 0 Oops: 0000 [#1] SMP Modules linked in: nfsd(F) nfs_layout_flexfiles(F) rpcsec_gss_krb5(F) auth_rpcgss(F) nfsv4(F) dns_resolver(F) nfsv3(F) nfs_acl(F) nfs(F) lockd(F) sunrpc(F) fscache(F) ebtable_nat(F) nf_conntrack_netbios_ns(F) nf_conntrack_broadcast(F) ipt_MASQUERADE(F) ip6table_nat(F) nf_nat_ipv6(F) ip6table_mangle(F) ip6t_REJECT(F) nf_conntrack_ipv6(F) nf_defrag_ipv6(F) iptable_nat(F) nf_nat_ipv4(F) nf_nat(F) iptable_mangle(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F) ebtable_filter(F) ebtables(F) ip6table_filter(F) ip6_tables(F) bnep(F) snd_ens1371(F) snd_rawmidi(F) snd_ac97_codec(F) btusb(F) ac97_bus(F) snd_seq(F) snd_seq_device(F) snd_pcm(F) ppdev(F) bluetooth(F) 6lowpan_iphc(F) rfkill(F) vmw_balloon(F) snd_timer(F) snd(F) soundcore(F) gameport(F) i2c_piix4(F) e1000(F) vmw_vmci(F) parport_pc(F) parport(F) shpchp(F) uinput(F) xfs(F) libcrc32c(F) vmwgfx(F) ttm(F) drm(F) mptspi(F) scsi_transport_spi(F) mptscsih(F) mptbase(F) i2c_core(F) CPU: 0 PID: 10397 Comm: mount.nfs Tainted: GF 3.14.7-100.pd_client.001.fc16.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 task: ffff880008942600 ti: ffff880007990000 task.ti: ffff880007990000 RIP: 0010:[<ffffffff8134f332>] [<ffffffff8134f332>] strlen+0x2/0x20 RSP: 0018:ffff880007991aa0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880038d39c20 RCX: 0000000000000004 RDX: 0000000000000006 RSI: 0000000000000010 RDI: 0000000000000000 RBP: ffff880007991b38 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000014600 R11: 0000000000000400 R12: ffffffff81cc8580 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000004 FS: 00007f90cd2ef880(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000001710000 CR4: 00000000001407f0 Stack: ffffffffa045f52c ffff880001782230 ffff880004141e28 0006880007991ac8 ffffffff816dc14b ffff880000000000 ffff880038d39c20 0000000000000010 0000000481cc0006 0000000000000000 ffffffffa0410be8 000000000000c014 Call Trace: [<ffffffffa045f52c>] ? nlmclnt_lookup_host+0x4c/0x2c0 [lockd] [<ffffffff816dc14b>] ? _raw_spin_unlock_bh+0x1b/0x20 [<ffffffffa0410be8>] ? svc_destroy+0xb8/0x140 [sunrpc] [<ffffffffa045c323>] nlmclnt_init+0x53/0xc0 [lockd] [<ffffffffa047d2dc>] ? nfs_get_client+0x1cc/0x340 [nfs] [<ffffffffa047c2e7>] nfs_start_lockd+0xa7/0xd0 [nfs] [<ffffffffa047df71>] nfs_create_server+0x181/0x5c0 [nfs] [<ffffffffa04460f3>] nfs3_create_server+0x13/0x30 [nfsv3] [<ffffffffa048a0bc>] nfs_try_mount+0x21c/0x300 [nfs] [<ffffffff811ca32d>] ? __kmalloc_track_caller+0x1ad/0x240 [<ffffffffa048b677>] ? nfs_fs_mount+0xc37/0xd80 [nfs] [<ffffffffa048ad05>] nfs_fs_mount+0x2c5/0xd80 [nfs] [<ffffffffa048a830>] ? nfs_clone_super+0x140/0x140 [nfs] [<ffffffffa048a240>] ? nfs_clone_sb_security+0x40/0x40 [nfs] [<ffffffff811e7e43>] mount_fs+0x43/0x1b0 [<ffffffff81193100>] ? __alloc_percpu+0x10/0x20 [<ffffffff812026e6>] vfs_kern_mount+0x76/0x120 [<ffffffff81204917>] do_mount+0x237/0xa80 Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-03 11:06:38 -08:00
Trond Myklebust	16cecdf620	NFSv4.1/NFSv3: Add pNFS callbacks for nfs3_(read\|write\|commit)_done() Enable pNFS callbacks to allow flex files to work correctly with a NFSv3-enabled data server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-02-03 11:06:37 -08:00
Peng Tao	46a5ab4754	nfs: allow to specify cred in nfs_initiate_pgio so that flexfile layout client can pass in DS credential instead of using user cred, which will be done in the next patch. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:37 -08:00
Peng Tao	2c4b131dea	nfs4: export nfs4_sequence_done Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:36 -08:00
Peng Tao	cb04ad2a2b	nfs4: pass slot table to nfs40_setup_sequence flexclient needs this as there is no nfs_server to DS connection. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:36 -08:00
Peng Tao	c36aae9ad9	nfs: allow different protocol in nfs_initiate_commit pnfs flexfile layout client may want to use NFSv3 ops rather than the default MDS v4 ops. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:35 -08:00
Tom Haynes	abde71f4d3	pnfs: Add nfs_rpc_ops in calls to nfs_initiate_pgio Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:35 -08:00
Peng Tao	5f01d95394	nfs41: create NFSv3 DS connection if specified Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:35 -08:00
Peng Tao	30626f9c32	nfs41: allow LD to choose DS connection version/minor_version flexfile layout may need to set such when making DS connections. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:34 -08:00
Peng Tao	1a04c6e1a2	nfsv3: introduce nfs3_set_ds_client The flexfiles layout wants to create DS connection over NFSv3. Add nfs3_set_ds_client to allow that to happen. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:34 -08:00
Peng Tao	39280a5ae8	nfs41: move file layout macros to generic pnfs They can be reused by flexfile layout as well. Also add a code such that if read fails on one DS and there are other DSes available to use, don't resend through MDS but through pNFS so that client can read from other DSes. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:33 -08:00
Peng Tao	064172f345	nfs41: allow LD to choose DS connection auth flavor flexfile layout may use different auth flavor as specified by MDS. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:33 -08:00
Peng Tao	7405f9e195	nfs41: pull nfs4_ds_connect from file layout to generic pnfs It can be reused by flexfiles layout client. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:32 -08:00
Peng Tao	6b7f3cf963	nfs41: pull decode_ds_addr from file layout to generic pnfs It can be reused by flexfile layout. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:32 -08:00
Peng Tao	875ae0694b	nfs41: pull data server cache from file layout to generic pnfs Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs. They can all be reused by flexfile layout as well. Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>	2015-02-03 11:06:32 -08:00
Tom Haynes	085d1e33a6	pnfs: Do not grab the commit_info lock twice when rescheduling writes Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:31 -08:00
Tom Haynes	f54bcf2ece	pnfs: Prepare for flexfiles by pulling out common code The flexfilelayout driver will share some common code with the filelayout driver. This set of changes refactors that common code out to avoid any module depenencies. Signed-off-by: Tom Haynes <loghyr@primarydata.com>	2015-02-03 11:06:31 -08:00
Gui Hecheng	b76808fc26	btrfs: cleanup init for list in free-space-cache o removed an unecessary INIT_LIST_HEAD after LIST_HEAD o merge a declare & INIT_LIST_HEAD pair into one LIST_HEAD Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:25:50 -08:00
Shaohua Li	2f0810880f	btrfs: delete chunk allocation attemp when setting block group ro Below test will fail currently: mkfs.ext4 -F /dev/sda btrfs-convert /dev/sda mount /dev/sda /mnt btrfs device add -f /dev/sdb /mnt btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt The reason is there are some block groups with usage 0, but the whole disk hasn't free space to allocate new chunk, so we even can't set such block group readonly. This patch deletes the chunk allocation when setting block group ro. For META, we already have reserve. But for SYSTEM, we don't have, so the check_system_chunk is still required. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:25:20 -08:00
Naohiro Aota	289454ad26	btrfs: clear bio reference after submit_one_bio() After submit_one_bio(), `bio' can go away. However submit_extent_page() leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM). It will cause invalid paging request when submit_extent_page() is called next time. I reproduced ENOMEM case with the following script (need CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS). #!/bin/bash dmesgout=dmesg.txt start=100000 end=300000 step=1000 # btrfs options device=/dev/vdb1 directory=/mnt/btrfs # fault-injection options percent=100 times=3 mkdir -p $directory \|\| exit 1 mount -o compress $device $directory \|\| exit 1 rm -f $directory/file \|\| exit 1 dd if=/dev/zero of=$directory/file bs=1M count=512 \|\| exit 1 for interval in `seq $start $step $end`; do dmesg -C echo 1 > /proc/sys/vm/drop_caches sync export FAILCMD_TYPE=fail_page_alloc ./failcmd.sh -p $percent -t $times -i $interval \ --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \ -- \ cat $directory/file > /dev/null dmesg > ${dmesgout} if grep -q BUG: ${dmesgout}; then cat ${dmesgout} exit 1 fi done umount $directory exit 0 Signed-off-by: Naohiro Aota <naota@elisp.net> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:51 -08:00
Filipe Manana	de554a4fa6	Btrfs: fix scrub race leading to use-after-free While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got the following trace: [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50 [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060 [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4 [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs] [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000 [68127.807663] RIP: 0010:[<ffffffff8107da31>] [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP: 0018:ffff8803f097fbb8 EFLAGS: 00010093 [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854 [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001 [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001 [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390 [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00 [68127.807663] FS: 0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000 [68127.807663] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0 [68127.807663] Stack: [68127.807663] ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000 [68127.807663] ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314 [68127.807663] ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50 [68127.807663] Call Trace: [68127.807663] [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55 [68127.807663] [<ffffffff81072948>] ? __wake_up+0x22/0x4b [68127.807663] [<ffffffff81072948>] __wake_up+0x22/0x4b [68127.807663] [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs] [68127.807663] [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs] [68127.807663] [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28 [68127.807663] [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9 [68127.807663] [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs] [68127.807663] [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs] [68127.807663] [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6 [68127.807663] [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf [68127.807663] [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8 [68127.807663] [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219 [68127.807663] [<ffffffff8105cd75>] kthread+0xdb/0xe3 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0 [68127.807663] [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67 [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75 [68127.807663] RIP [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122 [68127.807663] RSP <ffff8803f097fbb8> [68127.807663] CR2: ffff8803f8947a50 [68127.807663] ---[ end trace d7045aac00a66cd8 ]--- This is due to a race that can happen in a very tiny time window and is illustrated by the following sequence diagram: CPU 1 CPU 2 btrfs_scrub_dev() scrub_bio_end_io_worker() scrub_pending_bio_dec() atomic_dec(&sctx->bios_in_flight) wait sctx->bios_in_flight == 0 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) wake_up(&sctx->list_wait) __wake_up() spin_lock_irqsave(&sctx->list_wait->lock, flags) Another variation of this scenario that results in the same use-after-free issue is: CPU 1 CPU 2 btrfs_scrub_dev() wait sctx->bios_in_flight == 0 scrub_bio_end_io_worker() scrub_pending_bio_dec() __wake_up(&sctx->list_wait) spin_lock_irqsave(&sctx->list_wait->lock, flags) default_wake_function() wake up task at CPU 2 wait sctx->workers_pending == 0 mutex_lock(&fs_info->scrub_lock) (...) mutex_lock(&fs_info->scrub_lock) scrub_free_ctx(sctx) kfree(sctx) spin_unlock_irqrestore(&sctx->list_wait->lock, flags) Fix this by holding the scrub lock while doing the wakeup. This isn't a recent regression, the issue as been around since the scrub feature was added (2011, commit `a2de733c78`). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:50 -08:00
Filipe Manana	001a648df4	Btrfs: add missing cleanup on sysfs init failure If we failed during initialization of sysfs, we weren't unregistering the top level btrfs sysfs entry nor the debugfs stuff. Not unregistering the top level sysfs entry makes future attempts to reload the btrfs module impossible and the following is reported in dmesg: [ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0() [ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs' [ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs] [ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G W 3.13.0-fdm-btrfs-next-24+ #7 [ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2246.451312] 0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410 [ 2246.451314] ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290 [ 2246.451316] ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80 [ 2246.451318] Call Trace: [ 2246.451322] [<ffffffff816f1da6>] dump_stack+0x4e/0x68 [ 2246.451324] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [ 2246.451325] [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50 [ 2246.451328] [<ffffffff81367dc5>] ? strlcat+0x65/0x90 (....) This fixes the following change: btrfs: add simple debugfs interface commit `1bae30982b` Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:49 -08:00
Filipe Manana	d4b450cd4b	Btrfs: fix race between transaction commit and empty block group removal Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit `47ab2a6c68` Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:48 -08:00
David Sterba	e3540eab29	btrfs: add more checks to btrfs_read_sys_array Verify that the sys_array has enough bytes to read the next item. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:39 -08:00
David Sterba	1ffb22cf8c	btrfs: cleanup, rename a few variables in btrfs_read_sys_array There's a pointer to buffer, integer offset and offset passed as pointer, try to find matching names for them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:28 -08:00
David Sterba	ce7fca5f57	btrfs: add checks for sys_chunk_array sizes Verify that possible minimum and maximum size is set, validity of contents is checked in btrfs_read_sys_array. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:23:43 -08:00
David Sterba	75d6ad382b	btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize I received a few crafted images from Jiri, all got through the recently added superblock checks. The lower bounds checks for num_devices and sector/node -sizes were missing and caused a crash during mount. Tools for symbolic code execution were used to prepare the images contents. Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:20:39 -08:00
chandan r	9cc97d6462	Btrfs: Add code to support file creation time This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: chandan <chandanrmail@gmail.com> [refreshed, removed btrfs_inode_otime] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 18:39:16 -08:00
David Sterba	a937b9791e	btrfs: kill btrfs_inode_*time helpers They just opencode taking address of the timespec member. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 18:39:07 -08:00
Markus Elfring	648695c748	jfs: Deletion of an unnecessary check before the function call "unload_nls" The unload_nls() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2015-02-02 15:02:34 -06:00
Christoph Hellwig	31ef83dc05	nfsd: add trace events For now just a few simple events to trace the layout stateid lifetime, but these already were enough to find several bugs in the Linux client layout stateid handling. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:44 +01:00
Christoph Hellwig	c5c707f96f	nfsd: implement pNFS layout recalls Add support to issue layout recalls to clients. For now we only support full-file recalls to get a simple and stable implementation. This allows to embedd a nfsd4_callback structure in the layout_state and thus avoid any memory allocations under spinlocks during a recall. For normal use cases that do not intent to share a single file between multiple clients this implementation is fully sufficient. To ensure layouts are recalled on local filesystem access each layout state registers a new FL_LAYOUT lease with the kernel file locking code, which filesystems that support pNFS exports that require recalls need to break on conflicting access patterns. The XDR code is based on the old pNFS server implementation by Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman, Marc Eshel, Mike Sager and Ricardo Labiaga. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:43 +01:00
Christoph Hellwig	9cf514ccfa	nfsd: implement pNFS operations Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage outstanding layouts and devices. Layout management is very straight forward, with a nfs4_layout_stateid structure that extends nfs4_stid to manage layout stateids as the top-level structure. It is linked into the nfs4_file and nfs4_client structures like the other stateids, and contains a linked list of layouts that hang of the stateid. The actual layout operations are implemented in layout drivers that are not part of this commit, but will be added later. The worst part of this commit is the management of the pNFS device IDs, which suffers from a specification that is not sanely implementable due to the fact that the device-IDs are global and not bound to an export, and have a small enough size so that we can't store the fsid portion of a file handle, and must never be reused. As we still do need perform all export authentication and validation checks on a device ID passed to GETDEVICEINFO we are caught between a rock and a hard place. To work around this issue we add a new hash that maps from a 64-bit integer to a fsid so that we can look up the export to authenticate against it, a 32-bit integer as a generation that we can bump when changing the device, and a currently unused 32-bit integer that could be used in the future to handle more than a single device per export. Entries in this hash table are never deleted as we can't reuse the ids anyway, and would have a severe lifetime problem anyway as Linux export structures are temporary structures that can go away under load. Parts of the XDR data, structures and marshaling/unmarshaling code, as well as many concepts are derived from the old pNFS server implementation from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman, Mike Sager, Ricardo Labiaga and many others. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:42 +01:00
Christoph Hellwig	4d227fca1b	nfsd: make find_any_file available outside nfs4state.c Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:41 +01:00
Christoph Hellwig	e6ba76e194	nfsd: make find/get/put file available outside nfs4state.c Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:41 +01:00
Christoph Hellwig	cd61c52231	nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:40 +01:00
Christoph Hellwig	9558f2500a	nfsd: add fh_fsid_match helper Add a helper to check that the fsid parts of two file handles match. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:39 +01:00
Christoph Hellwig	4d94c2ef20	nfsd: move nfsd_fh_match to nfsfh.h The pnfs code will need it too. Also remove the nfsd_ prefix to match the other filehandle helpers in that file. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:39 +01:00
Christoph Hellwig	11afe9f76e	fs: add FL_LAYOUT lease type This (ab-)uses the file locking code to allow filesystems to recall outstanding pNFS layouts on a file. This new lease type is similar but not quite the same as FL_DELEG. A FL_LAYOUT lease can always be granted, an a per-filesystem lock (XFS iolock for the initial implementation) ensures not FL_LAYOUT leases granted when we would need to recall them. Also included are changes that allow multiple outstanding read leases of different types on the same file as long as they have a differnt owner. This wasn't a problem until now as nfsd never set FL_LEASE leases, and no one else used FL_DELEG leases, but given that nfsd will also issues FL_LAYOUT leases we will have to handle it now. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:38 +01:00
Christoph Hellwig	2ab99ee124	fs: track fl_owner for leases Just like for other lock types we should allow different owners to have a read lease on a file. Currently this can't happen, but with the addition of pNFS layout leases we'll need this feature. Signed-off-by: Christoph Hellwig <hch@lst.de>	2015-02-02 18:09:38 +01:00
Al Viro	15d0f5ea34	Make super_blocks and sb_lock static The only user outside of fs/super.c is gone now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-02 10:07:59 -07:00
J. Bruce Fields	a584143b01	Merge branch 'locks-3.20' of git://git.samba.org/jlayton/linux into for-3.20 Christoph's block pnfs patches have some minor dependencies on these lock patches.	2015-02-02 11:29:29 -05:00
Dave Chinner	179073620d	Merge branch 'xfs-ioctl-setattr-cleanup' into for-next	2015-02-02 10:57:30 +11:00
Iustin Pop	9b94fcc398	xfs: fix behaviour of XFS_IOC_FSSETXATTR on directories Currently, the ioctl handling code for XFS_IOC_FSSETXATTR treats all targets as regular files: it refuses to change the extent size if extents are allocated. This is wrong for directories, as there the extent size is only used as a default for children. The patch fixes this issue and improves validation of flag combinations: - only disallow extent size changes after extents have been allocated for regular files - only allow XFS_XFLAG_EXTSIZE for regular files - only allow XFS_XFLAG_EXTSZINHERIT for directories - automatically clear the flags if the extent size is zero Thanks to Dave Chinner for guidance on the proper fix for this issue. [dchinner: ported changes onto cleanup series. Makes changes clear and obvious.] [dchinner: added comments documenting validity checking rules.] Signed-off-by: Iustin Pop <iustin@k1024.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:26:26 +11:00
Dave Chinner	23bd0735cf	xfs: factor projid hint checking out of xfs_ioctl_setattr The project ID change checking is one of the few remaining open coded checks in xfs_ioctl_setattr(). Factor it into a helper function so that the setattr code mostly becomes a flow of check and action helpers, making it easier to read and follow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:22:53 +11:00
Dave Chinner	d4388d3c09	xfs: factor extsize hint checking out of xfs_ioctl_setattr The extent size hint change checking is fairly complex, so isolate that into it's own function. This simplifies the logic flow of the setattr code, making it easier to read. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:22:20 +11:00
Dave Chinner	41c145271d	xfs: XFS_IOCTL_SETXATTR can run in user namespaces Currently XFS_IOCTL_SETXATTR will fail if run in a user namespace as it it not allowed to change project IDs. The current code, however, also prevents any other change being made as well, so things like extent size hints cannot be set in user namespaces. This is wrong, so only disallow access to project IDs and related flags from inside the init namespace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:17:51 +11:00
Dave Chinner	fd179b9c3b	xfs: kill xfs_ioctl_setattr behaviour mask Now there is only one caller to xfs_ioctl_setattr that uses all the functionality of the function we can kill the behviour mask and start cleaning up the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:16:25 +11:00
Dave Chinner	f96291f6a3	xfs: disaggregate xfs_ioctl_setattr xfs_ioctl_setxflags doesn't need all of the functionailty in xfs_ioctl_setattr() and now we have separate helper functions that share the checks and modifications that xfs_ioctl_setxflags requires. Hence disaggregate it from xfs_ioctl_setattr() to allow further work to be done on xfs_ioctl_setattr. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:15:56 +11:00
Dave Chinner	8f3d17ab06	xfs: factor out xfs_ioctl_setattr transaciton preamble The setup of the transaction is done after a random smattering of checks and before another bunch of ioperations specific validity checks. Pull all the preamble out into a helper function that returns a transaction or error. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:15:35 +11:00
Dave Chinner	29a17c00d4	xfs: separate xflags from xfs_ioctl_setattr The setting of the extended flags is down through two separate interfaces, but they are munged together into xfs_ioctl_setattr and make that function far more complex than it needs to be. Separate it out into a helper function along with all the other common inode changes and transaction manipulations in xfs_ioctl_setattr(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:14:25 +11:00
Dave Chinner	817b6c480e	xfs: FSX_NONBLOCK is not used It is set if the filp is set ot non-blocking, but the flag is not used anywhere. Hence we can kill it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:14:04 +11:00
Dave Chinner	3fd1b0d158	Merge branch 'xfs-misc-fixes-for-3.20-3' into for-next	2015-02-02 10:03:18 +11:00
Christoph Hellwig	2ba6623702	xfs: don't allocate an ioend for direct I/O completions Back in the days when the direct I/O ->end_io callback could be called from interrupt context for AIO we needed a structure to hand off to the workqueue, and reused the ioend structure for this purpose. These days ->end_io is always called from user or workqueue context, which allows us to avoid this memory allocation and simplify the code significantly. [dchinner: removed now unused xfs_finish_ioend_sync() function after Brian Foster did an initial review. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 10:02:09 +11:00
Wang, Yalin	f3d215526e	xfs: change kmem_free to use generic kvfree() Change kmem_free to use kvfree() generic function, remove the duplicated code. Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 09:54:18 +11:00
Christoph Hellwig	8add71ca3f	xfs: factor out a xfs_update_prealloc_flags() helper This logic is duplicated in xfs_file_fallocate and xfs_ioc_space, and we'll need another copy of it for pNFS block support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2015-02-02 09:53:56 +11:00
Dan Carpenter	c7c545d4a3	NFS: a couple off by ones These tests are off by one because if len == sizeof(nfs_export_path) then we have truncated the name. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-30 20:43:30 -05:00
Omar Sandoval	3a7ed3fff3	nfs: prevent truncate on active swapfile Most filesystems prevent truncation of an active swapfile by way of inode_newsize_ok, called from inode_change_ok. NFS doesn't call either from nfs_setattr, presumably because most of these checks are expected to be done server-side. However, the IS_SWAPFILE check can only be done client-side, and truncating a swapfile can't possibly be good. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-30 20:43:29 -05:00
Jeff Layton	6ffa30d3f7	nfs: don't call blocking operations while !TASK_RUNNING Bruce reported seeing this warning pop when mounting using v4.1: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90 Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78 0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000 Call Trace: [<ffffffff8186ac78>] dump_stack+0x4c/0x65 [<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0 [<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90 [<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0 [<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430 [<ffffffff810d941e>] ? groups_alloc+0x3e/0x130 [<ffffffff810d941e>] groups_alloc+0x3e/0x130 [<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc] [<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc] [<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc] [<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc] [<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4] [<ffffffff810ff970>] ? wait_woken+0xc0/0xc0 [<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4] [<ffffffff810d45bf>] kthread+0x11f/0x140 [<ffffffff810ea815>] ? local_clock+0x15/0x30 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 [<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0 [<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250 ---[ end trace 675220a11e30f4f2 ]--- nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE, which is just wrong. Fix that by finishing the wait immediately if we've found that the list has something on it. Also, we don't expect this kthread to accept signals, so we should be using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up hung task warnings from the watchdog, so have the schedule_timeout wake up every 60s if there's no callback activity. Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-30 20:39:50 -05:00
Linus Torvalds	bc208e0ee0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "We have one more fix for btrfs in my for-linus branch - this was a bug in the new raid5/6 scrubbing support" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix raid56 scrub failed in xfstests btrfs/072	2015-01-30 14:25:52 -08:00
Linus Torvalds	92ef9ce301	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and UDF fix from Jan Kara: "A fix for UDF to properly free preallocated blocks and a fix for quota so that Q_GETQUOTA quotactl reports correct numbers for XFS filesystem (and similarly Q_XGETQUOTA quotactl works properly for other filesystems)" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units udf: Release preallocation on last writeable close	2015-01-30 13:46:04 -08:00
Jan Kara	b10a08194c	quota: Store maximum space limit in bytes Currently maximum space limit quota format supports is in blocks however since we store space limits in bytes, this is somewhat confusing. So store the maximum limit in bytes as well. Also rename the field to match the new unit and related inode field to match the new naming scheme. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:51:21 +01:00
Jan Kara	aaa3daed15	quota: Remove quota_on_meta callback There are no more users for quota_on_meta callback. Just remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:51:10 +01:00
Jan Kara	664dbd5fe5	ocfs2: Use generic helpers for quotaon and quotaoff Ocfs2 can just use the generic helpers provided by quota code for turning quotas on and off when quota files are stored as system inodes. The only difference is the feature test in ocfs2_quota_on() and that is covered by dquot_quota_enable() checking whether usage tracking is enabled (which can happen only if the filesystem has the quota feature set). Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:50:58 +01:00
Jan Kara	1fa5efe362	ext4: Use generic helpers for quotaon and quotaoff Ext4 can just use the generic helpers provided by quota code for turning quotas on and off when quota files are stored as system inodes. The only difference is the feature test in ext4_quota_on_sysfile() but the same is achieved in dquot_quota_enable() by checking whether usage tracking for the corresponding quota type is enabled (which can happen only if quota feature is set). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:50:42 +01:00
Jan Kara	3e2af67e66	quota: Add ->quota_{enable,disable} callbacks for VFS quotas Add functions which translate ->quota_enable / ->quota_disable calls into appropriate changes in VFS quota. This will enable filesystems supporting VFS quota files in system inodes to be controlled via Q_XQUOTA[ON\|OFF] quotactls for better userspace compatibility. Also provide a vector for quotactl using these functions which can be used by filesystems with quota files stored in hidden system files. Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:50:32 +01:00
Jan Kara	d3b8632485	quota: Wire up ->quota_{enable,disable} callbacks into Q_QUOTA{ON,OFF} Make Q_QUOTAON / Q_QUOTAOFF quotactl call ->quota_enable / ->quota_disable callback when provided. To match current behavior of ocfs2 & ext4 we make these quotactls turn on / off quota enforcement for appropriate quota type. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:50:16 +01:00
Jan Kara	38e478c448	quota: Split ->set_xstate callback into two Split ->set_xstate callback into two callbacks - one for turning quotas on (->quota_enable) and one for turning quotas off (->quota_disable). That way we don't have to pass quotactl command into the callback which seems cleaner. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-30 12:49:40 +01:00
Jan Kara	1cd6b7be92	Merge branch 'for_linus' into for_next	2015-01-30 10:16:33 +01:00
Linus Torvalds	353a0c6fcc	NFS client bugfixes for Linux 3.19 Highlights include: - Stable fix for a NFSv4.1 Oops on mount - Stable fix for an O_DIRECT deadlock condition - Fix an issue with submounted volumes and fake duplicate inode numbers -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUyqe0AAoJEGcL54qWCgDyIDcP/RZmf7SasgP+i1c4oWGMrqsn 69DaPRjakbCEJcX00DIp3x2Ulq4uUgJ5v9WDEPkQBKe2lAg7uaQBTzY6x1erDs+3 bBZFXd0SByfRD8M6UXQwxtbxksBfozItLNU7vbln/0BiIoabO8QVxNhX6fMfK7yt tl4Ik/Dr/o3Oe3pVR0wW+0eqEYcGGXzTpPdYoN470z1/DjV9w5rkm3HsPzoq3QHI CN3acBJXFrWdCMSJcyzjmeaWRO93FwLSwKS4KKB7hcFY1+pUgA9qhnA5W4GY/0/7 1ovakR778CPD2YAaQuY9iJG561QixiTSRQHRWoI6LgawQ3JyBMt/eqtcfFI6Olxp Mjs+EMjf6hlrFUpBNZ/8dAROJYSU35sZ6y9C5Ch7ZGBhcgbiQCOVyKjNiBWRG8KX +xnufzs4tnSIScMHuPyGYpkXy/xEdRhbtm+NjVbvel7Hr37q0jlV2M0/4i54HJ8U UWMyHfGFnODKkFGByJc9w1XWSGcEZmRsCAsevGI9Yzf+FSTxrdkLWwdVFFOonHQw OfAlYmO29v4zdTbl1dCgU1pyqdDb47Js/07TQCLb2YcRiUfAuysUTFtBSkRDbwCf eYKnRvXahPqTB84TDM62HUCJ7y4D945wt+/6FvuYnDYgOeTARXGzAZ+jhv1i8UbJ 8xX2xcdW6fe2DIfZM/9h =eGYe -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.19-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - Stable fix for a NFSv4.1 Oops on mount - Stable fix for an O_DIRECT deadlock condition - Fix an issue with submounted volumes and fake duplicate inode numbers" * tag 'nfs-for-3.19-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix use of nfs_attr_use_mounted_on_fileid() NFSv4.1: Fix an Oops in nfs41_walk_client_list nfs: fix dio deadlock when O_DIRECT flag is flipped	2015-01-29 15:18:12 -08:00
Ingo Molnar	3c01b74e81	* Move efivarfs from the misc filesystem section to pseudo filesystem, since that's a more logical and accurate place - Leif Lindholm * Update efibootmgr URL in Kconfig help - Peter Jones * Improve accuracy of EFI guid function names - Borislav Petkov * Expose firmware platform size in sysfs for the benefit of EFI boot loader installers and other utilities - Steve McIntyre * Cleanup __init annotations for arm64/efi code - Ard Biesheuvel * Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel * Fix memory leak in error code path of runtime map code - Dan Carpenter * Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t (which could change in future spec versions) and querying the firmware instead of guessing about the memmap size - Ard Biesheuvel * Remove superfluous guid unparse calls - Ivan Khoronzhuk * Delete unnecessary chosen@0 DT node FDT code since was duplicated from code in drivers/of and is entirely unnecessary - Leif Lindholm -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUv69oAAoJEC84WcCNIz1VEYgP/1b27WRfCXs4q/8FP+UheSDS nAFbGe9PjVPnxo5pA9VwPP6eNQ2zYiyNGEK1BlbQlFPZdSD1updIraA78CiF5iys iSYyG9xVIcTB23RZI8aJLnBXbosIUKPJZ3FORv1LPhI6Mz1rCpraEaaUlv67rUKr FLBG9cR7t9f/f+fJw6LOAAISGIG/4s0wQdA5/noaYkj5R5bICl2UTGtbwa0oNstb NUO93aKDgaG/VljpIEeG6XV96Ioz7cHjQsEaX8sTrvT0n7nPNIqSDjFJOqWKJOXl RsFrzyl8fFIbMuQatYv1f3efPvyH+iKOfHnHrvcjUNje0xhm7F0Bd86BkOw1a3JQ pNb0YUWecI0Z/8GSzN8X0JQ7cowa3wI15Z/Hfs03odTXiM6VqwFAhuz/s5DEUdKS U+rOPjU0ezt3G4oBB/VGgF9w5JWKfsMcsHgmLX9P+JYzKFrxggo1SXAtXUeRAqQp agKmUB+k6Y1baQO8efkoM7rKL2F0q1SR9QiK+16BHCCkevD23v7IFGrHm2r1xKil kvWlY4MkRVa4KGPxEFEDVty0HjXxImwYsxTaYVHTS7SMeoP41f6koHKB19NaB3No 5fqn/rT1KcJuhQj/I+vAixIX4WMJkX/MQVbtKfqSaKlAiRg3eRY6ONYr0jOglfF6 gaMuvmDd0HlV6UJvH/9L =iPpM -----END PGP SIGNATURE----- Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi Pull EFI updates from Matt Fleming: " - Move efivarfs from the misc filesystem section to pseudo filesystem, since that's a more logical and accurate place - Leif Lindholm - Update efibootmgr URL in Kconfig help - Peter Jones - Improve accuracy of EFI guid function names - Borislav Petkov - Expose firmware platform size in sysfs for the benefit of EFI boot loader installers and other utilities - Steve McIntyre - Cleanup __init annotations for arm64/efi code - Ard Biesheuvel - Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel - Fix memory leak in error code path of runtime map code - Dan Carpenter - Improve robustness of get_memory_map() by removing assumptions on the size of efi_memory_desc_t (which could change in future spec versions) and querying the firmware instead of guessing about the memmap size - Ard Biesheuvel - Remove superfluous guid unparse calls - Ivan Khoronzhuk - Delete unnecessary chosen@0 DT node FDT code since was duplicated from code in drivers/of and is entirely unnecessary - Leif Lindholm There's nothing super scary, mainly cleanups, and a merge from Ricardo who kindly picked up some patches from the linux-efi mailing list while I was out on annual leave in December. Perhaps the biggest risk is the get_memory_map() change from Ard, which changes the way that both the arm64 and x86 EFI boot stub build the early memory map. It would be good to have it bake in linux-next for a while. " Signed-off-by: Ingo Molnar <mingo@kernel.org>	2015-01-29 19:16:40 +01:00
Christoph Hellwig	dbe4e192a2	fs: add vfs_iter_{read,write} helpers Simple helpers that pass an arbitrary iov_iter to filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-29 00:13:13 -05:00
Al Viro	05afcb77eb	new helper: iov_iter_bvec() similar to iov_iter_kvec(), for ITER_BVEC ones Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-29 00:13:11 -05:00
Artem Bityutskiy	fb4325a3d9	UBIFS: add a couple of extra asserts ... to catch possible memory corruptions. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2015-01-28 16:09:32 +01:00
Subodh Nijsure	fee1756d80	UBIFS: add ubifs_err() to print error reason This patch adds ubifs_err() output to some error paths to tell the user what's going on. Artem: improve the messages, rename too long variable Signed-off-by: Subodh Nijsure <snijsure@grid-net.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Shelton <ben.shelton@ni.com> Acked-by: Brad Mouring <brad.mouring@ni.com> Acked-by: Terry Wilcox <terry.wilcox@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2015-01-28 16:09:01 +01:00
Subodh Nijsure	d7f0b70d30	UBIFS: Add security.* XATTR support for the UBIFS Artem: rename static functions so that they do not use the "ubifs_" prefix - we only use this prefix for non-static functions. Artem: remove few junk white-space changes in file.c Signed-off-by: Subodh Nijsure <snijsure@grid-net.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Shelton <ben.shelton@ni.com> Acked-by: Brad Mouring <brad.mouring@ni.com> Acked-by: Terry Wilcox <terry.wilcox@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2015-01-28 16:08:54 +01:00
Subodh Nijsure	895d9db253	UBIFS: Add xattr support for symlinks Artem: rename the __ubifs_setxattr() functions to just 'setxattr()'. Signed-off-by: Subodh Nijsure <snijsure@grid-net.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Shelton <ben.shelton@ni.com> Acked-by: Terry Wilcox <terry.wilcox@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2015-01-28 16:08:46 +01:00
Jan Kara	14bf61ffe6	quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which tracks space limits and usage in 512-byte blocks. However VFS quotas track usage in bytes (as some filesystems require that) and we need to somehow pass this information. Upto now it wasn't a problem because we didn't do any unit conversion (thus VFS quota routines happily stuck number of bytes into d_bcount field of struct fd_disk_quota). Only if you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA / Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone tried this but reportedly some Samba users hit the problem in practice. So when we want interfaces compatible we need to fix this. We bite the bullet and define another quota structure used for passing information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have to have more conversion routines in fs/quota/quota.c and another copying of quota structure slows down getting of quota information by about 2% but it seems cleaner than overloading e.g. units of d_bcount to bytes. CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-28 09:01:40 +01:00
Jan Kara	b07ef35244	udf: Release preallocation on last writeable close Commit `6fb1ca92a6` "udf: Fix race between write(2) and close(2)" changed the condition when preallocation is released. The idea was that we don't want to release the preallocation for an inode on close when there are other writeable file descriptors for the inode. However the condition was written in the opposite way so we released preallocation only if there were other writeable file descriptors. Fix the problem by changing the condition properly. CC: stable@vger.kernel.org Fixes: `6fb1ca92a6` Reported-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz>	2015-01-28 09:00:40 +01:00
David S. Miller	95f873f2ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: arch/arm/boot/dts/imx6sx-sdb.dts net/sched/cls_bpf.c Two simple sets of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-27 16:59:56 -08:00
Gui Hecheng	063c54dccd	btrfs: fix raid56 scrub failed in xfstests btrfs/072 The xfstests btrfs/072 reports uncorrectable read errors in dmesg, because scrub forgets to use commit_root for parity scrub routine and scrub attempts to scrub those extents items whose contents are not fully on disk. To fix it, we just add the @search_commit_root flag back. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-27 15:26:16 -08:00
Niklas Cassel	7a1ceba071	cifs: fix MUST SecurityFlags filtering If CONFIG_CIFS_WEAK_PW_HASH is not set, CIFSSEC_MUST_LANMAN and CIFSSEC_MUST_PLNTXT is defined as 0. When setting new SecurityFlags without any MUST flags, your flags would be overwritten with CIFSSEC_MUST_LANMAN (0). Signed-off-by: Niklas Cassel <niklass@axis.com> Signed-off-by: Steve French <steve.french@primarydata.com>	2015-01-26 19:38:26 -06:00
Bob Peterson	45094a58b1	GFS2: Eliminate a nonsense goto This patch just removes a goto that did nothing. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2015-01-26 20:58:54 +00:00
Jan Mrazek	04ecddb73d	ext4: change to use setup_timer() instead of init_timer() Signed-off-by: Jan Mrazek <email@honzamrazek.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2015-01-26 14:42:31 -05:00
Al Viro	87b95ce096	switch the IO-triggering parts of umount to fs_pin Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:17:29 -05:00
Al Viro	59eda0e07f	new fs_pin killing logics Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:17:28 -05:00
Al Viro	fdab684d72	allow attaching fs_pin to a group not associated with some superblock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:17:28 -05:00
Linus Torvalds	360f54796e	dcache: let the dentry count go down to zero without taking d_lock We can be more aggressive about this, if we are clever and careful. This is subtle. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:29 -05:00
Al Viro	32426f6653	pull bumping refcount into ->kill() there will be one more change of ->kill() calling conventions; this isn't final. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:29 -05:00
Al Viro	9e251d0204	kill pin_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:28 -05:00
Al Viro	d443b9fd56	gut proc_register() a bit There are only 3 callers and quite a bit of that thing is executed exactly in one of those. Just lift it there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:26 -05:00
Al Viro	d6cb125b99	kill d_validate() no users left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:26 -05:00
Al Viro	5e993e2534	ncpfs: get rid of d_validate() nonsense What we want is to have non-counting references to children in pagecache of parent directory, and avoid picking them after a child has been freed. Fine, so let's just have ->d_prune() clear parent's inode "has directory contents in page cache" flag. That way we don't need ->d_fsdata for storing offsets, so we can use it as a quick and dirty "is it referenced from page cache" flag. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 23:16:26 -05:00
Linus Torvalds	80a755545d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of fixes - deadlock in CIFS and build breakage in cris serial driver (resurfaced f_dentry in there)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Convert file->f_dentry->d_inode to file_inode() fix deadlock in cifs_ioctl_clone()	2015-01-25 17:27:18 -08:00
Al Viro	77b3da6e32	new primitive: debugfs_create_automount() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:53 -05:00
Al Viro	5233e31191	debugfs: split end_creating() into success and failure cases ... and don't bother with dput(dentry) in the former and with dget(dentry) preceding all its calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:53 -05:00
Al Viro	edac65eaf8	debugfs: take mode-dependent parts of debugfs_get_inode() into callers ... and trim the arguments list Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:53 -05:00
Al Viro	680b302409	fold debugfs_mknod() into callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:52 -05:00
Al Viro	3473cde565	fold debugfs_create() into caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:52 -05:00
Al Viro	02538a75ba	fold debugfs_mkdir() into caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:51 -05:00
Al Viro	160f7592f2	debugfs_mknod(): get rid useless arguments dev is always zero, dir was only used to get its ->i_sb, which is equal to ->d_sb of dentry... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:51 -05:00
Al Viro	9b73fab01b	fold debugfs_link() into caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:50 -05:00
Al Viro	ad5abd5ba8	debugfs: kill __create_file() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 16:52:31 -05:00
Al Viro	190afd81e4	debugfs: split the beginning and the end of __create_file() off Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 15:23:31 -05:00
Al Viro	e09ddf36dd	debugfs_{mkdir,create,link}(): get rid of redundant argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-25 15:23:31 -05:00
Trond Myklebust	cf6726e2ee	NFSv4: Deal with atomic upgrades of an existing delegation Ensure that we deal correctly with the case where the server sends us a newer instance of the same delegation. If the stateids match, but the sequence numbers differ, then treat the new delegation as if it were an atomic upgrade. Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>	2015-01-24 18:46:51 -05:00
Trond Myklebust	89f0ff386c	NFSv4.1: Replace usage of nfs_client->cl_addr in encode_create_session Replace the current code with something that is a little closer to what net/sunrpc/auth_unix.c uses. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:50 -05:00
Trond Myklebust	40dd4b7aee	NFSv4.1: Optimise layout return-on-close Optimise the layout return on close code by ensuring that 1) Add a check for whether we hold a layout before taking any spinlocks 2) Only take the spin lock once 3) Use nfs_state->state to speed up open file checks Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:48 -05:00
Trond Myklebust	b4019c0e21	NFSv4.1: Allow parallel LOCK/LOCKU calls Note, however, that we still serialise on the open stateid if the lock stateid is unconfirmed. Hopefully that will not prove too much of a burden for first time locks; it should leave the ability to parallelise OPENs unchanged, since they no longer call the serialisation primitives. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:48 -05:00
Trond Myklebust	c69899a17c	NFSv4: Update of VFS byte range lock must be atomic with the stateid update Ensure that we test the lock stateid remained unchanged while we were updating the VFS tracking of the byte range lock. Have the process replay the lock to the server if we detect that was not the case. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:47 -05:00
Trond Myklebust	425c1d4e5b	NFSv4: Fix lock on-wire reordering issues This patch ensures that the server cannot reorder our LOCK/LOCKU requests if they are sent in parallel on the wire. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:47 -05:00
Trond Myklebust	6b447539aa	NFSv4: Always do open_to_lock_owner if the lock stateid is uninitialised The original text in RFC3530 was terribly confusing since it conflated lockowners and lock stateids. RFC3530bis clarifies that you must use open_to_lock_owner when there is no lock state for that file+lockowner combination. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 18:46:46 -05:00
Trond Myklebust	39071e6fff	NFSv4: Fix atomicity problems with lock stateid updates When we update the lock stateid, we really do need to ensure that this is done under the state->state_lock, and that we are indeed only updating confirmed locks with a newer version of the same stateid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-24 17:27:41 -05:00
Trond Myklebust	63f5f796af	NFSv4.1: Allow parallel OPEN/OPEN_DOWNGRADE/CLOSE Remove the serialisation of OPEN/OPEN_DOWNGRADE and CLOSE calls for the case of NFSv4.1 and newer. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-23 23:06:45 -05:00
Trond Myklebust	a67964197c	NFSv4: Check for NULL argument in nfs_*_seqid() functions Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-23 23:06:45 -05:00
Trond Myklebust	badc76dd0d	NFSv4: Convert nfs_alloc_seqid() to return an ERR_PTR() if allocation fails When we relax the sequencing on the NFSv4.1 OPEN/CLOSE code, we will want to use the value NULL to indicate that no sequencing is needed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-23 23:06:44 -05:00
Trond Myklebust	f95549cf24	NFSv4: More CLOSE/OPEN races If an OPEN RPC call races with a CLOSE or OPEN_DOWNGRADE so that it updates the nfs_state structure before the CLOSE/OPEN_DOWNGRADE has a chance to do so, then we know that the state->flags need to be recalculated from scratch. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-23 23:06:44 -05:00
Linus Torvalds	c4e00f1d31	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have a few fixes in my for-linus branch. Qu Wenruo's batch fix a regression between some our merge window pull and the inode_cache feature. The rest are smaller bugs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock. btrfs: Fix the bug that fs_info->pending_changes is never cleared. btrfs: fix state->private cast on 32 bit machines Btrfs: fix race deleting block group from space_info->ro_bgs list Btrfs: fix incorrect freeing in scrub_stripe btrfs: sync ioctl, handle errors after transaction start	2015-01-24 14:31:27 +12:00
Trond Myklebust	566fcec60b	NFSv4: Fix an atomicity problem in CLOSE If we are to remove the serialisation of OPEN/CLOSE, then we need to ensure that the stateid sent as part of a CLOSE operation does not change after we test the state in nfs4_close_prepare. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2015-01-23 19:22:39 -05:00
Christoph Hellwig	4c94e13e9c	nfsd: factor out a helper to decode nfstime4 values Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-01-23 10:29:13 -05:00
Jeff Layton	3c5199143b	sunrpc/lockd: fix references to the BKL The BKL is completely out of the picture in the lockd and sunrpc code these days. Update the antiquated comments that refer to it. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-01-23 10:29:12 -05:00
J. Bruce Fields	bbc7f33ac6	nfsd: fix year-2038 nfs4 state problem Someone with a weird time_t happened to notice this, it shouldn't really manifest till 2038. It may not be our ownly year-2038 problem. Reported-by: Aaron Pace <Aaron.Pace@alcatel-lucent.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2015-01-23 10:29:11 -05:00
Paul Moore	55422d0bd2	audit: replace getname()/putname() hacks with reference counters In order to ensure that filenames are not released before the audit subsystem is done with the strings there are a number of hacks built into the fs and audit subsystems around getname() and putname(). To say these hacks are "ugly" would be kind. This patch removes the filename hackery in favor of a more conventional reference count based approach. The diffstat below tells most of the story; lots of audit/fs specific code is replaced with a traditional reference count based approach that is easily understood, even by those not familiar with the audit and/or fs subsystems. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:23:58 -05:00
Paul Moore	fd3522fdc8	audit: enable filename recording via getname_kernel() Enable recording of filenames in getname_kernel() and remove the kludgy workaround in __audit_inode() now that we have proper filename logging for kernel users. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:23:52 -05:00
Al Viro	cbaab2db91	simpler calling conventions for filename_mountpoint() a) make it accept ERR_PTR() as filename (and return its PTR_ERR() in that case) b) make it putname() the sucker in the end otherwise simplifies life for callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:21 -05:00
Paul Moore	5168910413	fs: create proper filename objects using getname_kernel() There are several areas in the kernel that create temporary filename objects using the following pattern: int func(const char name) { struct filename file = { .name = name }; ... return 0; } ... which for the most part works okay, but it causes havoc within the audit subsystem as the filename object does not persist beyond the lifetime of the function. This patch converts all of these temporary filename objects into proper filename objects using getname_kernel() and putname() which ensure that the filename object persists until the audit subsystem is finished with it. Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca for helping resolve a difficult kernel panic on boot related to a use-after-free problem in kern_path_create(); the thread can be seen at the link below: * https://lkml.org/lkml/2015/1/20/710 This patch includes code that was either based on, or directly written by Al in the above thread. CC: viro@zeniv.linux.org.uk CC: linux@roeck-us.net CC: sd@queasysnail.net CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:20 -05:00
Paul Moore	0851854972	fs: rework getname_kernel to handle up to PATH_MAX sized filenames In preparation for expanded use in the kernel, make getname_kernel() more useful by allowing it to handle any legal filename length. Thanks to Guenter Roeck for his suggestion to substitute memcpy() for strlcpy(). CC: linux@roeck-us.net CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:20 -05:00
Al Viro	fa14a0b8d2	cut down the number of do_path_lookup() callers ... and don't bother with new struct filename when we already have one Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-01-23 00:22:19 -05:00
Jens Axboe	b520252aa2	fs: make inode_to_bdi() handle NULL inode Running a heavy fs workload, I ran into a situation where we pass down a page for writeback/swap that doesn't have an inode mapping: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffff8119589f>] inode_to_bdi+0xf/0x50 PGD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: wl(O) tun cfg80211 btusb joydev hid_apple hid_generic usbhid hid bcm5974 usb_storage nouveau snd_hda_codec_hdmi snd_hda_codec_cirrus snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel kvm_intel snd_hda_controller snd_hda_codec kvm snd_hwdep snd_pcm applesmc input_polldev snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_timer snd_seq_device snd xhci_pci xhci_hcd ttm thunderbolt soundcore apple_gmux apple_bl bluetooth binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat [last unloaded: wl] CPU: 4 PID: 50 Comm: kswapd0 Tainted: G U O 3.19.0-rc5+ #60 Hardware name: Apple Inc. MacBookPro11,3/Mac-2BD1B31983FE1663, BIOS MBP112.88Z.0138.B02.1310181745 10/18/2013 task: ffff880462e917f0 ti: ffff880462edc000 task.ti: ffff880462edc000 RIP: 0010:[<ffffffff8119589f>] [<ffffffff8119589f>] inode_to_bdi+0xf/0x50 RSP: 0000:ffff880462edf8e8 EFLAGS: 00010282 RAX: ffffffff81c4cd80 RBX: ffffea0001b3abc0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880462edf8f8 R08: 00000000001e8500 R09: ffff880460f7cb68 R10: ffff880462edfa00 R11: 0000000000000101 R12: 0000000000000000 R13: ffffffff81c4cd98 R14: 0000000000000000 R15: ffff880460f7c9c0 FS: 0000000000000000(0000) GS:ffff88047f300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 00000002b6341000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffffea0001b3abc0 ffffffff81c4cd80 ffff880462edf948 ffffffff811244aa ffffffff811565b0 ffff880460f7c9c0 ffff880462edf948 ffffea0001b3abc0 0000000000000001 ffff880462edfb40 ffff880008b999c0 ffff880460f7c9c0 Call Trace: [<ffffffff811244aa>] __test_set_page_writeback+0x3a/0x170 [<ffffffff811565b0>] ? SyS_madvise+0x790/0x790 [<ffffffff81156bb6>] __swap_writepage+0x216/0x280 [<ffffffff8133d592>] ? radix_tree_insert+0x32/0xe0 [<ffffffff81157741>] ? swap_info_get+0x61/0xf0 [<ffffffff81159bfc>] ? page_swapcount+0x4c/0x60 [<ffffffff81156c4d>] swap_writepage+0x2d/0x50 [<ffffffff81131658>] shmem_writepage+0x198/0x2c0 [<ffffffff8112cae4>] shrink_page_list+0x464/0xa00 [<ffffffff8112d666>] shrink_inactive_list+0x266/0x500 [<ffffffff8112e215>] shrink_lruvec+0x5d5/0x720 [<ffffffff8112e3bb>] shrink_zone+0x5b/0x190 [<ffffffff8112ee3f>] kswapd+0x48f/0x8d0 [<ffffffff8112e9b0>] ? try_to_free_pages+0x4c0/0x4c0 [<ffffffff81067be2>] kthread+0xd2/0xf0 [<ffffffff81060000>] ? workqueue_congested+0x30/0x80 [<ffffffff81067b10>] ? kthread_create_on_node+0x180/0x180 [<ffffffff816b556c>] ret_from_fork+0x7c/0xb0 [<ffffffff81067b10>] ? kthread_create_on_node+0x180/0x180 Code: 00 48 c7 c7 8d 8d a4 81 e8 3f 62 eb ff e9 fc fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 5f 28 48 89 df e8 15 f8 00 00 85 c0 75 11 48 8b 83 d8 00 RIP [<ffffffff8119589f>] inode_to_bdi+0xf/0x50 RSP <ffff880462edf8e8> CR2: 0000000000000028 ---[ end trace eb0e21aa7dad3ddf ]--- Handle this in inode_to_bdi() by punting it to noop_backing_dev_info, if mapping->host is NULL. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-22 08:13:17 -07:00
chandan	95449a1626	Btrfs: insert_new_root: Fix lock type of the extent buffer. btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has been taken. Hence assign the appropriate value to path->locks[level]. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-22 05:42:23 -08:00
Anand Jain	78f55e5e1f	Btrfs: fix unused members in struct btrfs_root There isn't any real use of following members of struct btrfs_root so delete them. struct kobject root_kobj; struct completion kobj_unregister; Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:22:37 -08:00
Yang Dongsheng	0ee13fe28c	btrfs: qgroup: move WARN_ON() to the correct location. In function qgroup_excl_accounting(), we need to WARN when qg->excl is less than what we want to free, same to child and parents. But currently, for parent qgroup, the WARN_ON() is located after freeing qg->excl. It will WARN out even we free it normally. This patch move this WARN_ON() before freeing qg->excl. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:22:37 -08:00
Liu Bo	26455d3318	Btrfs: cleanup unused run_most "run_most" is not used anymore. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:22:16 -08:00
Zhao Lei	570193454a	Rename all ref_count to refs in struct refs is better than ref_count to record a struct's ref count. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:50 -08:00
Zhao Lei	ffe2d2034b	Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 \| BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:49 -08:00
Zhao Lei	10f1190016	Btrfs: Include map_type in raid_bio Corrent code use many kinds of "clever" way to determine operation target's raid type, as: raid_map != NULL or raid_map[MAX_NR] == RAID[56]_Q_STRIPE To make code easy to maintenance, this patch put raid type into bbio, and we can always get raid type from bbio with a "stupid" way. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:49 -08:00
Zhao Lei	be50a8ddaa	Btrfs: Simplify scrub_setup_recheck_block()'s argument scrub_setup_recheck_block() have many arguments but most of them can be get from one of them, we can remove them to make code clean. Some other cleanup for that function also included in this patch. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:49 -08:00
Zhao Lei	b968fed1c3	Btrfs: Combine per-page recover in dev-replace and scrub The code are similar, combine them to make code clean and easy to maintenance. Some lost condition are also completed with benefit of this combination. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:49 -08:00
Zhao Lei	8d6738c1bd	Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block() In corrent code, code of finding-right-mirror and writing-to-target are mixed in logic, if we find a right mirror but failed in writing to target, it will treat as "hadn't found right block", and fill the target with sblock_bad. Actually, "failed in writing to target" does not mean "source block is wrong", this patch separate above two condition in logic, and do some cleanup to make code clean. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:49 -08:00
Zhao Lei	dc5f7a3bd8	Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block() Use break instead of useless loop should be more suitable in this case. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	7653947fe6	Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event() Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	09dd7a01c3	Btrfs: Cleanup btrfs_bio_counter_inc_blocked() 1: Remove no-need DEFINE_WAIT(wait) 2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition 3: Use while loop instead of goto Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	114ab50d82	Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replace It is always 1 in this place, because !1 case was already jumped out in previous code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	b25c94c580	Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON() if (sctx->is_dev_replace && !is_metadata && !have_csum) { ... goto nodatasum_case; } ... nodatasum_case: WARN_ON(sctx->is_dev_replace); In above code, nodatasum_case marker should be moved after WARN_ON(). Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	6e9606d2a2	Btrfs: add ref_count and free function for btrfs_bio 1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
Zhao Lei	8e5cfb55d3	Btrfs: Make raid_map array be inlined in btrfs_bio structure It can make code more simple and clear, we need not care about free bbio and raid_map together. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:47 -08:00
Zhao Lei	cc7539edea	Btrfs: sort raid_map before adding tgtdev stripes It can avoid complex calculation of real stripes in sort, moreover, we can clean up code of sorting tgtdev_map because it will be in order initially. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:47 -08:00
Zhao Lei	e34c330d63	Btrfs: fix a out-of-bound access of raid_map We add the number of stripes on target devices into bbio->num_stripes if we are under device replacement, and we just sort the raid_map of those stripes that not on the target devices, so if when we need real raid_map, we need skip the stripes on the target devices. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:47 -08:00
Filipe Manana	df8d116ffa	Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs If we have an inode with a large number of hard links, some of which may be extrefs, turn a regular ref into an extref, fsync the inode and then replay the fsync log (after a crash/reboot), we can endup with an fsync log that makes the replay code always fail with -EOVERFLOW when processing the inode's references. This is easy to reproduce with the test case I made for xfstests. Its steps are the following: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Now remove one link, add a new one with a new name, add another new one with # the same name as the one we just removed and fsync the inode. rm -f $SCRATCH_MNT/foo_link_0001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001 rm -f $SCRATCH_MNT/foo_link_0002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002 ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Check that the number of hard links is correct, we are able to remove all # the hard links and read the file's data. This is just to verify we don't # get stale file handle errors (due to dangling directory index entries that # point to inodes that no longer exist). echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)" [ -f $SCRATCH_MNT/foo ] \|\| echo "Link foo is missing" for ((i = 1; i <= 3003; i++)); do name=foo_link_`printf "%04d" $i` if [ $i -eq 2 ]; then [ -f $SCRATCH_MNT/$name ] && echo "Link $name found" else [ -f $SCRATCH_MNT/$name ] \|\| echo "Link $name is missing" fi done rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo rm -f $SCRATCH_MNT/foo status=0 exit The fix is simply to correct the overflow condition when overwriting a reference item because it was wrong, trying to increase the item in the fs/subvol tree by an impossible amount. Also ensure that we don't insert one normal ref and one ext ref for the same dentry - this happened because processing a dir index entry from the parent in the log happened when the normal ref item was full, which made the logic insert an extref and later when the normal ref had enough room, it would be inserted again when processing the ref item from the child inode in the log. This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. This test only passes if the previous patch titled "Btrfs: fix fsync when extend references are added to an inode" is applied too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:05 -08:00
Filipe Manana	2c2c452b0c	Btrfs: fix fsync when extend references are added to an inode If we added an extended reference to an inode and fsync'ed it, the log replay code would make our inode have an incorrect link count, which was lower then the expected/correct count. This resulted in stale directory index entries after deleting some of the hard links, and any access to the dangling directory entries resulted in -ESTALE errors because the entries pointed to inode items that don't exist anymore. This is easy to reproduce with the test case I made for xfstests, and the bulk of that test is: _scratch_mkfs "-O extref" >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 3001 hard links. This number is large enough to # make btrfs start using extrefs at some point even if the fs has the maximum # possible leaf/node size (64Kb). echo "hello world" > $SCRATCH_MNT/foo for i in `seq 1 3000`; do ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i` done # Make sure all metadata and data are durably persisted. sync # Add one more link to the inode that ends up being a btrfs extref and fsync # the inode. ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now after the fsync log replay btrfs left our inode with a wrong link count N, # which was smaller than the correct link count M (N < M). # So after removing N hard links, the remaining M - N directory entries were # still visible to user space but it was impossible to do anything with them # because they pointed to an inode that didn't exist anymore. This resulted in # stale file handle errors (-ESTALE) when accessing those dentries for example. # # So remove all hard links except the first one and then attempt to read the # file, to verify we don't get an -ESTALE error when accessing the inodel # # The btrfs fsck tool also detected the incorrect inode link count and it # reported an error message like the following: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref # # The fstests framework automatically calls fsck after a test is run, so we # don't need to call fsck explicitly here. rm -f $SCRATCH_MNT/foo_link_* cat $SCRATCH_MNT/foo status=0 exit So make sure an fsync always flushes the delayed inode item, so that the fsync log contains it (needed in order to trigger the link count fixup code) and fix the extref counting function, which always return -ENOENT to its caller (and made it assume there were always 0 extrefs). This issue has been present since the introduction of the extrefs feature (2012). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
Filipe Manana	d36808e0d4	Btrfs: fix directory inconsistency after fsync log replay If we have an inode (file) with a link count greater than 1, remove one of its hard links, fsync the inode, power fail/crash and then replay the fsync log on the next mount, we end up getting the parent directory's metadata inconsistent - its i_size still reflects the deleted hard link and has dangling index entries (with no matching inode reference entries). This prevents the directory from ever being deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE even if all of its children inodes are deleted, and the dangling index entries can never be removed (as they point to an inode that does not exist anymore). This is easy to reproduce with the following excerpt from the test case for xfstests that I just made: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create a test file with 2 hard links in the same directory. mkdir -p $SCRATCH_MNT/a/b echo "hello world" > $SCRATCH_MNT/a/b/foo ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar # Make sure all metadata and data are durably persisted. sync # Now remove one of the hard links and fsync the inode. rm -f $SCRATCH_MNT/a/b/bar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo # Simulate a crash/power loss. This makes sure the next mount # will see an fsync log and will replay that log. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Remove the last hard link of the file and attempt to remove its parent # directory - this failed in btrfs because the fsync log and replay code # didn't decrement the parent directory's i_size and left dangling directory # index entries - this made the btrfs rmdir implementation always fail with # the error -ENOTEMPTY. # # The dangling directory index entries were visible to user space, but it was # impossible to do anything on them (unlink, open, read, write, stat, etc) # because the inode they pointed to did not exist anymore. # # The parent directory's metadata inconsistency (stale index entries) was # also detected by btrfs' fsck tool, which is run automatically by the fstests # framework when the test finishes. The error message reported by fsck was: # # root 5 inode 259 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref # rm -f $SCRATCH_MNT/a/b/* rmdir $SCRATCH_MNT/a/b rmdir $SCRATCH_MNT/a To fix this just make sure that after an unlink, if the inode is fsync'ed, he parent inode is fully logged in the fsync log. A test case for xfstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
Filipe Manana	6219872dc6	Btrfs: lookup for block group only if needed when freeing a tree block Very often our extent buffer's header generation doesn't match the current transaction's id or it is also referenced by other trees (snapshots), so we don't need the corresponding block group cache object. Therefore only search for it if we are going to use it, so we avoid an unnecessary search in the block groups rbtree (and acquiring and releasing its spinlock). Freeing a tree block is performed when COWing or deleting a node/leaf, which implies we are holding the node/leaf's parent node lock, therefore reducing the amount of time spent when freeing a tree block helps reducing the amount of time we are holding the parent node's lock. For example, for a run of xfstests/generic/083, the block group cache object was needed only 682 times for a total of 226691 calls to free a tree block. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
David Sterba	730a78c741	btrfs: remove a no-op unfreeze superbock callback Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
David Sterba	9ee49a047d	btrfs: switch extent_state state to unsigned Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 / u64 end; / 8 8 / struct rb_node rb_node; / 16 24 / wait_queue_head_t wq; / 40 24 / / --- cacheline 1 boundary (64 bytes) --- / atomic_t refs; / 64 4 / / XXX 4 bytes hole, try to pack / long unsigned int state; / 72 8 / u64 private; / 80 8 / / size: 88, cachelines: 2, members: 7 / / sum members: 84, holes: 1, sum holes: 4 / / last cacheline: 24 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
David Sterba	5efa0490cc	btrfs: set proper message level for skinny metadata This has been confusing people for too long, the message is really just informative. CC: <stable@vger.kernel.org> # 3.10+ Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:03 -08:00
David Sterba	f0954c6637	btrfs: update message levels after checksum errors The errors are worth noting and might get missed with INFO level. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:03 -08:00
David Sterba	aa8ee31209	btrfs: update message levels during failed mount All error conditions from open_ctree shall be ERR. Warning would suggest that something's wrong and we can continue. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:03 -08:00
David Sterba	68b663d13c	btrfs: update message levels for errors Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:03 -08:00
Filipe Manana	a8df6fe666	Btrfs: fix setup_leaf_for_split() to avoid leaf corruption We were incorrectly detecting when the target key didn't exist anymore after releasing the path and re-searching the tree. This could make us split or duplicate (btrfs_split_item() and btrfs_duplicate_item() are its only callers at the moment) an item when we should not. For the case of duplicating an item, we currently only duplicate checksum items (csum tree) and file extent items (fs/subvol trees). For the checksum items we end up overriding the item completely, but for file extent items we update only some of their fields in the copy (done in __btrfs_drop_extents), which means we can end up having a logical corruption for some values. Also for the case where we duplicate a file extent item it will make us produce a leaf with a wrong key order, as btrfs_duplicate_item() advances us to the next slot and then its caller sets a smaller key on the new item at that slot (like in __btrfs_drop_extents() e.g.). Alternatively if the tree search in setup_leaf_for_split() leaves with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end up accessing beyond the leaf's end (when we check if the item's size has changed) and make our caller insert an item at the invalid slot btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory access if the leaf is full or nearly full. This issue has been present since the introduction of this function in 2009: Btrfs: Add btrfs_duplicate_item commit `ad48fd7546` Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:03 -08:00
Chris Mason	57bbddd7fb	Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus	2015-01-21 17:49:35 -08:00

... 4 5 6 7 8 ...

39705 Commits