Commit Graph

41037 Commits

Author SHA1 Message Date
Fabian Frederick a612543fd1 ocfs2: use swap() in swap_refcount_rec()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Fabian Frederick 2a28f98c49 ocfs2: use swap() in dx_leaf_sort_swap()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Joseph Qi ae1f081467 ocfs2: fix wrong check in ocfs2_direct_IO_get_blocks
contig_blocks gotten from ocfs2_extent_map_get_blocks cannot be compared
with clusters_to_alloc. So convert it to clusters first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:40 -07:00
Xue jiufei 74e364ad1b ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()
ocfs2_abort_trigger() use bh->b_assoc_map to get sb.  But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function.  We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.

[akpm@linux-foundation.org: update comment, per Joseph]
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
alex chen fce56d841e ocfs2: o2net: should remove debugfs in o2net_init() out branch
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
WeiWei Wang fa5a0eb3b0 ocfs2: remove OCFS2_IOCB_SEM lock type in direct io
In ocfs2 direct read/write, OCFS2_IOCB_SEM lock type is used to protect
inode->i_alloc_sem rw semaphore lock in the earlier kernel version.
However, in the latest kernel, inode->i_alloc_sem rw semaphore lock is not
used at all, so OCFS2_IOCB_SEM lock type needs to be removed.

Signed-off-by: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Joseph Qi e272e7f0fb ocfs2: do not BUG if jbd2_journal_dirty_metadata fails
jbd2_journal_dirty_metadata may fail.  Currently it cannot take care of
non zero return value and just BUG in ocfs2_journal_dirty.  This patch is
aborting the handle and journal instead of BUG.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Xue jiufei 099768b0c6 ocfs2: remove BUG_ON(!empty_extent) in __ocfs2_rotate_tree_left()
ocfs2_rotate_tree_left() calls __ocfs2_rotate_tree_left() for left
rotation while non-rightmost path containing an empty extent in the leaf
block.  __ocfs2_rotate_tree_left() returns -EAGAIN if right subtree having an
empty extent and pass the empty_extent_path to caller.  The caller
ocfs2_rotate_tree_left() will restart rotation from the returned path.

It will trigger the BUG_ON(!ocfs2_is_empty_extent) when the et on disk
is as follows:

eb0 is the leaf block of path(say path_a) passed to
ocfs2_rotate_tree_left, which has an empty rec[0].

eb1 is the leaf block of path(say path_b) that just right to path_a, which
has no empty record.

eb2 is the leaf block of path(say path_c) that just right to path_b, which
has an empty rec[0].  And path_c is also the rightmost path.

Now we want to remove the empty rec[0] in eb0:

ocfs2_rotate_tree_left:
  -> call __ocfs2_rotate_tree_left with path_a as its input *path*
    -> call ocfs2_rotate_subtree_left with path_a as its input
       *left_path* and path_b as its input *right_path*. it will move
       rec[0] in eb1 to eb0, and rec[0] in eb0 is not empty now.
    -> continue to call ocfs2_rotate_subtree_left with path_b as its
       input *left_path* and path_c as its input *right_path*, and
       return -EAGAIN because eb2 has an empty rec[0]
  -> call __ocfs2_rotate_tree_left with path_c as it input, rotate all
     records in eb2 to left and return 0.
  -> call __ocfs2_rotate_tree_left with path_a as its input, and triggers
     the BUG_ON(!ocfs2_is_empty_extent) as the rec[0] in eb0 is not empty.

So the BUG_ON() should be removed and return 0 if rec[0] is no longer an
empty extent.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Xue jiufei 9f99ad0861 ocfs2: return error when ocfs2_figure_merge_contig_type() fails
ocfs2_figure_merge_contig_type() still returns CONTIG_NONE when some error
occurs which will cause an unpredictable error.  So return a proper errno
when ocfs2_figure_merge_contig_type() fails.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Joseph Qi 345dc681bd ocfs2/dlm: cleanup unused function __dlm_wait_on_lockres_flags_set
__dlm_wait_on_lockres_flags_set() is declared but not implemented and
used.  So remove it.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Daeseok Youn 2e17315242 ocfs2: use retval instead of status for checking error
The use of 'status' in __ocfs2_add_entry() can return wrong value.

Some functions' return value in __ocfs2_add_entry(), i.e
ocfs2_journal_access_di() is saved to 'status'.  But 'status' is not
used in 'bail' label for returning result of __ocfs2_add_entry().

So use retval instead of status.

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Joseph Qi cf1776a9e8 ocfs2: fix a tiny race when truncate dio orohaned entry
Once dio crashed it will leave an entry in orphan dir.  And orphan scan
will take care of the clean up.  There is a tiny race case that the same
entry will be truncated twice and then trigger the BUG in
ocfs2_del_inode_from_orphan.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Andrew Morton e327284abb ocfs2: remove __mlog_cpu_guess
raw_smp_processor_id() is the means of avoiding the runtime preemptibility
check.

[akpm@linux-foundation.org: fix printk warning]
Cc: Joe Perches <joe@perches.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Joe Perches 7c2bd2f930 ocfs2: reduce object size of mlog uses
Using a function for __mlog_printk instead of a macro reduces the object
size of built-in.o by about 190KB, or ~18% overall (x86-64 defconfig
with all ocfs2 options)

  $ size fs/ocfs2/built-in.o*
     text    data     bss     dec     hex filename
   870954  118471  134408 1123833  1125f9 fs/ocfs2/built-in.o,new
  1064081  118071  134408 1316560  1416d0 fs/ocfs2/built-in.o.old

Miscellanea:

 - Move the used-once __mlog_cpu_guess statement expression macro to the
   masklog.c file above the use in __mlog_printk function

 - Simplify the mlog macro moving the and/or logic and level code into
   __mlog_printk

[akpm@linux-foundation.org: export __mlog_printk() to other ocfs2 modules]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Fabian Frederick 5286d20c4e configfs: unexport/make static config_item_init()
config_item_init() is only used in item.c

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Pekka Enberg b0cbeee72f NTFS: use kvfree() in ntfs_free()
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:38 -07:00
Linus Torvalds e0456717e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add TX fast path in mac80211, from Johannes Berg.

 2) Add TSO/GRO support to ibmveth, from Thomas Falcon

 3) Move away from cached routes in ipv6, just like ipv4, from Martin
    KaFai Lau.

 4) Lots of new rhashtable tests, from Thomas Graf.

 5) Run ingress qdisc lockless, from Alexei Starovoitov.

 6) Allow servers to fetch TCP packet headers for SYN packets of new
    connections, for fingerprinting.  From Eric Dumazet.

 7) Add mode parameter to pktgen, for testing receive.  From Alexei
    Starovoitov.

 8) Cache access optimizations via simplifications of build_skb(), from
    Alexander Duyck.

 9) Move page frag allocator under mm/, also from Alexander.

10) Add xmit_more support to hv_netvsc, from KY Srinivasan.

11) Add a counter guard in case we try to perform endless reclassify
    loops in the packet scheduler.

12) Extern flow dissector to be programmable and use it in new "Flower"
    classifier.  From Jiri Pirko.

13) AF_PACKET fanout rollover fixes, performance improvements, and new
    statistics.  From Willem de Bruijn.

14) Add netdev driver for GENEVE tunnels, from John W Linville.

15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.

16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.

17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
    Borkmann.

18) Add tail call support to BPF, from Alexei Starovoitov.

19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.

20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.

21) Favor even port numbers for allocation to connect() requests, and
    odd port numbers for bind(0), in an effort to help avoid
    ip_local_port_range exhaustion.  From Eric Dumazet.

22) Add Cavium ThunderX driver, from Sunil Goutham.

23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
    from Alexei Starovoitov.

24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.

25) Double TCP Small Queues default to 256K to accomodate situations
    like the XEN driver and wireless aggregation.  From Wei Liu.

26) Add more entropy inputs to flow dissector, from Tom Herbert.

27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
    Jonassen.

28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.

29) Track and act upon link status of ipv4 route nexthops, from Andy
    Gospodarek.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
  bridge: vlan: flush the dynamically learned entries on port vlan delete
  bridge: multicast: add a comment to br_port_state_selection about blocking state
  net: inet_diag: export IPV6_V6ONLY sockopt
  stmmac: troubleshoot unexpected bits in des0 & des1
  net: ipv4 sysctl option to ignore routes when nexthop link is down
  net: track link-status of ipv4 nexthops
  net: switchdev: ignore unsupported bridge flags
  net: Cavium: Fix MAC address setting in shutdown state
  drivers: net: xgene: fix for ACPI support without ACPI
  ip: report the original address of ICMP messages
  net/mlx5e: Prefetch skb data on RX
  net/mlx5e: Pop cq outside mlx5e_get_cqe
  net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
  net/mlx5e: Remove extra spaces
  net/mlx5e: Avoid TX CQE generation if more xmit packets expected
  net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
  net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
  net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
  net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
  net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
  ...
2015-06-24 16:49:49 -07:00
Dan Carpenter 5a5003df98 btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
There is a cut and paste error so instead of freeing "head_ref", we free
"ref" twice.

Fixes: 3368d001ba ('btrfs: qgroup: Record possible quota-related extent for qgroup.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-24 12:28:03 -07:00
Linus Torvalds 54245ed870 MTD fixes for 4.2
JFFS2
 
  * fix a theoretical unbalanced locking issue; the lock handling was a bit
    unclean, but AFAICT, it didn't actually lead to real deadlocks
 
 NAND
 
  * brcmnand driver: new driver supporting NAND controller found originally on
    Broadcom STB SoCs (BCM7xxx), but now also found on BCM63xxx, iProc (e.g.,
    Cygnus, BCM5301x), BCM3xxx, and more
 
  * Begin factoring out BBT code so it can be shared between traditional
    (parallel) NAND drivers and upcoming SPI NAND drivers (WIP)
 
  * Add common DT-based init support, so nand_base can pick up some flash
    properties automatically, using established common NAND DT properties
 
  * mxc_nand: support 8-bit ECC
 
  * pxa3xx_nand:
    - fix build for ARM64
    - use a jiffies-based timeout
 
 SPI NOR
 
  * Add a few new IDs
 
  * Clear out some unnecessary entries
 
  * Make sure SECT_4K flags are correct for all (?) entries
 
 Core
 
  * Fix mtd->usecount race conditions (BUG_ON())
 
  * Switch to modern PM ops
 
 Other
 
  * CFI: save code space by de-inlining large functions
 
  * Clean up some partition parser selection code across several drivers
 
  * Various miscellaneous changes, mostly minor
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJViZdKAAoJEFySrpd9RFgtaqoQAINkYUdxa8mOlmXQSIhWz19K
 5FJqJN+/lgrYmIGI1SzVy/TTB/V4hWH+9h1snU98ToaHFACqbKKeo6FLs6GRd+WJ
 adR9H4PUjtAZOt8trKzzygJnqvRPDWnn6H+0urxFv7kpyPTEorifiiI6+o3AM995
 vh+YsLJNFYHucIzi4TVkgwssLYi8I9TOb35pDnW2uT/ikDi+4Uu+Ocq6u7SuMPXV
 Zch6DcmBhtTQ9ghBwF/dMMAhyyMLyY4q0+CEzmJGxNU+g2o1njOTUDBv+lLqxxMB
 fUXMAPXIT3ndhWjMmzklG7AjOorbg44aG2UlqJWXx00VYJKNf+pljpgdHOXKiuWo
 xYkqOYnEM8gZ4qYNInKbbv34Q2+EjCxg+aAWxh+yRCJrfnF3p5/QSMVL2P2JNrc3
 Kg8W7pW42xOeGxTYpydBtvpq+41qRtdWMgNp47PHvKF0mhpeSCCvRf/YfU4cNcfv
 WfQzBLy/d7RMVFyvACdvzerF/fh9YX3yxV0B0LstytCSZO13617F6HHpm1pYfP0v
 d9GpLpdiFktoG/AXpPzlSl992ALf0SQaedamuxApfvLVymDkG9xoO0P5p6SbOiJ0
 QnBvEMkswEf7ExPzr5SYufSE93wikAEsAfjsLOHo+FQQ57eaRgpCOZHHgfuwcTra
 xUr6u2Rq9iDenhVYDqJU
 =JN5N
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20150623' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "JFFS2:
   - fix a theoretical unbalanced locking issue; the lock handling was a
     bit unclean, but AFAICT, it didn't actually lead to real deadlocks

  NAND:
   - brcmnand driver: new driver supporting NAND controller found
     originally on Broadcom STB SoCs (BCM7xxx), but now also found on
     BCM63xxx, iProc (e.g., Cygnus, BCM5301x), BCM3xxx, and more

   - begin factoring out BBT code so it can be shared between
     traditional (parallel) NAND drivers and upcoming SPI NAND drivers
     (WIP)

   - add common DT-based init support, so nand_base can pick up some
     flash properties automatically, using established common NAND DT
     properties

   - mxc_nand: support 8-bit ECC

   - pxa3xx_nand:
     * fix build for ARM64
     * use a jiffies-based timeout

  SPI NOR:
   - add a few new IDs

   - clear out some unnecessary entries

   - make sure SECT_4K flags are correct for all (?) entries

  Core:
   - fix mtd->usecount race conditions (BUG_ON())

   - switch to modern PM ops

  Other:
   - CFI: save code space by de-inlining large functions

   - clean up some partition parser selection code across several
     drivers

   - various miscellaneous changes, mostly minor"

* tag 'for-linus-20150623' of git://git.infradead.org/linux-mtd: (57 commits)
  mtd: docg3: Fix kasprintf() usage
  mtd: docg3: Don't leak docg3->bbt in error path
  mtd: nandsim: Fix kasprintf() usage
  mtd: cs553x_nand: Fix kasprintf() usage
  mtd: r852: Fix device_create_file() usage
  mtd: brcmnand: drop unnecessary initialization
  mtd: propagate error codes from add_mtd_device()
  mtd: diskonchip: remove two-phase partitioning / registration
  mtd: dc21285: use raw spinlock functions for nw_gpio_lock
  mtd: chips: fixup dependencies, to prevent build error
  mtd: cfi_cmdset_0002: Initialize datum before calling map_word_load_partial
  mtd: cfi: deinline large functions
  mtd: lantiq-flash: use default partition parsers
  mtd: plat_nand: use default partition probe
  mtd: nand: correct indentation within conditional
  mtd: remove incorrect file name
  mtd: blktrans: use better error code for unimplemented ioctl()
  mtd: maps: Spelling s/reseved/reserved/
  mtd: blktrans: change blktrans_getgeo return value
  mtd: mxc_nand: generate nand_ecclayout for 8 bit ECC
  ...
2015-06-23 17:38:39 -07:00
Theodore Ts'o a2fd66d069 ext4: set lazytime on remount if MS_LAZYTIME is set by mount
Newer versions of mount parse the lazytime feature and pass it to the
mount system call via the flags field in the mount system call,
removing the lazytime string from the mount options list.  So we need
to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
order for this flag to be set on a remount.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-23 11:03:54 -04:00
Chris Mason c40b7b064f Merge branch 'sysfs-fsdevices-4.2-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into anand 2015-06-23 05:34:39 -07:00
Linus Torvalds 43224b96af Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A rather largish update for everything time and timer related:

   - Cache footprint optimizations for both hrtimers and timer wheel

   - Lower the NOHZ impact on systems which have NOHZ or timer migration
     disabled at runtime.

   - Optimize run time overhead of hrtimer interrupt by making the clock
     offset updates smarter

   - hrtimer cleanups and removal of restrictions to tackle some
     problems in sched/perf

   - Some more leap second tweaks

   - Another round of changes addressing the 2038 problem

   - First step to change the internals of clock event devices by
     introducing the necessary infrastructure

   - Allow constant folding for usecs/msecs_to_jiffies()

   - The usual pile of clockevent/clocksource driver updates

  The hrtimer changes contain updates to sched, perf and x86 as they
  depend on them plus changes all over the tree to cleanup API changes
  and redundant code, which got copied all over the place.  The y2038
  changes touch s390 to remove the last non 2038 safe code related to
  boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  clocksource: Increase dependencies of timer-stm32 to limit build wreckage
  timer: Minimize nohz off overhead
  timer: Reduce timer migration overhead if disabled
  timer: Stats: Simplify the flags handling
  timer: Replace timer base by a cpu index
  timer: Use hlist for the timer wheel hash buckets
  timer: Remove FIFO "guarantee"
  timers: Sanitize catchup_timer_jiffies() usage
  hrtimer: Allow hrtimer::function() to free the timer
  seqcount: Introduce raw_write_seqcount_barrier()
  seqcount: Rename write_seqcount_barrier()
  hrtimer: Fix hrtimer_is_queued() hole
  hrtimer: Remove HRTIMER_STATE_MIGRATE
  selftest: Timers: Avoid signal deadlock in leap-a-day
  timekeeping: Copy the shadow-timekeeper over the real timekeeper last
  clockevents: Check state instead of mode in suspend/resume path
  selftests: timers: Add leap-second timer edge testing to leap-a-day.c
  ntp: Do leapsecond adjustment in adjtimex read path
  time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
  ntp: Introduce and use SECS_PER_DAY macro instead of 86400
  ...
2015-06-22 18:57:44 -07:00
Dave Chinner de50e16ffa Merge branch 'xfs-misc-fixes-for-4.2-3' into for-next 2015-06-23 08:49:01 +10:00
Dave Chinner 3d238b7e0e Merge branch 'xfs-freelist-cleanup' into for-next 2015-06-23 08:48:43 +10:00
Brian Foster f66bf04269 xfs: don't truncate attribute extents if no extents exist
The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that
attribute blocks exist to invalidate. It is possible to have an
attribute fork without extents, however. Consider the case where the
attribute fork is created towards the beginning of xfs_attr_set() but
some part of the subsequent attribute set fails.

If an inode in such a state hits xfs_attr_inactive(), it eventually
calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a
filesystem corruption warning, returns an error that bubbles back up to
xfs_attr_inactive(), and leads to destruction of the in-core attribute
fork without an on-disk reset. If the inode happens to make it back
through xfs_inactive() in this state (e.g., via a concurrent bulkstat
that cycles the inode from the reclaim state and releases it), i_afp
might not exist when xfs_bmapi_read() is called and causes a NULL
dereference panic.

A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB)
reproduces these problems. The behavior is a regression caused by:

6dfe5a0 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind

... which removed logic that avoided the attribute extent truncate when
no extents exist. Restore this logic to ensure the attribute fork is
destroyed and reset correctly if it exists without any allocated
extents.

cc: stable@vger.kernel.org # 3.12 to 4.0.x
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-23 08:47:20 +10:00
Linus Torvalds 1bf7067c6e Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes are:

   - 'qspinlock' support, enabled on x86: queued spinlocks - these are
     now the spinlock variant used by x86 as they outperform ticket
     spinlocks in every category.  (Waiman Long)

   - 'pvqspinlock' support on x86: paravirtualized variant of queued
     spinlocks.  (Waiman Long, Peter Zijlstra)

   - 'qrwlock' support, enabled on x86: queued rwlocks.  Similar to
     queued spinlocks, they are now the variant used by x86:

       CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
       CONFIG_QUEUED_SPINLOCKS=y
       CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
       CONFIG_QUEUED_RWLOCKS=y

   - various lockdep fixlets

   - various locking primitives cleanups, further WRITE_ONCE()
     propagation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  locking/lockdep: Remove hard coded array size dependency
  locking/qrwlock: Don't contend with readers when setting _QW_WAITING
  lockdep: Do not break user-visible string
  locking/arch: Rename set_mb() to smp_store_mb()
  locking/arch: Add WRITE_ONCE() to set_mb()
  rtmutex: Warn if trylock is called from hard/softirq context
  arch: Remove __ARCH_HAVE_CMPXCHG
  locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
  locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
  locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
  locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
  locking/pvqspinlock, x86: Enable PV qspinlock for Xen
  locking/pvqspinlock, x86: Enable PV qspinlock for KVM
  locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
  locking/pvqspinlock: Implement simple paravirt support for the qspinlock
  locking/qspinlock: Revert to test-and-set on hypervisors
  locking/qspinlock: Use a simple write to grab the lock
  locking/qspinlock: Optimize for smaller NR_CPUS
  locking/qspinlock: Extract out code snippets for the next patch
  locking/qspinlock: Add pending bit
  ...
2015-06-22 14:54:22 -07:00
Linus Torvalds 052b398a43 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this pile: pathname resolution rewrite.

   - recursion in link_path_walk() is gone.

   - nesting limits on symlinks are gone (the only limit remaining is
     that the total amount of symlinks is no more than 40, no matter how
     nested).

   - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

   - stack footprint (independent of the nesting) is below kilobyte now,
     about on par with what it used to be with one level of nested
     symlinks and ~2.8 times lower than it used to be in the worst case.

   - struct nameidata is entirely private to fs/namei.c now (not even
     opaque pointers are being passed around).

   - ->follow_link() and ->put_link() calling conventions had been
     changed; all in-tree filesystems converted, out-of-tree should be
     able to follow reasonably easily.

     For out-of-tree conversions, see Documentation/filesystems/porting
     for details (and in-tree filesystems for examples of conversion).

  That has sat in -next since mid-May, seems to survive all testing
  without regressions and merges clean with v4.1"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
  turn user_{path_at,path,lpath,path_dir}() into static inlines
  namei: move saved_nd pointer into struct nameidata
  inline user_path_create()
  inline user_path_parent()
  namei: trim do_last() arguments
  namei: stash dfd and name into nameidata
  namei: fold path_cleanup() into terminate_walk()
  namei: saner calling conventions for filename_parentat()
  namei: saner calling conventions for filename_create()
  namei: shift nameidata down into filename_parentat()
  namei: make filename_lookup() reject ERR_PTR() passed as name
  namei: shift nameidata inside filename_lookup()
  namei: move putname() call into filename_lookup()
  namei: pass the struct path to store the result down into path_lookupat()
  namei: uninline set_root{,_rcu}()
  namei: be careful with mountpoint crossings in follow_dotdot_rcu()
  Documentation: remove outdated information from automount-support.txt
  get rid of assorted nameidata-related debris
  lustre: kill unused helper
  lustre: kill unused macro (LOOKUP_CONTINUE)
  ...
2015-06-22 12:51:21 -07:00
Christoph Hellwig 68e8bb0334 nfsd: wrap too long lines in nfsd4_encode_read
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-22 14:15:05 -04:00
Christoph Hellwig 96bcad5064 nfsd: fput rd_file from XDR encode context
Remove the hack where we fput the read-specific file in generic code.
Instead we can do it in nfsd4_encode_read as that gets called for all
error cases as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-22 14:15:04 -04:00
Christoph Hellwig af90f707fa nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
This patch changes nfs4_preprocess_stateid_op so it always returns
a valid struct file if it has been asked for that.  For that we
now allocate a temporary struct file for special stateids, and check
permissions if we got the file structure from the stateid.  This
ensures that all callers will get their handling of special stateids
right, and avoids code duplication.

There is a little wart in here because the read code needs to know
if we allocated a file structure so that it can copy around the
read-ahead parameters.  In the long run we should probably aim to
cache full file structures used with special stateids instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-22 14:15:03 -04:00
Anand Jain f90fc54728 Btrfs: Check if kobject is initialized before put
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-06-22 14:43:31 +02:00
Josef Bacik 3da40c7b08 ext4: only call ext4_truncate when size <= isize
At LSF we decided that if we truncate up from isize we shouldn't trim
fallocated blocks that were fallocated with KEEP_SIZE and are past the
new i_size.  This patch fixes ext4 to do this.

[ Completely reworked patch so that i_disksize would actually get set
  when truncating up.  Also reworked the code for handling truncate so
  that it's easier to handle. -- tytso ]

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2015-06-22 00:31:26 -04:00
Eric Whitney 04e22412f4 ext4: make online defrag error reporting consistent
Make the error reporting behavior resulting from the unsupported use
of online defrag on files with data journaling enabled consistent with
that implemented for bigalloc file systems. Difference found with
ext4/308.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-21 21:38:03 -04:00
Eric Whitney c27e43a10c ext4: minor cleanup of ext4_da_reserve_space()
Remove outdated comments and dead code from ext4_da_reserve_space.
Clean up its trace point, and relocate it to make it more useful.

While we're at it, fix a nearby conditional used to determine if
we have a non-bigalloc file system.  It doesn't match usage elsewhere
in the code, and misleadingly suggests that an s_cluster_ratio value
of 0 would be legal.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 21:37:05 -04:00
Darrick J. Wong 292db1bc6c ext4: don't retry file block mapping on bigalloc fs with non-extent file
ext4 isn't willing to map clusters to a non-extent file.  Don't signal
this with an out of space error, since the FS will retry the
allocation (which didn't fail) forever.  Instead, return EUCLEAN so
that the operation will fail immediately all the way back to userspace.

(The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-21 21:10:51 -04:00
Dave Chinner 496817b4be xfs: clean up XFS_MIN_FREELIST macros
We no longer calculate the minimum freelist size from the on-disk
AGF, so we don't need the macros used for this. That means the
nested macros can be cleaned up, and turn this into an actual
function so the logic is clear and concise. This will make it much
easier to add support for the rmap btree when the time comes.

This also gets rid of the XFS_AG_MAXLEVELS macro used by these
freelist macros as it is simply a wrapper around a single variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:13:30 +10:00
Dave Chinner 396503fc83 xfs: sanitise error handling in xfs_alloc_fix_freelist
The error handling is currently an inconsistent mess as every error
condition handles return values and releasing buffers individually.
Clean this up by using gotos and a sane error label stack.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:13:19 +10:00
Dave Chinner 72d552854b xfs: factor out free space extent length check
The longest extent length checks in xfs_alloc_fix_freelist() are now
essentially identical. Factor them out into a helper function, so we
know they are checking exactly the same thing before and after we
lock the AGF.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:04:42 +10:00
Dave Chinner 50adbcb4c4 xfs: xfs_alloc_fix_freelist() can use incore perag structures
At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based
access and agf buffer  based access to freelist and space usage
information. However, once the AGF buffer is locked inside this
function, it is guaranteed that both the in-memory and on-disk
values are identical. xfs_alloc_fix_freelist() doesn't modify the
values in the structures directly, so it is a read-only user of the
infomration, and hence can use the per-ag structure exclusively for
determining what it should do.

This opens up an avenue for cleaning up a lot of duplicated logic
whose only difference is the structure it gets the data from, and in
doing so removes a lot of needless byte swapping overhead when
fixing up the free list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 10:04:31 +10:00
Christoph Hellwig b2a922cd6c xfs: remove xfs_caddr_t
Just use char pointers directly instead of the confusing typedef to a
pointer type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:45:10 +10:00
Christoph Hellwig 5809d5e083 xfs: use void pointers in log validation helpers
Compared to char pointers this saves us a lot of casting effort.  Also
add another local variable to make the code easier to read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:44:47 +10:00
Christoph Hellwig 88ee2df7f2 xfs: return a void pointer from xfs_buf_offset
This avoids all kinds of unessecary casts in an envrionment like Linux where
we can assume that pointer arithmetics are support on void pointers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:44:29 +10:00
Christoph Hellwig fc51c2b5f8 xfs: remove inst_t
We can simply use a void pointer to pass a long return addresses in the
debugging helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:44:02 +10:00
Christoph Hellwig db9d67d6b0 xfs: remove __psint_t and __psunsigned_t
Replace uses of __psint_t with the proper uintptr_t and ptrdiff_t types,
and remove the defintions of __psint_t and __psunsigned_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:43:32 +10:00
Eric Sandeen 2ac56d3d4b xfs: fix remote symlinks on V5/CRC filesystems
If we create a CRC filesystem, mount it, and create a symlink with
a path long enough that it can't live in the inode, we get a very
strange result upon remount:

# ls -l mnt
total 4
lrwxrwxrwx. 1 root root 929 Jun 15 16:58 link -> XSLM

XSLM is the V5 symlink block header magic (which happens to be
followed by a NUL, so the string looks terminated).

xfs_readlink_bmap() advanced cur_chunk by the size of the header
for CRC filesystems, but never actually used that pointer; it
kept reading from bp->b_addr, which is the start of the block,
rather than the start of the symlink data after the header.

Looks like this problem goes back to v3.10.

Fixing this gets us reading the proper link target, again.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22 09:42:48 +10:00
Theodore Ts'o c5e298ae53 ext4: prevent ext4_quota_write() from failing due to ENOSPC
In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 01:25:29 -04:00
Theodore Ts'o 89d96a6f8e ext4: call sync_blockdev() before invalidate_bdev() in put_super()
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device.  So
try to force them out to memory before calling invalidate_bdev().

This fixes a warning triggered by generic/081:

WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-20 22:50:33 -04:00
Jan Kara 2143c1965a jbd2: speedup jbd2_journal_dirty_metadata()
It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-20 21:44:17 -04:00
Christoph Hellwig a0649b2d3f nfsd: refactor nfs4_preprocess_stateid_op
Split out two self contained helpers to make the function more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-19 15:39:52 -04:00
Christoph Hellwig e749a4621e nfsd: clean up raparams handling
Refactor the raparam hash helpers to just deal with the raparms,
and keep opening/closing files separate from that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-19 15:39:51 -04:00
Fabian Frederick 97b1f9aae9 nfsd: use swap() in sort_pacl_range()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-19 15:39:50 -04:00
Bob Peterson 39b0f1e929 GFS2: Don't brelse rgrp buffer_heads every allocation
This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
2015-06-19 07:40:22 -05:00
Anand Jain d2ff1b2008 Btrfs: sysfs: add support to show replacing target in the sysfs
This patch will add support to show the replacing target in sysfs
during the process of replacement.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-06-19 14:03:54 +02:00
Anand Jain 4fde46f0cc Btrfs: free the stale device
When btrfs on a device is overwritten with a new btrfs (mkfs),
the old btrfs instance in the kernel becomes stale. So with this
patch, if kernel finds device is overwritten then delete the stale
fsid/uuid.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
2015-06-19 14:03:13 +02:00
Peter Zijlstra a7c6f571ff seqcount: Rename write_seqcount_barrier()
I'll shortly be introducing another seqcount primitive that's useful
to provide ordering semantics and would like to use the
write_seqcount_barrier() name for that.

Seeing how there's only one user of the current primitive, lets rename
it to invalidate, as that appears what its doing.

While there, employ lockdep_assert_held() instead of
assert_spin_locked() to not generate debug code for regular kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: wanpeng.li@linux.intel.com
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124743.279926217@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:09:56 +02:00
Tejun Heo fb02915f47 kernfs: make kernfs_get_inode() public
Move kernfs_get_inode() prototype from fs/kernfs/kernfs-internal.h to
include/linux/kernfs.h.  It obtains the matching inode for a
kernfs_node.

It will be used by cgroup for inode based permission checks for now
but is generally useful.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-18 16:54:28 -04:00
Bob Peterson e7ccaf5fe1 GFS2: Don't add all glocks to the lru
The glocks used for resource groups often come and go hundreds of
thousands of times per second. Adding them to the lru list just
adds unnecessary contention for the lru_lock spin_lock, especially
considering we're almost certainly going to re-use the glock and
take it back off the lru microseconds later. We never want the
glock shrinker to cull them anyway. This patch adds a new bit in
the glops that determines which glock types get put onto the lru
list and which ones don't.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-06-18 12:17:59 -05:00
Tejun Heo 46b15caa7c vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
per-super_block flag.

super_block->s_flags carries some internal flags in the high bits but
it's exposd to userland through uapi header and running out of space
anyway.  This patch adds a new field super_block->s_iflags to carry
kernel-internal flags.  It is currently only used by the new
SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
consistency with other super_block flags.

ext2_fill_super() is updated to set SB_I_CGROUPWB.

v2: Added super_block->s_iflags instead of stealing another high bit
    from sb->s_flags as suggested by Christoph and Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-17 12:47:39 -06:00
Tejun Heo dd73e4b7df writeback: do foreign inode detection iff cgroup writeback is enabled
Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
foreign inode detection and migration logic still ends up activating
cgroup writeback which is unexpected.  This patch ensures that the
foreign inode detection logic stays disabled when inode_cgwb_enabled()
is false by not associating writeback_control's with bdi_writeback's.

This also avoids unnecessary operations in wbc_init_bio(),
wbc_account_io() and wbc_detach_inode() for filesystems which don't
support cgroup writeback.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-17 12:47:37 -06:00
Michal Hocko 7b506b1035 jbd2: get rid of open coded allocation retry loop
insert_revoke_hash does an open coded endless allocation loop if
journal_oom_retry is true. It doesn't implement any allocation fallback
strategy between the retries, though. The memory allocator doesn't know
about the never fail requirement so it cannot potentially help to move
on with the allocation (e.g. use memory reserves).

Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the
debugging message but I am not sure it is anyhow helpful.

Do the same for journal_alloc_journal_head which is doing a similar
thing.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 15:45:58 -04:00
Andreas Dilger b03a2f7eb2 ext4: improve warning directory handling messages
Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner.  Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 14:50:26 -04:00
Joseph Qi 6f6a6fda29 jbd2: fix ocfs2 corrupt when updating journal superblock fails
If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.  In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts.  So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first.  And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[ Fixed bug in patch which allowed a non-negative error return from
  jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
  was causing xfstests ext4/306 to fail. -- Ted ]

Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: stable@vger.kernel.org
2015-06-15 14:36:01 -04:00
Rasmus Villemoes 97b4af2f76 ext4: mballoc: avoid 20-argument function call
Making a function call with 20 arguments is rather expensive in both
stack and .text. In this case, doing the formatting manually doesn't
make it any less readable, so we might as well save 155 bytes of .text
and 112 bytes of stack.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
2015-06-15 00:32:58 -04:00
Lukas Czerner 0d306dcf86 ext4: wait for existing dio workers in ext4_alloc_file_blocks()
Currently existing dio workers can jump in and potentially increase
extent tree depth while we're allocating blocks in
ext4_alloc_file_blocks().  This may cause us to underestimate the
number of credits needed for the transaction because the extent tree
depth can change after our estimation.

Fix this by waiting for all the existing dio workers in the same way
as we do it in ext4_punch_hole.  We've seen errors caused by this in
xfstest generic/299, however it's really hard to reproduce.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:23:53 -04:00
Lukas Czerner 4134f5c88d ext4: recalculate journal credits as inode depth changes
Currently in ext4_alloc_file_blocks() the number of credits is
calculated only once before we enter the allocation loop. However within
the allocation loop the extent tree depth can change, hence the number
of credits needed can increase potentially exceeding the number of credits
reserved in the handle which can cause journal failures.

Fix this by recalculating number of credits when the inode depth
changes. Note that even though ext4_alloc_file_blocks() is only
currently used by extent base inodes we will avoid recalculating number
of credits unnecessarily in the case of indirect based inodes.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 00:20:46 -04:00
Dmitry Monakhov b4f1afcd06 jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()
jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS

[Full stack trace snipped from 3.10-rh7]
[<ffffffff815c4bd4>] dump_stack+0x19/0x1b
[<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80
[<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20
[<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210
[<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20
[<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20
[<ffffffff81147939>] mempool_alloc+0x69/0x170
[<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20
[<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150
[<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0
[<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120
[<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2]
[<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30
[<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210
[<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2]
[<ffffffff811532ce>] ? lru_cache_add+0xe/0x10
[<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4]
[<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4]
[<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270
[<ffffffff81146918>] __generic_file_write_iter+0x178/0x390
[<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0
[<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4]
[<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0
[<ffffffff811b93f0>] do_sync_write+0x90/0xe0
[<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0
[<ffffffff811ba5b8>] SyS_write+0x58/0xb0
[<ffffffff815d4799>] system_call_fastpath+0x16/0x1b

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-15 00:18:02 -04:00
Fabian Frederick bf86546760 ext4: use swap() in mext_page_double_lock()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:47:33 -04:00
Fabian Frederick 4b7e2db5c0 ext4: use swap() in memswap()
Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:46:33 -04:00
Theodore Ts'o bdf96838ae ext4: fix race between truncate and __ext4_journalled_writepage()
The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
 [<c027d883>] ? lock_buffer+0x36/0x36
 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
 [<c0229139>] do_invalidatepage+0x22/0x26
 [<c0229198>] truncate_inode_page+0x5b/0x85
 [<c022934b>] truncate_inode_pages_range+0x156/0x38c
 [<c0229592>] truncate_inode_pages+0x11/0x15
 [<c022962d>] truncate_pagecache+0x55/0x71
 [<c02b913b>] ext4_setattr+0x4a9/0x560
 [<c01ca542>] ? current_kernel_time+0x10/0x44
 [<c026c4d8>] notify_change+0x1c7/0x2be
 [<c0256a00>] do_truncate+0x65/0x85
 [<c0226f31>] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [<c01b879f>] ? console_unlock+0x3a1/0x3ce
 [<c082cbb4>] dump_stack+0x48/0x60
 [<c0178b65>] warn_slowpath_common+0x89/0xa0
 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c0178bef>] warn_slowpath_null+0x14/0x18
 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
 [<c02b2f44>] write_end_fn+0x40/0x53
 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
 [<c02b59e7>] ext4_writepage+0x354/0x3b8
 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b5a5b>] __writepage+0x10/0x2e
 [<c0225956>] write_cache_pages+0x22d/0x32c
 [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
 [<c02b6ee8>] ext4_writepages+0x102/0x607
 [<c019adfe>] ? sched_clock_local+0x10/0x10e
 [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
 [<c01a8ad5>] ? lock_is_held+0x43/0x51
 [<c0226dff>] do_writepages+0x1c/0x29
 [<c0276bed>] __writeback_single_inode+0xc3/0x545
 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
    ...

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-12 23:45:33 -04:00
Theodore Ts'o 1cb767cd4a ext4 crypto: fail the mount if blocksize != pagesize
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:44:33 -04:00
Josef Bacik 37b8d27de5 Btrfs: use received_uuid of parent during send
Neil Horman pointed out a problem where if he did something like this

receive A
snap A B
change B
send -p A B

and then on another box do

recieve A
receive B

the receive B would fail because we use the UUID of A for the clone sources for
B.  This makes sense most of the time because normally you are sending from the
original sources, not a received source.  However when you use a recieved subvol
its UUID is going to be something completely different, so if you then try to
receive the diff on a different volume it won't find the UUID because the new A
will be something else.  The only constant is the received uuid.  So instead
check to see if we have received_uuid set on the root, and if so use that as the
clone source, as btrfs receive looks for matches either in received_uuid or
uuid.  Thanks,

Reported-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 13:20:38 -07:00
Liu Bo 0eeff2362b Btrfs: fix use-after-free in btrfs_replay_log
@log_root_tree should not be referenced after kfree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-12 11:03:21 -07:00
Chao Yu 3c45414527 f2fs: do not trim preallocated blocks when truncating after i_size
When we perform generic/092 in xfstests, output is like below:

     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
     0: [0..10239]: data
     0: [0..10239]: data
    -1: [10240..20479]: unwritten
    +1: [10240..14335]: unwritten

This is because with this testcase, we redefine the regulation for
truncate in perallocated space past i_size as below:

"There was some confused about what the fs was supposed to do when you
truncate at i_size with preallocated space past i_size. We decided on the
following things.

1) truncate(i_size) will trim all blocks past i_size.
2) truncate(x) where x > i_size will not trim all blocks past i_size.
"

This method is used in xfs, and then ext4/btrfs will follow the rule.

This patch fixes to follow the new rule for f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 18:30:49 -07:00
Jaegeuk Kim 43f54cd52f f2fs crypto: add alloc_bounce_page
This patch adds alloc_bounce_page likewise ext4.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 15:04:20 -07:00
Jaegeuk Kim 7e8e754a4b f2fs crypto: fix to handle errors likewise ext4
This patch makes some error handling policies same with ext4.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-06-11 15:00:36 -07:00
Zhao Lei 9a4e7276d3 btrfs: wait for delayed iputs on no space
btrfs will report no_space when we run following write and delete
file loop:
 # FILE_SIZE_M=[ 75% of fs space ]
 # DEV=[ some dev ]
 # MNT=[ some dir ]
 #
 # mkfs.btrfs -f "$DEV"
 # mount -o nodatacow "$DEV" "$MNT"
 # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
 #

Reason:
 iput() and evict() is run after write pages to block device, if
 write pages work is not finished before next write, the "rm"ed space
 is not freed, and caused above bug.

Fix:
 We can add "-o flushoncommit" mount option to avoid above bug, but
 it have performance problem. Actually, we can to wait for on-the-fly
 writes only when no-space happened, it is which this patch do.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:34 -07:00
Qu Wenruo d672633545 btrfs: qgroup: Make snapshot accounting work with new extent-oriented
qgroup.

Make snapshot accounting work with new extent-oriented mechanism by
skipping given root in new/old_roots in create_pending_snapshot().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:29 -07:00
Qu Wenruo 9086db86e0 btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
This is used by later qgroup fix patches for snapshot.

As current snapshot accounting is done by btrfs_qgroup_inherit(), but
new extent oriented quota mechanism will account extent from
btrfs_copy_root() and other snapshot things, causing wrong result.

So add this ability to handle snapshot accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:23 -07:00
Qu Wenruo d4b8040459 btrfs: ulist: Add ulist_del() function.
This function will delete unode with given (val,aux) pair.
And with this patch, seqnum for debug usage doesn't have any meaning
now, so remove them.

This is used by later patches to skip snapshot root.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:17 -07:00
Qu Wenruo e69bcee376 btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
Goodbye, the old mechanisim.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:11 -07:00
Qu Wenruo 442244c963 btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
Since the self test transaction don't have delayed_ref_roots, so use
find_all_roots() and export btrfs_qgroup_account_extent() to simulate it

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:26:05 -07:00
Qu Wenruo 0ed4792af0 btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
Switch from old ref_node based qgroup to extent based qgroup mechanism
for normal operations.

The new mechanism should hugely reduce the overhead of btrfs quota
system, and further more, the codes and logic should be more clean and
easier to maintain.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:59 -07:00
Qu Wenruo 9d220c95f5 btrfs: qgroup: Switch rescan to new mechanism.
Switch rescan to use the new new extent oriented mechanism.

As rescan is also based on extent, new mechanism is just a perfect match
for rescan.

With re-designed internal functions, rescan is quite easy, just call
btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:54 -07:00
Qu Wenruo 550d7a2ed5 btrfs: qgroup: Add new qgroup calculation function
btrfs_qgroup_account_extents().

The new btrfs_qgroup_account_extents() function should be called in
btrfs_commit_transaction() and it will update all the qgroup according
to delayed_ref_root->dirty_extent_root.

The new function can handle both normal operation during
commit_transaction() or in rescan in a unified method with clearer
logic.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:49 -07:00
Qu Wenruo 21633fc603 btrfs: backref: Add special time_seq == (u64)-1 case for
btrfs_find_all_roots().

Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
lock to do tree search.

This is important for later qgroup implement which will call
find_all_roots() after fs trees are committed.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:45 -07:00
Qu Wenruo 3b7d00f99c btrfs: qgroup: Add new function to record old_roots.
Add function btrfs_qgroup_prepare_account_extents() to get old_roots
which are needed for qgroup.

We do it in commit_transaction() and before switch_roots(), and only
search commit_root, so it gives a quite accurate view for previous
transaction.

With old_roots from previous transaction, we can use it to do accurate
account with current transaction.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:39 -07:00
Qu Wenruo 3368d001ba btrfs: qgroup: Record possible quota-related extent for qgroup.
Add hook in add_delayed_ref_head() to record quota-related extent record
into delayed_ref_root->dirty_extent_record rb-tree for later qgroup
accounting.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:32 -07:00
Qu Wenruo 823ae5b8e3 btrfs: qgroup: Add function qgroup_update_counters().
Add function qgroup_update_counters(), which will update related
qgroups' rfer/excl according to old/new_roots.

This is one of the two core functions for the new qgroup implement.

This is based on btrfs_adjust_coutners() but with clearer logic and
comment.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:28 -07:00
Qu Wenruo d810ef2be5 btrfs: qgroup: Add function qgroup_update_refcnt().
This function is used to update refcnt for qgroups.
And is one of the two core functions used in the new qgroup implement.

This is based on the old update_old/new_refcnt, but provides a unified
logic and behavior.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:24 -07:00
Qu Wenruo c682f9b3c2 btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent()
__btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too
many parameters, but three of them can be extracted from
btrfs_delayed_ref_node struct.

So use btrfs_delayed_ref_node struct as a single parameter to replace
the bytenr/num_byte/no_quota parameters.

The real objective of this patch is to allow btrfs_qgroup_record_ref()
get the delayed_ref_node in incoming qgroup patches.

Other functions calling btrfs_qgroup_record_ref() are not affected since
the rest will only add/sub exclusive extents, where node is not used.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:18 -07:00
Qu Wenruo 9c542136fd btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read.
Use inline functions to do such things, to improve readability.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:13 -07:00
Qu Wenruo c43d160fcd btrfs: delayed-ref: Cleanup the unneeded functions.
Cleanup the rb_tree merge/insert/update functions, since now we use list
instead of rb_tree now.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:09 -07:00
Qu Wenruo c6fc245499 btrfs: delayed-ref: Use list to replace the ref_root in ref_head.
This patch replace the rbtree used in ref_head to list.
This has the following advantage:
1) Easier merge logic.
With the new list implement, we only need to care merging the tail
ref_node with the new ref_node.
And this can be done quite easy at insert time, no need to do a
indicated merge at run_delayed_refs().

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:25:03 -07:00
Qu Wenruo 00db646d3f btrfs: backref: Don't merge refs which are not for same block.
Old __merge_refs() in backref.c will even merge refs whose root_id are
different, which makes qgroup gives wrong result.

Fix it by checking ref_for_same_block() before any mode specific works.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 09:24:59 -07:00
Zhao Lei 20b2e3029e btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx()
lockdep report following warning in test:
 [25176.843958] =================================
 [25176.844519] [ INFO: inconsistent lock state ]
 [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
 [25176.845591] ---------------------------------
 [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
 [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
 [25176.847838] {SOFTIRQ-ON-W} state was registered at:
 [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
 [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
 [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
 [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
 [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
 [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
 [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
 [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
 [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
 [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
 [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
 [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
 [25176.854935] irq event stamp: 51506
 [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
 [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
 [25176.857746]
 other info that might help us debug this:
 [25176.858845]  Possible unsafe locking scenario:
 [25176.859981]        CPU0
 [25176.860537]        ----
 [25176.861059]   lock(&wr_ctx->wr_lock);
 [25176.861705]   <Interrupt>
 [25176.862272]     lock(&wr_ctx->wr_lock);
 [25176.862881]
  *** DEADLOCK ***

Reason:
 Above warning is caused by:
 Interrupt
 -> bio_endio()
 -> ...
 -> scrub_put_ctx()
 -> scrub_free_ctx() *1
 -> ...
 -> mutex_lock(&wr_ctx->wr_lock);

 scrub_put_ctx() is allowed to be called in end_bio interrupt, but
 in code design, it will never call scrub_free_ctx(sctx) in interrupe
 context(above *1), because btrfs_scrub_dev() get one additional
 reference of sctx->refs, which makes scrub_free_ctx() only called
 withine btrfs_scrub_dev().

 Now the code runs out of our wish, because free sequence in
 scrub_pending_bio_dec() have a gap.

 Current code:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
 -> scrub_free_ctx()                |
 -----------------------------------+-----------------------------------

 We expected:
 -----------------------------------+-----------------------------------
 scrub_pending_bio_dec()            |  btrfs_scrub_dev
 -----------------------------------+-----------------------------------
 atomic_dec(&sctx->bios_in_flight); |
 wake_up(&sctx->list_wait);         |
 scrub_put_ctx(sctx);               |
 -> atomic_dec_and_test(&sctx->refs)|
                                    | scrub_put_ctx()
                                    | -> atomic_dec_and_test(&sctx->refs)
                                    | -> scrub_free_ctx()
 -----------------------------------+-----------------------------------

Fix:
 Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
 in interrupt context.
 Tested by check tracelog in debug.

Changelog v1->v2:
 Use workqueue instead of adjust function call sequence in v1,
 because v1 will introduce a bug pointed out by:
 Filipe David Manana <fdmanana@gmail.com>

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:04:52 -07:00
Mark Fasheh e1d227a42e btrfs: Handle unaligned length in extent_same
The extent-same code rejects requests with an unaligned length. This
poses a problem when we want to dedupe the tail extent of files as we
skip cloning the portion between i_size and the extent boundary.

If we don't clone the entire extent, it won't be deleted. So the
combination of these behaviors winds up giving us worst-case dedupe on
many files.

We can fix this by allowing a length that extents to i_size and
internally aligining those to the end of the block. This is what
btrfs_ioctl_clone() so we can just copy that check over.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:50 -07:00
chandan 070034bdf9 Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag.
max_to_defrag represents the number of pages to defrag rather than the last
page of the file range to be defragged.

Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the
defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag
should actually have the value 4. Instead in the current code we end up
setting it to 6.

Now, this does not (yet) cause an issue since the first part of the while loop
condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control
to flow out of the while loop before any buggy behavior is actually caused. So
the patch just makes sure that max_to_defrag ends up having the right value
rather than fixing a bug. I did run the xfstests suite to make sure that the
code does not regress.

Changelog: v1->v2:
Provide a much descriptive commit message.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:48 -07:00
chandan e4826a5b24 Btrfs: btrfs_defrag_file: Fix ra_index computation.
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster -
1]. So the next read-ahead should be starting from the page at index 'ra_index
+ cluster' (unless we deemed that the extent at 'ra_index + cluster' as
non-defraggable) rather than from the page at index 'ra_index +
max_cluster'. This patch fixes this. I did run the xfstests suite to make sure
that the code does not regress.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:47 -07:00
Filipe Manana 4617ea3a52 Btrfs: fix necessary chunk tree space calculation when allocating a chunk
When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:

  btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
     btrfs_calc_trans_metadata_size(chunk_root, 1)

That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:46 -07:00
Filipe Manana 7558c8bc17 Btrfs: don't attach unnecessary extents to transaction on fsync
We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:44 -07:00