Commit Graph

213 Commits

Author SHA1 Message Date
Christoph Hellwig feac470e36 xfs: convert COW blocks to real blocks before unwritten extent conversion
We need to splice COW blocks we've completed in xfs_end_io_direct_write
into the data fork before converting unwritten extents.  Otherwise
xfs_bmapi_write might first allocate blocks for any holes in the data
fork, which isn't only not needed but also harmful as it might cause
reserved block underruns in the transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-11 09:03:19 +11:00
Darrick J. Wong 0613f16cd2 xfs: implement CoW for directio writes
For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes.  For writes that are not block-aligned,
just bounce them to the page cache.

For block-aligned writes, however, we can do better than that.  Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write.  This should be fairly performant.

Christoph discovered that xfs_reflink_allocate_cow_range may stumble
over invalid entries in the extent array given that it drops the ilock
but still expects the index to be stable.  Simple fixing it to a new
lookup for every iteration still isn't correct given that
xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
there is nothing preventing a xfs_bunmapi_cow call removing extents
once we dropped the ilock either.

This patch duplicates the inner loop of xfs_bmapi_allocate into a
helper for xfs_reflink_allocate_cow_range so that it can be done under
the same ilock critical section as our CoW fork delayed allocation.
The directio CoW warts will be revisited in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 16:26:04 -07:00
Darrick J. Wong db1327b16c xfs: report shared extent mappings to userspace correctly
Report shared extents through the iomap interface so that FIEMAP flags
shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
files because the bmap-based swap code requires static block mappings,
which is incompatible with copy on write.

NOTE: Existing userspace bmap users such as lilo will have the same
problem with reflink files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-10-05 16:26:04 -07:00
Darrick J. Wong 43caeb187d xfs: move mappings from cow fork to data fork after copy-write
After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind.  On error, we simply free the new blocks
and pass the error up.  If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05 13:55:40 -07:00
Darrick J. Wong ef4736678f xfs: allocate delayed extents in CoW fork
Modify the writepage handler to find and convert pending delalloc
extents to real allocations.  Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.

Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:41 -07:00
Darrick J. Wong 60b4984fc3 xfs: support allocating delayed extents in CoW fork
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate().  In a subsequent
patch, we'll modify the writepage handler to call this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04 18:06:41 -07:00
Christoph Hellwig e372843a40 xfs: refactor xfs_setfilesize
Rename the current function to __xfs_setfilesize and add a non-static
wrapper that also takes care of creating the transaction.  This new
helper will be used by the new iomap-based DAX path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19 11:26:41 +10:00
Linus Torvalds 0e6acf0204 xfs: update for 4.8-rc1
Changes in this update:
 o generic iomap based IO path infrastructure
 o generic iomap based fiemap implementation
 o xfs iomap based Io path implementation
 o buffer error handling fixes
 o tracking of in flight buffer IO for unmount serialisation
 o direct IO and DAX io path separation and simplification
 o shortform directory format definition changes for wider platform compatibility
 o various buffer cache fixes
 o cleanups in preparation for rmap merge
 o error injection cleanups and fixes
 o log item format buffer memory allocation restructuring to prevent rare OOM
   reclaim deadlocks
 o sparse inode chunks are now fully supported.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmA5XAAoJEK3oKUf0dfodCc0QAKY5Jlfw5HwLria+Ad87HCcM
 Zi/LGMMC3CPh+vkbqsmDnLKHYjXRwi3HamBoXdufiE8E3UtOjp/sV98/fCw+zwhe
 tHDLmdAx23RLTn7gUhcsIXydKeXh0+HlRxPa4eBAlmnsJ3nGgrKrKQLgDT7Gjlum
 nPfRSTYjzm5gs2dpUTYhMV7MplenDW9GFz2uBMct6N9kYQ9m225I99fd/4nb/L7R
 o/8UocsK7iREUXP6decDoN9uIAzE2mYR720EL+Txy09CTYy+luNyGoNXOsQtxT5O
 plyoPZbzIIDvC44bvp6bZX96Udm7tAeTloieInCZG13I2zJy9gmTmLqkZ3M2at12
 kOyeAMSBOWQYSa3uh++FsEP+JGtBTlZXf+4DAYf+U08s8tMVE/61/RZrtJZF4OjW
 hyumRBD6zqZ9Y6Qtji2HaA3l9IGxOC2k4URw9JZdDDyMoRTQvawN1QWNAeZINXiv
 9ywqTruVsfQnoGDC1Gk1OEfQpubNztTAkEPqVM7ez5dkwOdwuOZXcZPL1Ltvb4Bt
 PLaWKLIYFYZKrM5kqgQlTERspSQA99++z8H9a21wFezfetaBby28fIqwMMfQAiSw
 nCq95WshJPwenogMtWjNfOgs/fqOBKdPdLFw0H6Jpmjwna2KpuFIZiTnwu25vvjz
 dHh4DVSuMTq1pBkXEU7B
 =vcSd
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "The major addition is the new iomap based block mapping
  infrastructure.  We've been kicking this about locally for years, but
  there are other filesystems want to use it too (e.g. gfs2).  Now it
  is fully working, reviewed and ready for merge and be used by other
  filesystems.

  There are a lot of other fixes and cleanups in the tree, but those are
  XFS internal things and none are of the scale or visibility of the
  iomap changes.  See below for details.

  I am likely to send another pull request next week - we're just about
  ready to merge some new functionality (on disk block->owner reverse
  mapping infrastructure), but that's a huge chunk of code (74 files
  changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
  separate to all the "normal" pull request changes so they don't get
  lost in the noise.

  Summary of changes in this update:
   - generic iomap based IO path infrastructure
   - generic iomap based fiemap implementation
   - xfs iomap based Io path implementation
   - buffer error handling fixes
   - tracking of in flight buffer IO for unmount serialisation
   - direct IO and DAX io path separation and simplification
   - shortform directory format definition changes for wider platform
     compatibility
   - various buffer cache fixes
   - cleanups in preparation for rmap merge
   - error injection cleanups and fixes
   - log item format buffer memory allocation restructuring to prevent
     rare OOM reclaim deadlocks
   - sparse inode chunks are now fully supported"

* tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
  xfs: remove EXPERIMENTAL tag from sparse inode feature
  xfs: bufferhead chains are invalid after end_page_writeback
  xfs: allocate log vector buffers outside CIL context lock
  libxfs: directory node splitting does not have an extra block
  xfs: remove dax code from object file when disabled
  xfs: skip dirty pages in ->releasepage()
  xfs: remove __arch_pack
  xfs: kill xfs_dir2_inou_t
  xfs: kill xfs_dir2_sf_off_t
  xfs: split direct I/O and DAX path
  xfs: direct calls in the direct I/O path
  xfs: stop using generic_file_read_iter for direct I/O
  xfs: split xfs_file_read_iter into buffered and direct I/O helpers
  xfs: remove s_maxbytes enforcement in xfs_file_read_iter
  xfs: kill ioflags
  xfs: don't pass ioflags around in the ioctl path
  xfs: track and serialize in-flight async buffers against unmount
  xfs: exclude never-released buffers from buftarg I/O accounting
  xfs: don't reset b_retries to 0 on every failure
  xfs: remove extraneous buffer flag changes
  ...
2016-07-27 09:53:35 -07:00
Dave Chinner f2bdfda9a1 Merge branch 'xfs-4.8-misc-fixes-4' into for-next 2016-07-22 14:10:56 +10:00
Dave Chinner 28b783e47a xfs: bufferhead chains are invalid after end_page_writeback
In xfs_finish_page_writeback(), we have a loop that looks like this:

        do {
                if (off < bvec->bv_offset)
                        goto next_bh;
                if (off > end)
                        break;
                bh->b_end_io(bh, !error);
next_bh:
                off += bh->b_size;
        } while ((bh = bh->b_this_page) != head);

The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO.  This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.

While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:

.....
 INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
  free_buffer_head+0x41/0x90
  __slab_free+0x1ed/0x340
  kmem_cache_free+0x270/0x300
  free_buffer_head+0x41/0x90
  try_to_free_buffers+0x171/0x240
  xfs_vm_releasepage+0xcb/0x3b0
  try_to_release_page+0x106/0x190
  shrink_page_list+0x118e/0x1a10
  shrink_inactive_list+0x42c/0xdf0
  shrink_zone_memcg+0xa09/0xfa0
  shrink_zone+0x2c3/0xbc0
.....
 Call Trace:
  <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
  [<ffffffff8153a995>] print_trailer+0x115/0x1a0
  [<ffffffff81541174>] object_err+0x34/0x40
  [<ffffffff815436e7>] kasan_report_error+0x217/0x530
  [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
  [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
  [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
  [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
  [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
  [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
  [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
  [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
  [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340


Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.

Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.

Yet another example of why Bufferheads Must Die.

cc: <stable@vger.kernel.org> # 4.7
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:56:38 +10:00
Brian Foster 99579ccec4 xfs: skip dirty pages in ->releasepage()
XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:

 ...
 Call Trace:
  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
  [<ffffffffa2168539>] kswapd+0x4f9/0x970
  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100

This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.

XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.

Diagnosed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-22 09:50:38 +10:00
Dave Chinner b47ec80bfe Merge branch 'xfs-4.8-split-dax-dio' into for-next 2016-07-20 11:54:37 +10:00
Christoph Hellwig fa8d972d05 xfs: direct calls in the direct I/O path
We control both the callers and callees of ->direct_IO, so remove the
indirect calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20 11:38:01 +10:00
Christoph Hellwig 6e8a27a816 xfs: remove buffered write support from __xfs_get_blocks
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 09:53:45 +10:00
Christoph Hellwig 68a9f5e700 xfs: implement iomap based buffered write path
Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.

With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.

Based on earlier code from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21 09:53:44 +10:00
Mike Christie 50bfcd0cbf xfs: use bio op accessors
Separate the op from the rq_flag_bits and have xfs
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Linus Torvalds 0b9210c9c8 xfs: update for 4.7-rc1
Changes in this update:
 o fixes for mount line parsing, sparse warnings, read-only compat
   feature remount behaviour
 o allow fast path symlink lookups for inline symlinks.
 o attribute listing cleanups
 o writeback goes direct to bios rather than indirecting through
   bufferheads
 o transaction allocation cleanup
 o optimised kmem_realloc
 o added configurable error handling for metadata write errors,
   changed default error handling behaviour from "retry forever" to
   "retry until unmount then fail"
 o fixed several inode cluster writeback lookup vs reclaim race
   conditions
 o fixed inode cluster writeback checking wrong inode after lookup
 o fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
 o cleaned up inode reclaim tagging
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXRo8LAAoJEK3oKUf0dfodxLgP+wQMd46i/nCncr6jMcdoVXfL
 rE6cL1LJWWVOWzax/bmdlV1VNXqqW7n0ABAVMqikzqSEd+fBQe/HOkdBeVUywu7o
 bmqgNxuofMqHaiYhiTvUijHLHWLFxIgd/jNT7L5oGRzZdmP260VGf3EPipN7aA9U
 Y3KVhFQCqohpeIUeSV4Z/eIDdHN5LyUI1s+7zMLquHKCWyO4aO4GBX8YlyXdRRVe
 cwCZb4TBryS0PBCIra31MZ5wBRwLx8PBXqcNsnTQSR5Uu+WeuwxofXz5q3kdmNOU
 XGTWiabQbcvaC4twrzqnErHEX41PAs43tWxsI/qJH49QIFdfOYM+t8ERhEa2Q3DW
 Ihl+Q/2qiOuZZterG/t5MrxhybrmQhEFVJT6Ib8b/CnwpRm+K8kWTead1YJL8Xzd
 F9k8e57BCgTbDA7jWxWDbp7eQ1/4KglBD4sefFPpsuFgO882mmo5GmymALGjmitw
 JH1v3HL3PeTkQoHfcay8ZM/zNjX643CXHwCWYEOAgD+e77TPiOs/cHLZaXbrBkLK
 PpSJNfYiBe31eeSOEGsxivMapLpus+cHZyK3uR+XU+naJhjOdaBDTTo8RsAD9jS5
 C/dzxc4l7o+gYT+UjV5KtyfEeVwtGo5mtR9XozPbNDjNQor8Vo6NQMZXMXpFYDZI
 2XgzVNpkEf/74kexdEzV
 =0tYo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "A pretty average collection of fixes, cleanups and improvements in
  this request.

  Summary:
   - fixes for mount line parsing, sparse warnings, read-only compat
     feature remount behaviour
   - allow fast path symlink lookups for inline symlinks.
   - attribute listing cleanups
   - writeback goes direct to bios rather than indirecting through
     bufferheads
   - transaction allocation cleanup
   - optimised kmem_realloc
   - added configurable error handling for metadata write errors,
     changed default error handling behaviour from "retry forever" to
     "retry until unmount then fail"
   - fixed several inode cluster writeback lookup vs reclaim race
     conditions
   - fixed inode cluster writeback checking wrong inode after lookup
   - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
   - cleaned up inode reclaim tagging"

* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
  xfs: fix warning in xfs_finish_page_writeback for non-debug builds
  xfs: move reclaim tagging functions
  xfs: simplify inode reclaim tagging interfaces
  xfs: rename variables in xfs_iflush_cluster for clarity
  xfs: xfs_iflush_cluster has range issues
  xfs: mark reclaimed inodes invalid earlier
  xfs: xfs_inode_free() isn't RCU safe
  xfs: optimise xfs_iext_destroy
  xfs: skip stale inodes in xfs_iflush_cluster
  xfs: fix inode validity check in xfs_iflush_cluster
  xfs: xfs_iflush_cluster fails to abort on error
  xfs: remove xfs_fs_evict_inode()
  xfs: add "fail at unmount" error handling configuration
  xfs: add configuration handlers for specific errors
  xfs: add configuration of error failure speed
  xfs: introduce table-based init for error behaviors
  xfs: add configurable error support to metadata buffers
  xfs: introduce metadata IO error class
  xfs: configurable error behavior via sysfs
  xfs: buffer ->bi_end_io function requires irq-safe lock
  ...
2016-05-26 10:13:40 -07:00
Dave Chinner d6bd9615ab Merge branch 'xfs-4.7-trans-type-cleanup' into for-next 2016-05-20 10:31:52 +10:00
Christoph Hellwig 690a787122 xfs: fix warning in xfs_finish_page_writeback for non-debug builds
blockmask is unused if ASSERTs are disabled.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-05-20 10:29:15 +10:00
Christoph Hellwig c8b8e32d70 direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io.  It has to be ki_pos to actually
work, so eliminate the superflous argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig 253f4911f2 xfs: better xfs_trans_alloc interface
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.

While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode.  The
guts of it will be removed in another patch.

[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 09:19:55 +10:00
Christoph Hellwig 0e51a8e191 xfs: optimize bio handling in the buffer writeback path
This patch implements two closely related changes:  First it embeds
a bio the ioend structure so that we don't have to allocate one
separately.  Second it uses the block layer bio chaining mechanism
to chain additional bios off this first one if needed instead of
manually accounting for multiple bio completions in the ioend
structure.  Together this removes a memory allocation per ioend and
greatly simplifies the ioend setup and I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:34:30 +10:00
Dave Chinner 37992c18bb xfs: don't release bios on completion immediately
Completion of an ioend requires us to walk the bufferhead list to
end writback on all the bufferheads. This, in turn, is needed so
that we can end writeback on all the pages we just did IO on.

To remove our dependency on bufferheads in writeback, we need to
turn this around the other way - we need to walk the pages we've
just completed IO on, and then walk the buffers attached to the
pages and complete their IO. In doing this, we remove the
requirement for the ioend to track bufferheads directly.

To enable IO completion to walk all the pages we've submitted IO on,
we need to keep the bios that we used for IO around until the ioend
has been completed. We can do this simply by chaining the bios to
the ioend at completion time, and then walking their pages directly
just before destroying the ioend.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: changed the xfs_finish_page_writeback calling convention]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:12:28 +10:00
Dave Chinner bb18782aa4 xfs: build bios directly in xfs_add_to_ioend
Currently adding a buffer to the ioend and then building a bio from
the buffer list are two separate operations. We don't build the bios
and submit them until the ioend is submitted, and this places a
fixed dependency on bufferhead chaining in the ioend.

The first step to removing the bufferhead chaining in the ioend is
on the IO submission side. We can build the bio directly as we add
the buffers to the ioend chain, thereby removing the need for a
latter "buffer-to-bio" submission loop. This allows us to submit
bios on large ioends as soon as we cannot add more data to the bio.

These bios then get captured by the active plug, and hence will be
dispatched as soon as either the plug overflows or we schedule away
from the writeback context. This will reduce submission latency for
large IOs, but will also allow more timely request queue based
writeback blocking when the device becomes congested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: various small updates]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06 08:11:25 +10:00
Kirill A. Shutemov ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Linus Torvalds 53d2e6976b xfs: Changes for 4.6-rc1
Change summary:
 o error propagation for direct IO failures fixes for both XFS and ext4
 o new quota interfaces and XFS implementation for iterating all the quota IDs
   in the filesystem
 o locking fixes for real-time device extent allocation
 o reduction of duplicate information in the xfs and vfs inode, saving roughly
   100 bytes of memory per cached inode.
 o buffer flag cleanup
 o rework of the writepage code to use the generic write clustering mechanisms
 o several fixes for inode flag based DAX enablement
 o rework of remount option parsing
 o compile time verification of on-disk format structure sizes
 o delayed allocation reservation overrun fixes
 o lots of little error handling fixes
 o small memory leak fixes
 o enable xfsaild freezing again
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW71DQAAoJEK3oKUf0dfodyiwP/0Tou9f1huzLC0kd7kmEoKKC
 BWQmtJGEdo0iSpJNZhg/EJmjvRtbBiOB9CRcEyG8d71kqZ+MKW7t/4JjNvNG34aE
 vHjhwMBVVqkw/q6azi2LiEDsVcOe5bXxUrXNZi18/09OAl4pHm+X8VERLnnC5y+i
 QIHAOdB5R+36cXcceJm1HR6jTZedbNdQkT/ndhm5S60FGhvVI29cs9NwYwoi5aif
 O55r6krSWBj6U/X6MsLvr+lNb6+1Sd1hyE8dGTE7lOUX/crFIysaDPEuQmWvDjsO
 M1ulVfzKoBJHcyvpbdHwdBEyiBjzvETcrgndMRoWOjZiOLqNtWYsgIEiC+Nlidwd
 +T4XhkJJJg5UUQ4r6Hs85SQn/THanzR5KoN5nbTsFtFkCKw1DRkUSNuh2mXP2xVG
 JcNDCjDvvHG76EfQ1otlYf7ru79Ck+hjVs+szaEVPpOzAwz8yOtD+L7I8f73gQ6a
 ayP8W2oZQpYvQRv+smgvt+HwQA4fNJk9ZseY3QD5+z5snJz7JEhZogqW+ngFYkNQ
 dtA5Y7gpTkKfo3mKO0XmE5+3fcSXhGHGYQzmUgJFlgWTK7+E8fuDhn6D66wFcZSq
 QhyRk9J7Xb7ZWuP5PlOkxb9DLd4hnuyie2bYw/0hVtOatjE/Em4gRJ3Oq3ZANwZx
 OeMGj4Uyb3/MKAJwy3Gq
 =ZoiX
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There's quite a lot in this request, and there's some cross-over with
  ext4, dax and quota code due to the nature of the changes being made.

  As for the rest of the XFS changes, there are lots of little things
  all over the place, which add up to a lot of changes in the end.

  The major changes are that we've reduced the size of the struct
  xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
  >10%), the writepage code now only does a single set of mapping tree
  lockups so uses less CPU, delayed allocation reservations won't
  overrun under random write loads anymore, and we added compile time
  verification for on-disk structure sizes so we find out when a commit
  or platform/compiler change breaks the on disk structure as early as
  possible.

  Change summary:

   - error propagation for direct IO failures fixes for both XFS and
     ext4
   - new quota interfaces and XFS implementation for iterating all the
     quota IDs in the filesystem
   - locking fixes for real-time device extent allocation
   - reduction of duplicate information in the xfs and vfs inode, saving
     roughly 100 bytes of memory per cached inode.
   - buffer flag cleanup
   - rework of the writepage code to use the generic write clustering
     mechanisms
   - several fixes for inode flag based DAX enablement
   - rework of remount option parsing
   - compile time verification of on-disk format structure sizes
   - delayed allocation reservation overrun fixes
   - lots of little error handling fixes
   - small memory leak fixes
   - enable xfsaild freezing again"

* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
  xfs: always set rvalp in xfs_dir2_node_trim_free
  xfs: ensure committed is initialized in xfs_trans_roll
  xfs: borrow indirect blocks from freed extent when available
  xfs: refactor delalloc indlen reservation split into helper
  xfs: update freeblocks counter after extent deletion
  xfs: debug mode forced buffered write failure
  xfs: remove impossible condition
  xfs: check sizes of XFS on-disk structures at compile time
  xfs: ioends require logically contiguous file offsets
  xfs: use named array initializers for log item dumping
  xfs: fix computation of inode btree maxlevels
  xfs: reinitialise per-AG structures if geometry changes during recovery
  xfs: remove xfs_trans_get_block_res
  xfs: fix up inode32/64 (re)mount handling
  xfs: fix format specifier , should be %llx and not %llu
  xfs: sanitize remount options
  xfs: convert mount option parsing to tokens
  xfs: fix two memory leaks in xfs_attr_list.c error paths
  xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
  xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
  ...
2016-03-21 11:53:05 -07:00
Johannes Weiner 62cccb8c8e mm: simplify lock_page_memcg()
Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Johannes Weiner 81f8c3a461 mm: memcontrol: generalize locking for the page->mem_cgroup binding
These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
work properly and be as adaptive to new cache workingsets as global
reclaim already is.

This should have been part of the original thrash detection patch
series, but was deferred due to the complexity of those patches.

This patch (of 5):

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat().  But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg().  Since charge migration is a cgroup1 feature only,
we might be able to delete it at some point, and these now easy to
identify locking sites along with it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Dave Chinner 2cdb958aba Merge branch 'xfs-misc-fixes-4.6-4' into for-next 2016-03-15 11:44:35 +11:00
Brian Foster 801cc4e17a xfs: debug mode forced buffered write failure
Add a DEBUG mode-only sysfs knob to enable forced buffered write
failure. An additional side effect of this mode is brute force killing
of delayed allocation blocks in the range of the write. The latter is
the prime motiviation behind this patch, as userspace test
infrastructure requires a reliable mechanism to create and split
delalloc extents without causing extent conversion.

Certain fallocate operations (i.e., zero range) were used for this in
the past, but the implementations have changed such that delalloc
extents are flushed and converted to real blocks, rendering the test
useless.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15 11:42:44 +11:00
Dave Chinner 3d93ec0364 Merge branch 'xfs-writepage-rework-4.6' into for-next 2016-03-07 09:34:02 +11:00
Darrick J. Wong 0df61da8ac xfs: ioends require logically contiguous file offsets
We need to create a new ioend if the current writepage call isn't
logically contiguous with the range contained in the previous ioend.
Hopefully writepage gets called in order of increasing file offset.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07 09:32:14 +11:00
Dave Chinner 6d247d47fb Merge branch 'xfs-misc-fixes-4.6' into for-next 2016-03-07 09:30:12 +11:00
Ross Zwisler 7f6d5b529b dax: move writeback calls into the filesystems
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().

dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.

Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device.  This also fixes DAX code to properly flush caches in response
to sync(2).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Ross Zwisler 20a90f5899 dax: give DAX clearing code correct bdev
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases.  This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.

Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block.  This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
Dave Chinner e10de3723c xfs: don't chain ioends during writepage submission
Currently we can build a long ioend chain during ->writepages that
gets attached to the writepage context. IO submission only then
occurs when we finish all the writepage processing. This means we
can have many ioends allocated and pending, and this violates the
mempool guarantees that we need to give about forwards progress.
i.e. we really should only have one ioend being built at a time,
otherwise we may drain the mempool trying to allocate a new ioend
and that blocks submission, completion and freeing of ioends that
are already in progress.

To prevent this situation from happening, we need to submit ioends
for IO as soon as they are ready for dispatch rather than queuing
them for later submission. This means the ioends have bios built
immediately and they get queued on any plug that is current active.
Hence if we schedule away from writeback, the ioends that have been
built will make forwards progress due to the plug flushing on
context switch. This will also prevent context switches from
creating unnecessary IO submission latency.

We can't completely avoid having nested IO allocation - when we have
a block size smaller than a page size, we still need to hold the
ioend submission until after we have marked the current page dirty.
Hence we may need multiple ioends to be held while the current page
is completely mapped and made ready for IO dispatch. We cannot avoid
this problem - the current code already has this ioend chaining
within a page so we can mostly ignore that it occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:23:12 +11:00
Dave Chinner bfce7d2e2d xfs: factor mapping out of xfs_do_writepage
Separate out the bufferhead based mapping from the writepage code so
that we have a clear separation of the page operations and the
bufferhead state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:37 +11:00
Dave Chinner ad68972acb xfs: xfs_cluster_write is redundant
xfs_cluster_write() is not necessary now that xfs_vm_writepages()
aggregates writepage calls across a single mapping. This means we no
longer need to do page lookups in xfs_cluster_write, so writeback
only needs to look up th epage cache once per page being written.
This also removes a large amount of mostly duplicate code between
xfs_do_writepage() and xfs_convert_page().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:31 +11:00
Dave Chinner fbcc025613 xfs: Introduce writeback context for writepages
xfs_vm_writepages() calls generic_writepages to writeback a range of
a file, but then xfs_vm_writepage() clusters pages itself as it does
not have any context it can pass between->writepage calls from
__write_cache_pages().

Introduce a writeback context for xfs_vm_writepages() and call
__write_cache_pages directly with our own writepage callback so that
we can pass that context to each writepage invocation. This
encapsulates the current mapping, whether it is valid or not, the
current ioend and it's IO type and the ioend chain being built.

This requires us to move the ioend submission up to the level where
the writepage context is declared. This does mean we do not submit
IO until we packaged the entire writeback range, but with the block
plugging in the writepages call this is the way IO is submitted,
anyway.

It also means that we need to handle discontiguous page ranges.  If
the pages sent down by write_cache_pages to the writepage callback
are discontiguous, we need to detect this and put each discontiguous
page range into individual ioends. This is needed to ensure that the
ioend accurately represents the range of the file that it covers so
that file size updates during IO completion set the size correctly.
Failure to take into account the discontiguous ranges results in
files being too small when writeback patterns are non-sequential.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:19 +11:00
Dave Chinner 150d5be09c xfs: remove xfs_cancel_ioend
We currently have code to cancel ioends being built because we
change bufferhead state as we build the ioend. On error, this needs
to be unwound and so we have cancelling code that walks the buffers
on the ioend chain and undoes these state changes.

However, the IO submission path already handles state changes for
buffers when a submission error occurs, so we don't really need a
separate cancel function to do this - we can simply submit the
ioend chain with the specific error and it will be cancelled rather
than submitted.

Hence we can remove the explicit cancel code and just rely on
submission to deal with the error correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:21:12 +11:00
Dave Chinner 988ef92792 xfs: remove nonblocking mode from xfs_vm_writepage
Remove the nonblocking optimisation done for mapping lookups during
writeback. It's not clear that leaving a hole in the writeback range
just because we couldn't get a lock is really a win, as it makes us
do another small random IO later on rather than a large sequential
IO now.

As this gets in the way of sane error handling later on, just remove
for the moment and we can re-introduce an equivalent optimisation in
future if we see problems due to extent map lock contention.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-15 17:20:50 +11:00
Brian Foster af055e37a9 xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdown
If the filesystem has shut down, xfs_end_io() currently sets an
error on the ioend and proceeds to ioend destruction. The ioend
might contain a truncate transaction if the I/O extended the size of
the file. This transaction is only cleaned up in
xfs_setfilesize_ioend(), however, which is skipped in this case.
This results in an xfs_log_ticket leak message when the associate
cache slab is destroyed (e.g., on rmmod).

This was originally reproduced by xfs/141 on a distro kernel. The
problem is reproducible on an upstream kernel, but not easily
detected in current upstream if the xfs_log_ticket cache happens to
be merged with another cache. This can be reproduced more
deterministically with the 'slab_nomerge' kernel boot option.

Update xfs_end_io() to proceed with normal end I/O processing after
an error is set on an ioend due to fs shutdown. The I/O type-based
processing is already designed to handle an I/O error and ensure
that the ioend is cleaned up correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 15:00:02 +11:00
Brian Foster 60630fe66e xfs: clean up unwritten buffers on write failure
The xfs_vm_write_failed() handler is currently responsible for cleaning
up any delalloc blocks over the range of a failed write beyond EOF.
Failure to do so results in warning messages and other inconsistencies
between buffer and extent state. The ->releasepage() handler currently
warns in the event of a page being released with either unwritten or
delalloc buffers, as neither is ever expected by the time a page is
released.

As has been reproduced by generic/083 on a -bsize=1k fs, it is currently
possible to trigger the ->releasepage() warning for a page with
unwritten buffers when a filesystem is near ENOSPC. This is reproduced
by the following sequence:

  $ mkfs.xfs -f -b size=1k -d size=100m <dev>
  $ mount <dev> /mnt/
  $
  $ xfs_io -fc "falloc -k 0 1k" /mnt/file
  $ dd if=/dev/zero of=/mnt/enospc conv=notrunc oflag=append
  $
  $ xfs_io -c "pwrite 512 1k" /mnt/file
  $ xfs_io -d -c "pwrite 16k 1k" /mnt/file

The first pwrite command attempts a block unaligned write across an
unwritten block and a hole. The delalloc for the hole fails with ENOSPC
and the subsequent error handling does not clean up the unwritten buffer
that was instantiated during the first ->get_block() call.

The second pwrite triggers a warning as part of the inode mapping
invalidation that occurs prior to direct I/O. The releasepage() handler
detects the unwritten buffer at this time, warns and prevents the
release of the page.

To deal with this problem, update xfs_vm_write_failed() to clean up
unwritten as well as delalloc buffers that are beyond EOF and within the
range of the failed write.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 15:00:02 +11:00
Christoph Hellwig c19b104a67 xfs: fold xfs_vm_do_dio into xfs_vm_direct_IO
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
Christoph Hellwig 273dda76f7 xfs: don't use ioends for direct write completions
We only need to communicate two bits of information to the direct I/O
completion handler:

 (1) do we need to convert any unwritten extents in the range
 (2) do we need to check if we need to update the inode size based
     on the range passed to the completion handler

We can use the private data passed to the get_block handler and the
completion handler as a simple bitmask to communicate this information
instead of the current complicated infrastructure reusing the ioends
from the buffer I/O path, and thus avoiding a memory allocation and
a context switch for any non-trivial direct write.  As a nice side
effect we also decouple the direct I/O path implementation from that
of the buffered I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2016-02-08 14:40:51 +11:00
Christoph Hellwig 187372a3b9 direct-io: always call ->end_io if non-NULL
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08 14:40:51 +11:00
Dave Chinner 121e213eab xfs: add tracepoints to readpage calls
This allows us to see page cache driven readahead in action as it
passes through XFS. This helps to understand buffered read
throughput problems such as readahead IO IO sizes being too small
for the underlying device to reach max throughput.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-08 11:28:35 +11:00
Dave Chinner 264e89ad34 Merge branch 'xfs-dax-updates' into for-next 2015-11-03 13:28:41 +11:00
Dave Chinner 01a155e6cf xfs: DAX does not use IO completion callbacks
For DAX, we are now doing block zeroing during allocation. This
means we no longer need a special DAX fault IO completion callback
to do unwritten extent conversion. Because mmap never extends the
file size (it SEGVs the process) we don't need a callback to update
the file size, either. Hence we can remove the completion callbacks
from the __dax_fault and __dax_mkwrite calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:02 +11:00
Dave Chinner 1ca191576f xfs: Don't use unwritten extents for DAX
DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:37:00 +11:00
Dave Chinner 3e12dbbdbd xfs: fix inode size update overflow in xfs_map_direct()
Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03 12:27:22 +11:00
Brian Foster 009c6e871e xfs: add missing ilock around dio write last extent alignment
The iomap codepath (via get_blocks()) acquires and release the inode
lock in the case of a direct write that requires block allocation. This
is because xfs_iomap_write_direct() allocates a transaction, which means
the ilock must be dropped and reacquired after the transaction is
allocated and reserved.

xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before
the transaction is created and thus before the ilock is reacquired. This
can lead to calls to xfs_iread_extents() and reads of the in-core extent
list without any synchronization (via xfs_bmap_eof() and
xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock
is not held, but this is not currently seen in practice as the current
callers had already invoked xfs_bmapi_read().

What has been seen in practice are reports of crashes down in the
xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer
references from xfs_iext_get_ext(). While an explicit reproducer is not
currently available to confirm the cause of the problem, crash analysis
and code inspection from David Jeffrey had identified the insufficient
locking.

xfs_iomap_eof_align_last_fsb() is called from other contexts with the
inode lock already held, so we cannot acquire it therein.
__xfs_get_blocks() acquires and drops the ilock with variable flags to
cover the event that the extent list must be read in. The common case is
that __xfs_get_blocks() acquires the shared ilock. To provide locking
around the last extent alignment call without adding more lock cycles to
the dio path, update xfs_iomap_write_direct() to expect the shared ilock
held on entry and do the extent alignment under its protection. Demote
the lock, if necessary, from __xfs_get_blocks() and push the
xfs_qm_dqattach() call outside of the shared lock critical section.
Also, add an assert to document that the extent list is always expected
to be present in this path. Otherwise, we risk a call to
xfs_iread_extents() while under the shared ilock. This is safe as all
current callers have executed an xfs_bmapi_read() call under the current
iolock context.

Reported-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:34:20 +11:00
Zhaohongjiang 5cb13dcd0f cancel the setfilesize transation when io error happen
When I ran xfstest/073 case, the remount process was blocked to wait
transactions to be zero. I found there was a io error happened, and
the setfilesize transaction was not released properly. We should add
the changes to cancel the io error in this case.

Reproduction steps:
1. dd if=/dev/zero of=xfs1.img bs=1M count=2048
2. mkfs.xfs xfs1.img
3. losetup -f ./xfs1.img /dev/loop0
4. mount -t xfs /dev/loop0 /home/test_dir/
5. mkdir /home/test_dir/test
6. mkfs.xfs -dfile,name=image,size=2g
7. mount -t xfs -o loop image /home/test_dir/test
8. cp a file bigger than 2g to /home/test_dir/test
9. mount -t xfs -o remount,ro /home/test_dir/test

[ dchinner: moved io error detection to xfs_setfilesize_ioend() after
  transaction context restoration. ]

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12 15:28:39 +11:00
Linus Torvalds 77a78806c7 xfs: updates for 4.3-rc1
This update contains:
 o large rework of EFI/EFD lifecycle handling to fix log recovery corruption
   issues, crashes and unmount hangs
 o separate metadata UUID on disk to enable changing boot label UUID for v5
   filesystems
 o fixes for gcc miscompilation on certain platforms and optimisation levels
 o remote attribute allocation and recovery corruption fixes
 o inode lockdep annotation rework to fix bugs with too many subclasses
 o directory inode locking changes to prevent lockdep false positives
 o a handful of minor corruption fixes
 o various other small cleanups and bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJV7VlyAAoJEK3oKUf0dfodLDAQAMTYAERZGp8sI1ZZo9qTtMis
 HE3X7X1jpA2/CrSlsQtw5FEahl9NDoVZKInEFzpDeFogloOwLy+aNz6F9s6SQvSO
 p7r+Tkv8k5WCWCpYhm6N5yVSwMRkCVBJ9+DsxKeabQaNobu2nBRYWA7RcTPbwhL6
 eZZ41NT/1x4Di3MppjRcSHMRxq+DOYsoTj7ey2tB3jFK4w99pfhBqhMsxOMCyThQ
 g61Rj/IIwbUKWDZNBP1vdG9y8eN9xEan7+uQRJYpwjrdPAXeZMg9J0U5dIoZXmOA
 o7UDvyhxZP06vZGG52rMCMWl5kWbEyFGAa/bzh+L+O3/5DZAdoJQxZUF00AsLaxQ
 2kQ4L8vUEuGvPpUcFFopSjvpJmjmdg4O8KCkxKp4bcONA+2Z108e68zVxffnQPgm
 0d2msqRRHCVRSw+o52Nf8R1A29cjhShyxBq4Xw154zrK2lJNwWWx36LG+XgrW2R6
 CHXj2OoMvQZIJWpbhwZqJCcl1dmhcjES082Wvb+RyKvvcQzerOjb5p2R7uqwXVg+
 uR27KstQ3tJ3J+hmq2FwhB7E2GMnvYDL9qt+3RgMIJrM7rxAOB0b/QS+yO9hzgQH
 /I0KzyX72Lcwwxqd0aWLqlqoIWfn44eBK+V2vdXFRNTeWu3kDEW9q0JRQjxVBsFt
 /SMKetOh+gj7yAs+kgOh
 =Eikc
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs updates from Dave Chinner:
 "There isn't a whole lot to this update - it's mostly bug fixes and
  they are spread pretty much all over XFS.  There are some corruption
  fixes, some fixes for log recovery, some fixes that prevent unount
  from hanging, a lockdep annotation rework for inode locking to prevent
  false positives and the usual random bunch of cleanups and minor
  improvements.

  Deatils:

   - large rework of EFI/EFD lifecycle handling to fix log recovery
     corruption issues, crashes and unmount hangs

   - separate metadata UUID on disk to enable changing boot label UUID
     for v5 filesystems

   - fixes for gcc miscompilation on certain platforms and optimisation
     levels

   - remote attribute allocation and recovery corruption fixes

   - inode lockdep annotation rework to fix bugs with too many
     subclasses

   - directory inode locking changes to prevent lockdep false positives

   - a handful of minor corruption fixes

   - various other small cleanups and bug fixes"

* tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits)
  xfs: fix error gotos in xfs_setattr_nonsize
  xfs: add mssing inode cache attempts counter increment
  xfs: return errors from partial I/O failures to files
  libxfs: bad magic number should set da block buffer error
  xfs: fix non-debug build warnings
  xfs: collapse allocsize and biosize mount option handling
  xfs: Fix file type directory corruption for btree directories
  xfs: lockdep annotations throw warnings on non-debug builds
  xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()
  xfs: inode lockdep annotations broke non-lockdep build
  xfs: flush entire file on dio read/write to cached file
  xfs: Fix xfs_attr_leafblock definition
  libxfs: readahead of dir3 data blocks should use the read verifier
  xfs: stop holding ILOCK over filldir callbacks
  xfs: clean up inode lockdep annotations
  xfs: swap leaf buffer into path struct atomically during path shift
  xfs: relocate sparse inode mount warning
  xfs: dquots should be stamped with sb_meta_uuid
  xfs: log recovery needs to validate against sb_meta_uuid
  xfs: growfs not aware of sb_meta_uuid
  ...
2015-09-07 13:28:32 -07:00
Linus Torvalds 7d9071a095 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this one:

   - d_move fixes (Eric Biederman)

   - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
     handling ought to be fixed)

   - switch of sb_writers to percpu rwsem (Oleg Nesterov)

   - superblock scalability (Josef Bacik and Dave Chinner)

   - swapon(2) race fix (Hugh Dickins)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
  vfs: Test for and handle paths that are unreachable from their mnt_root
  dcache: Reduce the scope of i_lock in d_splice_alias
  dcache: Handle escaped paths in prepend_path
  mm: fix potential data race in SyS_swapon
  inode: don't softlockup when evicting inodes
  inode: rename i_wb_list to i_io_list
  sync: serialise per-superblock sync operations
  inode: convert inode_sb_list_lock to per-sb
  inode: add hlist_fake to avoid the inode hash lock in evict
  writeback: plug writeback at a high level
  change sb_writers to use percpu_rw_semaphore
  shift percpu_counter_destroy() into destroy_super_work()
  percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  percpu-rwsem: introduce percpu_down_read_trylock()
  document rwsem_release() in sb_wait_write()
  fix the broken lockdep logic in __sb_start_write()
  introduce __sb_writers_{acquired,release}() helpers
  ufs_inode_get{frag,block}(): get rid of 'phys' argument
  ufs_getfrag_block(): tidy up a bit
  ...
2015-09-05 20:34:28 -07:00
David Jeffery c9eb256eda xfs: return errors from partial I/O failures to files
There is an issue with xfs's error reporting in some cases of I/O partially
failing and partially succeeding. Calls like fsync() can report success even
though not all I/O was successful in partial-failure cases such as one disk of
a RAID0 array being offline.

The issue can occur when there are more than one bio per xfs_ioend struct.
Each call to xfs_end_bio() for a bio completing will write a value to
ioend->io_error.  If a successful bio completes after any failed bio, no
error is reported do to it writing 0 over the error code set by any failed bio.
The I/O error information is now lost and when the ioend is completed
only success is reported back up the filesystem stack.

xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
being clear.  ioend->io_error is initialized to 0 at allocation so only needs
to be updated by a failed bio. Also check that ioend->io_error is 0 so that
the first error reported will be the error code returned.

Cc: stable@vger.kernel.org
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-28 14:50:45 +10:00
Oleg Nesterov bee9182d95 introduce __sb_writers_{acquired,release}() helpers
Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jan Kara <jack@suse.com>
2015-08-15 13:52:08 +02:00
Kent Overstreet b54ffb73ca block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.

Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:32:04 -06:00
Christoph Hellwig 4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Linus Torvalds 68b4449d79 xfs: update for 4.2-rc1
This update contains:
 
 o A new sparse on-disk inode record format to allow small extents to
   be used for inode allocation when free space is fragmented.
 o DAX support. This includes minor changes to the DAX core code to
   fix problems with lock ordering and bufferhead mapping abuse.
 o transaction commit interface cleanup
 o removal of various unnecessary XFS specific type definitions
 o cleanup and optimisation of freelist preparation before allocation
 o various minor cleanups
 o bug fixes for
 	- transaction reservation leaks
 	- incorrect inode logging in unwritten extent conversion
 	- mmap lock vs freeze ordering
 	- remote symlink mishandling
 	- attribute fork removal issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJVkhI0AAoJEK3oKUf0dfod45MQAJCOEkNduBdlfPvTCMPjj/7z
 vzcfDdzgKwhpPTMXSDRvw4zDPt3C2FLMBJqxtPpC4sKGKG/8G0kFvw8bDtBag1m9
 ru5nI5LaQ6LC5RcU40zxBx1s/L8qYvyfUlxeoOT5lSwN9c6ENGOCQ3bUk4pSKaee
 pWDplag9LbfQomW2GHtxd8agMUZEYx0R1vgfv88V8xgPka8CvQo81XUgkb4PcDZV
 ugR+wDUsvwMS01aLYBmRFkMXuExNuCJVwtvdTJS+ZWGHzyTpulFoANUW6QT24gAM
 eP4yRXN4bv9vXrXpg8JkF25DHsfw4HBwNEL17ZvoB8t3oJp1/NYaH8ce1jS0+I8i
 NCtaO+qUqDSTGQZKgmeDPwCciQp54ra9LEdmIJFxpZxiBof9g/tIYEFgRklyFLwR
 GZU6Io6VpBa1oTGlC4D1cmG6bdcnhMB9MGVVCbqnB5mRRDKCmVgCyJwusd1pi7Re
 G4O6KkFt21O7+fP13VsjP57KoaJzsIgZ/+H3Ff/fJOJ33AKYTRCmwi8+IMi2n5JI
 zz+V0AIBQZAx9dlVyENnxufh9eJYcnwta0lUSLCCo91fZKxbo3ktK1kVHNZP5EGs
 IMFM1Ka6hibY20rWlR3GH0dfyP5/yNcvNgTMYPKjj9SVjTar1aSfF2rGpkqYXYyH
 D4FICbtDgtOc2ClfpI2k
 =3x+W
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pul xfs updates from Dave Chinner:
 "There's a couple of small API changes to the core DAX code which
  required small changes to the ext2 and ext4 code bases, but otherwise
  everything is within the XFS codebase.

  This update contains:

   - A new sparse on-disk inode record format to allow small extents to
     be used for inode allocation when free space is fragmented.

   - DAX support.  This includes minor changes to the DAX core code to
     fix problems with lock ordering and bufferhead mapping abuse.

   - transaction commit interface cleanup

   - removal of various unnecessary XFS specific type definitions

   - cleanup and optimisation of freelist preparation before allocation

   - various minor cleanups

   - bug fixes for
	- transaction reservation leaks
	- incorrect inode logging in unwritten extent conversion
	- mmap lock vs freeze ordering
	- remote symlink mishandling
	- attribute fork removal issues"

* tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
  xfs: don't truncate attribute extents if no extents exist
  xfs: clean up XFS_MIN_FREELIST macros
  xfs: sanitise error handling in xfs_alloc_fix_freelist
  xfs: factor out free space extent length check
  xfs: xfs_alloc_fix_freelist() can use incore perag structures
  xfs: remove xfs_caddr_t
  xfs: use void pointers in log validation helpers
  xfs: return a void pointer from xfs_buf_offset
  xfs: remove inst_t
  xfs: remove __psint_t and __psunsigned_t
  xfs: fix remote symlinks on V5/CRC filesystems
  xfs: fix xfs_log_done interface
  xfs: saner xfs_trans_commit interface
  xfs: remove the flags argument to xfs_trans_cancel
  xfs: pass a boolean flag to xfs_trans_free_items
  xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
  xfs: check min blks for random debug mode sparse allocations
  xfs: fix sparse inodes 32-bit compile failure
  xfs: add initial DAX support
  xfs: add DAX IO path support
  ...
2015-06-30 20:16:08 -07:00
Dave Chinner 4ea7976616 Merge branch 'xfs-commit-cleanup' into for-next
Conflicts:
	fs/xfs/xfs_attr_inactive.c
2015-06-04 13:55:48 +10:00
Christoph Hellwig 70393313dd xfs: saner xfs_trans_commit interface
The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES.  So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 13:48:08 +10:00
Christoph Hellwig 4906e21545 xfs: remove the flags argument to xfs_trans_cancel
xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
state, and can be deducted:

 - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
   and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
 - any transaction with a permanent log reservation needs
   XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
   XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
   log reservation is invalid.

So just remove the flags argument and do the right thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 13:47:56 +10:00
Dave Chinner 6e1ba0bcb8 xfs: add DAX IO path support
DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path.  Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:19:15 +10:00
Dave Chinner 6b698edeee xfs: add DAX file operations support
Add the initial support for DAX file operations to XFS. This
includes the necessary block allocation and mmap page fault hooks
for DAX to function.

Note that there are changes to the splice interfaces to ensure that
for DAX splice avoids direct page cache manipulations and instead
takes the DAX IO paths for read/write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:18:53 +10:00
Greg Thelen c4843a7593 memcg: add per cgroup dirty page accounting
When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
per memcg memory.stat cgroupfs file.  The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback.  It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter.  The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
	memcg = mem_cgroup_begin_page_stat(page)
	if (TestSetPageDirty()) {
		[...]
		mem_cgroup_update_page_stat(memcg)
	}
	mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
  rcu_read_lock()
- With CONFIG_MEMCG and inter memcg  task movement, it's
  rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
  __mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
  __delete_from_page_cache(), replace_page_cache_page(),
  invalidate_complete_page2(), and __remove_mapping().

   text    data     bss      dec    hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                            +192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                            +773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952.  Lower is better for
all metrics, they're all wall clock or cycle counts.  The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
                            baseline                              patched
  kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
  dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
  dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
  dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
  read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
  write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)

* CONFIG_MEMCG=y root_memcg:
                            baseline                              patched
  kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
  dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
  dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
  dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
  read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
  write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)

* CONFIG_MEMCG=y non-root_memcg:
                            baseline                              patched
  kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
  dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
  dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
  dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
  read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
  write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)

As expected anon page faults are not affected by this patch.

tj: Updated to apply on top of the recent cancel_dirty_page() changes.

Signed-off-by: Sha Zhengju <handai.szj@gmail.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:33 -06:00
Jens Axboe dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Linus Torvalds 1aef882f02 xfs: update for 4.1-rc1
This update contains:
 o RENAME_WHITEOUT support
 o conversion of per-cpu superblock accounting to use generic counters
 o new inode mmap lock so that we can lock page faults out of truncate, hole
   punch and other direct extent manipulation functions to avoid racing mmap
   writes from causing data corruption
 o rework of direct IO submission and completion to solve data corruption issue
   when running concurrent extending DIO writes. Also solves problem of running
   IO completion transactions in interrupt context during size extending AIO
   writes.
 o FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct
   extent manipulation to avoid needing to copy data within the file
 o attribute block header field overflow fix for 64k block size filesystems
 o Lots of changes to log messaging to be more informative and concise when
   errors occur. Also prevent a lot of unnecessary log spamming due to cascading
   failures in error conditions.
 o lots of cleanups and bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJVOE8oAAoJEK3oKUf0dfodx1kQAIIH8CwqcBrIslOntfHlFPHz
 P9aQl5uiI6JcnFqMiHG6mfnjWGpn+Z6XMDGIBwrSTzHj8IEnHTeXqYiS6SDPAnrH
 +VmlJEvW01ucAv7vcXKPrfutcc8dxLpy4fs63HOWmXh4rmrTcpel5S+0JSQxyGd6
 OriLg1nfD4Sid7R9CFEXAKLghJFK+gbao2CmT0wo6ZrTwiZl2p62Y187ou+d+u3k
 BRol99pI/Sp9bKpWZpUv3q2RnfD1v/k4oDP/JG4Ohdt2dx+nDqCjLvL8B5hJu74B
 ZI+R+N28sAkMmbtR61kk06F7MS9RZqzBNIZalugaSuspKoenDZzmURZX+i77ogPQ
 Ii3XLUMUzdwmi55/tBhpI7VkpFxahaEbWzTT1sMBh/Ka3GXO56BMIYTPvntjoN4w
 ElcbFAMAZl8O56ruGBnc/k72CfFbq8qp93KkOfBGIKwwiPN+eCK8bQYL4G3sIZzx
 f6k/WLbbShyViX9qoWLiX7qUfvh0NU/EcmGcJBsTmn0NFNOP4WmuojAq6SrvTgEz
 No6zYJtnJvEPDa/v5A0dZyYfLqz2cTkEyTM9uwSixcCa1qAS+8IBcCGgTKfQOYkV
 hCUWugiHwj4OQ/6WgP6oYLtIYdw6gqXgUKZy1Iy+ThDRwLbg9emYWixQTi4GAuRO
 2SEBbFGSk7KIpoPENDUC
 =WE6f
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs update from Dave Chinner:
 "This update contains:

   - RENAME_WHITEOUT support

   - conversion of per-cpu superblock accounting to use generic counters

   - new inode mmap lock so that we can lock page faults out of
     truncate, hole punch and other direct extent manipulation functions
     to avoid racing mmap writes from causing data corruption

   - rework of direct IO submission and completion to solve data
     corruption issue when running concurrent extending DIO writes.
     Also solves problem of running IO completion transactions in
     interrupt context during size extending AIO writes.

   - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via
     direct extent manipulation to avoid needing to copy data within the
     file

   - attribute block header field overflow fix for 64k block size
     filesystems

   - Lots of changes to log messaging to be more informative and concise
     when errors occur.  Also prevent a lot of unnecessary log spamming
     due to cascading failures in error conditions.

   - lots of cleanups and bug fixes

  One thing of note is the direct IO fixes that we merged last week
  after the window opened.  Even though a little late, they fix a user
  reported data corruption and have been pretty well tested.  I figured
  there was not much point waiting another 2 weeks for -rc1 to be
  released just so I could send them to you..."

* tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
  xfs: using generic_file_direct_write() is unnecessary
  xfs: direct IO EOF zeroing needs to drain AIO
  xfs: DIO write completion size updates race
  xfs: DIO writes within EOF don't need an ioend
  xfs: handle DIO overwrite EOF update completion correctly
  xfs: DIO needs an ioend for writes
  xfs: move DIO mapping size calculation
  xfs: factor DIO write mapping from get_blocks
  xfs: unlock i_mutex in xfs_break_layouts
  xfs: kill unnecessary firstused overflow check on attr3 leaf removal
  xfs: use larger in-core attr firstused field and detect overflow
  xfs: pass attr geometry to attr leaf header conversion functions
  xfs: disallow ro->rw remount on norecovery mount
  xfs: xfs_shift_file_space can be static
  xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate
  fs: Add support FALLOC_FL_INSERT_RANGE for fallocate
  xfs: Fix incorrect positive ENOMEM return
  xfs: xfs_mru_cache_insert() should use GFP_NOFS
  xfs: %pF is only for function pointers
  xfs: fix shadow warning in xfs_da3_root_split()
  ...
2015-04-24 07:08:41 -07:00
Dave Chinner b9d59846f7 xfs: DIO write completion size updates race
xfs_end_io_direct_write() can race with other IO completions when
updating the in-core inode size. The IO completion processing is not
serialised for direct IO - they are done either under the
IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
during AIO DIO completion. Hence the non-atomic test-and-set update
of the in-core inode size is racy and can result in the in-core
inode size going backwards if the race if hit just right.

If the inode size goes backwards, this can trigger the EOF zeroing
code to run incorrectly on the next IO, which then will zero data
that has successfully been written to disk by a previous DIO.

To fix this bug, we need to serialise the test/set updates of the
in-core inode size. This first patch introduces locking around the
relevant updates and checks in the DIO path. Because we now have an
ioend in xfs_end_io_direct_write(), we know exactly then we are
doing an IO that requires an in-core EOF update, and we know that
they are not running in interrupt context. As such, we do not need to
use irqsave() spinlock variants to protect against interrupts while
the lock is held.

Hence we can use an existing spinlock in the inode to do this
serialisation and so not need to grow the struct xfs_inode just to
work around this problem.

This patch does not address the test/set EOF update in
generic_file_write_direct() for various reasons - that will be done
as a followup with separate explanation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 22:03:07 +10:00
Dave Chinner a06c277a13 xfs: DIO writes within EOF don't need an ioend
DIO writes that lie entirely within EOF have nothing to do in IO
completion. In this case, we don't need no steekin' ioend, and so we
can avoid allocating an ioend until we have a mapping that spans
EOF.

This means that IO completion has two contexts - deferred completion
to the dio workqueue that uses an ioend, and interrupt completion
that does nothing because there is nothing that can be done in this
context.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 22:00:00 +10:00
Dave Chinner 6dfa1b67e3 xfs: handle DIO overwrite EOF update completion correctly
Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
write into allocated blocks beyond EOF) requires a transaction for
the EOF update. Thi is done in IO completion context, but we aren't
explicitly handling this situation properly and so it can run in
interrupt context. Ensure that we defer IO that spans EOF correctly
to the DIO completion workqueue, and now that we have an ioend in IO
completion we can use the common ioend completion path to do all the
work.

Note: we do not preallocate the append transaction as we can have
multiple mapping and allocation calls per direct IO. hence
preallocating can still leave us with nested transactions by
attempting to map and allocate more blocks after we've preallocated
an append transaction.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 21:59:34 +10:00
Dave Chinner d5cc2e3f96 xfs: DIO needs an ioend for writes
Currently we can only tell DIO completion that an IO requires
unwritten extent completion. This is done by a hacky non-null
private pointer passed to Io completion, but the private pointer
does not actually contain any information that is used.

We also need to pass to IO completion the fact that the IO may be
beyond EOF and so a size update transaction needs to be done. This
is currently determined by checks in the io completion, but we need
to determine if this is necessary at block mapping time as we need
to defer the size update transactions to a completion workqueue,
just like unwritten extent conversion.

To do this, first we need to allocate and pass an ioend to to IO
completion. Add this for unwritten extent conversion; we'll do the
EOF updates in the next commit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 21:59:07 +10:00
Dave Chinner 1fdca9c211 xfs: move DIO mapping size calculation
The mapping size calculation is done last in __xfs_get_blocks(), but
we are going to need the actual mapping size we will use to map the
direct IO correctly in xfs_map_direct(). Factor out the calculation
for code clarity, and move the call to be the first operation in
mapping the extent to the returned buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 21:58:21 +10:00
Dave Chinner a719370be5 xfs: factor DIO write mapping from get_blocks
Clarify and separate the buffer mapping logic so that the direct IO mapping is
not tangled up in propagating the extent status to teh mapping buffer. This
makes it easier to extend the direct IO mapping to use an ioend in future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-04-16 21:57:48 +10:00
Omar Sandoval 22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval 6f67376318 direct_IO: use iov_iter_rw() instead of rw everywhere
The rw parameter to direct_IO is redundant with iov_iter->type, and
treated slightly differently just about everywhere it's used: some users
do rw & WRITE, and others do rw == WRITE where they should be doing a
bitwise check. Simplify this with the new iov_iter_rw() helper, which
always returns either READ or WRITE.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval 17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Christoph Hellwig e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Christoph Hellwig 2ba6623702 xfs: don't allocate an ioend for direct I/O completions
Back in the days when the direct I/O ->end_io callback could be called
from interrupt context for AIO we needed a structure to hand off to the
workqueue, and reused the ioend structure for this purpose.  These days
->end_io is always called from user or workqueue context, which allows us
to avoid this memory allocation and simplify the code significantly.

[dchinner: removed now unused xfs_finish_ioend_sync() function after
	   Brian Foster did an initial review. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-02 10:02:09 +11:00
Christoph Hellwig bb58e6188a xfs: move most of xfs_sb.h to xfs_format.h
More on-disk format consolidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:09 +11:00
Christoph Hellwig 4fb6e8ade2 xfs: merge xfs_ag.h into xfs_format.h
More on-disk format consolidation.  A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:25:04 +11:00
Christoph Hellwig 6d3ebaae7c xfs: merge xfs_dinode.h into xfs_format.h
More consolidatation for the on-disk format defintions.  Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:06 +11:00
Dave Chinner 6889e783cd Merge branch 'xfs-misc-fixes-for-3.18-3' into for-next 2014-10-13 10:22:45 +11:00
Brian Foster 07d08681d2 xfs: restore buffer_head unwritten bit on ioend cancel
xfs_vm_writepage() walks each buffer_head on the page, maps to the block
on disk and attaches to a running ioend structure that represents the
I/O submission. A new ioend is created when the type of I/O (unwritten,
delayed allocation or overwrite) required for a particular buffer_head
differs from the previous. If a buffer_head is a delalloc or unwritten
buffer, the associated bits are cleared by xfs_map_at_offset() once the
buffer_head is added to the ioend.

The process of mapping each buffer_head occurs in xfs_map_blocks() and
acquires the ilock in blocking or non-blocking mode, depending on the
type of writeback in progress. If the lock cannot be acquired for
non-blocking writeback, we cancel the ioend, redirty the page and
return. Writeback will revisit the page at some later point.

Note that we acquire the ilock for each buffer on the page. Therefore
during non-blocking writeback, it is possible to add an unwritten buffer
to the ioend, clear the unwritten state, fail to acquire the ilock when
mapping a subsequent buffer and cancel the ioend. If this occurs, the
unwritten status of the buffer sitting in the ioend has been lost. The
page will eventually hit writeback again, but xfs_vm_writepage() submits
overwrite I/O instead of unwritten I/O and does not perform unwritten
extent conversion at I/O completion. This leads to data corruption
because unwritten extents are treated as holes on reads and zeroes are
returned instead of reading from disk.

Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
occurs once the page is eventually written back.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:42:06 +10:00
Dave Chinner 0d085a529b xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: stable@vger.kernel.org # dependent on 1c8349a171
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:36:27 +10:00
Dave Chinner 22e757a49c xfs: don't dirty buffers beyond EOF
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:51 +10:00
Dave Chinner 2451337dd0 xfs: global error sign conversion
Convert all the errors the core XFs code to negative error signs
like the rest of the kernel and remove all the sign conversion we
do in the interface layers.

Errors for conversion (and comparison) found via searches like:

$ git grep " E" fs/xfs
$ git grep "return E" fs/xfs
$ git grep " E[A-Z].*;$" fs/xfs

Negation points found via searches like:

$ git grep "= -[a-z,A-Z]" fs/xfs
$ git grep "return -[a-z,A-D,F-Z]" fs/xfs
$ git grep " -[a-z].*;" fs/xfs

[ with some bits I missed from Brian Foster ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:58:08 +10:00
Eric Sandeen b474c7ae43 xfs: Nuke XFS_ERROR macro
XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.

Just nuke XFS_ERROR and associated bits.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:04:54 +10:00
Linus Torvalds 16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Dave Chinner 7691283d05 Merge branch 'xfs-misc-fixes-3-for-3.16' into for-next 2014-06-10 07:32:56 +10:00
Christoph Hellwig 448011e2ab xfs: tone down writepage/releasepage WARN_ONs
I recently ran into the issue fixed by

  "xfs: kill buffers over failed write ranges properly"

which spams the log with lots of backtraces.  Make debugging any
issues like that easier by using WARN_ON_ONCE in the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:05:15 +10:00
Jie Liu 8695d27ec3 xfs: fix infinite loop at xfs_vm_writepage on 32bit system
Write to a file with an offset greater than 16TB on 32-bit system and
then trigger page write-back via sync(1) will cause task hang.

# block_size=4096
# offset=$(((2**32 - 1) * $block_size))
# xfs_io -f -c "pwrite $offset $block_size" /storage/test_file
# sync

INFO: task sync:2590 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sync            D c1064a28     0  2590   2097 0x00000000
.....
Call Trace:
[<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
[<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
[<c1066dbf>] ? wake_up_process+0x1f/0x40
[<c104fc2e>] ? wake_up_worker+0x1e/0x30
[<c15b6083>] schedule+0x23/0x60
[<c15b3c2d>] schedule_timeout+0x18d/0x1f0
[<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
[<c10515f1>] ? __queue_delayed_work+0x91/0x150
[<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
[<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
[<c15b5b5d>] wait_for_completion+0x7d/0xc0
[<c1066d60>] ? try_to_wake_up+0x220/0x220
[<c116a4d2>] sync_inodes_sb+0x92/0x180
[<c116fb05>] sync_inodes_one_sb+0x15/0x20
[<c114a8f8>] iterate_supers+0xb8/0xc0
[<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
[<c116fc21>] sys_sync+0x31/0x80
[<c15be18d>] sysenter_do_call+0x12/0x28

This issue can be triggered via xfstests/generic/308.

The reason is that the end_index is unsigned long with maximum value
'2^32-1=4294967295' on 32-bit platform, and the given offset cause it
wrapped to 0, so that the following codes will repeat again and again
until the task schedule time out:

end_index = offset >> PAGE_CACHE_SHIFT;
last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
if (page->index >= end_index) {
	unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
        /*
         * Just skip the page if it is fully outside i_size, e.g. due
         * to a truncate operation that is in progress.
         */
        if (page->index >= end_index + 1 || offset_into_page == 0) {
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
		unlock_page(page);
		return 0;
	}

In order to check if a page is fully outsids i_size or not, we can fix
the code logic as below:
	if (page->index > end_index ||
	    (page->index == end_index && offset_into_page == 0))

Secondly, there still has another similar issue when calculating the
end offset for mapping the filesystem blocks to the file blocks for
delalloc.  With the same tests to above, run unmount(8) will cause
kernel panic if CONFIG_XFS_DEBUG is enabled:

XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
	ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964

kernel BUG at fs/xfs/xfs_message.c:108!
invalid opcode: 0000 [#1] SMP
task: edddc100 ti: ec6ee000 task.ti: ec6ee000
EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
EIP is at assfail+0x2b/0x30 [xfs]
..............
Call Trace:
[<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
[<c115ddf1>] destroy_inode+0x31/0x50
[<c115deff>] evict+0xef/0x170
[<c115dfb2>] dispose_list+0x32/0x40
[<c115ea3a>] evict_inodes+0xca/0xe0
[<c1149706>] generic_shutdown_super+0x46/0xd0
[<c11497b9>] kill_block_super+0x29/0x70
[<c1149a14>] deactivate_locked_super+0x44/0x70
[<c114a427>] deactivate_super+0x47/0x60
[<c1161c3d>] mntput_no_expire+0xcd/0x120
[<c1162ae8>] SyS_umount+0xa8/0x370
[<c1162dce>] SyS_oldumount+0x1e/0x20
[<c15be18d>] sysenter_do_call+0x12/0x28

That because the end_offset is evaluated to 0 which is the same reason
to above, hence the mapping and covertion for dealloc file blocks to
file system blocks did not happened.

This patch just fixed both issues.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:24:26 +10:00
Al Viro 31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Dave Chinner 0e1f789d0d xfs: don't map ranges that span EOF for direct IO
Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct IO code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.

Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct IO code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.

Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

Again, thanks to Al Viro for finding what I couldn't.

[ dchinner: Fix for __divdi3 build error:

	Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Signed-off-by: Mark Tinguely <tinguely@sgi.com>
	Reviewed-by: Eric Sandeen <sandeen@redhat.com>
]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:19 +10:00
Dave Chinner aad3f3755e xfs: xfs_vm_write_end truncates too much on failure
Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:14:11 +10:00
Dave Chinner 72ab70a19b xfs: write failure beyond EOF truncates too much data
If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.

Fix it by only truncating away the exact range we are updating state
for in this write_begin call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:13:29 +10:00