linux

Commit Graph

Author	SHA1	Message	Date
David Sterba	3d4b9496e8	btrfs: remove unused parameter from update_nr_written The logic has been updated in "Btrfs: make mapping->writeback_index point to the last written page" (`a91326679f`) and page is not needed anymore. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	c2df8bb43f	btrfs: remove unused parameter from submit_extent_page This used to hold number of maximum pages to allocate, but this is now limited by BIO_MAX_PAGES. The local are now unused and removed as well. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	53d3235995	btrfs: embed extent_changeset::range_changed to the structure We can embed range_changed to the extent changeset to address following problems: - no need to allocate ulist dynamically, we also get rid of the GFP_NOFS for free - fix lack of allocation failure checking in btrfs_qgroup_reserve_data The stack consuption where extent_changeset is used slightly increases: before: 16 after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40 Which is bearable. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
Takafumi Kubota	fe01aa6538	Btrfs: add another missing end_page_writeback on submit_extent_page failure If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns without clearing the writeback bit of the failed page. __extent_writepage_io, that is a caller of submit_extent_page, does not clear the remaining writeback bit anywhere. As a result, this will cause the hang at filemap_fdatawait_range, because it waits the writeback bit to be cleared from the failed page. So, we have to call end_page_writeback to clear the writeback bit. For reproducing the hang, we inject a fault like if (should_failtest()) { // I define should_failtest() bio = NULL; } else { bio = btrfs_bio_alloc(...); } in submit_extent_page. We should also check whether page has the bit before end_page_writeback, to avoid the conflict against the other end_page_writeback in bio_endio. Thus, we add PageWriteback checks not only in __extent_writepage_io, but also in write_one_eb too, because it misses the check. Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:47 +01:00
Liu Bo	76c0021db8	Btrfs: use helper to simplify lock/unlock pages Since we have a helper to set page bits, let lock_delalloc_pages and __unlock_for_delalloc use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:47 +01:00
Liu Bo	da2c7009f6	btrfs: teach __process_pages_contig about PAGE_LOCK operation Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changes to the helper separated from the following patch ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:35 +01:00
Liu Bo	873695b301	Btrfs: create helper for processing bits on contiguous pages This introduces a new helper which can be used to process pages bits. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:51:00 +01:00
Liu Bo	bcf934894f	Btrfs: cleanup unused cached_state in __extent_writepage_io @cached_state is no more required in __extent_writepage_io, also remove the goto label. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
David Sterba	f85b7379cd	btrfs: fix over-80 lines introduced by previous cleanups This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00
Nikolay Borisov	4a0cc7ca6c	btrfs: Make btrfs_ino take a struct btrfs_inode Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:51 +01:00
Michal Hocko	1aceabf362	btrfs: drop gfp mask tweaking in try_release_extent_state try_release_extent_state reduces the gfp mask to GFP_NOFS if it is compatible. This is true for GFP_KERNEL as well. There is no real reason to do that though. There is no new lock taken down the the only consumer of the gfp mask which is try_release_extent_state clear_extent_bit __clear_extent_bit alloc_extent_state So this seems just unnecessary and confusing. Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:49 +01:00
Michal Hocko	3ba7ab220e	btrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage `b335b0034e` ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator") has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just to prevent from giving an unappropriate gfp mask to the slab allocator deeper down the callchain (in alloc_extent_state). This is wrong for two reasons a) GFP_NOFS might be just too restrictive for the calling context b) it is better to tweak the gfp mask down when it needs that. So just remove the mask tweaking from btrfs_releasepage and move it down to alloc_extent_state where it is needed. Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:49 +01:00
Linus Torvalds	087a76d390	Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in here, and Christoph pitched in corrections/improvements to make btrfs use proper helpers for bio walking instead of doing it by hand. There are some key fixes as well, including some long standing bugs that took forever to track down in btrfs_drop_extents and during balance" * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits) btrfs: limit async_work allocation and worker func duration Revert "Btrfs: adjust len of writes if following a preallocated extent" Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors btrfs: opencode chunk locking, remove helpers btrfs: remove root parameter from transaction commit/end routines btrfs: split btrfs_wait_marked_extents into normal and tree log functions btrfs: take an fs_info directly when the root is not used otherwise btrfs: simplify btrfs_wait_cache_io prototype btrfs: convert extent-tree tracepoints to use fs_info btrfs: root->fs_info cleanup, access fs_info->delayed_root directly btrfs: root->fs_info cleanup, add fs_info convenience variables btrfs: root->fs_info cleanup, update_block_group{,flags} btrfs: root->fs_info cleanup, lock/unlock_chunks btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size btrfs: pull node/sector/stripe sizes out of root and into fs_info btrfs: root->fs_info cleanup, io_ctl_init btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere btrfs: struct reada_control.root -> reada_control.fs_info btrfs: struct btrfsic_state->root should be an fs_info btrfs: alloc_reserved_file_extent trace point should use extent_root ...	2016-12-16 10:53:01 -08:00
Linus Torvalds	36869cb93d	Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...	2016-12-13 10:19:16 -08:00
Jeff Mahoney	3a45bb207e	btrfs: remove root parameter from transaction commit/end routines Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:07:00 +01:00
Jeff Mahoney	2ff7e61e0d	btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	0b246afa62	btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	da17066c40	btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
David Sterba	58e8012cc1	btrfs: add optimized version of eb to eb copy Using copy_extent_buffer is suitable for copying betwenn buffers from an arbitrary offset and deals with page boundaries. This is not necessary when doing a full extent_buffer-to-extent_buffer copy. We can utilize the copy_page helper as well. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:17 +01:00
David Sterba	b159fa2808	btrfs: remove constant parameter to memset_extent_buffer and rename it The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:17 +01:00
David Sterba	fba1acf9ff	btrfs: use specialized page copying helpers in btrfs_clone_extent_buffer The copy_page is usually optimized and can be faster than memcpy. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:17 +01:00
David Sterba	f157bf765b	btrfs: introduce helpers for updating eb uuids The fsid and chunk tree uuid are always located in the first page, we don't need the to use write_extent_buffer. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:17 +01:00
Christoph Hellwig	cf8cddd38b	btrfs: don't abuse REQ_OP_* flags for btrfs_map_block btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-29 14:10:38 +01:00
Christoph Hellwig	70fd76140a	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Dan Carpenter	9c894696f5	Btrfs: remove some no-op casts We cast 0 to a u8 but then because of type promotion, it's immediately cast to int back to int before we do a bitwise negate. The cast doesn't matter in this case, the code works as intended. It causes a static checker warning though so let's remove it. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Chris Mason	d9ed71e545	Merge branch 'fst-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 Signed-off-by: Chris Mason <clm@fb.com>	2016-10-12 13:16:00 -07:00
Omar Sandoval	2fe1d55134	Btrfs: fix free space tree bitmaps on big-endian systems In convert_free_space_to_{bitmaps,extents}(), we buffer the free space bitmaps in memory and copy them directly to/from the extent buffers with {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte granularity, which is equivalent to a little-endian bitmap. This means that on big-endian systems, the in-memory bitmaps will be written to disk byte-swapped. To fix this, use byte-granularity for the bitmaps in memory. Fixes: `a5ed918285` ("Btrfs: implement the free space B-tree") Cc: stable@vger.kernel.org # 4.5+ Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Liu Bo	851cd173f0	Btrfs: memset to avoid stale content in btree leaf This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Jeff Mahoney	ab8d0fc48d	btrfs: convert pr_* to btrfs_* where possible For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:04 +02:00
Jeff Mahoney	62e855771d	btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Jeff Mahoney	5d163e0e68	btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Liu Bo	3eb548ee3a	Btrfs: memset to avoid stale content in btree node block During updating btree, we could push items between sibling nodes/leaves, for leaves data sections starts reversely from the end of the block while for nodes we only have key pairs which are stored one by one from the start of the block. So we could do try to push key pairs from one node to the next node right in the tree, and after that, we update the node's nritems to reflect the correct end while leaving the stale content in the node. One may intentionally corrupt the fs image and access the stale content by bumping the nritems and causes various crashes. This takes the in-memory @nritems as the correct one and gets to memset the unused part of a btree node. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:03:47 +02:00
Josef Bacik	8436ea91a1	Btrfs: kill the start argument to read_extent_buffer_pages Nobody uses this, it makes no sense to do partial reads of extent buffers. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Josef Bacik	afcdd129e0	Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Qu Wenruo	ba8b04c1d4	btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc() parameters for both in-band dedupe and subpage sector size patchset. This should reduce conflict of both patchset and the effort to rebase them. Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	2571e73967	Btrfs: fix memory leak in reading btree blocks So we can read a btree block via readahead or intentional read, and we can end up with a memory leak when something happens as follows, 1) readahead starts to read block A but does not wait for read completion, 2) btree_readpage_end_io_hook finds that block A is corrupted, and it needs to clear all block A's pages' uptodate bit. 3) meanwhile an intentional read kicks in and checks block A's pages' uptodate to decide which page needs to be read. 4) when some pages have the uptodate bit during 3)'s check so 3) doesn't count them for eb->io_pages, but they are later cleared by 2) so we has to readpage on the page, we get the wrong eb->io_pages which results in a memory leak of this block. This fixes the problem by firstly getting all pages's locking and then checking pages' uptodate bit. t1(readahead) t2(readahead endio) t3(the following read) read_extent_buffer_pages end_bio_extent_readpage for pg in eb: for page 0,1,2 in eb: if pg is uptodate: btree_readpage_end_io_hook(pg) num_reads++ if uptodate: eb->io_pages = num_reads SetPageUptodate(pg) _______________ for pg in eb: for page 3 in eb: read_extent_buffer_pages if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb: __extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate: clear_extent_buffer_uptodate(eb) num_reads++ for pg in eb: eb->io_pages = num_reads ClearPageUptodate(page) _______________ for pg in eb: if pg is NOT uptodate: __extent_read_full_page(pg) So t3's eb->io_pages is not consistent with the number of pages it's reading, and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative number so that we're not able to free the eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Lu Fengqi	afce772e87	btrfs: fix check_shared for fiemap ioctl Only in the case of different root_id or different object_id, check_shared identified extent as the shared. However, If a extent was referred by different offset of same file, it should also be identified as shared. In addition, check_shared's loop scale is at least n^3, so if a extent has too many references, even causes soft hang up. First, add all delayed_ref to the ref_tree and calculate the unqiue_refs, if the unique_refs is greater than one, return BACKREF_FOUND_SHARED. Then individually add the on-disk reference(inline/keyed) to the ref_tree and calculate the unique_refs of the ref_tree to check if the unique_refs is greater than one.Because once there are two references to return SHARED, so the time complexity is close to the constant. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Linus Torvalds	fff648da96	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Here's the second round of block updates for this merge window. It's a mix of fixes for changes that went in previously in this round, and fixes in general. This pull request contains: - Fixes for loop from Christoph - A bdi vs gendisk lifetime fix from Dan, worth two cookies. - A blk-mq timeout fix, when on frozen queues. From Gabriel. - Writeback fix from Jan, ensuring that __writeback_single_inode() does the right thing. - Fix for bio->bi_rw usage in f2fs from me. - Error path deadlock fix in blk-mq sysfs registration from me. - Floppy O_ACCMODE fix from Jiri. - Fix to the new bio op methods from Mike. One more followup will be coming here, ensuring that we don't propagate the block types outside of block. That, and a rename of bio->bi_rw is coming right after -rc1 is cut. - Various little fixes" * 'for-linus' of git://git.kernel.dk/linux-block: mm/block: convert rw_page users to bio op use loop: make do_req_filebacked more robust loop: don't try to use AIO for discards blk-mq: fix deadlock in blk_mq_register_disk() error path Include: blkdev: Removed duplicate 'struct request;' declaration. Fixup direct bi_rw modifiers block: fix bdi vs gendisk lifetime mismatch blk-mq: Allow timeouts to run while queue is freezing nbd: fix race in ioctl block: fix use-after-free in seq file f2fs: drop bio->bi_rw manual assignment block: add missing group association in bio-cloning functions blkcg: kill unused field nr_undestroyed_grps writeback: Write dirty times for WB_SYNC_ALL writeback floppy: fix open(O_ACCMODE) for ioctl-only open	2016-08-05 23:31:51 -04:00
Linus Torvalds	d58b0d980f	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull more btrfs updates from Chris Mason: "This is part two of my btrfs pull, which is some cleanups and a batch of fixes. Most of the code here is from Jeff Mahoney, making the pointers we pass around internally more consistent and less confusing overall. I noticed a small problem right before I sent this out yesterday, so I fixed it up and re-tested overnight" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits) Btrfs: fix __MAX_CSUM_ITEMS btrfs: btrfs_abort_transaction, drop root parameter btrfs: add btrfs_trans_handle->fs_info pointer btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction btrfs: convert nodesize macros to static inlines btrfs: introduce BTRFS_MAX_ITEM_SIZE btrfs: cleanup, remove prototype for btrfs_find_root_ref btrfs: copy_to_sk drop unused root parameter btrfs: simpilify btrfs_subvol_inherit_props btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root btrfs: tests, require fs_info for root btrfs: tests, move initialization into tests/ btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs: prefix fsid to all trace events btrfs: plumb fs_info into btrfs_work btrfs: remove obsolete part of comment in statfs btrfs: hide test-only member under ifdef btrfs: Ratelimit "no csum found" info message btrfs: Add ratelimit to btrfs printing Btrfs: fix unexpected balance crash due to BUG_ON ...	2016-08-04 19:56:16 -04:00
Shaun Tancheff	b571bc606e	Fixup direct bi_rw modifiers bi_rw should be using bio_set_op_attrs to set bi_rw. Signed-off-by: Shaun Tancheff <shaun@tancheff.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-04 14:19:16 -06:00
Paolo Valente	20bd723ec6	block: add missing group association in bio-cloning functions When a bio is cloned, the newly created bio must be associated with the same blkcg as the original bio (if BLK_CGROUP is enabled). If this operation is not performed, then the new bio is not associated with any group, and the group of the current task is returned when the group of the bio is requested. Depending on the cloning frequency, this may cause a large percentage of the bios belonging to a given group to be treated as if belonging to other groups (in most cases as if belonging to the root group). The expected group isolation may thereby be broken. This commit adds the missing association in bio-cloning functions. Fixes: `da2f0f74cf` ("Btrfs: add support for blkio controllers") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-04 14:19:16 -06:00
Linus Torvalds	0e06f5c0de	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...	2016-07-26 19:55:54 -07:00
Michal Hocko	8a5c743e30	mm, memcg: use consistent gfp flags during readahead Vladimir has noticed that we might declare memcg oom even during readahead because read_pages only uses GFP_KERNEL (with mapping_gfp restriction) while __do_page_cache_readahead uses page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from OOMs. This gfp mask discrepancy is really unfortunate and easily fixable. Drop page_cache_alloc_readahead() which only has one user and outsource the gfp_mask logic into readahead_gfp_mask and propagate this mask from __do_page_cache_readahead down to read_pages. This alone would have only very limited impact as most filesystems are implementing ->readpages and the common implementation mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to use readahead_gfp_mask instead as this function is called only during readahead as well. The same applies to read_cache_pages. ext4 has its own ext4_mpage_readpages but the path which has pages != NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to mpage_readpages so the same can be applied to them as well. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.com: restrict gfp mask in mpage_alloc] Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Chris Mason <clm@fb.com> Cc: Steve French <sfrench@samba.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: Mike Marshall <hubcap@omnibond.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Changman Lee <cm224.lee@samsung.com> Cc: Chao Yu <yuchao0@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Linus Torvalds	d05d7f4079	Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...	2016-07-26 15:03:07 -07:00
Liu Bo	baf863b9c2	Btrfs: fix eb memory leak due to readpage failure eb->io_pages is set in read_extent_buffer_pages(). In case of readpage failure, for pages that have been added to bio, it calls bio_endio and later readpage_io_failed_hook() does the work. When this eb's page (couldn't be the 1st page) fails to add itself to bio due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio, and ends up with a memory leak eventually. This lets __do_readpage propagate errors to callers and adds the 'atomic_dec(&eb->io_pages)'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Liu Bo	6f034ece34	Btrfs: cleanup BUG_ON in merge_bio One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(), thus this aims to stop anyone to panic the whole system by using their btrfs. Since the error in merge_bio can only come from __btrfs_map_block() when chunk tree mapping has something insane and __btrfs_map_block() has already had printed the reason, we can just return errors in merge_bio. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Nikolay Borisov	fba4b69771	btrfs: Fix slab accounting flags BTRFS is using a variety of slab caches to satisfy internal needs. Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT, meaning allocations from the caches are going to be accounted as SReclaimable. At the same time btrfs is not registering any shrinkers whatsoever, thus preventing memory from the slabs to be shrunk. This means those caches are not in fact reclaimable. To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the inode cache, since this one is being freed by the generic VFS super_block shrinker. Also set the transaction related caches as SLAB_TEMPORARY, to better document the lifetime of the objects (it just translates to SLAB_RECLAIM_ACCOUNT). Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Liu Bo	415b35a55b	Btrfs: fix error handling in map_private_extent_buffer map_private_extent_buffer() can return -EINVAL in two different cases, 1. when the requested contents span two pages if nodesize is larger than pagesize, 2. when it detects something insane. The 2nd one used to be only a WARN_ON(1), and we decided to return a error to callers, but we didn't fix up all its callers, which will be addressed by this patch. Without this, btrfs may end up with 'general protection', ie. reading invalid memory. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-06-23 10:44:40 -07:00
Liu Bo	c871b0f2fd	Btrfs: check if extent buffer is aligned to sectorsize Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-06-17 18:32:40 +02:00
Chris Mason	4c52990080	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7	2016-06-08 14:35:11 -07:00
Mike Christie	81a75f6781	btrfs: use bio fields for op and flags The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is no need to pass around the rq_flag_bits bits too. btrfs users should should access the bio insead. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	1f7ad75b13	btrfs: have submit_one_bio users use bio op accessors This patch has btrfs's submit_one_bio users set the bio op using bio_set_op_attrs and get the op using bio_op. The next patches will continue to convert btrfs, so submit_bio_hook and merge_bio_hook related code will be modified to take only the bio. I did not do it in this patch to try and keep it smaller. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	4e49ea4a3d	block/fs/drivers: remove rw argument from submit_bio This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Feifei Xu	b9ef22dedd	Btrfs: self-tests: Support non-4k page size self-tests code assumes 4k as the sectorsize and nodesize. This commit fix hardcoded 4K. Enables the self-tests code to be executed on non-4k page sized systems (e.g. ppc64). Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-06-02 19:23:14 +02:00
Filipe Manana	b5de8d0df8	Btrfs: fix race between device replace and read repair While we are finishing a device replace operation we can have a concurrent task trying to do a read repair operation, in which case it will call btrfs_map_block() to get a struct btrfs_bio which can have a stripe that points to the source device of the device replace operation. This allows for the read repair task to dereference the stripe's device pointer after the device replace operation has freed the source device, resulting in an invalid memory access. This is similar to the problem solved by my previous patch in the same series and named "Btrfs: fix race between device replace and discard". So fix this by surrounding the call to btrfs_map_block() and the code that uses the returned struct btrfs_bio with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the proper serialization with the finishing phase of the device replace operation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>	2016-05-31 01:00:03 +01:00
David Sterba	42f31734eb	Merge branch 'cleanups-4.7' into for-chris-4.7-20160525	2016-05-25 22:51:03 +02:00
Nicholas D Steeves	0132761017	btrfs: fix string and comment grammatical issues and typos Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-05-25 22:35:14 +02:00
Liu Bo	2d324f59f3	Btrfs: fix unexpected return value of fiemap btrfs's fiemap is supposed to return 0 on success and return < 0 on error. however, ret becomes 1 after looking up the last file extent: btrfs_lookup_file_extent -> btrfs_search_slot(..., ins_len=0, cow=0) and if the offset is beyond EOF, we'll get 'path' pointed to the place of potentail insertion, and ret == 1. This may confuse applications using ioctl(FIEL_IOC_FIEMAP). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-05-25 19:53:54 +02:00
David Sterba	5ef64a3e75	Merge branch 'cleanups-4.7' into for-chris-4.7-20160516	2016-05-16 15:46:24 +02:00
David Sterba	e1860a7724	btrfs: GFP_NOFS does not GFP_HIGHMEM Masking HIGHMEM out of NOFS does not make sense. Signed-off-by: David Sterba <dsterba@suse.com>	2016-05-10 09:44:21 +02:00
David Sterba	58409edd2d	btrfs: kill unused writepage_io_hook callback It seems to be long time unused, since 2008 and `6885f308b5` ("Btrfs: Misc 2.6.25 updates"). Propagating the removal touches some code but has no functional effect. Signed-off-by: David Sterba <dsterba@suse.com>	2016-05-06 14:57:57 +02:00
David Sterba	210aa27768	btrfs: sink gfp parameter to convert_extent_bit Single caller passes GFP_NOFS. We can get rid of the gfpflags_allow_blocking checks as NOFS can block but does not recurse to filesystem through reclaim. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 13:48:14 +02:00
David Sterba	059f791c6b	btrfs: make state preallocation more speculative in __set_extent_bit Similar to __clear_extent_bit, do not fail if the state preallocation fails as we might not need it. One less BUG_ON. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	03bf538770	btrfs: untangle gotos a bit in convert_extent_bit Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	7ab5cb2a9e	btrfs: untangle gotos a bit in __clear_extent_bit Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	b5a4ba14e0	btrfs: untangle gotos a bit in __set_extent_bit Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	2c53b912ae	btrfs: sink gfp parameter to set_record_extent_bits Single caller passes GFP_NOFS. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	f734c44a1b	btrfs: sink gfp parameter to clear_record_extent_bits Callers pass GFP_NOFS. No need to pass the flags around. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	91166212e0	btrfs: sink gfp parameter to clear_extent_bits Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
David Sterba	ceeb0ae7bf	btrfs: sink gfp parameter to set_extent_bits All callers pass GFP_NOFS. Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-29 11:01:47 +02:00
Liu Bo	894b36e35a	Btrfs: cleanup error handling in extent_write_cached_pages Now that we bail out immediately if ->writepage() returns an error, we don't need an extra error to retain the error code. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-28 10:41:47 +02:00
Liu Bo	a91326679f	Btrfs: make mapping->writeback_index point to the last written page If sequential writer is writing in the middle of the page and it just redirties the last written page by continuing from it. In the above case this can end up with seeking back to that firstly redirtied page after writing all the pages at the end of file because btrfs updates mapping->writeback_index to 1 past the current one. For non-cow filesystems, the cost is only about extra seek, while for cow filesystems such as btrfs, it means unnecessary fragments. To avoid it, we just need to continue writeback from the last written page. This also updates btrfs to behave like what write_cache_pages() does, ie, bail out immediately if there is an error in writepage(). <Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html> Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-04-28 10:41:47 +02:00
Kirill A. Shutemov	ea1754a084	mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
Kirill A. Shutemov	09cbfeaf1a	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced long time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-04-04 10:41:08 -07:00
David Sterba	f004fae0cf	Merge branch 'cleanups-4.6' into for-chris-4.6	2016-02-26 15:38:33 +01:00
David Sterba	675d276b32	Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6	2016-02-26 15:38:32 +01:00
Arnd Bergmann	f827ba9a64	btrfs: avoid uninitialized variable warning With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides to partially inline the get_state_failrec() function but cannot figure out that means the failrec pointer is always valid if the function returns success, which causes a harmless warning: fs/btrfs/extent_io.c: In function 'clean_io_failure': fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized] This marks get_state_failrec() and set_state_failrec() both as 'noinline', which avoids the warning in all cases for me, and seems less ugly than adding a fake initialization. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `47dc196ae7` ("btrfs: use proper type for failrec in extent_state") Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-23 12:42:46 +01:00
Kinglong Mee	5598e9005a	btrfs: drop null testing before destroy functions Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
David Sterba	47dc196ae7	btrfs: use proper type for failrec in extent_state We use the private member of extent_state to store the failrec and play pointless pointer games. Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-18 11:46:03 +01:00
Filipe Manana	7f042a8370	Btrfs: remove no longer used function extent_read_full_page_nolock() Not needed after the previous patch named "Btrfs: fix page reading in extent_same ioctl leading to csum errors". Signed-off-by: Filipe Manana <fdmanana@suse.com>	2016-02-03 19:27:10 +00:00
Chandan Rajendra	dbfdb6d1b3	Btrfs: Search for all ordered extents that could span across a page In subpagesize-blocksize scenario it is not sufficient to search using the first byte of the page to make sure that there are no ordered extents present across the page. Fix this. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-02-01 19:24:29 +01:00
Chris Mason	b28cf57246	Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>	2016-01-11 06:08:37 -08:00
Byongho Lee	ee22184b53	Btrfs: use linux/sizes.h to represent constants We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-01-07 14:38:02 +01:00
Chris Mason	f0f76413d3	Merge branch 'freespace-4.5' into for-linus-4.5	2015-12-23 13:29:09 -08:00
Chris Mason	bb9d687618	Merge branch 'dev/simplify-set-bit' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>	2015-12-23 13:17:42 -08:00
Chris Mason	f7d3d2f99e	Merge branch 'freespace-tree' into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>	2015-12-18 11:11:10 -08:00
Omar Sandoval	0f3312295d	Btrfs: add extent buffer bitmap sanity tests Sanity test the extent buffer bitmap operations (test, set, and clear) against the equivalent standard kernel operations. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:46 -08:00
Omar Sandoval	3e1e8bb770	Btrfs: add extent buffer bitmap operations These are going to be used for the free space tree bitmap items. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-12-17 12:16:46 -08:00
David Sterba	35de6db28f	btrfs: make set_range_writeback return void Does not return any errors, nor anything from the callgraph. There's a BUG_ON but it's a sanity check and not an error condition we could recover from. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	f631157276	btrfs: make extent_range_redirty_for_io return void Does not return any errors, nor anything from the callgraph. There's a BUG_ON but it's a sanity check and not an error condition we could recover from. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	bd1fa4f0b0	btrfs: make extent_range_clear_dirty_for_io return void Does not return any errors, nor anything from the callgraph. There's a BUG_ON but it's a sanity check and not an error condition we could recover from. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	b5227c075b	btrfs: make end_extent_writepage return void Does not return any errors, nor anything from the callgraph. The branch in end_bio_extent_writepage has been skipped since `5fd0204355` ("Btrfs: finish ordered extents in their own thread"). Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	a9d93e1778	btrfs: make extent_clear_unlock_delalloc return void Does not return any errors, nor anything from the callgraph. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	69ba39272c	btrfs: make clear_extent_buffer_uptodate return void Does not return any errors, nor anything from the callgraph. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	09c25a8cda	btrfs: make set_extent_buffer_uptodate return void Does not return any errors, nor anything from the callgraph. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-07 15:06:45 +01:00
David Sterba	cd716d8fea	btrfs: make lock_extent static inline One call less reduces stack usage, code slightly reduced as well. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-03 14:44:59 +01:00
David Sterba	ff13db41f1	btrfs: drop unused parameter from lock_extent_bits We've always passed 0. Stack usage will slightly decrease. Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-03 14:30:40 +01:00
David Sterba	e83b1d91f8	btrfs: make clear_extent_bit helpers static inline The funcions just wrap the clear_extent_bit API and generate function calls. This increases stack consumption and may negatively affect performance due to icache misses. We can simply make the helpers static inline and keep the type checking and API untouched. The code slightly decreases: text data bss dec hex filename 938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko.before 939651 43670 23144 1006465 f5b81 fs/btrfs/btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-03 14:17:30 +01:00
David Sterba	c63179556a	btrfs: make set_extent_bit helpers static inline The funcions just wrap the set_extent_bit API and generate function calls. This increases stack consumption and may negatively affect performance due to icache misses. We can simply make the helpers static inline and keep the type checking and API untouched. The code slightly increases: text data bss dec hex filename 938427 43670 23144 1005241 f56b9 fs/btrfs/btrfs.ko.before 938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko Signed-off-by: David Sterba <dsterba@suse.com>	2015-12-03 14:08:11 +01:00
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Qu Wenruo	fefdc55702	btrfs: extent_io: Introduce new function clear_record_extent_bits() Introduce new function clear_record_extent_bits(), which will clear bits for given range and record the details about which ranges are cleared and how many bytes in total it changes. This provides the basis for later qgroup reserve codes. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:44 -07:00
Qu Wenruo	d38ed27f04	btrfs: extent_io: Introduce new function set_record_extent_bits Introduce new function set_record_extent_bits(), which will not only set given bits, but also record how many bytes are changed, and detailed range info. This is quite important for later qgroup reserve framework. The number of bytes will be used to do qgroup reserve, and detailed range info will be used to cleanup for EQUOT case. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-10-21 18:37:44 -07:00
Filipe Manana	5e6ecb362b	Btrfs: fix double range unlock of hole region when reading page If when reading a page we find a hole and our caller had already locked the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end up unlocking the hole's range and then later our caller unlocks it again, which might have already been locked by some other task once the first unlock happened. Currently this can only happen during a call to the extent_same ioctl, as it's the only caller of __do_readpage() that sets the bit EXTENT_BIO_PARENT_LOCKED for bio flags. Fix this by leaving the unlock exclusively to the caller. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2015-10-14 04:37:00 +01:00
Chris Mason	640926ffdd	Merge branch 'cleanup/messages' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4	2015-10-12 16:22:26 -07:00
Linus Torvalds	175d58cfed	Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection	2015-10-09 16:39:35 -07:00
David Sterba	f14d104dbd	btrfs: switch more printks to our helpers Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 13:08:03 +02:00
David Sterba	9464732266	btrfs: switch message printers to ratelimited variants Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 13:04:06 +02:00
David Sterba	b14af3b46f	btrfs: switch message printers to ratelimited _in_rcu variants Signed-off-by: David Sterba <dsterba@suse.com>	2015-10-08 11:07:55 +02:00
Filipe Manana	808f80b467	Btrfs: update fix for read corruption of compressed and shared extents My previous fix in commit `005efedf2c` ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo \| _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo \| _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com>	2015-10-05 16:56:27 -07:00
Linus Torvalds	03e8f64486	Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC	2015-09-25 12:08:41 -07:00
Filipe Manana	005efedf2c	Btrfs: fix read corruption of compressed and shared extents If a file has a range pointing to a compressed extent, followed by another range that points to the same compressed extent and a read operation attempts to read both ranges (either completely or part of them), the pages that correspond to the second range are incorrectly filled with zeroes. Consider the following example: File layout [0 - 8K] [8K - 24K] \| \| \| \| points to extent X, points to extent X, offset 4K, length of 8K offset 0, length 16K [extent X, compressed length = 4K uncompressed length = 16K] If a readpages() call spans the 2 ranges, a single bio to read the extent is submitted - extent_io.c:submit_extent_page() would only create a new bio to cover the second range pointing to the extent if the extent it points to had a different logical address than the extent associated with the first range. This has a consequence of the compressed read end io handler (compression.c:end_compressed_bio_read()) finish once the extent is decompressed into the pages covering the first range, leaving the remaining pages (belonging to the second range) filled with zeroes (done by compression.c:btrfs_clear_biovec_end()). So fix this by submitting the current bio whenever we find a range pointing to a compressed extent that was preceded by a range with a different extent map. This is the simplest solution for this corner case. Making the end io callback populate both ranges (or more, if we have multiple pointing to the same extent) is a much more complex solution since each bio is tightly coupled with a single extent map and the extent maps associated to the ranges pointing to the shared extent can have different offsets and lengths. The following test case for fstests triggers the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create a test file with a single extent that is compressed (the # data we write into it is highly compressible no matter which # compression algorithm is used, zlib or lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \ -c "pwrite -S 0xbb 4K 8K" \ -c "pwrite -S 0xcc 12K 4K" \ $SCRATCH_MNT/foo \| _filter_xfs_io # Now clone our extent into an adjacent offset. $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Same as before but for this file we clone the extent into a lower # file offset. $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \ -c "pwrite -S 0xbb 12K 8K" \ -c "pwrite -S 0xcc 20K 4K" \ $SCRATCH_MNT/bar \| _filter_xfs_io $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/bar echo "File digests before unmounting filesystem:" md5sum $SCRATCH_MNT/foo \| _filter_scratch md5sum $SCRATCH_MNT/bar \| _filter_scratch # Evicting the inode or clearing the page cache before reading # again the file would also trigger the bug - reads were returning # all bytes in the range corresponding to the second reference to # the extent with a value of 0, but the correct data was persisted # (it was a bug exclusively in the read path). The issue happened # only if the same readpages() call targeted pages belonging to the # first and second ranges that point to the same compressed extent. _scratch_remount echo "File digests after mounting filesystem again:" # Must match the same digests we got before. md5sum $SCRATCH_MNT/foo \| _filter_scratch md5sum $SCRATCH_MNT/bar \| _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2015-09-15 00:59:31 +01:00
Linus Torvalds	22365979ab	Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has Jeff Mahoney's long standing trim patch that fixes corners where trims were missing. Omar has some raid5/6 fixes, especially for using scrub and device replace when devices are missing. Zhao Lie continues cleaning and fixing things, this series fixes some really hard to hit corners in xfstests. I had to pull it last merge window due to some deadlocks, but those are now resolved. I added support for Tejun's new blkio controllers. It seems to work well for single devices, we'll expand to multi-device as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits) btrfs: fix compile when block cgroups are not enabled Btrfs: fix file read corruption after extent cloning and fsync Btrfs: check if previous transaction aborted to avoid fs corruption btrfs: use __GFP_NOFAIL in alloc_btrfs_bio btrfs: Prevent from early transaction abort btrfs: Remove unused arguments in tree-log.c btrfs: Remove useless condition in start_log_trans() Btrfs: add support for blkio controllers Btrfs: remove unused mutex from struct 'btrfs_fs_info' Btrfs: fix parity scrub of RAID 5/6 with missing device Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: remove misleading handling of missing device scrub btrfs: fix clone / extent-same deadlocks Btrfs: fix defrag to merge tail file extent Btrfs: fix warning in backref walking btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() btrfs: Remove root argument in extent_data_ref_count() btrfs: Fix wrong comment of btrfs_alloc_tree_block() ...	2015-09-05 15:14:43 -07:00
Chris Mason	3a9508b022	btrfs: fix compile when block cgroups are not enabled bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on. This adds an ifdef around them. It's not perfect, but our use of bi_ioc is being removed in the 4.3 merge window. The bi_css usage really should go into bio_clone, but I want to make sure that doesn't introduce problems for other bio_clone use cases. Signed-off-by: Chris Mason <clm@fb.com>	2015-08-21 10:08:13 -07:00
Michal Hocko	d1b5c5671d	btrfs: Prevent from early transaction abort Btrfs relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem currently but there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. And this would lead to a pre-mature transaction abort as follows: [ 55.328093] Call Trace: [ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363 [ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c [ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e [ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] ------------[ cut here ]------------ [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37 [ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb [ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c [ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by being explicit about the no-fail behavior of this allocation path and use __GFP_NOFAIL. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-08-19 14:25:15 -07:00
Kent Overstreet	b54ffb73ca	block: remove bio_get_nr_vecs() We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:32:04 -06:00
Chris Mason	da2f0f74cf	Btrfs: add support for blkio controllers This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>	2015-08-09 07:35:06 -07:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
Linus Torvalds	043cd04950	Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Outside of our usual batch of fixes, this integrates the subvolume quota updates that Qu Wenruo from Fujitsu has been working on for a few releases now. He gets an extra gold star for making btrfs smaller this time, and fixing a number of quota corners in the process. Dave Sterba tested and integrated Anand Jain's sysfs improvements. Outside of exporting a symbol (ack'd by Greg) these are all internal to btrfs and it's mostly cleanups and fixes. Anand also attached some of our sysfs objects to our internal device management structs instead of an object off the super block. It will make device management easier overall and it's a better fit for how the sysfs files are used. None of the existing sysfs files are moved around. Thanks for all the fixes everyone" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits) btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref() Btrfs: Check if kobject is initialized before put lib: export symbol kobject_move() Btrfs: sysfs: add support to show replacing target in the sysfs Btrfs: free the stale device Btrfs: use received_uuid of parent during send Btrfs: fix use-after-free in btrfs_replay_log btrfs: wait for delayed iputs on no space btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup. btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. btrfs: ulist: Add ulist_del() function. btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. btrfs: qgroup: Switch rescan to new mechanism. btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents(). btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots(). btrfs: qgroup: Add new function to record old_roots. btrfs: qgroup: Record possible quota-related extent for qgroup. btrfs: qgroup: Add function qgroup_update_counters(). ...	2015-06-30 20:07:45 -07:00
Linus Torvalds	bfffa1cc9d	Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...	2015-06-25 14:29:53 -07:00
Josef Bacik	0d2b2372e0	Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap We should be doing this, it's weird we hadn't been doing this. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:03:03 -07:00
Filipe Manana	0f31871f44	Btrfs: wake up extent state waiters on unlock through clear_extent_bits When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-06-03 04:02:56 -07:00
Christoph Hellwig	b25de9d6da	block: remove BIO_EOPNOTSUPP Since the big barrier rewrite/removal in 2007 we never fail FLUSH or FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag to help propagating those to the buffer_head layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-19 09:17:03 -06:00
Filipe Manana	062c19e9dd	Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON There's a race between releasing extent buffers that are flagged as stale and recycling them that makes us it the following BUG_ON at btrfs_release_extent_buffer_page: BUG_ON(extent_buffer_under_io(eb)) The BUG_ON is triggered because the extent buffer has the flag EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made dirty by another concurrent task. Here follows a sequence of steps that leads to the BUG_ON. CPU 0 CPU 1 CPU 2 path->nodes[0] == eb X X->refs == 2 (1 for the tree, 1 for the path) btrfs_header_generation(X) == current trans id flag EXTENT_BUFFER_DIRTY set on X btrfs_release_path(path) unlocks X reads eb X X->refs incremented to 3 locks eb X btrfs_del_items(X) X becomes empty clean_tree_block(X) clear EXTENT_BUFFER_DIRTY from X btrfs_del_leaf(X) unlocks X extent_buffer_get(X) X->refs incremented to 4 btrfs_free_tree_block(X) X's range is not pinned X's range added to free space cache free_extent_buffer_stale(X) lock X->refs_lock set EXTENT_BUFFER_STALE on X release_extent_buffer(X) X->refs decremented to 3 unlocks X->refs_lock btrfs_release_path() unlocks X free_extent_buffer(X) X->refs becomes 2 __btrfs_cow_block(Y) btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() gets offset == X->start btrfs_init_new_buffer(X->start) btrfs_find_create_tree_block(X->start) alloc_extent_buffer(X->start) find_extent_buffer(X->start) finds eb X in radix tree free_extent_buffer(X) lock X->refs_lock test X->refs == 2 test bit EXTENT_BUFFER_STALE is set test !extent_buffer_under_io(eb) increments X->refs to 3 mark_extent_buffer_accessed(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set in X clear EXTENT_BUFFER_STALE from X locks X btrfs_mark_buffer_dirty() set_extent_buffer_dirty(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set sets EXTENT_BUFFER_DIRTY on X test and clear EXTENT_BUFFER_TREE_REF decrements X->refs to 2 release_extent_buffer(X) decrements X->refs to 1 unlock X->refs_lock unlock X free_extent_buffer(X) lock X->refs_lock release_extent_buffer(X) decrements X->refs to 0 btrfs_release_extent_buffer_page(X) BUG_ON(extent_buffer_under_io(X)) --> EXTENT_BUFFER_DIRTY set on X Fix this by making find_extent buffer wait for any ongoing task currently executing free_extent_buffer()/free_extent_buffer_stale() if the extent buffer has the stale flag set. A more clean alternative would be to always increment the extent buffer's reference count while holding its refs_lock spinlock but find_extent_buffer is a performance critical area and that would cause lock contention whenever multiple tasks search for the same extent buffer concurrently. A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream btrfs patches backported from newer kernels) was hitting this often: [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507! (...) [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1 [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006 [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000 [1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs] (...) [1212302.630008] Call Trace: [1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs] [1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs] [1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs] [1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs] [1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs] [1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs] [1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs] [1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs] [1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190 [1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0 I was also able to reproduce this on a 3.19 kernel, corresponding to Chris' integration branch from about a month ago, running the following stress test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram): while true; do mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt snapshot_cmd="btrfs subvolume snapshot -r /mnt" snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`" fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100 umount /mnt done Which usually triggers the BUG_ON within less than 24 hours: [49558.618097] ------------[ cut here ]------------ [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551! (...) [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3 [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000 [49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs] (...) [49558.620031] Call Trace: [49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs] [49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43 [49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs] [49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs] [49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs] [49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs] [49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs] [49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs] [49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs] [49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs] [49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs] [49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4 [49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91 [49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22 [49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4 [49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83 [49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-05-11 07:59:11 -07:00
Forrest Liu	5d2361db48	Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent btrfs_release_extent_buffer_page() can't handle dummy extent that allocated by btrfs_clone_extent_buffer() properly. That is because reference count of pages that allocated by btrfs_clone_extent_buffer() was 2, 1 by alloc_page(), and another by attach_extent_buffer_page(). Running following command repeatly can check this memory leak problem btrfs inspect-internal inode-resolve 256 /mnt/btrfs Signed-off-by: Chien-Kuan Yeh <ckya@synology.com> Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-04-29 13:22:09 -07:00
Omar Sandoval	5ca64f45e9	btrfs: fix race on ENOMEM in alloc_extent_buffer Consider the following interleaving of overlapping calls to alloc_extent_buffer: Call 1: - Successfully allocates a few pages with find_or_create_page - find_or_create_page fails, goto free_eb - Unlocks the allocated pages Call 2: - Calls find_or_create_page and gets a page in call 1's extent_buffer - Finds that the page is already associated with an extent_buffer - Grabs a reference to the half-written extent_buffer and calls mark_extent_buffer_accessed on it mark_extent_buffer_accessed will then try to call mark_page_accessed on a null page and panic. The fix is to decrement the reference count on the half-written extent_buffer before unlocking the pages so call 2 won't use it. We should also set exists = NULL in the case that we don't use exists to avoid accidentally returning a freed extent_buffer in an error case. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-04-26 06:27:00 -07:00
Chengyu Song	26e726afe0	btrfs: incorrect handling for fiemap_fill_next_extent return fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-03-26 18:10:24 -07:00
Linus Torvalds	521d474631	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Most of these are fixing extent reservation accounting, or corners with tree writeback during commit. Josef's set does add a test, which isn't strictly a fix, but it'll keep us from making this same mistake again" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix outstanding_extents accounting in DIO Btrfs: add sanity test for outstanding_extents accounting Btrfs: just free dummy extent buffers Btrfs: account merges/splits properly Btrfs: prepare block group cache before writing Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) Btrfs: account for the correct number of extents for delalloc reservations Btrfs: fix merge delalloc logic Btrfs: fix comp_oper to get right order Btrfs: catch transaction abortion after waiting for it btrfs: fix sizeof format specifier in btrfs_check_super_valid()	2015-03-21 10:53:37 -07:00
Josef Bacik	bcb7e449ec	Btrfs: just free dummy extent buffers If we fail during our sanity tests we could get NULL deref's because we unload the module before the dummy extent buffers are free'd via RCU. So check for this case and just free the things directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>	2015-03-17 16:30:18 -04:00
Linus Torvalds	2b9fb532d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...	2015-02-19 14:36:00 -08:00
Josef Bacik	dcab6a3b2a	Btrfs: account for large extents with enospc On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-14 08:22:48 -08:00
Konstantin Khebnikov	8d38633c3b	page_writeback: put account_page_redirty() after set_page_dirty() Helper account_page_redirty() fixes dirty pages counter for redirtied pages. This patch puts it after dirtying and prevents temporary underflows of dirtied pages counters on zone/bdi and current->nr_dirtied. Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:04 -08:00
Naohiro Aota	289454ad26	btrfs: clear bio reference after submit_one_bio() After submit_one_bio(), `bio' can go away. However submit_extent_page() leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM). It will cause invalid paging request when submit_extent_page() is called next time. I reproduced ENOMEM case with the following script (need CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS). #!/bin/bash dmesgout=dmesg.txt start=100000 end=300000 step=1000 # btrfs options device=/dev/vdb1 directory=/mnt/btrfs # fault-injection options percent=100 times=3 mkdir -p $directory \|\| exit 1 mount -o compress $device $directory \|\| exit 1 rm -f $directory/file \|\| exit 1 dd if=/dev/zero of=$directory/file bs=1M count=512 \|\| exit 1 for interval in `seq $start $step $end`; do dmesg -C echo 1 > /proc/sys/vm/drop_caches sync export FAILCMD_TYPE=fail_page_alloc ./failcmd.sh -p $percent -t $times -i $interval \ --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \ -- \ cat $directory/file > /dev/null dmesg > ${dmesgout} if grep -q BUG: ${dmesgout}; then cat ${dmesgout} exit 1 fi done umount $directory exit 0 Signed-off-by: Naohiro Aota <naota@elisp.net> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-02-02 19:24:51 -08:00
Zhao Lei	6e9606d2a2	Btrfs: add ref_count and free function for btrfs_bio 1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:06:48 -08:00
David Sterba	9ee49a047d	btrfs: switch extent_state state to unsigned Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 / u64 end; / 8 8 / struct rb_node rb_node; / 16 24 / wait_queue_head_t wq; / 40 24 / / --- cacheline 1 boundary (64 bytes) --- / atomic_t refs; / 64 4 / / XXX 4 bytes hole, try to pack / long unsigned int state; / 72 8 / u64 private; / 80 8 / / size: 88, cachelines: 2, members: 7 / / sum members: 84, holes: 1, sum holes: 4 / / last cacheline: 24 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-21 18:02:04 -08:00
Satoru Takeuchi	6e1103a6e9	btrfs: fix state->private cast on 32 bit machines Suppress the following warning displayed on building 32bit (i686) kernel. =============================================================================== ... CC [M] fs/btrfs/extent_io.o fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’: fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] failrec = (struct io_failure_record *)state->private; ... =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Chris Mason <clm@fb.com>	2015-01-19 13:06:06 -08:00
David Sterba	ce3e69847e	btrfs: sink parameter len to alloc_extent_buffer Because we're using globally known nodesize. Do the same for the sanity test function variant. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-12-12 18:26:57 +01:00
David Sterba	3f556f7853	btrfs: unify extent buffer allocation api Make the extent buffer allocation interface consistent. Cloned eb will set a valid fs_info. For dummy eb, we can drop the length parameter and set it from fs_info. The built-in sanity checks may pass a NULL fs_info that's queried for nodesize, but we know it's 4096. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-12-12 18:26:55 +01:00
David Sterba	23d79d81b1	btrfs: use GFP_NOFS in __alloc_extent_buffer directly Same mask from all callers. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-12-12 18:07:23 +01:00
Filipe Manana	c7bc6319c5	Btrfs: avoid premature -ENOMEM in clear_extent_bit() We try to allocate an extent state structure before acquiring the extent state tree's spinlock as we might need a new one later and therefore avoid doing later an atomic allocation while holding the tree's spinlock. However we returned -ENOMEM if that initial non-atomic allocation failed, which is a bit excessive since we might end up not needing the pre-allocated extent state at all - for the case where the tree doesn't have any extent states that cover the input range and cover too any other range. Therefore don't return -ENOMEM if that pre-allocation fails. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:20:06 -08:00
Filipe Manana	c8fd3de79f	Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early We try to allocate an extent state before acquiring the tree's spinlock just in case we end up needing to split an existing extent state into two. If that allocation failed, we would return -ENOMEM. However, our only single caller (transaction/log commit code), passes in an extent state that was cached from a call to find_first_extent_bit() and that has a very high chance to match exactly the input range (always true for a transaction commit and very often, but not always, true for a log commit) - in this case we end up not needing at all that initial extent state used for an eventual split. Therefore just don't return -ENOMEM if we can't allocate the temporary extent state, since we might not need it at all, and if we end up needing one, we'll do it later anyway. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Filipe Manana	e38e2ed701	Btrfs: make find_first_extent_bit be able to cache any state Right now the only caller of find_first_extent_bit() that is interested in caching extent states (transaction or log commit), never gets an extent state cached. This is because find_first_extent_bit() only caches states that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and the transaction/log commit caller always passes a tree that doesn't have ever extent states with any of those flags (they can only have one of the following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT). This change together with the following one in the patch series (titled "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will help reduce significantly the chances of calls to convert_extent_bit() fail with -ENOMEM when called from the transaction/log commit code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:29 -08:00
Filipe Manana	704de49d2b	Btrfs: set page and mapping error on compressed write failure If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(), we start and end the writeback for the pages (clear their dirty flag, unlock them, etc) but we don't tag the pages, nor the inode's mapping, with an error. This makes it impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit for e.g.) know that there was an error. Note that the return value of submit_compressed_extents() is useless, as that function is executed by a workqueue task and not directly by the fill_delalloc callback. This means the writepage/s callbacks of the inode's address space operations don't get that return value. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-11-20 17:14:25 -08:00
Chris Mason	bbf65cf0b5	Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus Signed-off-by: Chris Mason <clm@fb.com> Conflicts: fs/btrfs/extent_io.c	2014-10-04 09:56:45 -07:00
Filipe Manana	656f30dba7	Btrfs: be aware of btree inode write errors to avoid fs corruption While we have a transaction ongoing, the VM might decide at any time to call btree_inode->i_mapping->a_ops->writepages(), which will start writeback of dirty pages belonging to btree nodes/leafs. This call might return an error or the writeback might finish with an error before we attempt to commit the running transaction. If this happens, we might have no way of knowing that such error happened when we are committing the transaction - because the pages might no longer be marked dirty nor tagged for writeback (if a subsequent modification to the extent buffer didn't happen before the transaction commit) which makes filemap_fdata[write\|wait]_range unable to find such pages (even if they're marked with SetPageError). So if this happens we must abort the transaction, otherwise we commit a super block with btree roots that point to btree nodes/leafs whose content on disk is invalid - either garbage or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like "parent transid verify failed on 10826481664 wanted 25748 found 29562" when reading btree nodes/leafs from disk). Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's i_mapping would not be enough because we need to distinguish between log tree extents (not fatal) vs non-log tree extents (fatal) and because the next call to filemap_fdatawait_range() will catch and clear such errors in the mapping - and that call might be from a log sync and not from a transaction commit, which means we would not know about the error at transaction commit time. Also, checking for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't done and would not be completely reliable, as the eb might be removed from memory and read back when trying to get it, which clears that flag right before reading the eb's pages from disk, making us not know about the previous write error. Using the new 3 flags for the btree inode also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages() returns success, started writeback for all dirty pages and before filemap_fdatawait_range() is called, the writeback for all dirty pages had already finished with errors - because we were not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as it could not know that writeback errors happened (the pages were no longer tagged for writeback). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-03 16:14:59 -07:00
Liu Bo	8146502820	Btrfs: fix crash of btrfs_release_extent_buffer_page This is actually inspired by Filipe's patch. When write_one_eb() fails on submit_extent_page(), it'll give up writing this eb and mark it with EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure, there are some left pages which remain DIRTY, and if a later COW on this eb happens, ie. eb is COWed and freed, it'd run into BUG_ON in btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page)); This adds the missing clear_page_dirty_for_io() for the rest pages of eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-03 16:14:58 -07:00
Filipe Manana	55e3bd2e0c	Btrfs: add missing end_page_writeback on submit_extent_page failure If submit_extent_page() fails in write_one_eb(), we end up with the current page not marked dirty anymore, unlocked and marked for writeback. But we never end up calling end_page_writeback() against the page, which will make calls to filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting for the writeback bit to be cleared from the page. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-10-03 16:14:58 -07:00
David Sterba	fb85fc9a67	btrfs: kill extent_buffer_page helper It used to be more complex but now it's just a simple array access. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-10-02 17:30:32 +02:00
David Sterba	a50924e3a4	btrfs: drop constant param from btrfs_release_extent_buffer_page All callers use the same value, simplify the function. Signed-off-by: David Sterba <dsterba@suse.cz>	2014-10-02 17:30:32 +02:00
Miao Xie	f612496bca	Btrfs: cleanup the read failure record after write or when the inode is freeing After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:39:02 -07:00

1 2 3 4 5 ...

645 Commits