linux

Commit Graph

Author	SHA1	Message	Date
Wang Shilong	88e081bf82	Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic The original code is a little confusing and not clear, The right way to deal with the kernel code like this: [...] if (ret) goto out; [...] So i move the common clean_up code to the place labeled with out_fail, this will be easier to maintain. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Wang Shilong	a9870c0e03	Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails commit `eb6b88d92c` leads into another bug. If it is just because qgroup_reserve fails, the function btrfs_qgroup_free should not be called, otherwise, it will cause the wrong quota accounting. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Linus Torvalds	de1a2262b0	2 writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle() -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRLrFaAAoJECvKgwp+S8JaV2IP/jo34e3Ht0gvIfxz9rh05dvR LqBmSAXXJQYgxUKUjYECuyLahIciniKYZp/fS6s5myOPAayiirB70rC1W85Kz8Sm uR1wDvG0g1AyK39kJas+WZw2fJFicthSSp29jhTH0upbEcMX+/tzsHTsJRH1WqI0 rtV8wHVxDu4+njz44hZIVxmJ9S7XZCuw8D6NfbyobmAqOm35j0VJ7uzQOxrNoJDe lvnwEGXfSU9KTfOIxt4K0d+lovXT6IRfN0qfdgcrWwxx9QJ/cU5F5b6cjdN9BsEF oq2UKSihbU55PdgUk6DfMJ3t7AXS/u2/P5a8PNfoNL9ovKQMJMHPXXDtxXmwCvcI aaYbULbwojMWZyrijViJpkftVKKtM/96X/jyCsof96UhJdah8c9wM44k1LDRBYXi WbQbD+doUII+pEmxUxF3Chrk/Yi3T5q2IWiVsixUEGewrSChOSqMIXOcSpgz97lL eGmNHgC/rn5TdDx8J3u0V+1+QYCvNxC25GG4E2+9QhU+mecLKt+IG1Dhn35xUjq1 kjgfrNWJC6zxlIq7owk8pTI7DxiV/iMqogR5mMDz0umrPrid/J/xb6zxuAcnk3WU j0clNu7gzIYB8NjxBskO3Fg2AWKJxSohpu+r9wcjmjf0T5uEUmLwpI0i4tdDlYNw IvcmOpF1I2Ja5TrW8HWw =j9Sn -----END PGP SIGNATURE----- Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fixes from Wu Fengguang: "Two writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: Negative (setpoint-dirty) in bdi_position_ratio() vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them	2013-02-28 13:21:44 -08:00
Miao Xie	d5c1207017	Btrfs: fix wrong reserved space in qgroup during snap/subv creation There are two problems in the space reservation of the snapshot/ subvolume creation. - don't reserve the space for the root item insertion - the space which is reserved in the qgroup is different with the free space reservation. we need reserve free space for 7 items, but in qgroup reservation, we need reserve space only for 3 items. So we implement new metadata reservation functions for the snapshot/subvolume creation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:54 -05:00
Qu Wenruo	fda2832feb	btrfs: cleanup for open-coded alignment Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:04:13 -05:00
Alexandre Oliva	a81cb9a2d9	clear chunk_alloc flag on retryable failure I've experienced filesystem freezes with permanent spikes in the active process count for quite a while, particularly on filesystems whose available raw space has already been fully allocated to chunks. While looking into this, I found a pretty obvious error in do_chunk_alloc: it sets space_info->chunk_alloc, but if btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving that flag set, which causes any other threads waiting for space_info->chunk_alloc to become zero to spin indefinitely. I haven't double-checked that this patch fixes the failure I've observed fully (it's not exactly trivial to trigger), but it surely is a bug and the fix is trivial, so... Please put it in :-) What I saw in that function also happens to explain why in some cases I see filesystems allocate a huge number of chunks that remain unused (leading to the scenario above, of not having more chunks to allocate). It happens for data and metadata, but not necessarily both. I'm guessing some thread sets the force_alloc flag on the corresponding space_info, and then several threads trying to get disk space end up attempting to allocate a new chunk concurrently. All of them will see the force_alloc flag and bump their local copy of force up to the level they see first, and they won't clear it even if another thread succeeds in allocating a chunk, thus clearing the force flag. Then each thread that observed the force flag will, on its turn, force the allocation of a new chunk. And any threads that come in while it does that will see the force flag still set and pick it up, and so on. This sounds like a problem to me, but... what should the correct behavior be? Clear force_flag once we copy it to a local force? Reset force to the incoming value on every loop? Set the flag to our incoming force if we have it at first, clear our local flag, and move it from the space_info when we determined that we are the thread that's going to perform the allocation? btrfs: clear chunk_alloc flag on retryable failure From: Alexandre Oliva <oliva@gnu.org> If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc without clearing chunk_alloc in space_info. As a result, any further calls to do_chunk_alloc on that filesystem will start busy-waiting for chunk_alloc to be cleared, but it never will be. This patch adjusts do_chunk_alloc so that it clears this flag in case of an error. Signed-off-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:00:51 -05:00
Linus Torvalds	9afa3195b9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Assorted tiny fixes queued in trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits) DocBook: update EXPORT_SYMBOL entry to point at export.h Documentation: update top level 00-INDEX file with new additions ARM: at91/ide: remove unsused at91-ide Kconfig entry percpu_counter.h: comment code for better readability x86, efi: fix comment typo in head_32.S IB: cxgb3: delay freeing mem untill entirely done with it net: mvneta: remove unneeded version.h include time: x86: report_lost_ticks doesn't exist any more pcmcia: avoid static analysis complaint about use-after-free fs/jfs: Fix typo in comment : 'how may' -> 'how many' of: add missing documentation for of_platform_populate() btrfs: remove unnecessary cur_trans set before goto loop in join_transaction sound: soc: Fix typo in sound/codecs treewide: Fix typo in various drivers btrfs: fix comment typos Update ibmvscsi module name in Kconfig. powerpc: fix typo (utilties -> utilities) of: fix spelling mistake in comment h8300: Fix home page URL in h8300/README xtensa: Fix home page URL in Kconfig ...	2013-02-21 17:40:58 -08:00
Zach Brown	24542bf7ea	btrfs: limit fallocate extent reservation to 256MB Very large fallocate requests are cpu bound and result in extents with a repeating pattern of ever decreasing size: $ time fallocate -l 1T file real 0m13.039s ( an excerpt of the extents from btrfs-debug-tree: ) prealloc data disk byte 1536292564992 nr 397312 prealloc data disk byte 1536292962304 nr 196608 prealloc data disk byte 1536293158912 nr 98304 prealloc data disk byte 1536293257216 nr 49152 prealloc data disk byte 1536293306368 nr 24576 prealloc data disk byte 1536293330944 nr 12288 prealloc data disk byte 1536293343232 nr 8192 prealloc data disk byte 1536293351424 nr 4096 prealloc data disk byte 1536293355520 nr 4096 prealloc data disk byte 1536293359616 nr 4096 The excessive cpu use comes from __btrfs_prealloc_file_range() trying to allocate the entire remaining size after each extent is allocated. btrfs_reserve_extent() repeatedly cuts this requested size in half until it gets down to the size that the allocators can return. We limit the problem for now by capping each reservation at 256 meg. The small extents come from a masking bug when decreasing the requested reservation size. The high 32bits are cleared and the remaining low bits might happen to reserve a small size. Fix this by using round_down() which properly casts the mask. After these fixes huge fallocate requests are fast and result in nice large extents: $ time fallocate -l 1T file real 0m0.082s prealloc data disk byte 1112425889792 nr 268435456 prealloc data disk byte 1112694325248 nr 268435456 prealloc data disk byte 1112962760704 nr 268435456 Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-20 14:06:25 -05:00
Chris Mason	e942f883bc	Merge branch 'raid56-experimental' into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/ctree.h fs/btrfs/extent-tree.c fs/btrfs/inode.c fs/btrfs/volumes.c	2013-02-20 14:06:05 -05:00
Chris Mason	b2c6b3e061	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/disk-io.c	2013-02-20 14:05:45 -05:00
David Sterba	b069e0c345	btrfs: put some enospc messages under enospc_debug The warning in use_block_rsv is not useful for users and may fill the logs unnecessarily. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:49 -05:00
Miao Xie	0934856d46	Btrfs: fix deadlock due to unsubmitted The deadlock problem happened when running fsstress(a test program in LTP). Steps to reproduce: # mkfs.btrfs -b 100M <partition> # mount <partition> <mnt> # <Path>/fsstress -p 3 -n 10000000 -d <mnt> The reason is: btrfs_direct_IO() \|->do_direct_IO() \|->get_page() \|->get_blocks() \| \|->btrfs_delalloc_resereve_space() \| \|->btrfs_add_ordered_extent() ------- Add a new ordered extent \|->dio_send_cur_page(page0) -------------- We didn't submit bio here \|->get_page() \|->get_blocks() \|->btrfs_delalloc_resereve_space() \|->flush_space() \|->btrfs_start_ordered_extent() \|->wait_event() ---------- Wait the completion of the ordered extent that is mentioned above But because we didn't submit the bio that is mentioned above, the ordered extent can not complete, we would wait for its completion forever. There are two methods which can fix this deadlock problem: 1. submit the bio before we invoke get_blocks() 2. reserve the space before we do dio Though the 1st is the simplest way, we need modify the code of VFS, and it is likely to break contiguous requests, and introduce performance regression for the other filesystems. So we have to choose the 2nd way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:45 -05:00
Josef Bacik	5d80366e9b	Btrfs: steal from global reserve if we are cleaning up orphans Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:42 -05:00
Josef Bacik	70afa3998c	Btrfs: rework the overcommit logic to be based on the total size People have been complaining about random ENOSPC errors that will clear up after a umount or just a given amount of time. Chris was able to reproduce this with stress.sh and lots of processes and so was I. Basically the overcommit stuff would really let us get out of hand, in my tests I saw up to 30 gigs of outstanding reservations with only 2 gigs total of metadata space. This usually worked out fine but with so much outstanding reservation the flushing stuff short circuits to make sure we don't hang forever flushing when we really need ENOSPC. Plus we allocate chunks in order to alleviate the pressure, but this doesn't actually help us since we only use the non-allocated area in our over commit logic. So instead of basing overcommit on the amount of non-allocated space, instead just do it based on how much total space we have, and then limit it to the non-allocated space in case we are short on space to spill over into. This allows us to have the same performance as well as no longer giving random ENOSPC. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:32 -05:00
Eric Sandeen	1971e917c8	btrfs: remove unnecessary DEFINE_WAIT() declarations No point in DEFINE_WAIT(wait) if it's not used! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:24 -05:00
Josef Bacik	96f1bb5777	Btrfs: do not overcommit if we don't have enough space for global rsv Because of how little we allocate chunks now we can get really tight on metadata space before we will allocate a new chunk. This resulted in being unable to add device extents when allocating a new metadata chunk as we did not have enough space. This is because we were allowed to overcommit too much metadata without actually making sure we had enough space to make allocations. The idea behind overcommit is that we are allowed to say "sure you can have that reservation" when most of the free space is occupied by reservations, not actual allocations. But in this case where a majority of the total space is in use by actual allocations we can screw ourselves by not being able to make real allocations when it matters. So make sure we have enough real space for our global reserve, and if not then don't allow overcommitting. Thanks, Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:13 -05:00
Miao Xie	de98ced9e7	Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:08 -05:00
Miao Xie	963d678b0f	Btrfs: use percpu counter for fs_info->delalloc_bytes fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:05 -05:00
Miao Xie	e6ec716f0d	Btrfs: make raid attr array more readable The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:19 -05:00
Liu Bo	a1897fddd2	Btrfs: record first logical byte in memory This'd save us a rbtree search which may become expensive in large filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:18 -05:00
Liu Bo	dcfac4156f	Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:14 -05:00
Liu Bo	c53d613e52	Btrfs: kill unused argument of update_block_group Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:13 -05:00
Liu Bo	f6373bf3dc	Btrfs: kill unused arguments of cache_block_group Argument 'trans' and 'root' are not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:11 -05:00
Liu Bo	17b85495cf	Btrfs: remove deprecated comments commit `d53ba47484` (Btrfs: use commit root when loading free space cache) has remove the deadlock check, and the related comments can be removed as well. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:10 -05:00
Josef Bacik	c6b305a89b	Btrfs: don't re-enter when allocating a chunk If we start running low on metadata space we will try to allocate a chunk, which could then try to allocate a chunk to add the device entry. The thing is we allocate a chunk before we try really hard to make the allocation, so we should be able to find space for the device entry. Add a flag to the trans handle so we know we're currently allocating a chunk so we can just bail out if we try to allocate another chunk. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:09 -05:00
Miao Xie	da633a4217	Btrfs: flush all dirty inodes if writeback can not start We may try to flush some dirty pages when there is no enough space to reserve. But it is possible that this operation fails, in order to get enough space to reserve successfully, we will sync all the delalloc file. This operation is safe, we needn't worry about the case that the filesystem goes from r/w to r/o. because the filesystem should guarantee all the dirty pages have been written into the disk after it becomes readonly, so the sync operation will do nothing if the filesystem is already readonly. Though it may waste lots of time, as a corner case, we needn't care. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:42 -05:00
Miao Xie	093486c453	Btrfs: make delayed ref lock logic more readable Locking and unlocking delayed ref mutex are in the different functions, and the name of lock functions is not uniform, so the readability is not so good, this patch optimizes the lock logic and makes it more readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:41 -05:00
Miao Xie	78a6184a3f	Btrfs: use slabs for delayed reference allocation The delayed reference allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:34 -05:00
Linus Torvalds	8d19514fad	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: move d_instantiate outside the transaction during mksubvol Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Btrfs: fix possible stale data exposure Btrfs: fix missing i_size update Btrfs: fix race between snapshot deletion and getting inode Btrfs: fix missing release of the space/qgroup reservation in start_transaction() Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() Btrfs: do not merge logged extents if we've removed them from the tree btrfs: don't try to notify udev about missing devices	2013-02-08 12:06:46 +11:00
Jan Schmidt	eb6b88d92c	Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by: Lev Vainblat <lev@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 09:24:40 -05:00
Chris Mason	0e4e026366	Merge branch 'for-linus' into raid56-experimental Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:04:03 -05:00
Chris Mason	bb721703aa	Btrfs: reduce CPU contention while waiting for delayed extent operations We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	8de972b4fa	Btrfs: fix cluster alignment for mount -o ssd With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
David Woodhouse	53b381b3ab	Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
Jiri Kosina	617677295b	Merge branch 'master' into for-next Conflicts: drivers/devfreq/exynos4_bus.c Sync with Linus' tree to be able to apply patches that are against newer code (mvneta).	2013-01-29 10:48:30 +01:00
Linus Torvalds	d7df025eb4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits) Btrfs: fix repeated delalloc work allocation Btrfs: fix wrong max device number for single profile Btrfs: fix missed transaction->aborted check Btrfs: Add ACCESS_ONCE() to transaction->abort accesses Btrfs: put csums on the right ordered extent Btrfs: use right range to find checksum for compressed extents Btrfs: fix panic when recovering tree log Btrfs: do not allow logged extents to be merged or removed Btrfs: fix a regression in balance usage filter Btrfs: prevent qgroup destroy when there are still relations Btrfs: ignore orphan qgroup relations Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Btrfs: fix unlock order in btrfs_ioctl_rm_dev Btrfs: fix unlock order in btrfs_ioctl_resize Btrfs: fix "mutually exclusive op is running" error code Btrfs: bring back balance pause/resume logic btrfs: update timestamps on truncate() btrfs: fix btrfs_cont_expand() freeing IS_ERR em Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents Btrfs: fix off-by-one in lseek ...	2013-01-25 10:55:21 -08:00
Liu Bo	3268a2468e	Btrfs: reset path lock state to zero We forgot to reset the path lock state to zero after we unlock the path block, and this can lead to the ASSERT checker in tree unlock API. Reported-by: Slava Barinov <rayslava@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:53 -05:00
Liu Bo	ac5c93005b	Btrfs: let allocation start from the right raid type This'd avoid us empty looping. Say we have only one disk and the metadata raid type will be defaultly DUP, and we do not need to start from index=0(RAID10) and get over two empty loops to index=2(DUP). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:52 -05:00
Josef Bacik	72bcd99d45	Btrfs: set flushing if we're limited flushing We still need to say we're flushing if we're limit flushing to keep somebody from coming in and stealing our reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:51 -05:00
Miao Xie	10ee27a06c	vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read() with down_read_trylock() because - If ->s_umount is write locked, then the sb is not idle. That is writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock. - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start writeback, it may bring us deadlock problem when doing umount. In order to fix the problem, ext4 and btrfs implemented their own writeback functions instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle(). The name of these two functions is cumbersome, so rename them to try_to_writeback_inodes_sb(_nr). This idea came from Christoph Hellwig. Some code is from the patch of Kamal Mostafa. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2013-01-12 10:47:43 +08:00
Liu Bo	2c016dc2cb	btrfs: fix comment typos Convert 'hepler' to 'helper'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-01-09 11:40:52 +01:00
Linus Torvalds	a22180d266	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "A big set of fixes and features. In terms of line count, most of the code comes from Stefan, who added the ability to replace a single drive in place. This is different from how btrfs normally replaces drives, and is much much much faster. Josef is plowing through our synchronous write performance. This pull request does not include the DIO_OWN_WAITING patch that was discussed on the list, but it has a number of other improvements to cut down our latencies and CPU time during fsync/O_DIRECT writes. Miao Xie has a big series of fixes and is spreading out ordered operations over more CPUs. This improves performance and reduces contention. I've put in fixes for error handling around hash collisions. These are going back to individual stable kernels as I test against them. Otherwise we have a lot of fixes and cleanups, thanks everyone! raid5/6 is being rebased against the device replacement code. I'll have it posted this Friday along with a nice series of benchmarks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits) Btrfs: fix a bug of per-file nocow Btrfs: fix hash overflow handling Btrfs: don't take inode delalloc mutex if we're a free space inode Btrfs: fix autodefrag and umount lockup Btrfs: fix permissions of empty files not affected by umask Btrfs: put raid properties into global table Btrfs: fix BUG() in scrub when first superblock reading gives EIO Btrfs: do not call file_update_time in aio_write Btrfs: only unlock and relock if we have to Btrfs: use tokens where we can in the tree log Btrfs: optimize leaf_space_used Btrfs: don't memset new tokens Btrfs: only clear dirty on the buffer if it is marked as dirty Btrfs: move checks in set_page_dirty under DEBUG Btrfs: log changed inodes based on the extent map tree Btrfs: add path->really_keep_locks Btrfs: do not mark ems as prealloc if we are writing to them Btrfs: keep track of the extents original block length Btrfs: inline csums if we're fsyncing Btrfs: don't bother copying if we're only logging the inode ...	2012-12-18 09:42:05 -08:00
Josef Bacik	c64c2bd890	Btrfs: don't take inode delalloc mutex if we're a free space inode This confuses and angers lockdep even though it's ok. We don't really need the lock for free space inodes since only the transaction committer will be reserving space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:29 -05:00
Josef Bacik	1135d6df22	Btrfs: fix autodefrag and umount lockup This happens because writeback_inodes_sb_nr_if_idle does down_read. This doesn't work for us and it has not been fixed upstream yet, so do it ourselves and use that instead so we can stop having this stupid long standing lockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:29 -05:00
Liu Bo	31e502298d	Btrfs: put raid properties into global table Raid properties can be shared among raid calculation code, we can put them into a global table to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:28 -05:00
Miao Xie	4b5829a8e3	Btrfs: fix missing reserved space release in error path of delalloc reservation We forget to release the reserved space in the error path of delalloc reservatiom, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:18 -05:00
Stefan Behrens	63a212abc2	Btrfs: disallow some operations on the device replace target device This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:39 -05:00
Stefan Behrens	3ec706c831	Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree This is required for the device replace procedure in a later step. Two calling functions also had to be changed to have the fs_info pointer: repair_io_failure() and scrub_setup_recheck_block(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:34 -05:00
Liu Bo	37c4146d22	Btrfs: fix a deadlock in aborting transaction due to ENOSPC When committing a transaction, we may bail out of running delayed refs due to ENOSPC, and then abort the current transaction to flip into readonly. But we'll hit a deadlock on ref head's lock since we forget to release its lock and other cleanup stuff. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:25 -05:00
Julia Lawall	31b1a2bd75	fs/btrfs: use WARN Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:23 -05:00
Josef Bacik	7b398f8e58	Btrfs: fill the global reserve when unpinning space Dave gave me an image of a very full file system that would abort the transaction because it ran out of space while committing the transaction. This is because we would think there was plenty of room to create a snapshot even though the global reserve was not full. This happens because we calculate the global reserve size before we unpin any space, so after we unpin the space we allow reservations to occur even though we haven't reserved all of the space for our global reserve. Fix this by adding to the global reserve while unpinning in order to make sure we always have enough space to do our work. With this patch we no longer end up with an aborted transaction, we return ENOSPC properly to the person trying to create the snapshot. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:36 -05:00
Miao Xie	08e007d2e5	Btrfs: improve the noflush reservation In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:31 -05:00
Miao Xie	561c294d4c	Btrfs: fix wrong comment in can_overcommit() The comment is not coincident with the code. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:30 -05:00
Miao Xie	3fed40cc97	Btrfs: cleanup duplicated division functions div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:30 -05:00
Adam Buchbinder	48fc7f7e78	Fix misspellings of "whether" in comments. "Whether" is misspelled in various comments across the tree; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-11-19 14:31:35 +01:00
Josef Bacik	44734ed1ca	Btrfs: don't commit instead of overcommitting I don't think we have the same problem that this was supposed to fix originally since we can allocate chunks in the enospc path now. This code is causing us to constantly commit the transaction as we get close to using all of our available space in our currently allocated chunks, instead of allocating another chunk and carrying on with life, which is not nice for performance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:42 -04:00
Josef Bacik	e6138876ad	Btrfs: cache extent state when writing out dirty metadata pages Everytime we write out dirty pages we search for an offset in the tree, convert the bits in the state, and then when we wait we search for the offset again and clear the bits. So for every dirty range in the io tree we are doing 4 rb searches, which is suboptimal. With this patch we are only doing 2 searches for every cycle (modulo weird things happening). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:41 -04:00
Josef Bacik	67b0fd63d5	Btrfs: run delayed refs first when out of space Running delayed refs is faster than running delalloc, so lets do that first to try and reclaim space. This makes my fs_mark test about 20% faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:39 -04:00
David Sterba	005d6427ac	btrfs: move transaction aborts to the point of failure Call btrfs_abort_transaction as early as possible when an error condition is detected, that way the line number reported is useful and we're not clueless anymore which error path led to the abort. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-08 20:09:02 -04:00
Liu Bo	6bbe3a9c80	Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents nocow_only is now an obsolete argument. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Liu Bo	ab26e9d6c8	Btrfs: cleanup for duplicated code in find_free_extent There is already an 'add free space' phrase in front of this one, we needn't to redo it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Josef Bacik	698d0082c4	Btrfs: remove bytes argument from do_chunk_alloc Everybody is just making stuff up, and it's just used to see if we really do need to alloc a chunk, and since we do this when we already know we really do it's just a waste of space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Josef Bacik	ea658badc4	Btrfs: delay block group item insertion So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Josef Bacik	a80c8dcf7e	Btrfs: fix our overcommit math I noticed I was seeing large lags when running my torrent test in a vm on my laptop. While trying to make it lag less I noticed that our overcommit math was taking into account the number of bytes we wanted to reclaim, not the number of bytes we actually wanted to allocate, which means we wouldn't overcommit as often. This patch fixes the overcommit math and makes shrink_delalloc() use that logic so that it will stop looping faster. We still have pretty high spikes of latency, but the test now takes 3 minutes less time (about 5% faster). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:16 -04:00
Josef Bacik	dea31f5233	Btrfs: wait on async pages when shrinking delalloc Mitch reported a problem where you could get an ENOSPC error when untarring a kernel git tree onto a 16gb file system with compress-force=zlib. This is because compression is a huge pain, it will return from ->writepages() without having actually created any ordered extents. To get around this we check to see if the async submit counter is up, and if it is wait until it drops to 0 before doing our normal ordered wait dance. With this patch I can now untar a kernel git tree onto a 16gb file system without getting ENOSPC errors. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:15 -04:00
Miao Xie	48c03c4bcf	Btrfs: fix wrong size for the reservation of the, snapshot creation We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	66d8f3dd1c	Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Josef Bacik	2aaa665581	Btrfs: add hole punching This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	ca7e70f590	Btrfs: do not needlessly restart the transaction for enospc We will stop and restart a transaction every time we move to a different leaf when truncating a file. This is for enospc reasons, but really we could probably get away with doing this a little better by actually working until we hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do truncates which will fail as soon as the block rsv runs out of space, and then at that point we can stop and restart the transaction and refill the block rsv and carry on. This will make rm'ing of a file with lots of extents a bit faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:04 -04:00
Josef Bacik	54338b5cc4	Btrfs: do not allocate chunks as agressively Swinging this pendulum back the other way. We've been allocating chunks up to 2% of the disk no matter how much we actually have allocated. So instead fix this calculation to only allocate chunks if we have more than 80% of the space available allocated. Please test this as it will likely cause all sorts of ENOSPC problems to pop up suddenly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	ae1e206b80	Btrfs: allow delayed refs to be merged Daniel Blueman reported a bug with fio+balance on a ramdisk setup. Basically what happens is the balance relocates a tree block which will drop the implicit refs for all of its children and adds a full backref. Once the block is relocated we have to add the implicit refs back, so when we cow the block again we add the implicit refs for its children back. The problem comes when the original drop ref doesn't get run before we add the implicit refs back. The delayed ref stuff will specifically prefer ADD operations over DROP to keep us from freeing up an extent that will have references to it, so we try to add the implicit ref before it is actually removed and we panic. This worked fine before because the add would have just canceled the drop out and we would have been fine. But the backref walking work needs to be able to freeze the delayed ref stuff in time so we have this ever increasing sequence number that gets attached to all new delayed ref updates which makes us not merge refs and we run into this issue. So to fix this we need to merge delayed refs. So everytime we run a clustered ref we need to try and merge all of its delayed refs. The backref walking stuff locks the delayed ref head before processing, so if we have it locked we are safe to merge any refs inside of the sequence number. If there is no sequence number we can merge all refs. Doing this not only fixes our bug but keeps the delayed ref code from adding and removing useless refs and batching together multiple refs into one search instead of one search per delayed ref, which will really help our commit times. I ran this with Daniels test and 276 and I haven't seen any problems. Thanks, Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:38 -04:00
Arne Jansen	22cd2e7de7	Btrfs: fix race in run_clustered_refs With commit commit `d1270cd91f` Author: Arne Jansen <sensille@gmx.net> Date: Tue Sep 13 15:16:43 2011 +0200 Btrfs: put back delayed refs that are too new I added a window where the delayed_ref's head->ref_mod code can diverge from the sum of the remaining refs, because we release the head->mutex in the middle. This leads to btrfs_lookup_extent_info returning wrong numbers. This patch fixes this by adjusting the head's ref_mod with each delayed ref we run. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:35 -04:00
Josef Bacik	6fc823b10f	Btrfs: increase the size of the free space cache Arne was complaining about the space cache having mismatching generation numbers when debugging a deadlock. This is because we can run out of space in our preallocated range for our space cache if you have a pretty fragmented amount of space in your pinned space. So just increase the amount of space we preallocate for space cache so we can be sure to have enough space. This will only really affect data ranges since their the only chunks that end up larger than 256MB. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:34 -04:00
Arne Jansen	1fa11e265f	Btrfs: fix deadlock in wait_for_more_refs Commit `a168650c` introduced a waiting mechanism to prevent busy waiting in btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations, where a tree_mod_seq is held while waiting for the io to complete, while the end_io calls btrfs_run_delayed_refs. This whole mechanism is unnecessary. If not enough runnable refs are available to satisfy count, just return as count is more like a guideline than a strict requirement. In case we have to run all refs, commit transaction makes sure that no other threads are working in the transaction anymore, so we just assert here that no refs are blocked. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:32 -04:00
Dan Carpenter	55e591ffde	Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() We should release this mutex before returning the error code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-28 16:53:25 -04:00
Chris Mason	cd1cfc4915	Btrfs: add a barrier before a waitqueue_active check We were missing wakeups on the delayed ref waitqueue due to races on waitqueue_active. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 16:15:08 -04:00
Chris Mason	b478b2baa3	Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linus Conflicts: fs/btrfs/ioctl.c fs/btrfs/ioctl.h fs/btrfs/transaction.c fs/btrfs/transaction.h Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 16:11:38 -04:00
Liu Bo	df57dbe6bf	Btrfs: make btrfs's allocation smoothly with preallocation For backref walking, we've introduce delayed ref's sequence. However, it changes our preallocation behavior. The story is that when we preallocate an extent and then mark it written piece by piece, the ideal case should be that we don't need to COW the extent, which is why we use 'preallocate'. But we may not make use of preallocation, since when we check for cross refs on the extent, we may have two ref entries which have the same content except the sequence value, and we recognize them as cross refs and do COW to allocate another extent. So we end up with several pieces of space instead of an whole extent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:10 -04:00
Li Zefan	b4d7c3c945	Btrfs: kill free_space pointer from inode structure Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-07-23 16:28:05 -04:00
Liu Bo	799ffc3c31	Btrfs: add ro notification to dump_space_info Block group has ro attributes, make dump_space_info show it. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:02 -04:00
Liu Bo	cf7c1ef6e1	Btrfs: fix a bug of writting free space cache during balance Here is the whole story: 1) A free space cache consists of two parts: o free space cache inode, which is special becase it's stored in root tree. o free space info, which is stored as the above inode's file data. But we only build up another new inode and does not flush its free space info onto disk when we _clear and setup_ free space cache, and this ends up with that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN. And holding DC_SETUP means that we will not truncate this free space cache inode, which means the disk offset of its file extent will remain _unchanged_ at least until next transaction finishes committing itself. 2) We can set a block group readonly when we relocate the block group. However, if the readonly block group covers the disk offset where our free space cache inode is going to write, it will force the free space cache inode into cow_file_range() and it'll end up hitting a BUG_ON. 3) Due to the above analysis, we fix this bug by adding the missing dirty flag. 4) However, it's not over, there is still another case, nospace_cache. With nospace_cache, we do not want to set dirty flag, instead we just truncate free space cache inode and bail out with setting cache state DC_WRITTEN. We can benifit from it since it saves us another 'pre-allocation' part which usually costs a lot. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:02 -04:00
Liu Bo	0678938423	Btrfs: do not abort transaction in prealloc case During disk balance, we prealloc new file extent for file data relocation, but we may fail in 'no available space' case, and it leads to flipping btrfs into readonly. It is not necessary to bail out and abort transaction since we do have several ways to rescue ourselves from ENOSPC case. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:01 -04:00
Liu Bo	83eea1f1ba	Btrfs: kill root from btrfs_is_free_space_inode Since root can be fetched via BTRFS_I macro directly, we can save an args for btrfs_is_free_space_inode(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:00 -04:00
Josef Bacik	f4c738c2e7	Btrfs: rework shrink_delalloc So shrink_delalloc has grown all sorts of cruft over the years thanks to many reworkings of how we track enospc. What happens now as we fill up the disk is we will loop for freaking ever hoping to reclaim a arbitrary amount of space of metadata, this was from when everybody flushed at the same time. Now we only have people flushing one at a time. So instead of trying to reclaim a huge amount of space, just try to flush a decent chunk of space, and stop looping as soon as we have enough free space to satisfy our reservation. This makes xfstests 224 go much faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:58 -04:00
Josef Bacik	0e72110692	Btrfs: change how we indicate we're adding csums There is weird logic I had to put in place to make sure that when we were adding csums that we'd used the delalloc block rsv instead of the global block rsv. Part of this meant that we had to free up our transaction reservation before we ran the delayed refs since csum deletion happens during the delayed ref work. The problem with this is that when we release a reservation we will add it to the global reserve if it is not full in order to keep us going along longer before we have to force a transaction commit. By releasing our reservation before we run delayed refs we don't get the opportunity to drain down the global reserve for the work we did, so we won't refill it as often. This isn't a problem per-se, it just results in us possibly committing transactions more and more often, and in rare cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran out of space in our block rsv. This also helps us by holding onto space while the delayed refs run so we don't end up with as many people trying to do things at the same time, which again will help us not force commits or hit the use_block_rsv warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:55 -04:00
Josef Bacik	96c3f4331a	Btrfs: flush delayed inodes if we're short on space Those crazy gentoo guys have been complaining about ENOSPC errors on their portage volumes. This is because doing things like untar tends to create lots of new files which will soak up all the reservation space in the delayed inodes. Usually this gets papered over by the fact that we will try and commit the transaction, however if this happens in the wrong spot or we choose not to commit the transaction you will be screwed. So add the ability to expclitly flush delayed inodes to free up space. Please test this out guys to make sure it works since as usual I cannot reproduce. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 15:41:40 -04:00
Arne Jansen	c556723794	Btrfs: hooks to reserve qgroup space Like block reserves, reserve a small piece of space on each transaction start and for delalloc. These are the hooks that can actually return EDQUOT to the user. The amount of space reserved is tracked in the transaction handle. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-12 10:54:39 +02:00
Jan Schmidt	edf39272db	Btrfs: call the qgroup accounting functions Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-12 10:54:37 +02:00
Arne Jansen	bed92eae26	Btrfs: qgroup implementation and prototypes Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-12 10:54:21 +02:00
Arne Jansen	709c0486b9	Btrfs: Test code to change the order of delayed-ref processing Normally delayed refs get processed in ascending bytenr order. This correlates in most cases to the order added. To expose dependencies on this order, we start to process the tree in the middle instead of the beginning. This code is only effective when SCRAMBLE_DELAYED_REFS is defined. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:44 +02:00
Jan Schmidt	097b8a7c9e	Btrfs: join tree mod log code with the code holding back delayed refs We've got two mechanisms both required for reliable backref resolving (tree mod log and holding back delayed refs). You cannot make use of one without the other. So instead of requiring the user of this mechanism to setup both correctly, we join them into a single interface. Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list as we did before, which was of no value. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-10 15:14:41 +02:00
Jan Schmidt	8ca78f3eda	Btrfs: avoid waiting for delayed refs when we must not We track two conditions to decide if we should sleep while waiting for more delayed refs, the number of delayed refs (num_refs) and the first entry in the list of blockers (first_seq). When we suspect staleness, we save num_refs and do one more cycle. If nothing changes, we then save first_seq for later comparison and do wait_event. We ought to save first_seq the very same moment we're saving num_refs. Otherwise we cannot be sure that nothing has changed and we might start waiting when we shouldn't, which could lead to starvation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:35 +02:00
Chris Mason	1e20932a23	Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into for-linus Conflicts: fs/btrfs/ulist.h Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-31 16:49:53 -04:00
Josef Bacik	72ac3c0d79	Btrfs: convert the inode bit field to use the actual bit operations Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:36 -04:00
Jan Schmidt	5581a51a59	Btrfs: don't set for_cow parameter for tree block functions Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:53 +02:00
Dan Carpenter	a25c75d5ad	Btrfs: cleanup: use consistent lock naming It confuses Smatch that we use two names for the same lock. Plus the shorter name is nicer. This doesn't change how the code works, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-05-11 10:56:41 -04:00
Chris Mason	b9fab919b7	Btrfs: avoid sleeping in verify_parent_transid while atomic verify_parent_transid needs to lock the extent range to make sure no IO is underway, and so it can safely clear the uptodate bits if our checks fail. But, a few callers are using it with spinlocks held. Most of the time, the generation numbers are going to match, and we don't want to switch to a blocking lock just for the error case. This adds an atomic flag to verify_parent_transid, and changes it to return EAGAIN if it needs to block to properly verifiy things. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-06 07:23:47 -04:00
Stefan Behrens	1f699d38b6	Btrfs: fix block_rsv and space_info lock ordering may_commit_transaction() calls spin_lock(&space_info->lock); spin_lock(&delayed_rsv->lock); and update_global_block_rsv() calls spin_lock(&block_rsv->lock); spin_lock(&sinfo->lock); Lockdep complains about this at run time. Everywhere except in update_global_block_rsv(), the space_info lock is the outer lock, therefore the locking order in update_global_block_rsv() is changed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:14 -04:00
Arne Jansen	b9688bb845	btrfs: don't return EINTR It is basically a good thing if we are interruptible when waiting for free space, but the generality in which it is implemented currently leads to system calls being interruptible that are not documented this way. For example git can't handle interrupted unlink(), leading to corrupt repos under space pressure. Instead we raise the bar to only be interruptible by SIGKILL. Thanks to David Sterba for suggesting this. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-04-18 19:22:33 +02:00
Dan Carpenter	253beebd5a	Btrfs: double unlock bug in error handling The caller expects this function to return with the lock held and releases it immediately on error. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-04-18 19:22:31 +02:00
Josef Bacik	d53ba47484	Btrfs: use commit root when loading free space cache A user reported that booting his box up with btrfs root on 3.4 was way slower than on 3.3 because I removed the ideal caching code. It turns out that we don't load the free space cache if we're in a commit for deadlock reasons, but since we're reading the cache and it hasn't changed yet we are safe reading the inode and free space item from the commit root, so do that and remove all of the deadlock checks so we don't unnecessarily skip loading the free space cache. The user reported this fixed the slowness. Thanks, Tested-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-12 20:54:01 -04:00
Ilya Dryomov	c6664b42c4	Btrfs: remove lock assert from get_restripe_target() This fixes a regression introduced by `fc67c450`. spin_is_locked() always returns 0 on UP kernels, which caused assert in get_restripe_target() to be fired on every call from btrfs_reduce_alloc_profile() on UP systems. Remove it completely for now, it's not clear if it's going to be needed in future. Reported-by: Bobby Powers <bobbypowers@gmail.com> Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Tested-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-12 16:03:56 -04:00
Chris Mason	8e62c2de6e	Revert "Btrfs: increase the global block reserve estimates" This reverts commit `5500cdbe14`. We've had a number of complaints of early enospc that bisect down to this patch. We'll hae to fix the reservations differently. CC: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-12 13:46:48 -04:00
Liu Bo	15d1ff8111	Btrfs: fix deadlock during allocating chunks This deadlock comes from xfstests 251. We'll hold the chunk_mutex throughout the whole of a chunk allocation. But if we find that we've used up system chunk space, we need to allocate a new system chunk, but this will lead to a recursion of chunk allocation and end up with a deadlock on chunk_mutex. So instead we need to allocate the system chunk first if we find we're in ENOSPC. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-29 09:57:44 -04:00
Liu Bo	2bcc0328c3	Btrfs: show useful info in space reservation tracepoint o For space info, the type of space info is useful for debug. o For transaction handle, its transid is useful. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-29 09:57:44 -04:00
Chris Mason	1c691b330a	Merge branch 'for-chris' of git://github.com/idryomov/btrfs-unstable into for-linus	2012-03-28 20:32:46 -04:00
Chris Mason	1d4284bd6e	Merge branch 'error-handling' into for-linus Conflicts: fs/btrfs/ctree.c fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/inode.c fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-03-28 20:31:37 -04:00
Ilya Dryomov	4a5e98f5d6	Btrfs: improve the logic in btrfs_can_relocate() Currently if we don't have enough space allocated we go ahead and loop though devices in the hopes of finding enough space for a chunk of the same type as the one we are trying to relocate. The problem with that is that if we are trying to restripe the chunk its target type can be more relaxed than the current one (eg require less devices or less space). So, when restriping, run checks against the target profile instead of the current one. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	7738a53a3a	Btrfs: add __get_block_group_index() helper Add __get_block_group_index() helper to be able to derive block group index from an arbitary set of flags. Implement get_block_group_index() in terms of it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	fc67c45083	Btrfs: add get_restripe_target() helper Add get_restripe_target() helper and switch everybody to use it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	0c460c0d70	Btrfs: move alloc_profile_is_valid() to volumes.c Header file is not a good place to define functions. This also moves a call to alloc_profile_is_valid() down the stack and removes a redundant check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it into account. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	e8920a640b	Btrfs: make profile_is_valid() check more strict "0" is a valid value for an on-disk chunk profile, but it is not a valid extended profile. (We have a separate bit for single chunks in extended case) Also rename it to alloc_profile_is_valid() for clarity. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:17 +03:00
Ilya Dryomov	899c81eac8	Btrfs: add wrappers for working with alloc profiles Add functions to abstract the conversion between chunk and extended allocation profile formats and switch everybody to use them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:16 +03:00
Ilya Dryomov	e3176ca276	Btrfs: stop silently switching single chunks to raid0 on balance This has been causing a lot of confusion for quite a while now and a lot of users were surprised by this (some of them were even stuck in a ENOSPC situation which they couldn't easily get out of). The addition of restriper gives users a clear choice between raid0 and drive concat setup so there's absolutely no excuse for us to keep doing this. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-03-27 17:09:16 +03:00
Josef Bacik	3083ee2e18	Btrfs: introduce free_extent_buffer_stale Because btrfs cow's we can end up with extent buffers that are no longer necessary just sitting around in memory. So instead of evicting these pages, we could end up evicting things we actually care about. Thus we have free_extent_buffer_stale for use when we are freeing tree blocks. This will make it so that the ref for the eb being in the radix tree is dropped as soon as possible and then is freed when the refcount hits 0 instead of waiting to be released by releasepage. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-03-26 16:51:08 -04:00
Josef Bacik	81c9ad237c	Btrfs: remove search_start and search_end from find_free_extent and callers We have been passing nothing but (u64)-1 to find_free_extent for search_end in all of the callers, so it's completely useless, and we've always been passing 0 in as search_start, so just remove them as function arguments and move search_start into find_free_extent. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-03-26 14:42:51 -04:00
Josef Bacik	285ff5af6c	Btrfs: remove the ideal caching code This is a relic from before we had the disk space cache and it was to make bootup times when you had btrfs as root not be so damned slow. Now that we have the disk space cache this isn't a problem anymore and really having this code casues uneeded fragmentation and complexity, so just remove it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-03-26 14:42:51 -04:00
Jeff Mahoney	79787eaab4	btrfs: replace many BUG_ONs with proper error handling btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 11:52:54 +01:00
Jeff Mahoney	2c536799f1	btrfs: btrfs_drop_snapshot should return int Commit `cb1b69f4` (Btrfs: forced readonly when btrfs_drop_snapshot() fails) made btrfs_drop_snapshot return void because there were no callers checking the return value. That is the wrong order to handle error propogation since the caller will have no idea that an error has occured and continue on as if nothing went wrong. Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:36 +01:00
Jeff Mahoney	143bede527	btrfs: return void in functions without error conditions Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:34 +01:00
Jeff Mahoney	538042801a	btrfs: avoid NULL deref in btrfs_reserve_extent with DEBUG_ENOSPC __find_space_info can return NULL but we don't check it before calling dump_space_info(). Signed-off-by: Jeff Mahoney <jeffm@suse.com>	2012-03-22 01:45:32 +01:00
Chris Mason	e77266e4c4	Btrfs: fix compiler warnings on 32 bit systems The enospc tracing code added some interesting uses of u64 pointer casts. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-02-24 10:39:05 -05:00
Liu Bo	5500cdbe14	Btrfs: increase the global block reserve estimates When doing IO with large amounts of data fragmentation, the global block reserve calulations are too low. This increases them to avoid ENOSPC crashes. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-02-23 10:49:04 -05:00
Liu Bo	d9b0218f6c	Btrfs: fix a bug on overcommit stuff When overcommitting, we should check the sum of pinned space and bytes for delayed item. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-02-16 17:23:18 +01:00
Liu Bo	2cac13e41b	Btrfs: fix trim 0 bytes after a device delete A user reported a bug of btrfs's trim, that is we will trim 0 bytes after a device delete. The reproducer: $ mkfs.btrfs disk1 $ mkfs.btrfs disk2 $ mount disk1 /mnt $ fstrim -v /mnt $ btrfs device add disk2 /mnt $ btrfs device del disk1 /mnt $ fstrim -v /mnt This is because after we delete the device, the block group may start from a non-zero place, which will confuse trim to discard nothing. Reported-by: Lutz Euler <lutz.euler@freenet.de> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-02-15 16:40:23 +01:00
Miao Xie	9e622d6bea	Btrfs: fix enospc error caused by wrong checks of the chunk When we did sysbench test for inline files, enospc error happened easily though there was lots of free disk space which could be allocated for new chunks. Reproduce steps: # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition> # mount <test partition> /mnt # ulimit -n 102400 # cd /mnt # sysbench --num-threads=1 --test=fileio --file-num=81920 \ > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \ > --file-test-mode=seqwr prepare # sysbench --num-threads=1 --test=fileio --file-num=81920 \ > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \ > --file-test-mode=seqwr run <soon later, BUG_ON() was triggered by enospc error> The reason of this bug is: Now, we can reserve space which is larger than the free space in the chunks if we have enough free disk space which can be used for new chunks. By this way, the space allocator should allocate a new chunk by force if there is no free space in the free space cache. But there are two wrong checks which break this operation. One is if (ret == -ENOSPC && num_bytes > min_alloc_size) in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk even we fail to allocate free space by minimum allocable size. The other is if (space_info->force_alloc) force = space_info->force_alloc; in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen. Fix these two wrong checks. Especially the second one, we fix it by changing the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has higher priority. And if the value which is passed in by the caller is greater than ->force_alloc, use the passed value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-26 15:01:12 -05:00
Chris Mason	96bdc7dc61	Btrfs: use larger system chunks system chunks by default are very small. This makes them slightly larger and also fixes the conditional checks to make sure we don't allocate a billion of them at once. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-16 15:38:24 -05:00
Josef Bacik	f248679e86	Btrfs: add a delalloc mutex to inodes for delalloc reservations I was using i_mutex for this, but we're getting bogus lockdep warnings by doing that and theres no real way to get rid of those, so just stop using i_mutex to protect delalloc metadata reservations and use a delalloc mutex instead. This shouldn't be contended often at all, only if you are writing and mmap writing to the file at the same time. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-01-16 15:29:43 -05:00
Josef Bacik	8c2a3ca20f	Btrfs: space leak tracepoints This in addition to a script in my btrfs-tracing tree will help track down space leaks when we're getting space left over in block groups on umount. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-01-16 15:29:43 -05:00
Josef Bacik	3f7de037fb	Btrfs: add allocator tracepoints I used these tracepoints when figuring out what the cluster stuff was doing, so add them to mainline in case we need to profile this stuff again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-01-16 15:29:42 -05:00
Chris Mason	9785dbdf26	Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration	2012-01-16 15:26:31 -05:00
Chris Mason	d756bd2d93	Merge branch 'for-chris' of git://repo.or.cz/linux-btrfs-devel into integration Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-16 15:26:17 -05:00
Chris Mason	27263e2832	Merge branch 'restriper' of git://github.com/idryomov/btrfs-unstable into integration	2012-01-16 15:26:02 -05:00
Ilya Dryomov	e4d8ec0f65	Btrfs: implement online profile changing Profile changing is done by launching a balance with BTRFS_BALANCE_CONVERT bits set and target fields of respective btrfs_balance_args structs initialized. Profile reducing code in this case will pick restriper's target profile if it's available instead of doing a blind reduce. If target profile is not yet available it goes back to a plain reduce. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:48 +02:00
Ilya Dryomov	70922617b0	Btrfs: do not reduce profile in do_chunk_alloc() Every caller of do_chunk_alloc() feeds it the reduced allocation profile, so stop trying to reduce it one more time. Instead check the validity of the passed profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:48 +02:00
Ilya Dryomov	10ea00f55a	Btrfs: make avail_*_alloc_bits fields dynamic Currently when new chunks are created respective avail_alloc_bits field is updated to reflect profiles of all chunks present in the system. However when chunks are removed profile bits are never cleared. This patch clears profile bit of respective avail_alloc_bits field when the last chunk with that profile is removed. Restriper needs this to properly operate when "downgrading". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	a46d11a8b0	Btrfs: add BTRFS_AVAIL_ALLOC_BIT_SINGLE bit Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for avail_{data,metadata,system}_alloc_bits fields, which gather info about available allocation profiles in the FS. When chunk is created or read from disk, its profile is OR'ed with the corresponding avail_alloc_bits field. Since SINGLE is denoted by 0 in the on-disk format, currently there is no way to tell when such chunks become avaialble. Restriper needs that information, so add a separate bit for SINGLE profile. This bit is going to be in-memory only, it should never be written out to disk, so it's not a disk format change. However to avoid remappings in future, reserve corresponding on-disk bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	52ba692972	Btrfs: introduce masks for chunk type and profile Chunk's type and profile are encoded in u64 flags field. Introduce masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_* constants, it should be ULL. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Ilya Dryomov	6fef8df1dc	Btrfs: get rid of *_alloc_profile fields {data,metadata,system}_alloc_profile fields have been unused for a long time now. Get rid of them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-01-16 22:04:47 +02:00
Li Zefan	c7c144db53	Btrfs: update global block_rsv when creating a new block group A bug was triggered while using seed device: # mkfs.btrfs /dev/loop1 # btrfstune -S 1 /dev/loop1 # mount -o /dev/loop1 /mnt # btrfs dev add /dev/loop2 /mnt btrfs: block rsv returned -28 ------------[ cut here ]------------ WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]() ... Call Trace: ... [<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs] [<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs] [<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs] [<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs] [<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs] [<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs] [<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs] [<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs] [<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs] [<c04f55f7>] do_vfs_ioctl+0x496/0x4cc [<c04f5660>] sys_ioctl+0x33/0x4f [<c07b9edf>] sysenter_do_call+0x12/0x38 ---[ end trace 906adac595facc7d ]--- Since seed device is readonly, there's no usable space in the filesystem. Afterwards we add a sprout device to it, and the kernel creates a METADATA block group and a SYSTEM block group where comes free space we can reserve, but we still get revervation failure because the global block_rsv hasn't been updated accordingly. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2012-01-11 10:26:52 +08:00
Li Zefan	125ccb0ae6	Btrfs: don't pass a trans handle unnecessarily in volumes.c Some functions never use the transaction handle passed to them. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2012-01-11 10:26:42 +08:00
Alexandre Oliva	fc7c1077ce	Btrfs: don't set up allocation result twice We store the allocation start and length twice in ins, once right after the other, but with intervening calls that may prevent the duplicate from being optimized out by the compiler. Remove one of the assignments. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-07 19:15:14 -05:00
Alexandre Oliva	a5f6f719a5	Btrfs: test free space only for unclustered allocation Since the clustered allocation may be taking extents from a different block group, there's no point in spin-locking and testing the current block group free space before attempting to allocate space from a cluster, even more so when we might refrain from even trying the cluster in the current block group because, after the cluster was set up, not enough free space remained. Furthermore, cluster creation attempts fail fast when the block group doesn't have enough free space, so the test was completely superfluous. I've move the free space test past the cluster allocation attempt, where it is more useful, and arranged for a cluster in the current block group to be released before trying an unclustered allocation, when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in the cluster stands a chance of being combined with additional free space in the block group so as to succeed in the allocation attempt. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-06 15:48:21 -05:00
Chris Mason	cf1d72c9ce	Btrfs: lower the bar for chunk allocation The chunk allocation code has tried to keep a pretty tight lid on creating new metadata chunks. This is partially because in the past the reservation code didn't give us an accurate idea of how much space was being used. The new code is much more accurate, so we're able to get rid of some of these checks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-06 15:41:34 -05:00
Chris Mason	203bf287cb	Btrfs: run chunk allocations while we do delayed refs Btrfs tries to batch extent allocation tree changes to improve performance and reduce metadata trashing. But it doesn't allocate new metadata chunks while it is doing allocations for the extent allocation tree. This commit changes the delayed refence code to do chunk allocations if we're getting low on room. It prevents crashes and improves performance. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-01-06 15:23:57 -05:00
Jan Schmidt	a168650c08	Btrfs: add waitqueue instead of doing busy waiting for more delayed refs Now that we may be holding back delayed refs for a limited period, we might end up having no runnable delayed refs. Without this commit, we'd do busy waiting in that thread until another (runnable) ref arives. Instead, we're detecting this situation and use a waitqueue, such that we only try to run more refs after a) another runnable ref was added or b) delayed refs are no longer held back Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-01-04 16:12:48 +01:00
Arne Jansen	d1270cd91f	Btrfs: put back delayed refs that are too new When processing a delayed ref, first check if there are still old refs in the process of being added. If so, put this ref back to the tree. To avoid looping on this ref, choose a newer one in the next loop. btrfs_find_ref_cluster has to take care of that. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-01-04 16:12:45 +01:00
Arne Jansen	66d7e7f09f	Btrfs: mark delayed refs as for cow Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value from every call site. The for_cow parameter will later on be used to determine if a ref will change anything with respect to qgroups. Delayed refs coming from relocation are always counted as for_cow, as they don't change subvol quota. Also pass in the fs_info for later use. btrfs_find_all_roots() will use this as an optimization, as changes that are for_cow will not change anything with respect to which root points to a certain leaf. Thus, we don't need to add the current sequence number to those delayed refs. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2011-12-22 16:22:27 +01:00
Linus Torvalds	c9a7fe9672	Merge branches 'for-linus' and 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: unplug every once and a while Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code Btrfs: only set cache_generation if we setup the block group Btrfs: don't panic if orphan item already exists Btrfs: fix leaked space in truncate Btrfs: fix how we do delalloc reservations and how we free reservations on error Btrfs: deal with enospc from dirtying inodes properly Btrfs: fix num_workers_starting bug and other bugs in async thread BTRFS: Establish i_ops before calling d_instantiate Btrfs: add a cond_resched() into the worker loop Btrfs: fix ctime update of on-disk inode btrfs: keep orphans for subvolume deletion Btrfs: fix inaccurate available space on raid0 profile Btrfs: fix wrong disk space information of the files Btrfs: fix wrong i_size when truncating a file to a larger size Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror * 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: lower the dirty balance poll interval	2011-12-16 12:15:50 -08:00
Josef Bacik	e65cbb94e0	Btrfs: only set cache_generation if we setup the block group A user reported a problem booting into a new kernel with the old format inodes. He was panicing in cow_file_range while writing out the inode cache. This is because if the block group is not cached we'll just skip writing out the cache, however if it gets dirtied again in the same transaction and it finished caching we'd go ahead and write it out, but since we set cache_generation to the transid we think we've already truncated it and will just carry on, running into cow_file_range and blowing up. We need to make sure we only set cache_generation if we've done the truncate. The user tested this patch and verified that the panic no longer occured. Thanks, Reported-and-Tested-by: Klaus Bitto <klaus.bitto@gmail.com> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-12-15 11:04:24 -05:00
Josef Bacik	660d3f6cde	Btrfs: fix how we do delalloc reservations and how we free reservations on error Running xfstests 269 with some tracing my scripts kept spitting out errors about releasing bytes that we didn't actually have reserved. This took me down a huge rabbit hole and it turns out the way we deal with reserved_extents is wrong, we need to only be setting it if the reservation succeeds, otherwise the free() method will come in and unreserve space that isn't actually reserved yet, which can lead to other warnings and such. The math was all working out right in the end, but it caused all sorts of other issues in addition to making my scripts yell and scream and generally make it impossible for me to track down the original issue I was looking for. The other problem is with our error handling in the reservation code. There are two cases that we need to deal with 1) We raced with free. In this case free won't free anything because csum_bytes is modified before we dro the lock in our reservation path, so free rightly doesn't release any space because the reservation code may be depending on that reservation. However if we fail, we need the reservation side to do the free at that point since that space is no longer in use. So as it stands the code was doing this fine and it worked out, except in case #2 2) We don't race with free. Nobody comes in and changes anything, and our reservation fails. In this case we didn't reserve anything anyway and we just need to clean up csum_bytes but not free anything. So we keep track of csum_bytes before we drop the lock and if it hasn't changed we know we can just decrement csum_bytes and carry on. Because of the case where we can race with free()'s since we have to drop our spin_lock to do the reservation, I'm going to serialize all reservations with the i_mutex. We already get this for free in the heavy use paths, truncate and file write all hold the i_mutex, just needed to add it to page_mkwrite and various ioctl/balance things. With this patch my space leak scripts no longer scream bloody murder. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-12-15 11:04:22 -05:00
Linus Torvalds	fb38f9b8fe	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: drop spin lock when memory alloc fails Btrfs: check if the to-be-added device is writable Btrfs: try cluster but don't advance in search list Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE	2011-12-08 13:18:59 -08:00
Alexandre Oliva	274bd4fb3e	Btrfs: try cluster but don't advance in search list When we find an existing cluster, we switch to its block group as the current block group, possibly skipping multiple blocks in the process. Furthermore, under heavy contention, multiple threads may fail to allocate from a cluster and then release just-created clusters just to proceed to create new ones in a different block group. This patch tries to allocate from an existing cluster regardless of its block group, and doesn't switch to that group, instead proceeding to try to allocate a cluster from the group it was iterating before the attempt. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-12-08 08:55:40 -05:00
Alexandre Oliva	062c05c46b	Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that others might have set up. Odds are that there won't be one, but if someone else succeeded in setting it up, we might as well use it, even if we don't try to set up a cluster again. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-12-07 19:50:42 -05:00
Linus Torvalds	b930c26416	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix meta data raid-repair merge problem Btrfs: skip allocation attempt from empty cluster Btrfs: skip block groups without enough space for a cluster Btrfs: start search for new cluster at the beginning Btrfs: reset cluster's max_size when creating bitmap Btrfs: initialize new bitmaps' list Btrfs: fix oops when calling statfs on readonly device Btrfs: Don't error on resizing FS to same size Btrfs: fix deadlock on metadata reservation when evicting a inode Fix URL of btrfs-progs git repository in docs btrfs scrub: handle -ENOMEM from init_ipath()	2011-12-01 08:28:53 -08:00
Alexandre Oliva	be064d1139	Btrfs: skip allocation attempt from empty cluster If we don't have a cluster, don't bother trying to allocate from it, jumping right away to the attempt to allocate a new cluster. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-30 13:43:00 -05:00
Alexandre Oliva	425d83156c	Btrfs: skip block groups without enough space for a cluster We test whether a block group has enough free space to hold the requested block, but when we're doing clustered allocation, we can save some cycles by testing whether it has enough room for the cluster upfront, otherwise we end up attempting to set up a cluster and failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered allocation, and by then we'll have zeroed the cluster size, so this patch won't stop us from using the block group as a last resort. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-30 13:43:00 -05:00
Alexandre Oliva	1b22bad779	Btrfs: start search for new cluster at the beginning Instead of starting at zero (offset is always zero), request a cluster starting at search_start, that denotes the beginning of the current block group. Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-30 13:43:00 -05:00
Miao Xie	aa38a711a8	Btrfs: fix deadlock on metadata reservation when evicting a inode When I ran the xfstests, I found the test tasks was blocked on meta-data reservation. By debugging, I found the reason of this bug: start transaction \| v reserve meta-data space \| v flush delay allocation -> iput inode -> evict inode ^ \| \| v wait for delay allocation flush <- reserve meta-data space And besides that, the flush on evicting inode will block the thread, which is reclaiming the memory, and make oom happen easily. Fix this bug by skipping the flush step when evicting inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2011-11-30 18:46:03 +01:00
Linus Torvalds	af36d15f58	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove free-space-cache.c WARN during log replay Btrfs: sectorsize align offsets in fiemap Btrfs: clear pages dirty for io and set them extent mapped Btrfs: wait on caching if we're loading the free space cache Btrfs: prefix resize related printks with btrfs: btrfs: fix stat blocks accounting Btrfs: avoid unnecessary bitmap search for cluster setup Btrfs: fix to search one more bitmap for cluster setup btrfs: mirror_num should be int, not u64 btrfs: Fix up 32/64-bit compatibility for new ioctls Btrfs: fix barrier flushes Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush	2011-11-22 08:53:40 -08:00
Josef Bacik	291c7d2f57	Btrfs: wait on caching if we're loading the free space cache We've been hitting panics when running xfstest 13 in a loop for long periods of time. And actually this problem has always existed so we've been hitting these things randomly for a while. Basically what happens is we get a thread coming into the allocator and reading the space cache off of disk and adding the entries to the free space cache as we go. Then we get another thread that comes in and tries to allocate from that block group. Since block_group->cached != BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because if we're doing the old slow way of caching we don't want to hold people up and wait for everything to finish. The problem with this is we could end up discarding the space cache at some arbitrary point in the future, which means we could very well end up allocating space that is either bad, or when the real caching happens it could end up thinking the space isn't in use when it really is and cause all sorts of other problems. The solution is to add a new flag to indicate we are loading the free space cache from disk, and always try to cache the block group if cache->cached != BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else who tries to allocate from the block group will have to wait until it's finished to make sure it completes successfully. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-11-20 07:42:16 -05:00
Linus Torvalds	c1f4246716	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: rename the option to nospace_cache Btrfs: handle bio_add_page failure gracefully in scrub Btrfs: fix deadlock caused by the race between relocation Btrfs: only map pages if we know we need them when reading the space cache Btrfs: fix orphan backref nodes Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush} Btrfs: fix unreleased path in btrfs_orphan_cleanup() Btrfs: fix no reserved space for writing out inode cache Btrfs: fix nocow when deleting the item Btrfs: tweak the delayed inode reservations again Btrfs: rework error handling in btrfs_mount() Btrfs: close devices on all error paths in open_ctree() Btrfs: avoid null dereference and leaks when bailing from open_ctree() Btrfs: fix subvol_name leak on error in btrfs_mount() Btrfs: fix memory leak in btrfs_parse_early_options() Btrfs: fix our reservations for updating an inode when completing io Btrfs: fix oops on NULL trans handle in btrfs_truncate btrfs: fix double-free 'tree_root' in 'btrfs_mount()'	2011-11-11 23:47:06 -02:00
Miao Xie	61b520a9d0	Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush} btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-10 20:45:05 -05:00
Josef Bacik	7fd2ae21a4	Btrfs: fix our reservations for updating an inode when completing io People have been reporting ENOSPC crashes in finish_ordered_io. This is because we try to steal from the delalloc block rsv to satisfy a reservation to update the inode. The problem with this is we don't explicitly save space for updating the inode when doing delalloc. This is kind of a problem and we've gotten away with this because way back when we just stole from the delalloc reserve without any questions, and this worked out fine because generally speaking the leaf had been modified either by the mtime update when we did the original write or because we just updated the leaf when we inserted the file extent item, only on rare occasions had the leaf not actually been modified, and that was still ok because we'd just use a block or two out of the over-reservation that is delalloc. Then came the delayed inode stuff. This is amazing, except it wants a full reservation for updating the inode since it may do it at some point down the road after we've written the blocks and we have to recow everything again. This worked out because the delayed inode stuff just stole from the global reserve, that is until recently when I changed that because it caused other problems. So here we are, we're doing everything right and being screwed for it. So take an extra reservation for the inode at delalloc reservation time and carry it through the life of the delalloc reservation. If we need it we can steal it in the delayed inode stuff. If we have already stolen it try and do a normal metadata reservation. If that fails try to steal from the delalloc reservation. If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to solve this and in the meantime we'll steal from the global reserve. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see any problems. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-08 15:47:34 -05:00
Linus Torvalds	6a6662ced4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits) Btrfs: check for a null fs root when writing to the backup root log Btrfs: fix race during transaction joins Btrfs: fix a potential btrfs_bio leak on scrub fixups Btrfs: rename btrfs_bio multi -> bbio for consistency Btrfs: stop leaking btrfs_bios on readahead Btrfs: stop the readahead threads on failed mount Btrfs: fix extent_buffer leak in the metadata IO error handling Btrfs: fix the new inspection ioctls for 32 bit compat Btrfs: fix delayed insertion reservation Btrfs: ClearPageError during writepage and clean_tree_block Btrfs: be smarter about committing the transaction in reserve_metadata_bytes Btrfs: make a delayed_block_rsv for the delayed item insertion Btrfs: add a log of past tree roots btrfs: separate superblock items out of fs_info Btrfs: use the global reserve when truncating the free space cache inode Btrfs: release metadata from global reserve if we have to fallback for unlink Btrfs: make sure to flush queued bios if write_cache_pages waits Btrfs: fix extent pinning bugs in the tree log Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN Btrfs: don't wait as long for more batches during SSD log commit ...	2011-11-06 20:03:41 -08:00
Chris Mason	806468f8bf	Merge git://git.jan-o-sch.net/btrfs-unstable into integration Conflicts: fs/btrfs/Makefile fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:07:10 -05:00
Josef Bacik	c06a0e120a	Btrfs: fix delayed insertion reservation We all keep getting those stupid warnings from use_block_rsv when running stress.sh, and it's because the delayed insertion stuff is being stupid. It's not the delayed insertion stuffs fault, it's all just stupid. When marking an inode dirty for oh say updating the time on it, we just do a btrfs_join_transaction, which doesn't reserve any space. This is stupid because we're going to have to have space reserve to make this change, but we do it because it's fast because chances are we're going to call it over and over again and it doesn't matter. Well thanks to the delayed insertion stuff this is mostly the case, so we do actually need to make this reservation. So if trans->bytes_reserved is 0 then try to do a normal reservation. If not return ENOSPC which will make the btrfs_dirty_inode start a proper transaction which will let it do the whole ENOSPC dance and reserve enough space for the delayed insertion to steal the reservation from the transaction. The other stupid thing we do is not reserve space for the inode when writing to the thing. Usually this is ok since we have to update the time so we'd have already done all this work before we get to the endio stuff, so it doesn't matter. But this is stupid because we could write the data after the transaction commits where we changed the mtime of the inode so we have to cow all the way down to the inode anyway. This used to be masked by the delalloc reservation stuff, but because we delay the update it doesn't get masked in this case. So again the delayed insertion stuff bites us in the ass. So if our trans->block_rsv is delalloc, just steal the reservation from the delalloc reserve. Hopefully this won't bite us in the ass, but I've said that before. With this patch stress.sh no longer spits out those stupid warnings (famous last words). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:20 -05:00
Josef Bacik	663350ac38	Btrfs: be smarter about committing the transaction in reserve_metadata_bytes Because of the overcommit stuff I had to make it so that we committed the transaction all the time in reserve_metadata_bytes in case we had overcommitted because of delayed items. This was because previously we had no way of knowing how much space was reserved for delayed items. Now that we have the delayed_block_rsv we can check it to see if committing the transaction would get us anywhere. This patch breaks out the committing logic into a helper function that will check to see if committing the transaction would free enough space for us to get anything done. With this patch xfstests 83 goes from taking 445 seconds to taking 28 seconds on my box. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:19 -05:00
Josef Bacik	6d668dda0c	Btrfs: make a delayed_block_rsv for the delayed item insertion I've been hitting warnings in use_block_rsv when running the delayed insertion stuff. It's because we will readjust global block rsv based on what is in use, which means we could end up discarding reservations that are for the delayed insertion stuff. So instead create a seperate block rsv for the delayed insertion stuff. This will also make it easier to debug problems with the delayed insertion reservations since we will know that only the delayed insertion code touches this block_rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-11-06 03:04:18 -05:00
David Sterba	6c41761fc6	btrfs: separate superblock items out of fs_info fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-11-06 03:04:01 -05:00
Chris Mason	e688b7252f	Btrfs: fix extent pinning bugs in the tree log The tree log had two important bugs that could cause corruptions after a crash. Sometimes we were allowing tree log blocks to be reused after the tree log was committed but before the transaction commit was done. This allowed a future metadata write to overwrite the tree log data. It is fixed by adding a new variant of freeing reserved extents that always pins them. Credit goes to Stefan Behrens and Arne Jansen for many many hours spent tracking this bug down. During tree log replay, we do a pass through the tree log and pin all the extents we find. This makes sure the replay code won't go in and use any of those blocks for new allocations during replay. The problem is the free space cache isn't honoring these pinned extents. So the allocator can end up handing them out, leading to all kinds of problems during replay. The fix here is to force any free space cache to load while we pin the extents, and then to make sure we remove the pinned extents from the free space rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>	2011-11-06 03:03:48 -05:00
Curt Wohlgemuth	0e175a1835	writeback: Add a 'reason' to wb_writeback_work This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>	2011-10-31 00:33:36 +08:00
David Sterba	dff51cd1c6	btrfs: ratelimit WARN_ON in use_block_rsv The WARN_ON under some circumstances heavily polute log and slow down the machine. This is just a safety, as the warning should be fixed by another patch, nevertheless, it still pops up during testing. Signed-off-by: David Sterba <dsterba@suse.cz>	2011-10-24 14:48:00 +02:00
Miao Xie	60d2adbb1e	Btrfs: fix race between multi-task space allocation and caching space The task may fail to get free space though it is enough when multi-task space allocation and caching space happen at the same time. Task1 Caching Thread Task2 ------------------------------------------------------------------------ find_free_extent The space has not be cached, and start caching thread. And wait for it. cache space, if the space is > 2MB wake up Task1 find_free_extent get all the space that is cached. try to allocate space, but there is no space now. trigger BUG_ON() The message is following: btrfs allocation failed flags 1, wanted 4096 space_info has 1040187392 free, is not full space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0 block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved block group has cluster?: no 0 blocks of free space at or bigger than bytes is ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:835! [<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs] [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108 [<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs] [<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2 [<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs] [<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs] [<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103 [<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs] [<ffffffff811380fa>] ? fsnotify+0x236/0x266 [<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs] [<ffffffff810cc215>] do_writepages+0x1c/0x25 [<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50 [<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51 [<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs] [<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21 [<ffffffff8112d170>] vfs_fsync+0x17/0x19 [<ffffffff8112d316>] do_fsync+0x29/0x3e [<ffffffff8112d348>] sys_fsync+0xb/0xf [<ffffffff81468352>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs] We fix this bug by trying to allocate the space again if there are block groups in caching. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2011-10-20 18:10:49 +02:00
Ilya Dryomov	10b2f34d6e	Btrfs: pass the correct root to lookup_free_space_inode() Free space items are located in tree of tree roots, not in the extent tree. It didn't pop up because lookup_free_space_inode() grabs the inode all the time instead of actually searching the tree. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2011-10-20 18:10:46 +02:00
Josef Bacik	7e355b83ef	Btrfs: if we have a lot of pinned space, commit the transaction Mitch kept hitting a panic because he was getting ENOSPC. One of my previous patches makes it so we are much better at not allocating new metadata chunks. Unfortunately coupled with the overcommit patch this works us into a bit of a problem if we are removing a bunch of space and end up chewing up all of our space with pinned extents. We can allocate chunks fine and overflow is ok, but the only way to reclaim this space is to commit the transaction. So if we go to overcommit, first check and see how much pinned space we have. If we have more than 80% of the free space chewed up with pinned extents, just commit the transaction, this will free up enough space for our reservation and we won't have this problem anymore. With this patch Mitch's test doesn't blow up anymore. Thanks, Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:13:00 -04:00
Josef Bacik	36ba022ac0	Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:59 -04:00
Josef Bacik	b24e03db0d	Btrfs: release trans metadata bytes before flushing delayed refs We started setting trans->block_rsv = NULL to allow the delayed refs flushing stuff to use the right block_rsv and then just made btrfs_trans_release_metadata() unconditionally use the trans block rsv. The problem with this is we need to reserve some space in the transaction and then migrate it to the global block rsv, so we need to be able to free that out properly. So instead just move btrfs_trans_release_metadata() before the delayed ref flushing and use trans->block_rsv for the freeing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:58 -04:00
Josef Bacik	877da17430	Btrfs: allow shrink_delalloc flush the needed reclaimed pages Currently we only allow a maximum of 2 megabytes of pages to be flushed at a time. This was ok before, but now we have overcommit which will screw us in a heartbeat if we are quickly filling the disk. So instead pick either 2 megabytes or the number of pages we need to reclaim to be safe again, which ever is larger. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:58 -04:00
Josef Bacik	f104d04437	Btrfs: wait for ordered extents if we're in trouble when shrinking delalloc The only way we actually reclaim delalloc space is waiting for the IO to completely finish. Usually we kick off a bunch of IO and wait for a little bit and hope we can make our reservation, and usually this works out pretty well. With overcommit however we can get seriously underwater if we're filling up the disk quickly, so we need to be able to force the delalloc shrinker to wait for the ordered IO to finish to give us a better chance of actually reclaiming enough space to get our reservation. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:57 -04:00
Josef Bacik	bbb495c2ed	Btrfs: don't check bytes_pinned to determine if we should commit the transaction Before the only reason to commit the transaction to recover space in reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our reservation. But now we have the delayed inode stuff which will hold it's reservations until we commit the transaction. So say we max out our reservation by creating a bunch of files but don't have any pinned bytes we will ENOSPC out early even though we could commit the transaction and get that space back. So now just unconditionally commit the transaction since currently there is no way to know how much metadata space is being reserved by delayed inode stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:56 -04:00
Josef Bacik	4b91c14f91	Btrfs: wait for ordered extents if we didn't reclaim enough I noticed recently that my overcommit patch was causing one of my enospc tests to fail 25% of the time with early ENOSPC. This is because my overcommit patch was letting us go way over board, but it wasn't waiting long enough to let the delalloc shrinker do it's job. The problem is we just start writeback and wait a little bit hoping we flush enough, but we only free up delalloc space by having the writes complete all the way. We do this by waiting for ordered extents, which we do but only if we already free'd enough for the reservation, which isn't right, we should flush ordered extents if we didn't reclaim enough in case that will push us over the edge. With this patch I've not seen a failure in this enospc test after running it in a loop for an hour. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:55 -04:00
Josef Bacik	5b0e95bf60	Btrfs: inline checksums into the disk free space cache Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:54 -04:00
Josef Bacik	9a82ca659d	Btrfs: take overflow into account in reserving space My overcommit stuff can be a little racy when we're filling up the disk with fs_mark and we overcommit into things that quickly get used up for data. So use num_bytes to see if we have enough available space so we're less likely to overcommit ourselves out of the ability to make reservations. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:53 -04:00
Josef Bacik	73bc187680	Btrfs: introduce mount option no_space_cache Some users have requested this and I've found I needed a way to disable cache loading without actually clearing the cache, so introduce the no_space_cache option. Before we check the super blocks cache generation field and if it was populated we always turned space caching on. Now we check this and set the space cache option on, and then parse the mount options so that if we want it off it get's turned off. Then we check the mount option all the places we do the caching work instead of checking the super's cache generation. This makes things more consistent and lets us turn space caching off. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:51 -04:00
Josef Bacik	2bf64758fd	Btrfs: allow us to overcommit our enospc reservations One of the things that kills us is the fact that our ENOSPC reservations are horribly over the top in most normal cases. There isn't too much that can be done about this because when we are completely full we really need them to work like this so we don't under reserve. However if there is plenty of unallocated chunks on the disk we can use that to gauge how much we can overcommit. So this patch adds chunk free space accounting so we always know how much unallocated space we have. Then if we fail to make a reservation within our allocated space, check to see if we can overcommit. In the normal flushing case (like with delalloc metadata reservations) we'll take the free space and divide it by 2 if our metadata profile is setup for DUP or any of those, and then divide it by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing case (we really need this reservation now!) we only limit ourselves to half of the free space. This makes this fio test [torrent] filename=torrent-test rw=randwrite size=4g ioengine=sync directory=/mnt/btrfs-test go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB file system. This doesn't seem to break my other enospc tests, but could really use some more testing as this is a super scary change. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:50 -04:00
Josef Bacik	ef3be45722	Btrfs: check unused against how much space we actually want There is a bug that may lead to early ENOSPC in our reservation code. We've been checking against num_bytes which may be above and beyond what we want to actually reserve, which could give us a false ENOSPC. Fix this by making sure the unused space is above how much we want to reserve and not how much we're trying to flush. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:47 -04:00
Josef Bacik	455757c322	Btrfs: delay iput when deleting a block group I kept getting warnings from evict because we were calling btrfs_start_transaction() with a transaction already started when doing a balance. This is because we remove a block group which requires a transaction, and the put the last reference on the cache inode. Instead of doing this we need to delay the iput so it is done not within a transaction having started. This gets rid of our warnings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:45 -04:00
Josef Bacik	4a92b1b8d2	Btrfs: stop passing a trans handle all around the reservation code The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current->journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:44 -04:00
Josef Bacik	d02c9955de	Btrfs: don't get the block_rsv in btrfs_free_tree_block Since the durable block rsv stuff has been killed there is no need to get the block_rsv in btrfs_free_tree_block anymore. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:43 -04:00
Josef Bacik	4c13d758b7	Btrfs: use the transactions block_rsv for the csum root The alloc warnings everybody has been seeing is because we have been reserving space for csums, but we weren't actually using that space. So make get_block_rsv() return the trans->block_rsv if we're modifying the csum root. Also set the trans->block_rsv to NULL so that if we modify the csum root when running delayed ref's that comes out of the global reserve like it's supposed to. With this patch I'm not seeing those alloc warnings anymore. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:42 -04:00
Josef Bacik	c09544e07f	Btrfs: handle enospc accounting for free space inodes Since free space inodes now use normal checksumming we need to make sure to account for their metadata use. So reserve metadata space, and then if we fail to write out the metadata we can just release it, otherwise it will be freed up when the io completes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:42 -04:00
Josef Bacik	7f70150896	Btrfs: don't increase the block_rsv's size when emergency allocating space If we have to emergency reserve space we need to not increase the block_rsv size, otherwise we'll leak space. Take for instance delalloc, say we reserve 4k, and we use that 4k, and then we have to emergency allocate another 4k, we bump the size up to 8k, however we've only accounted for 4k in reservations in all of our supporting logic, so we'll go to free the 4k and end up having a size of 4k, which will cause us to later not free as much space. I saw this doing testing where I wasn't reserving enough space for something but was still leaking space, very frustrating. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:40 -04:00
Josef Bacik	7ed49f187c	Btrfs: fix space leak when we fail to make an allocation When changing back to using a spin_lock to protect the extent counters I decided that since we would only be dropping our original extent, it was ok to just drop the extent and return. However since somebody else could have come in and done a reservation, we need to do the normal song and dance to clear the reservation out properly. So calculate how much space we need to free, and then subtract what we just attempted to reserve. If it's more then we know we need to drop those bytes from the delalloc block rsv. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:39 -04:00
Josef Bacik	482e6dc526	Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:38 -04:00
Josef Bacik	5e962c7850	Btrfs: kill btrfs_truncate_reserve_metadata Since we've optimized the truncate path, we no longer require this function. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:36 -04:00
Josef Bacik	13553e5221	Btrfs: don't try to commit in btrfs_block_rsv_check We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot because we have a transaction open it will return EAGAIN, so we do not need to try and commit the transaction again. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:35 -04:00
Josef Bacik	dabdb6408c	Btrfs: kill unused parts of block_rsv The priority and refill_used flags are not used anymore, and neither is the usage counter, so just remove them from btrfs_block_rsv. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:34 -04:00
Josef Bacik	37be25bcb6	Btrfs: kill the durable block rsv stuff This is confusing code and isn't used by anything anymore, so delete it. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:32 -04:00
Josef Bacik	7709cde33f	Btrfs: calculate checksum space correctly We have not been reserving enough space for checksums. We were just reserving bytes for the checksum items themselves, we were not taking into account having to cow the tree and such. This patch adds a csum_bytes counter to the inode for keeping track of the number of bytes outstanding we have for checksums. Then we calculate how many leaves would be required for the checksums we are given and use that to reserve space. This adds a significant amount of bytes to our reservations, but we will handle this later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-10-19 15:12:31 -04:00

... 2 3 4 5 6 ...

871 Commits