linux

Commit Graph

Author	SHA1	Message	Date
Chris Mason	771ed689d2	Btrfs: Optimize compressed writeback and reads When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-06 22:02:51 -05:00
Chris Mason	4a69a41009	Btrfs: Add ordered async work queues Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-11-06 22:03:00 -05:00
Yan Zheng	84234f3a1f	Btrfs: Add root tree pointer transaction ids This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-29 14:49:05 -04:00
Josef Bacik	2517920135	Btrfs: nuke fs wide allocation mutex V2 This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: Josef Bacik <jbacik@redhat.com>	2008-10-29 14:49:05 -04:00
Yan Zheng	f82d02d9d8	Btrfs: Improve space balancing code This patch improves the space balancing code to keep more sharing of tree blocks. The only case that breaks sharing of tree blocks is data extents get fragmented during balancing. The main changes in this patch are: Add a 'drop sub-tree' function. This solves the problem in old code that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block. Remove relocation mapping tree. Relocation mappings are stored in struct btrfs_ref_path and updated dynamically during walking up/down the reference path. This reduces CPU usage and simplifies code. This patch also fixes a bug. Root items for reloc trees should be updated in btrfs_free_reloc_root. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>	2008-10-29 14:49:05 -04:00
Chris Mason	c8b978188c	Btrfs: Add zlib compression support This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-29 14:49:59 -04:00
Jim Meyering	83afeac42c	Btrfs: disk-io.c (open_ctree): avoid leaks upon allocation failure Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-01 19:09:51 -04:00
Jim Meyering	0463bb4e8d	Btrfs: disk-io.c (open_ctree): Don't deref. NULL upon failed kzalloc Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-10-01 19:09:04 -04:00
Chris Mason	d352ac6814	Btrfs: add and improve comments This improves the comments at the top of many functions. It didn't dive into the guts of functions because I was trying to avoid merging problems with the new allocator and back reference work. extent-tree.c and volumes.c were both skipped, and there is definitely more work todo in cleaning and commenting the code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-29 15:18:18 -04:00
Chris Mason	8c8bee1d7c	Btrfs: Wait for IO on the block device inodes of newly added devices btrfs-vol -a /dev/xxx will zero the first and last two MB of the device. The kernel code needs to wait for this IO to finish before it adds the device. btrfs metadata IO does not happen through the block device inode. A separate address space is used, allowing the zero filled buffer heads in the block device inode to be written to disk after FS metadata starts going down to the disk via the btrfs metadata inode. The end result is zero filled metadata blocks after adding new devices into the filesystem. The fix is a simple filemap_write_and_wait on the block device inode before actually inserting it into the pool of available devices. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-29 11:19:10 -04:00
Zheng Yan	1a40e23b95	Btrfs: update space balancing code This patch updates the space balancing code to utilize the new backref format. Before, btrfs-vol -b would break any COW links on data blocks or metadata. This was slow and caused the amount of space used to explode if a large number of snapshots were present. The new code can keeps the sharing of all data extents and most of the tree blocks. To maintain the sharing of data extents, the space balance code uses a seperate inode hold data extent pointers, then updates the references to point to the new location. To maintain the sharing of tree blocks, the space balance code uses reloc trees to relocate tree blocks in reference counted roots. There is one reloc tree for each subvol, and all reloc trees share same root key objectid. Reloc trees are snapshots of the latest committed roots of subvols (root->commit_root). To relocate a tree block referenced by a subvol, there are two steps. COW the block through subvol's reloc tree, then update block pointer in the subvol to point to the new block. Since all reloc trees share same root key objectid, doing special handing for tree blocks owned by them is easy. Once a tree block has been COWed in one reloc tree, we can use the resulting new block directly when the same block is required to COW again through other reloc trees. In this way, relocated tree blocks are shared between reloc trees, so they are also shared between subvols. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-26 10:09:34 -04:00
Zheng Yan	e465768938	Btrfs: Add shared reference cache Btrfs has a cache of reference counts in leaves, allowing it to avoid reading tree leaves while deleting snapshots. To reduce contention with multiple subvolumes, this cache is private to each subvolume. This patch adds shared reference cache support. The new space balancing code plays with multiple subvols at the same time, So the old per-subvol reference cache is not well suited. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-26 10:04:53 -04:00
Chris Mason	24ab9cd85c	Btrfs: Raise thresholds for metadata writeback Btrfs metadata writeback is fairly expensive. Once a tree block is written it must be cowed before it can be changed again. The btree writepages code has a threshold based on a count of dirty btree bytes which is updated as IO is sent out. This changes btree_writepages to skip the writeout if there are less than 32MB of dirty bytes from the btrees, improving performance across many workloads. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 15:41:59 -04:00
Chris Mason	2b1f55b0f0	Remove Btrfs compat code for older kernels Btrfs had compatibility code for kernels back to 2.6.18. These have been removed, and will be maintained in a separate backport git tree from now on. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 15:41:59 -04:00
Zheng Yan	31840ae1a6	Btrfs: Full back reference support This patch makes the back reference system to explicit record the location of parent node for all types of extents. The location of parent node is placed into the offset field of backref key. Every time a tree block is balanced, the back references for the affected lower level extents are updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	ce3ed71a58	Btrfs: Checksum tree blocks in the background Tree blocks were using async bio submission, but the sum was still being done directly during writepage. This moves the checksumming into the worker thread. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Josef Bacik	0f9dd46cda	Btrfs: free space accounting redo 1) replace the per fs_info extent_io_tree that tracked free space with two rb-trees per block group to track free space areas via offset and size. The reason to do this is because most allocations come with a hint byte where to start, so we can usually find a chunk of free space at that hint byte to satisfy the allocation and get good space packing. If we cannot find free space at or after the given offset we fall back on looking for a chunk of the given size as close to that given offset as possible. When we fall back on the size search we also try to find a slot as close to the size we want as possible, to avoid breaking small chunks off of huge areas if possible. 2) remove the extent_io_tree that tracked the block group cache from fs_info and replaced it with an rb-tree thats tracks block group cache via offset. also added a per space_info list that tracks the block group cache for the particular space so we can lookup related block groups easily. 3) cleaned up the allocation code to make it a little easier to read and a little less complicated. Basically there are 3 steps, first look from our provided hint. If we couldn't find from that given hint, start back at our original search start and look for space from there. If that fails try to allocate space if we can and start looking again. If not we're screwed and need to start over again. 4) small fixes. there were some issues in volumes.c where we wouldn't allocate the rest of the disk. fixed cow_file_range to actually pass the alloc_hint, which has helped a good bit in making the fs_mark test I run have semi-normal results as we run out of space. Generally with data allocations we don't track where we last allocated from, so everytime we did a data allocation we'd search through every block group that we have looking for free space. Now searching a block group with no free space isn't terribly time consuming, it was causing a slight degradation as we got more data block groups. The alloc_hint has fixed this slight degredation and made things semi-normal. There is still one nagging problem I'm working on where we will get ENOSPC when there is definitely plenty of space. This only happens with metadata allocations, and only when we are almost full. So you generally hit the 85% mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm still tracking it down, but until then this seems to be pretty stable and make a significant performance gain. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	23a07867b7	Btrfs: Fix mismerge in block header checks I had incorrectly disabled the check for the block number being correct in the header block. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	d0c803c404	Btrfs: Record dirty pages tree-log pages in an extent_io tree This is the same way the transaction code makes sure that all the other tree blocks are safely on disk. There's an extent_io tree for each root, and any blocks allocated to the tree logs are recorded in that tree. At tree-log sync, the extent_io tree is walked to flush down the dirty pages and wait for them. The main benefit is less time spent walking the tree log and skipping clean pages, and getting sequential IO down to the drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	d00aff0013	Btrfs: Optimize tree log block allocations Since tree log blocks get freed every transaction, they never really need to be written to disk. This skips the step where we update metadata to record they were allocated. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	3a5f1d458a	Btrfs: Optimize btree walking while logging inodes Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	98509cfc5a	Btrfs: Fix releasepage to properly keep dirty and writeback pages Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	4bef084857	Btrfs: Tree logging fixes * Pin down data blocks to prevent them from being reallocated like so: trans 1: allocate file extent trans 2: free file extent trans 3: free file extent during old snapshot deletion trans 3: allocate file extent to new file trans 3: fsync new file Before the tree logging code, this was legal because the fsync would commit the transation that did the final data extent free and the transaction that allocated the extent to the new file at the same time. With the tree logging code, the tree log subtransaction can commit before the transaction that freed the extent. If we crash, we're left with two different files using the extent. * Don't wait in start_transaction if log replay is going on. This avoids deadlocks from iput while we're cleaning up link counts in the replay code. * Don't deadlock in replay_one_name by trying to read an inode off the disk while holding paths for the directory * Hold the buffer lock while we mark a buffer as written. This closes a race where someone is changing a buffer while we write it. They are supposed to mark it dirty again after they change it, but this violates the cow rules. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	e02119d5a7	Btrfs: Add a write ahead tree log to optimize synchronous operations File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	a1b32a5932	Btrfs: Add debugging checks to track down corrupted metadata Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	9473f16c75	Btrfs: Throttle for async bio submits higher up the chain The current code waits for the count of async bio submits to get below a given threshold if it is too high right after adding the latest bio to the work queue. This isn't optimal because the caller may have sequential adjacent bios pending they are waiting to send down the pipe. This changeset requires the caller to wait on the async bio count, and changes the async checksumming submits to wait for async bios any time they self throttle. The end result is much higher sequential throughput. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:07 -04:00
Chris Mason	b64a2851ba	Btrfs: Wait for async bio submissions to make some progress at queue time Before, the btrfs bdi congestion function was used to test for too many async bios. This keeps that check to throttle pdflush, but also adds a check while queuing bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	53863232ef	Btrfs: Lower contention on the csum mutex This takes the csum mutex deeper in the call chain and releases it more often. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	4854ddd0ed	Btrfs: Wait for kernel threads to make progress during async submission Before this change, btrfs would use a bdi congestion function to make sure there weren't too many pending async checksum work items. This change makes the process creating async work items wait instead, leading to fewer congestion returns from the bdi. This improves pdflush background_writeout scanning. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	5443be45f5	Btrfs: Give all the worker threads descriptive names Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	777e6bd706	Btrfs: Transaction commit: don't use filemap_fdatawait After writing out all the remaining btree blocks in the transaction, the commit code would use filemap_fdatawait to make sure it was all on disk. This means it would wait for blocks written by other procs as well. The new code walks the list of blocks for this transaction again and waits only for those required by this transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	0986fe9eac	Btrfs: Count async bios separately from async checksum work items Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	b720d20952	Btrfs: Limit the number of async bio submission kthreads to the number of devices Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	4ca8b41e3f	Btrfs: Avoid calling into the FS for the final iput on fake root inodes Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	ea8c281947	Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	2dd3e67b1e	Btrfs: More throttle tuning * Make walk_down_tree wake up throttled tasks more often * Make walk_down_tree call cond_resched during long loops * As the size of the ref cache grows, wait longer in throttle * Get rid of the reada code in walk_down_tree, the leaves don't get read anymore, thanks to the ref cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:06 -04:00
Chris Mason	61b4944018	Btrfs: Fix streaming read performance with checksumming on Large streaming reads make for large bios, which means each entry on the list async work queues represents a large amount of data. IO congestion throttling on the device was kicking in before the async worker threads decided a single thread was busy and needed some help. The end result was that a streaming read would result in a single CPU running at 100% instead of balancing the work off to other CPUs. This patch also changes the pre-IO checksum lookup done by reads to work on a per-bio basis instead of a per-page. This results in many extra btree lookups on large streaming reads. Doing the checksum lookup right before bio submit allows us to reuse searches while processing adjacent offsets. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Yan	bcc63abbf3	Btrfs: implement memory reclaim for leaf reference cache The memory reclaiming issue happens when snapshot exists. In that case, some cache entries may not be used during old snapshot dropping, so they will remain in the cache until umount. The patch adds a field to struct btrfs_leaf_ref to record create time. Besides, the patch makes all dead roots of a given snapshot linked together in order of create time. After a old snapshot was completely dropped, we check the dead root list and remove all cache entries created before the oldest dead root in the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	33958dc6d3	Btrfs: Fix verify_parent_transid It was incorrectly clearing the up to date flag on the buffer even when the buffer properly verified. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	ab78c84de1	Btrfs: Throttle operations if the reference cache gets too large A large reference cache is directly related to a lot of work pending for the cleaner thread. This throttles back new operations based on the size of the reference cache so the cleaner thread will be able to keep up. Overall, this actually makes the FS faster because the cleaner thread will be more likely to find things in cache. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	017e5369eb	Btrfs: Leaf reference cache update This changes the reference cache to make a single cache per root instead of one cache per transaction, and to key by the byte number of the disk block instead of the keys inside. This makes it much less likely to have cache misses if a snapshot or something has an extra reference on a higher node or a leaf while the first transaction that added the leaf into the cache is dropping. Some throttling is added to functions that free blocks heavily so they wait for old transactions to drop. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Yan Zheng	31153d8128	Btrfs: Add a leaf reference cache Much of the IO done while dropping snapshots is done looking up leaves in the filesystem trees to see if they point to any extents and to drop the references on any extents found. This creates a cache so that IO isn't required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Josef Bacik	7b12876623	Btrfs: Create orphan inode records to prevent lost files after a crash Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	3eaa288527	Btrfs: Fix the defragmention code and the block relocation code for data=ordered Before setting an extent to delalloc, the code needs to wait for pending ordered extents. Also, the relocation code needs to wait for ordered IO before scanning the block group again. This is because the extents are not removed until the IO for the new extents is finished Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	89642229a5	Btrfs: Search data ordered extents first for checksums on read Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	6af118ce51	Btrfs: Index extent buffers in an rbtree Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:05 -04:00
Chris Mason	f929574938	btrfs_start_transaction: wait for commits in progress to finish btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	247e743cbe	Btrfs: Use async helpers to deal with pages that have been improperly dirtied Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	e6dcd2dc9c	Btrfs: New data=ordered implementation The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	77a41afb7d	Btrfs: Drop some verbose printks Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	7d9eb12c87	Btrfs: Add locking around volume management (device add/remove/balance) Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	3f157a2fd2	Btrfs: Online btree defragmentation fixes The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:04 -04:00
Chris Mason	a74a4b97b6	Btrfs: Replace the transaction work queue with kthreads This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	89ce8a63d0	Add btrfs_end_transaction_throttle to force writers to wait for pending commits The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	333db94cdd	Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	051e1b9f74	Drop locks in btrfs_search_slot when reading a tree block. One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	a213501153	Btrfs: Replace the big fs_mutex with a collection of other locks Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	925baeddc5	Btrfs: Start btree concurrency work. The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	1cc127b5d1	Btrfs: Add a thread pool just for submit_bio If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	4543df7ecc	Btrfs: Add a mount option to control worker thread pool size mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	8b71284292	Btrfs: Add async worker threads for pre and post IO checksumming Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Christoph Hellwig	edf24abe51	btrfs: sanity mount option parsing and early mount code Also adds lots of comments to describe what's going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Jan Engelhardt	51ebc0d3d5	Btrfs: bdi_init and bdi_destroy come with 2.6.23 Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	da496f2acf	Btrfs: Always use the async submission queue for checksummed writes This avoids IO stalls and poorly ordered IO from inline writers mixing in with the async submission queue Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	1c8cfcc159	Btrfs: Enable btree balancing on old kernels again Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	cb03c743c6	Btrfs: Change the congestion functions to meter the number of async submits as well The async submit workqueue was absorbing too many requests, leading to long stalls where the async submitters were stalling. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	a0af469b58	Fix btrfs_open_devices to deal with changes since the scan ioctls Devices can change after the scan ioctls are done, and btrfs_open_devices needs to be able to verify them as they are opened and used by the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	dfe2502068	Btrfs: Add mount -o degraded to allow mounts to continue with missing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	1259ab75c6	Btrfs: Handle write errors on raid1 and raid10 When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	ca7a79ad8d	Btrfs: Pass down the expected generation number when reading tree blocks Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	188de649c5	Btrfs: Don't do btree balance_dirty_pages on old kernels, it stalls forever Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:03 -04:00
Chris Mason	a061fc8da7	Btrfs: Add support for online device removal This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	d6bfde8765	Btrfs: Fixes for 2.6.18 enterprise kernels 2.6.18 seems to get caught in an infinite loop when cancel_rearming_delayed_workqueue is called more than once, so this switches to cancel_delayed_work, which is arguably more correct. Also, balance_dirty_pages can run into problems with 2.6.18 based kernels because it doesn't have the per-bdi dirty limits. This avoids calling balance_dirty_pages on the btree inode unless there is actually something to balance, which is a good optimization in general. Finally there's a compile fix for ordered-data.h Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	a236aed14c	Btrfs: Deal with failed writes in mirrored configurations Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	4235298e4f	Btrfs: Drop some verbose printks Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	8f18cf1339	Btrfs: Make the resizer work based on shrinking and growing devices Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	84eed90fac	Btrfs: Add failure handling for read_sys_array Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	bcbfce8abd	Btrfs: Fix the unplug_io_fn to grab a consistent copy of page->mapping Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	38b669880d	Deal with page == NULL in the btrfs_unplug_io_fn Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	f2d8d74d78	Btrfs: Make an unplug function that doesn't unplug every spindle Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	4ef64eae28	Btrfs: Remove debugging statements from the invalidatepage calls Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	4575c9ccee	Btrfs: Scale the bdi ra_pages by the number of devices in the FS Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	9ad6b7bc2e	Force page->private removal in btrfs_invalidatepage btrfs_invalidatepage is not allowed to leave pages around on the lru. Any such pages will trigger an oops later on because the VM will see page->private and assume it is a buffer head. This also forces extra flushes of the async work queues before dropping all the pages on the btree inode during unmount. Left over items on the work queues are one possible cause of busy state ranges during truncate_inode_pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	0afbaf8c82	Btrfs: Set the btree inode i_size to OFFSET_MAX Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	7b13b7b119	Btrfs: Don't drop extent_map cache during releasepage on the btree inode The btree inode should only have a single extent_map in the cache, it doesn't make sense to ever drop it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:02 -04:00
Chris Mason	7b859fe7cd	Btrfs: Only do async bio submission for pdflush Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	44b8bd7edd	Btrfs: Create a work queue for bio writes This allows checksumming to happen in parallel among many cpus, and keeps us from bogging down pdflush with the checksumming code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	e17cade25f	Btrfs: Add chunk uuids and update multi-device back references Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	b248a41529	Btrfs: A few updates for 2.6.18 and versions older than 2.6.25 This includes fixing a missing spinlock init call that caused oops on mount for most kernels other than 2.6.25. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Miguel	73f61b2a64	Btrfs: bio_endio support for linux 2.6.23 and older. bio_endio() changed prototype on linux 2.6.24, support older kernels using the older prototype. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Miguel	a5eb62e345	Btrfs: Endianess bug fix for v0.13 with kernels Fix for a endianess BUG when using btrfs v0.13 with kernels older than 2.6.23 Problem: Has of v0.13, btrfs-progs is using crc32c.c equivalent to the one found on linux-2.6.23/lib/libcrc32c.c Since crc32c_le() changed in linux-2.6.23, when running btrfs v0.13 with older kernels we have a missmatch between the versions of crc32c_le() from btrfs-progs and libcrc32c in the kernel. This missmatch causes a bug when using btrfs on big endian machines. Solution: btrfs_crc32c() macro that when compiling for kernels older than 2.6.23, does endianess conversion to parameters and return value of crc32c(). This endianess conversion nullifies the differences in implementation of crc32c_le(). If kernel 2.6.23 or better, it calls crc32c(). Signed-off-by: Miguel Sousa Filipe <miguel.filipe@gmail.com> --- Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	3dd39914bc	Btrfs: Add extra checks to avoid removing extent_state from pages we can't free Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	f29844623d	Btrfs: Write out all super blocks on commit, and bring back proper barrier support Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	f188591e98	Btrfs: Retry metadata reads in the face of checksum failures Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	22c599485b	Btrfs: Handle data block end_io through the async work queue Before it was done by the bio end_io routine, the work queue code is able to scale much better with faster IO subsystems. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	ce9adaa5a7	Btrfs: Do metadata checksums for reads via a workqueue Before, metadata checksumming was done by the callers of read_tree_block, which would set EXTENT_CSUM bits in the extent tree to show that a given range of pages was already checksummed and didn't need to be verified again. But, those bits could go away via try_to_releasepage, and the end result was bogus checksum failures on pages that never left the cache. The new code validates checksums when the page is read. It is a little tricky because metadata blocks can span pages and a single read may end up going via multiple bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	728131d8e4	Btrfs: Add additional debugging for metadata checksum failures Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	d18a2c4475	Btrfs: Fix allocation profile init Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	611f0e00a2	Btrfs: Add support for duplicate blocks on a single spindle Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00
Chris Mason	8790d502e4	Btrfs: Add support for mirroring across drives Signed-off-by: Chris Mason <chris.mason@oracle.com>	2008-09-25 11:04:01 -04:00

1 2 3 4 5 ...

298 Commits