This adds basic O_DIRECT read and write support. In the write case, we
just do a normal buffered write followed by a cache flush. O_DIRECT +
O_SYNC are required to trigger metadata syncs.
In the read case, there is a basic btrfs_get_block call for use by
the generic O_DIRECT code. This does honor multi-volume mapping rules
but it skips all checksumming.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before it was done by the bio end_io routine, the work queue code is able
to scale much better with faster IO subsystems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.
But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.
The new code validates checksums when the page is read. It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a block is freed, it can be immediately reused if it is from
the current transaction. But, an extra check is required to make sure
the block had not been written yet. If it were reused after being written,
the transid in the block header might match the transid of the
next time the block was allocated.
The parent node records the transaction ID of the block it is pointing to,
and this is used as part of validating the block on reads. So, there
can only be one version of a block per transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.
But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified. This patch
makes sure all buffers go through checksum verification before they
are used.
This is safer, and it prevents modification of buffers before they go
through the csum code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There was an optimization to drop the fs_mutex when doing snapshot deletion
reads, but this can lead to false positives on checksumming errors. Keep
the lock for now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_name_hash, Local variable 'buf' is declared as
__u32 buf[2];
but we then try to do this:
buf[0] = 0x67452301;
buf[1] = 0xefcdab89;
buf[2] = 0x98badcfe;
buf[3] = 0x10325476;
Oops. Fix buf to be the proper size.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here's a patch against the unstable tree that gets the code to build
against Linus's current tree (2.6.24-git12). This is needed as the
kobject/kset api has changed there.
I tried to make the smallest changes needed, and it builds and loads
successfully, but I don't have a btrfs volume anywhere (yet) to try to
see if things still work properly :)
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.
This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.
Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that we don't clear the extent state tree dirty and delalloc
bits when we clear the dirty bits on the page during file write.
This leads to csum errors later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit
Fix find_free_extent to make better use of the last_alloc hint. Before it
was often finding blocks just before the hint.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs set/get macros lose type information needed to avoid
unaligned accesses on sparc64.
ere is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.
btrfs_name_hash is modified to return the hash value instead
of getting a return location via a (potentially unaligned)
pointer.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A few codes were not properly updated for changes of extent map. This
may be the causes of "no csum found for inode" issue.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that delayed allocation accounting works, i_blocks accounting is changed
to only modify i_blocks when extents inserted or removed.
The fillattr call is changed to include the delayed allocation byte count
in the i_blocks result.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When freeing root block of a tree, btrfs_free_extent' parameter
'ref_generation' is from root block itseft. When freeing non-root
block, 'ref_generation' is from its parent. so when converting a
non-root block to root block, we must guarantee its generation is
equal to its parent's generation.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This makes searches for backrefs and backref insertion much more efficient
when there are many backrefs for a single extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When truncating a inline extent, btrfs_drop_extents doesn't properly
handle the case "key.offset > inline_limit". This bug can only happen
when max line size is larger than 8K.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The end_bio routines are changed to take a pointer to the extent state
struct, and the state tree is walked in order to set/clear appropriate
bits as IO completes. This greatly reduces the number of rbtree searches
done by the end_bio handlers, and reduces lock contention.
The extent_io releasepage function is changed to avoid expensive searches
for locked state.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.
The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.
A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.
lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
test_range_bit doesn't properly handle the case: there's a hole at the
end of the range and there's no other extent_state after the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_free_objectid may return a used objectid due to arithmetic
underflow. This bug may happen when parameter 'root' is tree root, so
it may cause serious problems when creating snapshot or sub-volume.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using ilookup5 during data=ordered writeback could deadlock on I_LOCK. This
saves a pointer to the inode instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds readonly inode flag support. A file with this flag
can't be modified, but can be deleted.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While shrinking the FS, the allocation functions need to make sure
they don't try to allocate bytes past the end of the FS.
nodatacow needed an extra check to force cows when the existing extents are
past the end of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is very difficult to create a consistent snapshot of the btree when
other writers may update the btree before the commit is done.
This changes the snapshot creation to happen during the commit, while
no other updates are possible.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The shrinking code used btrfs_next_leaf to find the next item, but
this does not cow the blocks it touches. This fix calls search_slot after
finding the next item to do appropriate cow and balancing.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The patch fixes the overlapping extent issue in shrink_extent_tree.
It checks whether there is an overlapping extent by using
find_previous_extent. If there is an overlapping extent, it setups
key.objectid and cur_byte properly.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is intended to prevent accidentally filling the drive. A determined
user can still make things oops.
It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed the offset into the extent was incorrectly being added to the
extent start before trying to find it in the extent allocation tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.
In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
One of my old patches introduces a new bug to
btrfs_drop_extents(changeset 275). Inline extents are not truncated
properly when "extent_end == end", it can trigger the BUG_ON at
file.c:600. I hope I don't introduce new bug this time.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Hello everybody,
compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.
Signed-off-by: Christian Hesse <mail@earthworm.de>
--
Regards,
Chris
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="btrfs-section_mismatches.patch"
Signed-off-by: Chris Mason <chris.mason@oracle.com>