Commit Graph

132 Commits

Author SHA1 Message Date
Theodore Ts'o 9d0be50230 ext4: Calculate metadata requirements more accurately
In the past, ext4_calc_metadata_amount(), and its sub-functions
ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount()
badly over-estimated the number of metadata blocks that might be
required for delayed allocation blocks.  This didn't matter as much
when functions which managed the reserved metadata blocks were more
aggressive about dropping reserved metadata blocks as delayed
allocation blocks were written, but unfortunately they were too
aggressive.  This was fixed in commit 0637c6f, but as a result the
over-estimation by ext4_calc_metadata_amount() would lead to reserving
2-3 times the number of pending delayed allocation blocks as
potentially required metadata blocks.  So if there are 1 megabytes of
blocks which have been not yet been allocation, up to 3 megabytes of
space would get reserved out of the user's quota and from the file
system free space pool until all of the inode's data blocks have been
allocated.

This commit addresses this problem by much more accurately estimating
the number of metadata blocks that will be required.  It will still
somewhat over-estimate the number of blocks needed, since it must make
a worst case estimate not knowing which physical blocks will be
needed, but it is much more accurate than before.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 02:41:30 -05:00
Aneesh Kumar K.V 515f41c33a ext4: Ensure zeroout blocks have no dirty metadata
This fixes a bug (found by Curt Wohlgemuth) in which new blocks
returned from an extent created with ext4_ext_zeroout() can have dirty
metadata still associated with them.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-29 23:39:06 -05:00
Surbhi Palande 034fb4c95f ext4: replace BUG() with return -EIO in ext4_ext_get_blocks
This patch fixes the Kernel BZ #14286.  When the address of an extent
corresponding to a valid block is corrupted, a -EIO should be reported
instead of a BUG().  This situation should not normally not occur
except in the case of a corrupted filesystem.  If however it does,
then the system should not panic directly but depending on the mount
time options appropriate action should be taken. If the mount options
so permit, the I/O should be gracefully aborted by returning a -EIO.

http://bugzilla.kernel.org/show_bug.cgi?id=14286

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-14 09:53:52 -05:00
Theodore Ts'o fab3a549e2 ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
Fix the following potential circular locking dependency between
mm->mmap_sem and ei->i_data_sem:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.32-04115-gec044c5 #37
    -------------------------------------------------------
    ureadahead/1855 is trying to acquire lock:
     (&mm->mmap_sem){++++++}, at: [<ffffffff81107224>] might_fault+0x5c/0xac

    but task is already holding lock:
     (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&ei->i_data_sem){++++..}:
           [<ffffffff81099bfa>] __lock_acquire+0xb67/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81516633>] down_read+0x51/0x84
           [<ffffffff811a2414>] ext4_get_blocks+0x50/0x2a5
           [<ffffffff811a3453>] ext4_get_block+0xab/0xef
           [<ffffffff81154f39>] do_mpage_readpage+0x198/0x48d
           [<ffffffff81155360>] mpage_readpages+0xd0/0x114
           [<ffffffff811a104b>] ext4_readpages+0x1d/0x1f
           [<ffffffff810f8644>] __do_page_cache_readahead+0x12f/0x1bc
           [<ffffffff810f86f2>] ra_submit+0x21/0x25
           [<ffffffff810f0cfd>] filemap_fault+0x19f/0x32c
           [<ffffffff81107b97>] __do_fault+0x55/0x3a2
           [<ffffffff81109db0>] handle_mm_fault+0x327/0x734
           [<ffffffff8151aaa9>] do_page_fault+0x292/0x2aa
           [<ffffffff81518205>] page_fault+0x25/0x30
           [<ffffffff812a34d8>] clear_user+0x38/0x3c
           [<ffffffff81167e16>] padzero+0x20/0x31
           [<ffffffff81168b47>] load_elf_binary+0x8bc/0x17ed
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff81166d64>] load_script+0x1b8/0x1cc
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff8113255f>] do_execve+0x1ce/0x2cf
           [<ffffffff81027494>] sys_execve+0x43/0x5a
           [<ffffffff8102918a>] stub_execve+0x6a/0xc0

    -> #0 (&mm->mmap_sem){++++++}:
           [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81107251>] might_fault+0x89/0xac
           [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
           [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
           [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
           [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
           [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
           [<ffffffff811392ca>] sys_ioctl+0x56/0x79
           [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

    other info that might help us debug this:

    1 lock held by ureadahead/1855:
     #0:  (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    stack backtrace:
    Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37
    Call Trace:
     [<ffffffff81098c70>] print_circular_bug+0xa8/0xb7
     [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
     [<ffffffff8102f229>] ? sched_clock+0x9/0xd
     [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81107251>] might_fault+0x89/0xac
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81124b44>] ? __kmalloc+0x13b/0x18c
     [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
     [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
     [<ffffffff811bca0b>] ? ext4_ext_fiemap_cb+0x0/0x157
     [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
     [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
     [<ffffffff8129f6d0>] ? __up_read+0x8d/0x95
     [<ffffffff81517fb5>] ? retint_swapgs+0x13/0x1b
     [<ffffffff811392ca>] sys_ioctl+0x56/0x79
     [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:30:02 -05:00
Jan Kara b436b9bef8 ext4: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 23:51:10 -05:00
Dmitry Monakhov 5aca07eb7d ext4: quota macros cleanup
Currently all quota block reservation macros contains hard-coded "2"
aka MAXQUOTAS value. This is no good because in some places it is not
obvious to understand what does this digit represent. Let's introduce
new macro with self descriptive name.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:42:15 -05:00
Theodore Ts'o e6362609b6 ext4: call ext4_forget() from ext4_free_blocks()
Add the facility for ext4_forget() to be called from
ext4_free_blocks().  This simplifies the code in a large number of
places, and centralizes most of the work of calling ext4_forget() into
a single place.

Also fix a bug in the extents migration code; it wasn't calling
ext4_forget() when releasing the indirect blocks during the
conversion.  As a result, if the system cashed during or shortly after
the extents migration, and the released indirect blocks get reused as
data blocks, the journal replay would corrupt the data blocks.  With
this new patch, fixing this bug was as simple as adding the
EXT4_FREE_BLOCKS_FORGET flags to the call to ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-11-23 07:17:05 -05:00
Theodore Ts'o 5068969686 ext4: make sure directory and symlink blocks are revoked
When an inode gets unlinked, the functions ext4_clear_blocks() and
ext4_remove_blocks() call ext4_forget() for all the buffer heads
corresponding to the deleted inode's data blocks.  If the inode is a
directory or a symlink, the is_metadata parameter must be non-zero so
ext4_forget() will revoke them via jbd2_journal_revoke().  Otherwise,
if these blocks are reused for a data file, and the system crashes
before a journal checkpoint, the journal replay could end up
corrupting these data blocks.

Thanks to Curt Wohlgemuth for pointing out potential problems in this
area.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:17:34 -05:00
Mingming ba230c3f6d ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O
To prepare for a direct I/O write, we need to split the unwritten
extents before submitting the I/O.  When no extents needed to be
split, ext4_split_unwritten_extents() was incorrectly returning 0
instead of the size of uninitialized extents. This bug caused the
wrong return value sent back to VFS code when it gets called from
async IO path, leading to an unnecessary fall back to buffered IO.

This bug also hid the fact that the check to see whether or not a
split would be necessary was incorrect; we can only skip splitting the
extent if the write completely covers the uninitialized extent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-06 04:01:23 -05:00
Mingming 5f5249507e ext4: skip conversion of uninit extents after direct IO if there isn't any
At the end of direct I/O operation, ext4_ext_direct_IO() always called
ext4_convert_unwritten_extents(), regardless of whether there were any
unwritten extents involved in the I/O or not.

This commit adds a state flag so that ext4_ext_direct_IO() only calls
ext4_convert_unwritten_extents() when necessary.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-10 10:48:04 -05:00
Mingming 109f556519 ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extents
After a direct I/O request covering an uninitalized extent (i.e.,
created using the fallocate system call) or a hole in a file, ext4
will convert the uninitialized extent so it is marked as initialized
by calling ext4_convert_unwritten_extents().  This function returns
zero on success.

This return value was getting returned by ext4_direct_IO(); however
the file system's direct_IO function is supposed to return the number
of bytes read or written on a success.  By returning zero, it confused
the direct I/O code into falling back to buffered I/O unnecessarily.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-10 10:48:08 -05:00
Theodore Ts'o 90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
Mingming Cao 8d5d02e6b1 ext4: async direct IO for holes and fallocate support
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.

But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.

Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue.  When fsync() is called, it will go
through the list and do the conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-28 15:48:29 -04:00
Mingming Cao 0031462b5b ext4: Split uninitialized extents for direct I/O
When writing into an unitialized extent via direct I/O, and the direct
I/O doesn't exactly cover the unitialized extent, split the extent
into uninitialized and initialized extents before submitting the I/O.
This avoids needing to deal with an ENOSPC error in the end_io
callback that gets used for direct I/O.

When the IO is complete, the written extent will be marked as initialized.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:49:08 -04:00
Theodore Ts'o 55ad63bf3a ext4: fix extent sanity checking code with AGGRESSIVE_TEST
The extents sanity-checking code depends on the ext4_ext_space_*()
functions returning the maximum alloable size for eh_max; however,
when the debugging #ifdef AGGRESSIVE_TEST is enabled to test the
extent tree handling code, this prevents a normally created ext4
filesystem from being mounted with the errors:

Aug 26 15:43:50 bsd086 kernel: [   96.070277] EXT4-fs error (device sda8): ext4_ext_check_inode: bad header/extent in inode #8: too large eh_max - magic f30a, entries 1, max 4(3), depth 0(0)
Aug 26 15:43:50 bsd086 kernel: [   96.070526] EXT4-fs (sda8): no journal found

Bug reported by Akira Fujita.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-28 10:40:33 -04:00
Jan Kara 487caeef9f ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks()
During truncate we are sometimes forced to start a new transaction as
the amount of blocks to be journaled is both quite large and hard to
predict. So far we restarted a transaction while holding i_data_sem
and that violates lock ordering because i_data_sem ranks below a
transaction start (and it can lead to a real deadlock with
ext4_get_blocks() mapping blocks in some page while having a
transaction open).

We fix the problem by dropping the i_data_sem before restarting the
transaction and acquire it afterwards. It's slightly subtle that this
works:

1) By the time ext4_truncate() is called, all the page cache for the
truncated part of the file is dropped so get_block() should not be
called on it (we only have to invalidate extent cache after we
reacquire i_data_sem because some extent from not-truncated part could
extend also into the part we are going to truncate).

2) Writes, migrate or defrag hold i_mutex so they are stopped for all
the time of the truncate.

This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 22:17:20 -04:00
Mingming 553f900893 ext4: Show unwritten extent flag in ext4_ext_show_leaf()
ext4_ext_show_leaf() will display the leaf extents when extent
debugging is enabled.

Printing out the unwritten bit is useful for debugging unwritten
extent, allow us to see the unwritten extents vs written extents,
after the unwritten extents are splitted or converted.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-18 13:34:55 -04:00
Mingming 84fe3bef59 ext4: Compile warning fix when EXT_DEBUG enabled
When EXT_DEBUG is enabled I received the following compile warning on
PPC64:

  CC [M]  fs/ext4/inode.o
  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’:
fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’
fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’:
fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’
fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’
fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’
  CC [M]  fs/ext4/migrate.o

The patch fixes compile warning.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Index: linux-2.6.31-rc4/fs/ext4/extents.c
===================================================================
2009-09-01 08:44:37 -04:00
Aneesh Kumar K.V 5887e98b60 ext4: Calculate required journal credits for inserting an extent properly
When we have space in the extent tree leaf node we should be able to
insert the extent with much less journal credits. The code was doing
proper calculation but missed a return statement.

Reported-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:12:04 -04:00
Akira Fujita 748de6736c ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
The EXT4_IOC_MOVE_EXT exchanges the blocks between orig_fd and donor_fd,
and then write the file data of orig_fd to donor_fd.
ext4_mext_move_extent() is the main fucntion of ext4 online defrag,
and this patch includes all functions related to ext4 online defrag.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 19:24:03 -04:00
Aneesh Kumar K.V a41f207169 ext4: Avoid corrupting the uninitialized bit in the extent during truncate
The unitialized bit was not properly getting preserved in in an extent
which is partially truncated because the it was geting set to the
value of the first extent to be removed or truncated as part of the
truncate operation, and if there are multiple extents are getting
removed or modified as part of the truncate operation, it is only the
last extent which will might be partially truncated, and its
uninitalized bit is not necessarily the same as the first extent to be
truncated.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-10 14:22:55 -04:00
Jan Kara 03f5d8bcf0 ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
seems to be a relict from some old days and setting disksize in this
function does not make much sense.  Currently it was set only by
ext4_getblk().  Since the parameter has some effect only if create ==
1, it is easy to check by grepping through the sources that the three
callers which end up calling ext4_getblk() with create == 1
(ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 00:17:05 -04:00
Theodore Ts'o 0568c51893 ext4: down i_data_sem only for read when walking tree for fiemap
Not sure why I put this in as down_write originally; all we are
doing is walking the tree, nothing will change under us and
concurrent reads should be no problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:31:23 -04:00
Theodore Ts'o 6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Theodore Ts'o c217705733 ext4: Define a new set of flags for ext4_get_blocks()
The functions ext4_get_blocks(), ext4_ext_get_blocks(), and
ext4_ind_get_blocks() used an ad-hoc set of integer variables used as
boolean flags passed in as arguments.  Use a single flags parameter
and a setandard set of bitfield flags instead.  This saves space on
the call stack, and it also makes the code a bit more understandable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:58:52 -04:00
Theodore Ts'o 12b7ac1768 ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()
Another function rename for clarity's sake.  The _wrap prefix simply
confuses people, and didn't add much people trying to follow the code
paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:57:44 -04:00
Theodore Ts'o eefd7f03b8 ext4: fix the length returned by fiemap for an unallocated extent
If the file's blocks have not yet been allocated because of delayed
allocation, the length of the extent returned by fiemap is incorrect.
This commit fixes this bug.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 19:05:37 -04:00
Eric Sandeen c9877b205f ext4: fix for fiemap last-block test
Carl Henrik Lunde reported and debugged this; the test for the
last allocated block was comparing bytes to blocks in this test:

	if (logical + length - 1 == EXT_MAX_BLOCK ||
	    ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK)
		flags |= FIEMAP_EXTENT_LAST;

so any extent which ended right at 4G was stopping the extent
walk.  Just replacing these values with the extent block &
length should fix it.

Also give blksize_bits a saner type, and reverse the order 
of the tests to make the more likely case tested first.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 23:32:06 -04:00
Aneesh Kumar K.V 29fa89d088 ext4: Mark the unwritten buffer_head as mapped during write_begin
Setting BH_Unwritten buffer_heads as BH_Mapped avoids multiple
(unnecessary) calls to get_block() during the call to the write(2)
system call.  Setting BH_Unwritten buffer heads as BH_Mapped requires
that the writepages() functions can handle BH_Unwritten buffer_heads.

After this commit, things work as follows:

ext4_ext_get_block() returns unmapped, unwritten, buffer head when
called with create = 0 for prealloc space. This makes sure we handle
the read path and non-delayed allocation case correctly.  Even though
the buffer head is marked unmapped we have valid b_blocknr and b_bdev
values in the buffer_head.

ext4_da_get_block_prep() called for block resrevation will now return
mapped, unwritten, new buffer_head for prealloc space. This avoids
multiple calls to get_block() for write to same offset. By making such
buffers as BH_New, we also assure that sub-block zeroing of buffered
writes happens correctly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 16:30:27 -04:00
Aneesh Kumar K.V 79ffab3439 ext4: Properly initialize the buffer_head state
These struct buffer_heads are allocated on the stack (and hence are
initialized with stack garbage).  They are only used to call a
get_blocks() function, so that's mostly OK, but b_state must be
initialized to be 0 so we don't have any unexpected BH_* flags set by
accident, such as BH_Unwritten or BH_Delay.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 15:13:42 -04:00
Theodore Ts'o 2ec0ae3ace ext4: Fix race in ext4_inode_info.i_cached_extent
If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time.  This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development.  Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-05-15 09:07:28 -04:00
Aneesh Kumar K.V 9c1ee184a3 ext4: Fix sub-block zeroing for writes into preallocated extents
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:36:58 -04:00
Linus Torvalds a4277bf122 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix potential inode allocation soft lockup in Orlov allocator
  ext4: Make the extent validity check more paranoid
  jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records
  jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
  ext4: really print the find_group_flex fallback warning only once
2009-04-24 08:37:40 -07:00
Theodore Ts'o e84a26ce17 ext4: Make the extent validity check more paranoid
Instead of just checking that the extent block number is greater or
equal than s_first_data_block, make sure it it is not pointing into
the block group descriptors, since that is clearly wrong.  This helps
prevent filesystem from getting very badly corrupted in case an extent
block is corrupted.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-22 20:52:25 -04:00
Nikanth Karthikesan 226e7dabf5 ext4: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_NOIO implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Thiemo Nagel e44543b83b ext4: Fix off-by-one-error in ext4_valid_extent_idx()
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-04 23:30:44 -04:00
Aneesh Kumar K.V 7a262f7c69 ext4: Validate extent details only when read from the disk
Make sure we validate extent details only when read from the disk.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-27 16:39:58 -04:00
Aneesh Kumar K.V 56b19868ac ext4: Add checks to validate extent entries.
This patch adds checks to validate the extent entries along with extent
headers, to avoid crashes caused by corrupt filesystems.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-12 09:51:20 -04:00
Theodore Ts'o a4912123b6 ext4: New inode/block allocation algorithms for flex_bg filesystems
The find_group_flex() inode allocator is now only used if the
filesystem is mounted using the "oldalloc" mount option.  It is
replaced with the original Orlov allocator that has been updated for
flex_bg filesystems (it should behave the same way if flex_bg is
disabled).  The inode allocator now functions by taking into account
each flex_bg group, instead of each block group, when deciding whether
or not it's time to allocate a new directory into a fresh flex_bg.

The block allocator has also been changed so that the first block
group in each flex_bg is preferred for use for storing directory
blocks.  This keeps directory blocks close together, which is good for
speeding up e2fsck since large directories are more likely to look
like this:

debugfs:  stat /home/tytso/Maildir/cur
Inode: 1844562   Type: directory    Mode:  0700   Flags: 0x81000
Generation: 1132745781    Version: 0x00000000:0000ad71
User: 15806   Group: 15806   Size: 1060864
File ACL: 0    Directory ACL: 0
Links: 2   Blockcount: 2072
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
 atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
 mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
Size of extra inode fields: 28
BLOCKS:
(0):7348651, (1-258):7348654-7348911
TOTAL: 259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-12 12:18:34 -04:00
Eric Sandeen 395a87bfef ext4: fix header check in ext4_ext_search_right() for deep extent trees.
The ext4_ext_search_right() function is confusing; it uses a
"depth" variable which is 0 at the root and maximum at the leaves, 
but the on-disk metadata uses a "depth" (actually eh_depth) which
is opposite: maximum at the root, and 0 at the leaves.

The ext4_ext_check_header() function is given a depth and checks
the header agaisnt that depth; it expects the on-disk semantics,
but we are giving it the opposite in the while loop in this 
function.  We should be giving it the on-disk notion of "depth"
which we can get from (p_depth - depth) - and if you look, the last
(more commonly hit) call to ext4_ext_check_header() does just this.

Sending in the wrong depth results in (incorrect) messages
about corruption:

EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340,
max 340(0), depth 1(2)

http://bugzilla.kernel.org/show_bug.cgi?id=12821

Reported-by: David Dindorp <ddi@dubex.dk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-10 18:18:47 -04:00
Thadeu Lima de Souza Cascardo 9fd9784c91 ext4: Fix building with EXT4FS_DEBUG
When bg_free_blocks_count was renamed to bg_free_blocks_count_lo in
560671a0, its uses under EXT4FS_DEBUG were not changed to the helper
ext4_free_blks_count.

Another commit, 498e5f24, also did not change everything needed under
EXT4FS_DEBUG, thus making it spill some warnings related to printing
format.

This commit fixes both issues and makes ext4 build again when
EXT4FS_DEBUG is enabled.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-26 19:26:26 -05:00
Linus Torvalds 2150edc6c5 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
  jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
  ext4: Remove "extents" mount option
  block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
  ext4: Make printk's consistently prefixed with "EXT4-fs: "
  ext4: Add sanity checks for the superblock before mounting the filesystem
  ext4: Add mount option to set kjournald's I/O priority
  jbd2: Submit writes to the journal using WRITE_SYNC
  jbd2: Add pid and journal device name to the "kjournald2 starting" message
  ext4: Add markers for better debuggability
  ext4: Remove code to create the journal inode
  ext4: provide function to release metadata pages under memory pressure
  ext3: provide function to release metadata pages under memory pressure
  add releasepage hooks to block devices which can be used by file systems
  ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
  ext4: Init the complete page while building buddy cache
  ext4: Don't allow new groups to be added during block allocation
  ext4: mark the blocks/inode bitmap beyond end of group as used
  ext4: Use new buffer_head flag to check uninit group bitmaps initialization
  ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
  ext4: code cleanup
  ...
2009-01-08 17:14:59 -08:00
Coly Li 73ac36ea14 fix similar typos to successfull
When I review ocfs2 code, find there are 2 typos to "successfull".  After
doing grep "successfull " in kernel tree, 22 typos found totally -- great
minds always think alike :)

This patch fixes all the similar typos. Thanks for Randy's ack and comments.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:15 -08:00
Theodore Ts'o 83982b6f47 ext4: Remove "extents" mount option
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled.  The simplest thing to do is to nuke the mount option
entirely.  It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 14:53:16 -05:00
Aneesh Kumar K.V 3a06d778df ext4: sparse fixes
* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-22 15:04:59 -05:00
Theodore Ts'o 498e5f2415 ext4: Change unsigned long to unsigned int
Convert the unsigned longs that are most responsible for bloating the
stack usage on 64-bit systems.

Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-05 00:14:04 -05:00
Theodore Ts'o cde6436004 ext4: Remove i_ext_generation from ext4_inode_info structure
The i_ext_generation was incremented, but never used.  Remove it to
slim down the ext4_inode_info structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-04 18:46:03 -05:00
Frank Mayhar 0390131ba8 ext4: Allow ext4 to run without a journal
A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-07 00:06:22 -05:00
Wu Fengguang 25f1ee3aba ext4: fix build warning
Replace `if' with `goto' to assure gcc that ix has been initialized.

Signed-off-by: Wu Fengguang <wfg@linux.intel.com>
2008-11-25 17:24:23 -05:00
Theodore Ts'o 97df5d155d ext4: remove do_blk_alloc()
The convenience function do_blk_alloc() is a static function with only
one caller, so fold it into ext4_new_meta_blocks() to simplify the
code and to make it easier to understand.

To save more stack space, if count is a null pointer in
ext4_new_meta_blocks() assume that caller wanted a single block (and
if there is an error, no blocks were allocated).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-12-12 12:41:28 -05:00