Commit Graph

2040 Commits

Author SHA1 Message Date
Jan Kara 4eec708d26 ext4: use io_end for multiple bios
Change writeback path to create just one io_end structure for the
extent to which we submit IO and share it among bios writing that
extent. This prevents needless splitting and joining of unwritten
extents when they cannot be submitted as a single bio.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-11 23:56:53 -04:00
Jan Kara 0058f9658c ext4: make ext4_bio_write_page() use BH_Async_Write flags
So far ext4_bio_write_page() attached all the pages to ext4_io_end
structure.  This makes that structure pretty heavy (1 KB for pointers
+ 16 bytes per page attached to the bio).  Also later we would like to
share ext4_io_end structure among several bios in case IO to a single
extent needs to be split among several bios and pointing to pages from
ext4_io_end makes this complex.

We remove page pointers from ext4_io_end and use pointers from bio
itself instead.  This isn't as easy when blocksize < pagesize because
then we can have several bios in flight for a single page and we have
to be careful when to call end_page_writeback().  However this is a
known problem already solved by block_write_full_page() /
end_buffer_async_write() so we mimic its behavior here.  We mark
buffers going to disk with BH_Async_Write flag and in
ext4_bio_end_io() we check whether there are any buffers with
BH_Async_Write flag left.  If there are not, we can call
end_page_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-04-11 23:48:32 -04:00
Lukas Czerner e1091b157c ext4: Use kstrtoul() instead of parse_strtoul()
In parse_strtoul() we're still using deprecated simple_strtoul().  Remove
parse_strtoul() altogether and replace it with kstrtoul()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 23:37:19 -04:00
Dmitry Monakhov 7e8b12c60a ext4: defragmentation code cleanup
- grab_cache_page_write_begin() may not wait on page's writeback since
  (1d1d1a7672). But it is still reasonable to wait on page's writeback
  here in order to be on the safe side.

- Fix miss typo: pass 'length' instead of 'end' to __block_write_begin()
  https://bugzilla.kernel.org/show_bug.cgi?id=56241

TESTCASE: git://oss.sgi.com/xfs/cmds/xfstests.git
MKFS_OPTIONS="-b1024" ; ./check ext4/304

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Akira Fujita <a-fujita.rs.jp.nec.com>
2013-04-11 23:24:58 -04:00
Lukas Czerner 43e50f5086 ext4: do not convert to indirect with bigalloc enabled
With bigalloc feature enabled we do not support indirect addressing at all
so we have to prevent extent addressing to indirect addressing
conversion in this case. The problem has been introduced with the commit
"ext4: support simple conversion of extent-mapped inodes to use i_blocks"

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 10:54:46 -04:00
Lukas Czerner 0d14b098ce ext4: move ext4_ind_migrate() into migrate.c
Move ext4_ind_migrate() into migrate.c file since it makes much more
sense and ext4_ext_migrate() is there as well.

Also fix tiny style problem - add spaces around "=" in "i=0".

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-10 23:32:52 -04:00
Theodore Ts'o d6a771056b ext4: fix miscellaneous big endian warnings
None of these result in any bug, but they makes sparse complain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 23:59:55 -04:00
Dmitry Monakhov 171a7f21a7 ext4: fix big-endian bug in metadata checksum calculations
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-09 23:56:48 -04:00
Dmitry Monakhov 0b65349ebc ext4: fix big-endian bug in extent migration code
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-09 23:56:44 -04:00
Dmitri Monakho 8c8e0ca622 ext4: fix usless declarations
This patch should fix sparse complains about shadow declatations.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 22:48:36 -04:00
Lukas Czerner 27dd438542 ext4: introduce reserved space
Currently in ENOSPC condition when writing into unwritten space, or
punching a hole, we might need to split the extent and grow extent tree.
However since we can not allocate any new metadata blocks we'll have to
zero out unwritten part of extent or punched out part of extent, or in
the worst case return ENOSPC even though use actually does not allocate
any space.

Also in delalloc path we do reserve metadata and data blocks for the
time we're going to write out, however metadata block reservation is
very tricky especially since we expect that logical connectivity implies
physical connectivity, however that might not be the case and hence we
might end up allocating more metadata blocks than previously reserved.
So in future, metadata reservation checks should be removed since we can
not assure that we do not under reserve.

And this is where reserved space comes into the picture. When mounting
the file system we slice off a little bit of the file system space (2%
or 4096 clusters, whichever is smaller) which can be then used for the
cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.

The number of reserved clusters can be set via sysfs, however it can
never be bigger than number of free clusters in the file system.

Note that this patch fixes the failure of xfstest 274 as expected.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09 22:11:22 -04:00
Al Viro d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Jan Kara f45a5ef91b ext4: improve credit estimate for EXT4_SINGLEDATA_TRANS_BLOCKS
Estimate of 27 credits for allocation of a block in extent based inode
is unnecessarily high. We can easily argue 20 is enough.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 12:39:26 -04:00
Andrey Sidorov eabe0444df ext4: speed-up releasing blocks on commit
Improve mb_free_blocks speed by clearing entire range at once instead
of iterating over each bit. Freeing block-by-block also makes buddy
bitmap subtree flip twice making most of the work a no-op. Very few
bits in buddy bitmap require change, e.g. freeing entire group is a 1
bit flip only.  As a result, releasing blocks of 60G file now takes
5ms instead of 2.7s.  This is especially good for non-preemptive
kernels as there is no rescheduling during release.

Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 12:22:29 -04:00
Eric Whitney 5c1ff33640 ext4: fix free space estimate in ext4_nonda_switch()
Values stored in s_freeclusters_counter and s_dirtyclusters_counter
are both in cluster units.  Remove the cluster to block conversion
applied to s_freeclusters_counter causing an inflated estimate of
free space because s_dirtyclusters_counter is not similarly
converted.  Rename free_blocks and dirty_blocks to better reflect
the units these variables contain to avoid future confusion.  This
fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc
file systems.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:27:31 -04:00
Jan Kara bcb1385096 ext4: fix deadlock with quota feature
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
accounted even for quota files. Thus we could recurse back to quota code
when adding new blocks to quota file which can easily deadlock. Mark
hidden quota files properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:21:41 -04:00
Dmitry Monakhov e8238f9a83 ext4: fix incorrect lock ordering for ext4_ind_migrate
existing locking ordering: journal-> i_data_sem, but
ext4_ind_migrate() grab locks in opposite order which may result in
deadlock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-08 13:02:25 -04:00
Dr. Tilmann Bubeck 393d1d1d76 ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT
Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
associated attributes (like i_blocks, i_size, i_flags, ...) from the
specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
typically used to store a boot loader in a secure part of the
filesystem, where it can't be changed by a normal user by accident.
The data blocks of the previous boot loader will be associated with
the given inode.

This usercode program is a simple example of the usage:

int main(int argc, char *argv[])
{
  int fd;
  int err;

  if ( argc != 2 ) {
    printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
    exit(1);
  }

  fd = open(argv[1], O_WRONLY);
  if ( fd < 0 ) {
    perror("open");
    exit(1);
  }

  err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
  if ( err < 0 ) {
    perror("ioctl");
    exit(1);
  }

  close(fd);
  exit(0);
}

[ Modified by Theodore Ts'o to fix a number of bugs in the original code.]

Signed-off-by: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-08 12:54:05 -04:00
Lukas Czerner f78ee70db4 ext4: print more info in ext4_print_free_blocks()
Additionally print i_allocated_meta_blocks information as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 23:33:30 -04:00
Lukas Czerner be8981be6b ext4: try to prepend extent to the existing one
Currently when inserting extent in ext4_ext_insert_extent() we would
only try to to see if we can append new extent to the found extent. If
we can not, then we proceed with adding new extent into the extent tree,
but then possibly merging it back again.

We can avoid this situation by trying to append and prepend new extent
to the existing ones. However since the new extent can be on either
sides of the existing extent, we have to pick the right extent to try to
append/prepend to.

This patch adds the conditions to pick the right extent to
append/prepend to and adds the actual prepending condition as well. This
will also eliminate the need to use "reserved" block for possibly
growing extent tree.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 23:33:28 -04:00
Lukas Czerner bc2d9db48c ext4: Transfer initialized block to right neighbor if possible
Currently when converting extent to initialized we attempt to transfer
initialized block to the left neighbour if possible when certain
criteria are met. However we do not attempt to do the same for the
right neighbor.

This commit adds the possibility to transfer initialized block to the
right neighbour if:

1. We're not converting the whole extent
2. Both extents are stored in the same extent tree node
3. Right neighbor is initialized
4. Right neighbor is logically abutting the current one
5. Right neighbor is physically abutting the current one
6. Right neighbor would not overflow the length limit

This is basically the same logic as with transferring to the left. This
will gain us some performance benefits since it is faster than inserting
extent and then merging it.

It would also prevent some situation in delalloc patch when we might run
out of metadata reservation. This is due to the fact that we would
attempt to split the extent first (possibly allocating new metadata
block) even though we did not counted for that because it can (and will)
be merged again. This commit fix that scenario, because we no longer
need to split the extent in such case.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 23:33:27 -04:00
Lukas Czerner bd86298e60 ext4: introduce ext4_get_group_number()
Currently on many places in ext4 we're using
ext4_get_group_no_and_offset() even though we're only interested in
knowing the block group of the particular block, not the offset within
the block group so we can use more efficient way to compute block
group.

This patch introduces ext4_get_group_number() which computes block
group for a given block much more efficiently. Use this function
instead of ext4_get_group_no_and_offset() everywhere where we're only
interested in knowing the block group.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 23:32:34 -04:00
Lukas Czerner 689110098c ext4: make ext4_block_in_group() much more efficient
Currently in when getting the block group number for a particular
block in ext4_block_in_group() we're using
ext4_get_group_no_and_offset() which uses do_div() to get the block
group and the remainer which is offset within the group.

We don't need all of that in ext4_block_in_group() as we only need to
figure out the group number.

This commit changes ext4_block_in_group() to calculate group number
directly. This shows as a big improvement with regards to cpu
utilization. Measuring fallocate -l 15T on fresh file system with perf
showed that 23% of cpu time was spend in the
ext4_get_group_no_and_offset(). With this change it completely
disappears from the list only bumping the occurrence of
ext4_init_block_bitmap() which is the biggest user of
ext4_block_in_group() by 4%. As the result of this change on my system
the fallocate call was approx. 10% faster.

However since there is '-g' option in mkfs which allow us setting
different groups size (mostly for developers) I've introduced new per
file system flag whether we have a standard block group size or
not. The flag is used to determine whether we can use the bit shift
optimization or not.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:12:52 -04:00
Dmitry Monakhov a75ae78f08 ext4: unregister es_shrinker if mount failed
Otherwise destroyed ext_sb_info will be part of global shinker list
and result in the following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <ffff88011d5cbcd8>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-03 22:10:52 -04:00
Dmitry Monakhov 5d3ee20855 ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
2013-04-03 22:08:52 -04:00
Theodore Ts'o 996bb9fddd ext4: support simple conversion of extent-mapped inodes to use i_blocks
In order to make it simpler to test the code which support
i_blocks/indirect-mapped inodes, support the conversion of inodes
which are less than 12 blocks and which are contained in no more than
a single extent.

The primary intended use of this code is to converting freshly created
zero-length files and empty directories.

Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
check that prevents the clearing of the extent flag.  A simple patch
which allows "chattr -e <file>" to work will be checked into the
e2fsprogs git repository.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:04:52 -04:00
Theodore Ts'o d76a3a7711 ext4/jbd2: don't wait (forever) for stale tid caused by wraparound
In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: George Barnett <gbarnett@atlassian.com>
Cc: stable@vger.kernel.org
2013-04-03 22:02:52 -04:00
Theodore Ts'o b10a44c369 ext4: add might_sleep() annotations
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 22:00:52 -04:00
Theodore Ts'o 19b5ef6157 ext4: add mutex_is_locked() assertion to ext4_truncate()
[ Added fixup from Lukáš Czerner which only checks the assertion when
  the inode is not new and is not being freed. ]

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 21:58:52 -04:00
Theodore Ts'o 819c4920b7 ext4: refactor truncate code
Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
ext4_truncate().  This saves over 60 lines of code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:47:17 -04:00
Theodore Ts'o 26a4c0c6cc ext4: refactor punch hole code
Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
into ext4_punch_hole().  This saves over 150 lines of code.

This also fixes a potential bug when the punch_hole() code is racing
against indirect-to-extents or extents-to-indirect migation.  We are
currently using i_mutex to protect against changes to the inode flag;
specifically, the append-only, immutable, and extents inode flags.  So
we need to take i_mutex before deciding whether to use the
extents-specific or indirect-specific punch_hole code.

Also, there was a missing call to ext4_inode_block_unlocked_dio() in
the indirect punch codepath.  This was added in commit 02d262dffc
to block DIO readers racing against the punch operation in the
codepath for extent-mapped inodes, but it was missing for
indirect-block mapped inodes.  One of the advantages of refactoring
the code is that it makes such oversights much less likely.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:45:17 -04:00
Theodore Ts'o 781f143ea0 ext4: fold ext4_alloc_blocks() in ext4_alloc_branch()
The older code was far more complicated than it needed to be because
of how we spliced in the ext4's new multiblock allocator into ext3's
indirect block code.  By folding ext4_alloc_blocks() into
ext4_alloc_branch(), we make the code far more understable, shave off
over 130 lines of code and half a kilobyte of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 12:43:17 -04:00
Zheng Liu eed4333f08 ext4: fold ext4_generic_write_end() into ext4_write_end()
After collapsing the handling of data ordered and data writeback
codepath, ext4_generic_write_end() has only one caller,
ext4_write_end().  So we fold it into ext4_write_end().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:41:17 -04:00
Theodore Ts'o 74d553aad7 ext4: collapse handling of data=ordered and data=writeback codepaths
The only difference between how we handle data=ordered and
data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
code duplication by factoring out redundant the code paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 12:39:17 -04:00
Zheng Liu 8cde7ad17e ext4: fix big-endian bugs which could cause fs corruptions
When an extent was zeroed out, we forgot to do convert from cpu to le16.
It could make us hit a BUG_ON when we try to write dirty pages out.  So
fix it.

[ Also fix a bug found by Dmitry Monakhov where we were missing
  le32_to_cpu() calls in the new indirect punch hole code.

  There are a number of other big endian warnings found by static code
  analyzers, but we'll wait for the next merge window to fix them all
  up.  These fixes are designed to be Obviously Correct by code
  inspection, and easy to demonstrate that it won't make any
  difference (and hence, won't introduce any bugs) on little endian
  architectures such as x86.  --tytso ]

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-04-03 12:37:17 -04:00
Linus Torvalds d3c926264a Fix a number of regression and other bugs in ext4, most of which were
relatively obscure cornercases or races that were found using
 regression tests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRSm5lAAoJENNvdpvBGATwZW8QAN7jMn7IaVCTXXblqgqba4uN
 KvLGRgK7R/n1rIhdHoxJHumwRQLTppVzjDCc8ePnWhdypzMZNuzUvs+OoCFdkDsW
 qf3CmL/p/R1oSiSzzFIs/7wGp7xBZ0l0BWZMFWd9EUg9cqoMBDA6KzcMF95fOtas
 KsjRL+BThacVldS7jyKFwE4BrpXd0Z5V9qZ6wjQPPoBx8sXF4iYA+CZVo5FUKBs8
 6I82LS1/PIYCe3IOSpCgyKXQqRzAYJANv1ndken5wW8jWT2R58e360OwZEVcpIN9
 /caov+F5OKfk4iOGq3b+vwRplNhAI2S6C4vhMbmS2GPWE8Fnr8gubyqNAIIs5R/y
 3zYHdqZESfuEF7K3QoAepiJhi3YIoRxXC1FxD7uxx7VBRhW2w8Ij5hlXhuSoh24M
 MUiXgCeIxQb+ZfUx0OHV++LSOHVccU4y7Z0X+LpXQa6tEMBuSgK6yCKsGkyr8APN
 gPMupTptgyUE3tFaCjqc7QKtmoeRAMSvzfqEyV6DlblIOe+3f/RJzRO222Xc4kxq
 D9t2tOuPoXsR+ivtS5pEcrZkE4Y2hkJbJzb7XXvfoETixYsuX6VkiPK/D68S9eRe
 VelqTM2lHPJi/3Wkle0p4pzWpEq70D8qZVp4TKLHMJCTQKpwUfopm5lvln87lc7w
 4JDORIx/ed1u8MMTJlmG
 =X3vc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of regression and other bugs in ext4, most of which were
  relatively obscure cornercases or races that were found using
  regression tests."

* tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: fix data=journal fast mount/umount hang
  ext4: fix ext4_evict_inode() racing against workqueue processing code
  ext4: fix memory leakage in mext_check_coverage
  ext4: use s_extent_max_zeroout_kb value as number of kb
  ext4: use atomic64_t for the per-flexbg free_clusters count
  jbd2: fix use after free in jbd2_journal_dirty_metadata()
  ext4: reserve metadata block for every delayed write
  ext4: update reserved space after the 'correction'
  ext4: do not use yield()
  ext4: remove unused variable in ext4_free_blocks()
  ext4: fix WARN_ON from ext4_releasepage()
  ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
  ext4: update extent status tree after an extent is zeroed out
  ext4: fix wrong m_len value after unwritten extent conversion
  ext4: add self-testing infrastructure to do a sanity check
  ext4: avoid a potential overflow in ext4_es_can_be_merged()
  ext4: invalidate extent status tree during extent migration
  ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
  ext4: add warning to ext4_convert_unwritten_extents_endio
  ext4: disable merging of uninitialized extents
  ...
2013-03-21 17:56:10 -07:00
Theodore Ts'o 2b405bfa84 ext4: fix data=journal fast mount/umount hang
In data=journal mode, if we unmount the file system before a
transaction has a chance to complete, when the journal inode is being
evicted, we can end up calling into jbd2_log_wait_commit() for the
last transaction, after the journalling machinery has been shut down.

Arguably we should adjust ext4_should_journal_data() to return FALSE
for the journal inode, but the only place it matters is
ext4_evict_inode(), and so to save a bit of CPU time, and to make the
patch much more obviously correct by inspection(tm), we'll fix it by
explicitly not trying to waiting for a journal commit when we are
evicting the journal inode, since it's guaranteed to never succeed in
this case.

This can be easily replicated via: 

     mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb

------------[ cut here ]------------
WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
Hardware name: Bochs
JBD2: bad log_start_commit: 3005630206 3005630206 0 0
Modules linked in:
Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
Call Trace:
 [<c015c0ef>] warn_slowpath_common+0x68/0x7d
 [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd
 [<c015c177>] warn_slowpath_fmt+0x2b/0x2f
 [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd
 [<c02b8075>] jbd2_log_start_commit+0x24/0x34
 [<c0279ed5>] ext4_evict_inode+0x71/0x2e3
 [<c021f0ec>] evict+0x94/0x135
 [<c021f9aa>] iput+0x10a/0x110
 [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce
 [<c0175284>] ? bit_waitqueue+0x50/0x50
 [<c028d23f>] ext4_put_super+0x52/0x294
 [<c020efe3>] generic_shutdown_super+0x48/0xb4
 [<c020f071>] kill_block_super+0x22/0x60
 [<c020f3e0>] deactivate_locked_super+0x22/0x49
 [<c020f5d6>] deactivate_super+0x30/0x33
 [<c0222795>] mntput_no_expire+0x107/0x10c
 [<c02233a7>] sys_umount+0x2cf/0x2e0
 [<c02233ca>] sys_oldumount+0x12/0x14
 [<c08096b8>] syscall_call+0x7/0xb
---[ end trace 6a954cc790501c1f ]---
jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2013-03-20 09:42:11 -04:00
Theodore Ts'o 1ada47d946 ext4: fix ext4_evict_inode() racing against workqueue processing code
Commit 84c17543ab (ext4: move work from io_end to inode) triggered a
regression when running xfstest #270 when the file system is mounted
with dioread_nolock.

The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
this guarantees that last io_end structure has been freed, but it does
not guarantee that the workqueue structure, which was moved into the
inode by commit 84c17543ab, is actually finished.  Once
ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
will allow ext4_ioend_wait() to return on CPU #2, at which point the
evict_inode() codepath can race against the workqueue code on CPU #1
accessing EXT4_I(inode)->i_unwritten_work to find the next item of
work to do.

Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
will be renamed ext4_ioend_shutdown(), since it is only used by
ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
until after truncate_inode_pages() and filemap_write_and_wait() are
called, to make sure all dirty pages have been written back and
flushed from the page cache first.

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
*pdpt = 0000000030bc3001 *pde = 0000000000000000 
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in:
Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
EIP is at cwq_activate_delayed_work+0x3b/0x7e
EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
Stack:
 f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
 f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
 f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
Call Trace:
 [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
 [<c01def1d>] process_one_work+0x5d8/0x637
 [<c040a10b>] ? ext4_end_bio+0x300/0x300
 [<c01e3105>] worker_thread+0x249/0x3ef
 [<c01ea317>] kthread+0xd8/0xeb
 [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
 [<c023a370>] ? trace_hardirqs_on+0x27/0x37
 [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
 [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
 6c c1 00 31 c9 83
EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
CR2: 0000000000000000
---[ end trace a1923229da53d8a4 ]---

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
2013-03-20 09:39:42 -04:00
Dmitry Monakhov 0e401101db ext4: fix memory leakage in mext_check_coverage
Regression was introduced by following commit 8c854473
TESTCASE (git://oss.sgi.com/xfs/cmds/xfstests.git):
#while true;do ./check 301 || break ;done

Also fix potential memory leakage in get_ext_path() once
ext4_ext_find_extent() have failed.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-18 11:40:19 -04:00
Eric W. Biederman fa7614ddd6 fs: Readd the fs module aliases.
I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module.  It turns out I was wrong.  At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does.  So at some point we may be delete these aliases without
problems.  However that day is not today.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-12 18:55:21 -07:00
Lukas Czerner 4f42f80a8f ext4: use s_extent_max_zeroout_kb value as number of kb
Currently when converting extent to initialized, we have to decide
whether to zeroout part/all of the uninitialized extent in order to
avoid extent tree growing rapidly.

The decision is made by comparing the size of the extent with the
configurable value s_extent_max_zeroout_kb which is in kibibytes units.

However when converting it to number of blocks we currently use it as it
was in bytes. This is obviously bug and it will result in ext4 _never_
zeroout extents, but rather always split and convert parts to
initialized while leaving the rest uninitialized in default setting.

Fix this by using s_extent_max_zeroout_kb as kibibytes.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-12 12:40:04 -04:00
Theodore Ts'o 90ba983f68 ext4: use atomic64_t for the per-flexbg free_clusters count
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-11 23:39:59 -04:00
Lukas Czerner 386ad67c9a ext4: reserve metadata block for every delayed write
Currently we only reserve space (data+metadata) in delayed allocation if
we're allocating from new cluster (which is always in non-bigalloc file
system) which is ok for data blocks, because we reserve the whole cluster.

However we have to reserve metadata for every delayed block we're going
to write because every block could potentially require metedata block
when we need to grow the extent tree.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-03-10 22:50:00 -04:00
Lukas Czerner 232ec8720d ext4: update reserved space after the 'correction'
Currently in ext4_ext_map_blocks() in delayed allocation writeback
we would update the reservation and after that check whether we claimed
cluster outside of the range of the allocation and if so, we'll give the
block back to the reservation pool.

However this also means that if the number of reserved data block
dropped to zero before the correction, we would release all the metadata
reservation as well, however we might still need it because the we're
not done with the delayed allocation and there might be more blocks to
come. This will result in error messages such as:

EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
with reserved 1 data blocks)

This will only happen on bigalloc file system and it can be easily
reproduced using fiemap-tester from xfstests like this:

./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file

Or using xfstests such as 225.

Fix this by doing the correction first and updating the reservation
after that so that we do not accidentally decrease
i_reserved_data_blocks to zero.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:46:30 -04:00
Lukas Czerner bb8b20ed94 ext4: do not use yield()
Using yield() is strongly discouraged (see sched/core.c) especially
since we can just use cond_resched().

Replace all use of yield() with cond_resched().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:28:09 -04:00
Lukas Czerner e3d85c3660 ext4: remove unused variable in ext4_free_blocks()
Remove unused variable 'freed' in ext4_free_blocks().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:21:49 -04:00
Jan Kara e1c36595be ext4: fix WARN_ON from ext4_releasepage()
ext4_releasepage() warns when it is passed a page with PageChecked set.
However this can correctly happen when invalidate_inode_pages2_range()
invalidates pages - and we should fail the release in that case. Since
the page was dirty anyway, it won't be discarded and no harm has
happened but it's good to be safe. Also remove bogus page_has_buffers()
check - we are guaranteed page has buffers in this function.

Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Tested-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-03-10 22:19:00 -04:00
Zheng Liu 3a2256702e ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
This commit fixes a wrong return value of the number of the allocated
blocks in ext4_split_extent.  When the length of blocks we want to
allocate is greater than the length of the current extent, we return a
wrong number.  Let's see what happens in the following case when we
call ext4_split_extent().

  map: [48, 72]
  ex:  [32, 64, u]

'ex' will be split into two parts:
  ex1: [32, 47, u]
  ex2: [48, 64, w]

'map->m_len' is returned from this function, and the value is 24.  But
the real length is 16.  So it should be fixed.

Meanwhile in this commit we use right length of the allocated blocks
when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
is called.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2013-03-10 21:20:23 -04:00
Zheng Liu adb2355104 ext4: update extent status tree after an extent is zeroed out
When we try to split an extent, this extent could be zeroed out and mark
as initialized.  But we don't know this in ext4_map_blocks because it
only returns a length of allocated extent.  Meanwhile we will mark this
extent as uninitialized because we only check m_flags.

This commit update extent status tree when we try to split an unwritten
extent.  We don't need to worry about the status of this extent because
we always mark it as initialized.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 21:13:05 -04:00
Zheng Liu cdee78433c ext4: fix wrong m_len value after unwritten extent conversion
The ext4_ext_handle_uninitialized_extents() function was assuming the
return value of ext4_ext_map_blocks() is equal to map->m_len.  This
incorrect assumption was harmless until we started use status tree as
a extent cache because we need to update status tree according to
'm_len' value.

Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
conversion.  It shouldn't cause a bug because we update status tree
according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.

After applied this commit, the following error message from self-testing
infrastructure disappears.

    ...
    kernel: ES len assertation failed for inode: 230 retval 1 !=
    map->m_len 3 in ext4_map_blocks (allocation)
    ...

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 21:08:52 -04:00
Dmitry Monakhov 921f266bc6 ext4: add self-testing infrastructure to do a sanity check
This commit adds a self-testing infrastructure like extent tree does to
do a sanity check for extent status tree.  After status tree is as a
extent cache, we'd better to make sure that it caches right result.

After applied this commit, we will get a lot of messages when we run
xfstests as below.

...
kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
3 in ext4_map_blocks (allocation)
...
kernel: ES cache assertation failed for inode: 230 es_cached ex
[974/2/4781/20] != found ex [974/1/4781/1000]
...
kernel: ES insert assertation failed for inode: 635 ex_status
[0/45/21388/w] != es_status [44/1/21432/u]
...

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 21:01:03 -04:00
Zheng Liu bd384364c1 ext4: avoid a potential overflow in ext4_es_can_be_merged()
Check the length of an extent to avoid a potential overflow in
ext4_es_can_be_merged().

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10 20:48:59 -04:00
Dmitry Monakhov 6ca470d7b5 ext4: invalidate extent status tree during extent migration
mext_replace_branches() will change inode's extents layout so
we have to drop corresponding cache.

TESTCASE:  301'th xfstest was not yet accepted to official xfstest's branch
and can be found here: 7b7efeee30

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:50:47 -05:00
Jan Kara de99fcce1d ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
Now that we don't merge uninitialized extents anymore,
ext4_fallocate() is free to operate on the inode while there are still
some extent conversions pending - it won't disturb them in any way.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-04 00:43:32 -05:00
Dmitry Monakhov ff95ec22cd ext4: add warning to ext4_convert_unwritten_extents_endio
Splitting extents inside endio is a bad thing, but unfortunately it is
still possible.  In fact we are pretty close to the moment when all
related issues will be fixed.  Let's warn developer if it still the
case.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:41:05 -05:00
Dmitry Monakhov ec22ba8edb ext4: disable merging of uninitialized extents
Derived from Jan's patch:http://permalink.gmane.org/gmane.comp.file-systems.ext4/36470

Merging of uninitialized extents creates all sorts of interesting race
possibilities when writeback / DIO races with fallocate. Thus
ext4_convert_unwritten_extents_endio() has to deal with a case where
extent to be converted needs to be split out first. That isn't nice
for two reasons:

1) It may need allocation of extent tree block so ENOSPC is possible.
2) It complicates end_io handling code

So we disable merging of uninitialized extents which allows us to simplify
the code. Extents will get merged after they are converted to initialized
ones.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:36:06 -05:00
Dmitry Monakhov 357b66fdc8 ext4: ext4_split_extent should take care of extent zeroout
When ext4_split_extent_at() ends up doing zeroout & conversion to
initialized instead of split & conversion, ext4_split_extent() gets
confused and can wrongly mark the extent back as uninitialized
resulting in end IO code getting confused from large unwritten extents
and may result in data loss.

The example of problematic behavior is:
			    lblk len              lblk len
  ext4_split_extent() (ex=[1000,30,uninit], map=[1010,10])
    ext4_split_extent_at() (split [1000,30,uninit] at 1020)
      ext4_ext_insert_extent() -> ENOSPC
      ext4_ext_zeroout()
	 -> extent [1000,30] is now initialized
    ext4_split_extent_at() (split [1000,30,init] at 1010,
			     MARK_UNINIT1 | MARK_UNINIT2)
      -> extent is split and parts marked as uninitialized

Fix the problem by rechecking extent type after the first
ext4_split_extent_at() returns. None of split_flags can not be applied
to initialized extent so this patch also add BUG_ON to prevent similar
issues in future.

TESTCASE: b8a55eb5ce

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-03-04 00:34:34 -05:00
Eric W. Biederman 7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
Linus Torvalds 56a79b7b02 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull  more VFS bits from Al Viro:
 "Unfortunately, it looks like xattr series will have to wait until the
  next cycle ;-/

  This pile contains 9p cleanups and fixes (races in v9fs_fid_add()
  etc), fixup for nommu breakage in shmem.c, several cleanups and a bit
  more file_inode() work"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  constify path_get/path_put and fs_struct.c stuff
  fix nommu breakage in shmem.c
  cache the value of file_inode() in struct file
  9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry
  9p: make sure ->lookup() adds fid to the right dentry
  9p: untangle ->lookup() a bit
  9p: double iput() in ->lookup() if d_materialise_unique() fails
  9p: v9fs_fid_add() can't fail now
  v9fs: get rid of v9fs_dentry
  9p: turn fid->dlist into hlist
  9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine
  more file_inode() open-coded instances
  selinux: opened file can't have NULL or negative ->f_path.dentry

(In the meantime, the hlist traversal macros have changed, so this
required a semantic conflict fixup for the newly hlistified fid->dlist)
2013-03-03 13:23:03 -08:00
Linus Torvalds a7c1120d2d Various bug fixes for ext4. The most important is a fix for the new
extent cache's slab shrinker which can cause significant, user-visible
 pauses when the system is under memory pressure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRMpM3AAoJENNvdpvBGATwiXgP/3eSg3C+M0ZUeL6lH3aXRxQO
 PHxUL/Di5cfFs3GX4DksVzsD1KkTIz8B424AhdahrrgGh1jTx4/J23OrEdu9nK24
 JGU5hmowoCyG8PZG1kGMbX6EYcblYTx+O2tX/RInnRExm9ajkfxb0S1g0Vl340qw
 58WTSWfl2+J/3RnJ9TYX/qNVeCJdxLH3GkpFbvQbLGyylfM9hsUD5MZMAR1bpOJF
 U2vNdK3n65W0AtKhLo7TYnoJ4ll2PoFRvffS0rqhEpIAcRxpVsNThFJLBcOQ1a79
 6cCN5uhrJOlL5jLN/fYCViU1+03y7itCMJmtSpuyV8DtUGjf4r1tzlvWGeiSmpB9
 NprZ/MgO1ROnzO/gzPM2s4nWWeGZiGaf7vMDyScIDtqF1ckfHN17jqazuSJcybN8
 U83O9+KyhHkvr/+zqlySXiBX2MUSUdSE37CsMC7R+mAz7C46yjXEPuG8pLkLCWiG
 gjMD30D1f6+h+K646WN497+Crxl1CurEH+ON7k158cNvVNlX1FfFHUprRHeNUXkV
 tEKjiCUCf5WjNeFEc93nC/nDi4OIISD25N9LyHzp2CcV/XXRjpsrNPBFDAZjwgiK
 YVUQIwocVUVlRaACzrM9sDFtSELqNzy/GLuERITu1Mb2R4sMXIyvvJkjc+EuQS0F
 XVQ3BU5ypWyxJGrSGCPd
 =+vcC
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most important is a fix for the new
  extent cache's slab shrinker which can cause significant, user-visible
  pauses when the system is under memory pressure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  ext4: use percpu counter for extent cache count
  ext4: optimize ext4_es_shrink()
2013-03-02 19:33:21 -08:00
Jan Kara 9b2ff35753 ext4: enable quotas before orphan cleanup
When using quota feature we need to enable quotas before orphan cleanup
so that changes happening during it are properly reflected in quota
accounting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 18:22:38 -05:00
Jan Kara 262b4662f4 ext4: don't allow quota mount options when quota feature enabled
So far we silently ignored when quota mount options were set while quota
feature was enabled.  But this can create confusion in userspace when
mount options are set but silently ignored and also creates opportunities
for bugs when we don't properly test all quota types.  Actually
ext4_mark_dquot_dirty() forgets to test for quota feature so it was
dependent on journaled quota options being set.  OTOH ext4_orphan_cleanup()
tries to enable journaled quota when quota options are specified which is
wrong when quota feature is enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 17:57:08 -05:00
Zheng Liu d4e4395491 ext4: fix a warning from sparse check for ext4_dir_llseek
ext4_dir_llseek is only used as a callback function, and no one calls
it directly.  So make it as a static function in order to remove a
warning message from sparse check.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 17:24:05 -05:00
Lukas Czerner 810da240f2 ext4: convert number of blocks to clusters properly
We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems.  However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-02 17:18:58 -05:00
Wei Yongjun 3e36a16375 ext4: fix possible memory leak in ext4_remount()
'orig_data' is malloced in ext4_remount() and should be freed
before leaving from the error handling cases, otherwise it will
cause memory leak.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-02 17:13:55 -05:00
Theodore Ts'o 1ac6466f25 ext4: use percpu counter for extent cache count
Use a percpu counter rather than atomic types for shrinker accounting.
There's no need for ultimate accuracy in the shrinker, so this
should come a little more cheaply.  The percpu struct is somewhat
large, but there was a big gap before the cache-aligned
s_es_lru_lock anyway, and it fits nicely in there.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 10:27:46 -05:00
Theodore Ts'o 246307745c ext4: optimize ext4_es_shrink()
When the system is under memory pressure, ext4_es_srhink() will get
called very often.  So optimize returning the number of items in the
file system's extent status cache by keeping a per-filesystem count,
instead of calculating it each time by scanning all of the inodes in
the extent status cache.

Also rename the slab used for the extent status cache to be
"ext4_extent_status" so it's obviousl the slab in question is created
by ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2013-02-28 23:58:56 -05:00
Linus Torvalds de1a2262b0 2 writeback fixes
- fix negative (setpoint - dirty) in 32bit archs
 - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJRLrFaAAoJECvKgwp+S8JaV2IP/jo34e3Ht0gvIfxz9rh05dvR
 LqBmSAXXJQYgxUKUjYECuyLahIciniKYZp/fS6s5myOPAayiirB70rC1W85Kz8Sm
 uR1wDvG0g1AyK39kJas+WZw2fJFicthSSp29jhTH0upbEcMX+/tzsHTsJRH1WqI0
 rtV8wHVxDu4+njz44hZIVxmJ9S7XZCuw8D6NfbyobmAqOm35j0VJ7uzQOxrNoJDe
 lvnwEGXfSU9KTfOIxt4K0d+lovXT6IRfN0qfdgcrWwxx9QJ/cU5F5b6cjdN9BsEF
 oq2UKSihbU55PdgUk6DfMJ3t7AXS/u2/P5a8PNfoNL9ovKQMJMHPXXDtxXmwCvcI
 aaYbULbwojMWZyrijViJpkftVKKtM/96X/jyCsof96UhJdah8c9wM44k1LDRBYXi
 WbQbD+doUII+pEmxUxF3Chrk/Yi3T5q2IWiVsixUEGewrSChOSqMIXOcSpgz97lL
 eGmNHgC/rn5TdDx8J3u0V+1+QYCvNxC25GG4E2+9QhU+mecLKt+IG1Dhn35xUjq1
 kjgfrNWJC6zxlIq7owk8pTI7DxiV/iMqogR5mMDz0umrPrid/J/xb6zxuAcnk3WU
 j0clNu7gzIYB8NjxBskO3Fg2AWKJxSohpu+r9wcjmjf0T5uEUmLwpI0i4tdDlYNw
 IvcmOpF1I2Ja5TrW8HWw
 =j9Sn
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback fixes from Wu Fengguang:
 "Two writeback fixes

   - fix negative (setpoint - dirty) in 32bit archs

   - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  Negative (setpoint-dirty) in bdi_position_ratio()
  vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
2013-02-28 13:21:44 -08:00
Al Viro 6131ffaa1f more file_inode() open-coded instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27 16:59:05 -05:00
Linus Torvalds 7bf2fbcdf5 This fixes a real brown paper bag bug which causes ext4 to choke on
file systems larger than 512GB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRLmSkAAoJENNvdpvBGATwhf8P/1OqyhnveyeGPGHF3gvufx6W
 QLwVdIAWLIuLmltmgXhah/IugYOYv2msIFkmtpH/exx9mmXSfRpScZ1QQfPoIuW8
 ovZDljbkDUGyYiZPvHBKnek+8EK8zwIq2TQYz1E87rvFqCg97EonwzjeUo6dcggv
 toQvHFBjCAKGUI9qGiTW9Xthqh+6MUYGw9Wh3CtUkOhNMqhyc2U0SxMuqoc32qVZ
 Sj8Yn4Sba5CRnMSqZbFEJh3B6JtDrsw7tTe/W0Oqu6X+z/ro9ky8Gg7OpxTyWzHI
 HyJKYPDwkS+Bj8n4oH3Q8EKs6mA5rzFrwwH0vNA/GErRdga2SEHxoaLhnP8LGhDi
 GrdZGOfOIAC6qwYaNA9G+uk3Zt73N+BurIgbUMExOLRz51ViTw8ep9j0mNvYoZLZ
 S977/SO4axtv34Gmsi3oi3zuy+zmHPRQnLpWo3bghPo0jIuSVl225zqIkkVf8jaE
 zpRUNvycymaCSEMc2daLXhJDKOO1Ll46wwsSZ7ppZqRBMGV4jp2c3I7sG5GNcB4Z
 TmxKoIVhXlCs1sD3cLXNpEwdpSA4NWDLFtfqNPyssyGrzafVIMH1cYUzAvP36Ymx
 mLFtSt5pnEnUF/C8Kb3Yowsh0kOdnkVmadgKdDa05fEmiwLd0oxGFIqed9zhhIcI
 vKEiSNphSdwxXPbGF7ZF
 =cuPi
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 regression fix from Theodore Ts'o:
 "This fixes a real brown paper bag bug which causes ext4 to choke on
  file systems larger than 512GB."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix extent status tree regression for file systems > 512GB
2013-02-27 12:22:30 -08:00
Theodore Ts'o 8e919d1304 ext4: fix extent status tree regression for file systems > 512GB
This fixes a regression introduced by commit f7fec032aa.  The
problem was that the extents status flags caused us to mask out block
numbers smaller than 2**28 blocks.  Since we didn't test with file
systems smaller than 512GB, we didn't notice this during the
development cycle.

A typical failure looks like this:

EXT4-fs error (device sdb1): htree_dirblock_to_tree:919: inode #172235804: block
152052301: comm ls: bad entry in directory: rec_len is smaller than minimal -
offset=0(0), inode=0, rec_len=0, name_len=0

... where 'debugfs -R "stat <172235804>" /dev/sdb1' reports that the
inode has block number 688923213.  When viewed in hex, block number
152052301 (from the syslog) is 0x910224D, while block number 688923213
is 0x2910224D.  Note the missing "0x20000000" in the block number.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Verified-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Dave Jones <davej@redhat.com>
Verified-by: Dave Jones <davej@redhat.com>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-27 14:54:37 -05:00
Linus Torvalds d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Linus Torvalds 6515925b82 The one new feature added in this patch series is the ability to use
the "punch hole" functionality for inodes that are not using extent
 maps.
 
 In the bug fix category, we fixed some races in the AIO and fstrim
 code, and some potential NULL pointer dereferences and memory leaks in
 error handling code paths.
 
 In the optimization category, we fixed a performance regression in the
 jbd2 layer introduced by commit d9b0193 (introduced in v3.0) which
 shows up in the AIM7 benchmark.  We also further optimized jbd2 by
 minimize the amount of time that transaction handles are held active.
 
 This patch series also features some additional enhancement of the
 extent status tree, which is now used to cache extent information in a
 more efficient/compact form than what we use on-disk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRLRs7AAoJENNvdpvBGATwNb8QAML+TjGtHlJ1coDUzGT2Cq9R
 yREAzI1N/+Phiohy3O0JNx55uPvYEMx6+zi+JCNSs1/gnf/OWruESTXssRbBv3Yd
 WxfOiCIaK8BbOEGZlMwGsFDCzVNKfvHxRrmyeHtcyUONKLFQUmBcE/woVPHcsvlE
 ya/zGnD2e58NaGwS643bqfvTrVt/azH0U0osNCNwfZepZmboEXK8fzT9b3Auh+1Q
 EI28m0GSRp0V0cgwOEN54EhTtocyS30GN8sbC1K5cFHK8tGLhyVwnvIonyFDI5/D
 GOkEPeRb7v2FwGpAilQ/V0jT++E//7zzyMFwvIY1U6b1dzBFCaJUuLMO1R8xoaoa
 c/Qd3AFIt1anS66qZAnW3m5rRyJgU2YA3VrKJj4q0jPKCh+k3+EqVfNTOB8BPLmC
 oCI/4ApUyHeYDdcErFjW4VDJ5N0debPP4yjma3uUtdM7RvQvMdQECnkAjIDCcGKe
 bMc7dtI9jdUYDCPGDeOjdrvk623QpE7J4Pf6iSQ5WxA4f2QmOQ8uIuGe8CPQSVtQ
 bUYjkthtWX2cX2/kHVvSYx6FzAjkgwmxCpAaiCXtGploxJIDjlWkiTXibkRYPLp4
 jBmQPK8ct8bl98k/i3mdybZnJU2TxWLA45hub0zBYs0aSgi8HzFyd+y8DiCKRS0S
 2sANbrsKG6TCzZ6C6ods
 =KSV1
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Theodore Ts'o:
 "The one new feature added in this patch series is the ability to use
  the "punch hole" functionality for inodes that are not using extent
  maps.

  In the bug fix category, we fixed some races in the AIO and fstrim
  code, and some potential NULL pointer dereferences and memory leaks in
  error handling code paths.

  In the optimization category, we fixed a performance regression in the
  jbd2 layer introduced by commit d9b01934d5 ("jbd: fix fsync() tid
  wraparound bug", introduced in v3.0) which shows up in the AIM7
  benchmark.  We also further optimized jbd2 by minimize the amount of
  time that transaction handles are held active.

  This patch series also features some additional enhancement of the
  extent status tree, which is now used to cache extent information in a
  more efficient/compact form than what we use on-disk."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (65 commits)
  ext4: fix free clusters calculation in bigalloc filesystem
  ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()
  ext4: fix xattr block allocation/release with bigalloc
  ext4: reclaim extents from extent status tree
  ext4: adjust some functions for reclaiming extents from extent status tree
  ext4: remove single extent cache
  ext4: lookup block mapping in extent status tree
  ext4: track all extent status in extent status tree
  ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
  ext4: rename and improbe ext4_es_find_extent()
  ext4: add physical block and status member into extent status tree
  ext4: refine extent status tree
  ext4: use ERR_PTR() abstraction for ext4_append()
  ext4: refactor code to read directory blocks into ext4_read_dirblock()
  ext4: add debugging context for warning in ext4_da_update_reserve_space()
  ext4: use KERN_WARNING for warning messages
  jbd2: use module parameters instead of debugfs for jbd_debug
  ext4: use module parameters instead of debugfs for mballoc_debug
  ext4: start handle at the last possible moment when creating inodes
  ext4: fix the number of credits needed for acl ops with inline data
  ...
2013-02-26 14:52:45 -08:00
Linus Torvalds bbbd27e694 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, ext3, udf updates from Jan Kara:
 "Several UDF fixes, a support for UDF extent cache, and couple of ext2
  and ext3 cleanups and minor fixes"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  Ext2: remove the static function release_blocks to optimize the kernel
  Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
  Ext2: remove the overhead check about sb in the function ext2_new_blocks
  udf: Remove unused s_extLength from udf_bitmap
  udf: Make s_block_bitmap standard array
  udf: Fix bitmap overflow on large filesystems with small block size
  udf: add extent cache support in case of file reading
  udf: Write LVID to disk after opening / closing
  Ext3: return ENOMEM rather than EIO if sb_getblk fails
  Ext2: return ENOMEM rather than EIO if sb_getblk fails
  Ext3: use unlikely to improve the efficiency of the kernel
  Ext2: use unlikely to improve the efficiency of the kernel
  Ext3: add necessary check in case IO error happens
  Ext2: free memory allocated and forget buffer head when io error happens
  ext3: Fix memory leak when quota options are specified multiple times
  ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
2013-02-26 14:51:52 -08:00
Al Viro 496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Lukas Czerner 304e220f08 ext4: fix free clusters calculation in bigalloc filesystem
ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.

Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
off by one. However r_blocks_count should always be a multiple of the
cluster ratio so doing a plain bit shift should be enough here. We
avoid using EXT4_B2C() because it's confusing.

As a result of the first problem number of free clusters is much bigger
than it should have been and ext4_has_free_clusters() would return 1 even
if there is really not enough free clusters available.

Fix this by removing the EXT4_C2B() conversion of free clusters and
using bit shift when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-22 15:27:52 -05:00
Eryu Guan d438147211 ext4: no need to remove extent if len is 0 in ext4_es_remove_extent()
len is 0 means no extent needs to be removed, so return immediately.
Otherwise it could trigger the following BUG_ON() in
ext4_es_remove_extent()

	end = lblk + len - 1;
	BUG_ON(end < lblk);

This could be reproduced by a simple truncate(1) command by an
unprivileged user

	truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile

The same is true for __es_insert_extent().

Patched kernel passed xfstests regression test.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-02-22 15:27:47 -05:00
Darrick J. Wong 1d1d1a7672 mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait.  Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function.  This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:19 -08:00
Lukas Czerner 1231b3a1eb ext4: fix xattr block allocation/release with bigalloc
Currently when new xattr block is created or released we we would call
dquot_free_block() or dquot_alloc_block() respectively, among the else
decrementing or incrementing the number of blocks assigned to the
inode by one block.

This however does not work for bigalloc file system because we always
allocate/free the whole cluster so we have to count with that in
dquot_free_block() and dquot_alloc_block() as well.

Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
blocks to the dquot_alloc/free functions to fix the problem.

The problem has been revealed by xfstests #117 (and possibly others).

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2013-02-18 12:12:07 -05:00
Zheng Liu 74cd15cd02 ext4: reclaim extents from extent status tree
Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.

Here we maintain a lru list in super_block.  When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list.  The inode will be dropped from this list when it is
cleared.  In the inode, a counter is added to count the number of
cached objects in extent status tree.  Here only written/unwritten/hole
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it.  The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.

In this commit we use normal shrinker framework to reclaim memory from
the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
to count the number of reclaimable extents.  ext4_es_shrink() tries to
reclaim written/unwritten/hole extents from extent status tree.  The
inode that has been shrunk is moved to the tail of lru list.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:32:55 -05:00
Zheng Liu bdedbb7b8d ext4: adjust some functions for reclaiming extents from extent status tree
This commit changes some interfaces in extent status tree because we
need to use inode to count the cached objects in a extent status tree.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:32:02 -05:00
Zheng Liu 69eb33dc24 ext4: remove single extent cache
Single extent cache could be removed because we have extent status tree
as a extent cache, and it would be better.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:31:07 -05:00
Zheng Liu d100eef244 ext4: lookup block mapping in extent status tree
After tracking all extent status, we already have a extent cache in
memory.  Every time we want to lookup a block mapping, we can first
try to lookup it in extent status tree to avoid a potential disk I/O.

A new function called ext4_es_lookup_extent is defined to finish this
work.  When we try to lookup a block mapping, we always call
ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
first try to lookup a block mapping in extent status tree.

A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
in order not to put a hole into extent status tree because this hole
will be converted to delayed extent in the tree immediately.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:29:59 -05:00
Zheng Liu f7fec032aa ext4: track all extent status in extent status tree
By recording the phycisal block and status, extent status tree is able
to track the status of every extents.  When we call _map_blocks
functions to lookup an extent or create a new written/unwritten/delayed
extent, this extent will be inserted into extent status tree.

We don't load all extents from disk in alloc_inode() because it costs
too much memory, and if a file is opened and closed frequently it will
takes too much time to load all extent information.  So currently when
we create/lookup an extent, this extent will be inserted into extent
status tree.  Hence, the extent status tree may not comprehensively
contain all of the extents found in the file.

Here a condition we need to take care is that an extent might contains
unwritten and delayed status simultaneously because an extent is delayed
allocated and could be allocated by fallocate.  At this time we need to
keep delayed status because later we need to update delayed reservation
space using it.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:28:47 -05:00
Zheng Liu a25a4e1a5d ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
because in later commit ext4_map_blocks needs to use this flag to
determine the extent status.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:28:04 -05:00
Zheng Liu be401363ac ext4: rename and improbe ext4_es_find_extent()
This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
and improve this function.  First, we split input and output parameter.
Second, this function never return the first block of the next delayed
extent after 'es'.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:27:26 -05:00
Zheng Liu fdc0212e86 ext4: add physical block and status member into extent status tree
This commit adds two members in extent_status structure to let it record
physical block and extent status.  Here es_pblk is used to record both
of them because physical block only has 48 bits.  So extent status could
be stashed into it so that we can save some memory.  Now written,
unwritten, delayed and hole are defined as status.

Due to new member is added into extent status tree, all interfaces need
to be adjusted.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:26:51 -05:00
Zheng Liu 06b0c88621 ext4: refine extent status tree
This commit refines the extent status tree code.

1) A prefix 'es_' is added to to the extent status tree structure
members.

2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.

3) Rename extent_status_end() to ext4_es_end()

4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.

5) Update and clarified comments.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18 00:26:51 -05:00
Theodore Ts'o 0f70b40613 ext4: use ERR_PTR() abstraction for ext4_append()
Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
pointer to an integer for the error code, as a code cleanup.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-15 03:35:57 -05:00
Theodore Ts'o dc6982ff4d ext4: refactor code to read directory blocks into ext4_read_dirblock()
The code to read in directory blocks and verify their metadata
checksums was replicated in ten different places across
fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
those replicated sites.  In some cases, ext4_error() was called with a
training newline.  In others, in particularly in empty_dir(), it was
possible to call ext4_dirent_csum_verify() on an index block, which
would trigger false warnings requesting the system adminsitrator to
run e2fsck.

By refactoring the code, we make the code more readable, as well as
shrinking the compiled object file by over 700 bytes and 50 lines of
code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 23:59:26 -05:00
Theodore Ts'o 01a523eb51 ext4: add debugging context for warning in ext4_da_update_reserve_space()
Print some additional debugging context to hopefully help to debug a
warning which is getting triggered by xfstests #74.

Also remove extraneous newlines from when printk's were converted to
ext4_warning() and ext4_msg().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 15:51:58 -05:00
Theodore Ts'o 8de5c325b4 ext4: use KERN_WARNING for warning messages
Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE.  Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14 15:11:41 -05:00
Theodore Ts'o a0b30c1229 ext4: use module parameters instead of debugfs for mballoc_debug
There are multiple reasons to move away from debugfs.  First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.

Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.

Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:28:20 -05:00
Theodore Ts'o 1139575a92 ext4: start handle at the last possible moment when creating inodes
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated.  In order to do this,
we need to start the handle in the ext4_new_inode().  So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:27:09 -05:00
Theodore Ts'o 95eaefbdec ext4: fix the number of credits needed for acl ops with inline data
Operations which modify extended attributes may need extra journal
credits if inline data is used, since there is a chance that some
extended attributes may need to get pushed to an external attribute
block.

Changes to reflect this was made in xattr.c, but they were missed in
fs/ext4/acl.c.  To fix this, abstract the calculation of the number of
credits needed for xattr operations to an inline function defined in
ext4_jbd2.h, and use it in acl.c and xattr.c.

Also move the function declarations used in inline.c from xattr.h
(where they are non-obviously hidden, and caused problems since
ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
them to ext4.h.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:23:03 -05:00
Theodore Ts'o 64044abf05 ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()
The ext4_unlink() and ext4_rmdir() don't actually release the blocks
associated with the file/directory.  This gets done in a separate jbd2
handle called via ext4_evict_inode().  Thus, we don't need to reserve
lots of journal credits for the truncate.

Note that using too many journal credits is non-optimal because it can
leading to the journal transmit getting closed too early, before it is
strictly necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:06:24 -05:00
Theodore Ts'o 4b217630d0 ext4: fix the number of credits needed for ext4_ext_migrate()
The migration ioctl creates a temporary inode.  Since this inode is
never linked to a directory, we don't need to reserve journal credits
required for modifying the directory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 12:50:27 -05:00
Theodore Ts'o 8dcfaad244 ext4: start handle at the last possible moment in ext4_rmdir()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:45:11 -05:00
Theodore Ts'o 931b68649d ext4: start handle at the last possible moment in ext4_unlink()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:43:39 -05:00
Theodore Ts'o 47564bfb95 ext4: grab page before starting transaction handle in write_begin()
The grab_cache_page_write_begin() function can potentially sleep for a
long time, since it may need to do memory allocation which can block
if the system is under significant memory pressure, and because it may
be blocked on page writeback.  If it does take a long time to grab the
page, it's better that we not hold an active jbd2 handle.

So grab a handle on the page first, and _then_ start the transaction
handle.

This commit fixes the following long transaction handle hold time:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:24:14 -05:00
Theodore Ts'o 9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Theodore Ts'o 722887ddc8 ext4: move the jbd2 wrapper functions out of super.c
Move the jbd2 wrapper functions which start and stop handles out of
super.c, where they don't really logically belong, and into
ext4_jbd2.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 13:00:31 -05:00
Theodore Ts'o 40ae348762 ext4: optimize mballoc for large allocations
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group.  That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.

For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.

To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10?  Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group.  If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.

To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
2013-02-04 15:08:40 -05:00
Theodore Ts'o 8dc0aa8cf0 ext4: check incompatible mount options while mounting ext2/3
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:38:39 -05:00
Jan Kara e33e60eaed ext4: print error when argument of inode_readahead_blk is invalid
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:14:31 -05:00
Jan Kara 5f3633e36b ext4: make mount option parsing loop more logical
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:09:36 -05:00
Jan Kara 0efb3b2300 ext4: move several mount options to standard handling loop
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 22:52:19 -05:00
Cong Ding 0e79537d30 ext4: reduce one "if" comparison in ext4_dirhash()
It is unnecessary to check i<4 after the loop; just do it before the
break.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 22:33:21 -05:00
Niu Yawei f116700971 ext4: fix race in ext4_mb_add_n_trim()
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-01 21:31:27 -05:00
Akria Fujita 87e698734b ext4: fix smatch warning in move_extent.c's mext_replace_branches()
Commit 2147b1a6a4 resulted in a new smatch warning:

> fs/ext4/move_extent.c:693 mext_replace_branches()
> 	 warn: variable dereferenced before check 'dext' (see line 683)

Fix this by adding a check to make sure dext is non-NULL before we
derefrence it.

Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com>
[ modified by tytso to make sure an ext4_error is called ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:52:46 -05:00
Julia Lawall 524c19ebc9 ext4: use WARN in ext4_alloc_blocks
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:07:21 -05:00
Jan Kara 091e26dfc1 ext4: fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-29 22:48:17 -05:00
Guo Chao b1deefc99e ext4: remove unnecessary NULL pointer check
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:41:02 -05:00
Guo Chao 41be871f74 ext4: remove useless assignment in dx_probe()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:33:28 -05:00
Guo Chao 2bbbee2a68 ext4: remove unused variable in add_dirent_to_buf()
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:26:44 -05:00
Guo Chao d5ac777305 ext4: release buffer when checksum failed
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:23:24 -05:00
Lukas Czerner b06acd38a4 ext4: remove explicit WARN_ON when ext4_map_blocks() fails
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.

Also use ext4_warning() instead of ext4_msg() and printk().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:21:12 -05:00
Lukas Czerner cfa7275482 ext4: remove unused variable flags
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-01-28 21:14:11 -05:00
Jan Kara fe386132f6 ext4: fix ext4_writepage() to achieve data=ordered guarantees
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:06:42 -05:00
Jan Kara 8a850c3fb8 ext4: Make ext4_bio_writepage() handle unprepared buffers
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:

a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
   pages that may have only some of dirty buffers mapped.

So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 20:53:28 -05:00
Jan Kara b6a8e62f8b ext4: simplify mpage_add_bh_to_extent()
The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 13:06:48 -05:00
Jan Kara f8bec37037 ext4: dirty page has always buffers attached
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 12:55:08 -05:00
Jan Kara 002bd7fa3a ext4: simplify list handling in ext4_do_flush_completed_IO()
The function splices i_completed_io_list to its private list
first.  From that moment on we don't need any lock for working with
io_end structures because all io_end structure on the list are only
our own. So we can remove the other two lists in the function and free
io_end immediately after we are done with it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:49:15 -05:00
Jan Kara 84c17543ab ext4: move work from io_end to inode
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself.  This also allows for a small simplification in
processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:43:46 -05:00
Jan Kara fe089c77f1 ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()
We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:38:49 -05:00
Jan Kara 1ae48a6354 ext4: use redirty_page_for_writepage() in ext4_bio_write_page()
When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.

Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:32:54 -05:00
Jan Kara 36ade451a5 ext4: Always use ext4_bio_write_page() for writeout
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:30:52 -05:00
Zheng Liu 8bad6fc813 ext4: add punching hole support for non-extent-mapped files
This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:21:37 -05:00
Chen Gang 03dafb5f59 ext4: fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24 23:24:58 -05:00
Theodore Ts'o 72ba74508b ext4: release sysfs kobject when failing to enable quotas on mount
In addition, print the error returned from ext4_enable_quotas()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-01-24 23:24:54 -05:00
Guo Chao 306a74920b ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
This macro, initially introduced by ext2 in v0.99.15, does not
have any users from the beginning. It has been removed in later
ext2 version but still remains in the code of ext3, ext4, ocfs2.
Remove this macro there.

Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Zheng Liu aaddea812c ext4: add tracepoint in punching hole
This patch adds a tracepoint in ext4_punch_hole.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-16 20:21:26 -05:00
Theodore Ts'o 7f5118629f ext4: trigger the lazy inode table initialization after resize
After we have finished extending the file system, we need to trigger a
the lazy inode table thread to zero out the inode tables.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-13 08:41:45 -05:00
Eryu Guan 15b49132fc ext4: check bh in ext4_read_block_bitmap()
Validate the bh pointer before using it, since
ext4_read_block_bitmap_nowait() might return NULL.

I've seen this in fsfuzz testing.

 EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0
 ...
 Call Trace:
  [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60
  [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80
  [<ffffffff811d0d36>] ? __getblk+0x36/0x70
  [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210
  [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140
  [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0
  [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100
  [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0
  [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230
  [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490
  [<ffffffff811b8cd7>] evict+0xa7/0x1c0
  [<ffffffff811b8ed3>] iput_final+0xe3/0x170
  [<ffffffff811b8f9e>] iput+0x3e/0x50
  [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90
  [<ffffffff81231d0b>] ext4_create+0xeb/0x170
  [<ffffffff811aae9c>] vfs_create+0xac/0xd0
  [<ffffffff811ac845>] lookup_open+0x185/0x1c0
  [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170
  [<ffffffff811acb54>] do_last+0x2d4/0x7a0
  [<ffffffff811af743>] path_openat+0xb3/0x480
  [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0
  [<ffffffff811afc49>] do_filp_open+0x49/0xa0
  [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150
  [<ffffffff8119da28>] do_sys_open+0x108/0x1f0
  [<ffffffff8119db51>] sys_open+0x21/0x30
  [<ffffffff81618959>] system_call_fastpath+0x16/0x1b

Also fix comment for ext4_read_block_bitmap_nowait()

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:33:25 -05:00
Wang Shilong aebf02430d ext4: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12 16:28:47 -05:00
Theodore Ts'o 860d21e2c5 ext4: return ENOMEM if sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:19:36 -05:00
Miao Xie 10ee27a06c vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because

- If ->s_umount is write locked, then the sb is not idle. That is
  writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
  writeback, it may bring us deadlock problem when doing umount. In order to
  fix the problem, ext4 and btrfs implemented their own writeback functions
  instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
  code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).

This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-12 10:47:43 +08:00
Guo Chao fef0ebdb22 ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
This fixes a buffer cache leak when creating a directory, introduced
in commit a774f9c20.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:40:25 -05:00
Guo Chao 0ecaef0644 ext4: release buffer in failed path in dx_probe()
If checksum fails, we should also release the buffer
read from previous iteration.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>-
Cc: stable@vger.kernel.org
--
 fs/ext4/namei.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2013-01-06 23:38:47 -05:00
Valerie Aurora 96465efee1 ext4: fix configuration dependencies for ext4 ACLs and security labels
Commit "ext4: Remove CONFIG_EXT4_FS_XATTR" removed the configuration
dependencies for ext4 xattrs from the ext4 ACLs and security labels
configuration options, but did not replace them with a dependency on
ext4 itself.  Add back the dependency on ext4 so the options only show
up if ext4 is enabled.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:38:44 -05:00
Linus Torvalds 5439ca6b8f Various bug fixes for ext4. Perhaps the most serious bug fixed is one
which could cause file system corruptions when performing file punch
 operations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ374OAAoJENNvdpvBGATwEGAP/jKUwjQhBZiF0k9dg1kQ5eTz
 bdli4fy1vxrEMIOym8IZa4nBQJVCkArwRgjc28gCBD6k9u6X3GPa26vUydsoPfP6
 odPdc9c9HtsbYQGuaq1SohID5HfjxHewTcUmCs4X4SpGcSurUcT7eQYWqSuIxFHR
 0nKk8NO4EcWh2uqIoGPrc8QpSdor0DXXYYjZmHCeVLH1n6PyoMsnrFMfO9KqMLUL
 vNR54CX9n1GRTfAfJNkNzcwfs8IfNkDUyv5hFpDh15tLltogU0TqnlAl3vSeZGSx
 vVfhwHmQTK/bJyC3YaoRZqq9CQJVk2f/OTBpJDFY/USaapuitJd6vqbmh7NiRNAN
 LaKmFt99MPfwyjEhIA7+J0LCTraAxc536q43oWWK5dAJhWI7DW0lbHARVeQTixNy
 KJ1Lp0pmmz1mX8/lugOnK1SPBF525kTaoiz2bWqg4oQgn7mBzUlgj+EV22/6Rq83
 TpKOKstl4BiZi8t5AhmFiwqtknCDiT5vUKQNy2kuM/oXtPJID/lM/TJbR5viYD3l
 AH3Ef7xj61CynFZ0oBeraGwtXc2BHJpJdWz+8uj0/VhFfC+uNUYapSLFwyiAVZKO
 xxaItT3ylfKpa0AWK6HBc2SLuL72SCHAPks06YKFtSyHtr5C8SCcafxU2DSOSi7K
 VrhkcH6STa77Br7a1ORt
 =9R/D
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  Perhaps the most serious bug fixed is one
  which could cause file system corruptions when performing file punch
  operations."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid hang when mounting non-journal filesystems with orphan list
  ext4: lock i_mutex when truncating orphan inodes
  ext4: do not try to write superblock on ro remount w/o journal
  ext4: include journal blocks in df overhead calcs
  ext4: remove unaligned AIO warning printk
  ext4: fix an incorrect comment about i_mutex
  ext4: fix deadlock in journal_unmap_buffer()
  ext4: split off ext4_journalled_invalidatepage()
  jbd2: fix assertion failure in jbd2_journal_flush()
  ext4: check dioread_nolock on remount
  ext4: fix extent tree corruption caused by hole punch
2013-01-02 09:57:34 -08:00
Theodore Ts'o 0e9a9a1ad6 ext4: avoid hang when mounting non-journal filesystems with orphan list
When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree.  If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack.  (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:50 -05:00
Theodore Ts'o 721e3eba21 ext4: lock i_mutex when truncating orphan inodes
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is
called without i_mutex being taken.  It had previously not been taken
during orphan cleanup since races weren't possible at that point in
the mount process, but as a result of this c278531d39, we will now see
a kernel WARN_ON in this case.  Take the i_mutex in
ext4_orphan_cleanup() to suppress this warning.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:48 -05:00
Michael Tokarev d096ad0f79 ext4: do not try to write superblock on ro remount w/o journal
When a journal-less ext4 filesystem is mounted on a read-only block
device (blockdev --setro will do), each remount (for other, unrelated,
flags, like suid=>nosuid etc) results in a series of scary messages
from kernel telling about I/O errors on the device.

This is becauese of the following code ext4_remount():

       if (sbi->s_journal == NULL)
                ext4_commit_super(sb, 1);

at the end of remount procedure, which forces writing (flushing) of
a superblock regardless whenever it is dirty or not, if the filesystem
is readonly or not, and whenever the device itself is readonly or not.

We only need call ext4_commit_super when the file system had been
previously mounted read/write.

Thanks to Eric Sandeen for help in diagnosing this issue.

Signed-off-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-25 14:08:16 -05:00
Eric Sandeen 0875a2b448 ext4: include journal blocks in df overhead calcs
To more accurately calculate overhead for "bsd" style
df reporting, we should count the journal blocks as
overhead as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Eric Whitney <enwlinux@gmail.com>
2012-12-25 13:56:01 -05:00
Eric Sandeen a28a9178e8 ext4: remove unaligned AIO warning printk
Although I put this in, I now think it was a bad decision.  For most
users, there is very little to be done in this case.  They get the
message, once per day, with no real context or proposed action.  TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.

Just nuke it.  Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:33:13 -05:00
Andy Lutomirski ad96f71155 ext4: fix an incorrect comment about i_mutex
i_mutex is not held when ->sync_file is called.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:31:52 -05:00
Jan Kara 53e872681f ext4: fix deadlock in journal_unmap_buffer()
We cannot wait for transaction commit in journal_unmap_buffer()
because we hold page lock which ranks below transaction start.  We
solve the issue by bailing out of journal_unmap_buffer() and
jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
for waiting for transaction commit to finish and try invalidation
again. Since the issue can happen only for page stradding i_size, it
is simple enough to manually call jbd2_journal_invalidatepage() for
such page from ext4_setattr(), check the return value and wait if
necessary.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:29:52 -05:00
Jan Kara 4520fb3c36 ext4: split off ext4_journalled_invalidatepage()
In data=journal mode we don't need delalloc or DIO handling in invalidatepage
and similarly in other modes we don't need the journal handling. So split
invalidatepage implementations.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:28:54 -05:00
Jan Kara 261cb20cb2 ext4: check dioread_nolock on remount
Currently we allow enabling dioread_nolock mount option on remount for
filesystems where blocksize < PAGE_CACHE_SIZE.  This isn't really
supported so fix the bug by moving the check for blocksize !=
PAGE_CACHE_SIZE into parse_options(). Change the original PAGE_SIZE to
PAGE_CACHE_SIZE along the way because that's what we are really
interested in.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2012-12-20 00:07:18 -05:00