Commit Graph

1541 Commits

Author SHA1 Message Date
Linus Torvalds b2034d474b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
  fs: add documentation on fallocate hole punching
  Gfs2: fail if we try to use hole punch
  Btrfs: fail if we try to use hole punch
  Ext4: fail if we try to use hole punch
  Ocfs2: handle hole punching via fallocate properly
  XFS: handle hole punching via fallocate properly
  fs: add hole punching to fallocate
  vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
  fix signedness mess in rw_verify_area() on 64bit architectures
  fs: fix kernel-doc for dcache::prepend_path
  fs: fix kernel-doc for dcache::d_validate
  sanitize ecryptfs ->mount()
  switch afs
  move internal-only parts of ncpfs headers to fs/ncpfs
  switch ncpfs
  switch 9p
  pass default dentry_operations to mount_pseudo()
  switch hostfs
  switch affs
  switch configfs
  ...
2011-01-13 10:27:28 -08:00
Linus Torvalds 008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Andrew Morton 6db26ffc91 fs/ext4/inode.c: use pr_warn_ratelimited()
pr_warning_ratelimited() doesn't exist.

Also include printk.h, which defines these things.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 08:03:05 -08:00
Josef Bacik d6dc8462f4 Ext4: fail if we try to use hole punch
Ext4 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12 20:16:44 -05:00
Jan Kara f00c9e44ad quota: Fix deadlock during path resolution
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-12 19:14:55 +01:00
Linus Torvalds e9688f6aca Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: fix trimming starting with block 0 with small blocksize
  ext4: revert buggy trim overflow patch
  ext4: don't pass entire map to check_eofblocks_fl
  ext4: fix memory leak in ext4_free_branches
  ext4: remove ext4_mb_return_to_preallocation()
  ext4: flush the i_completed_io_list during ext4_truncate
  ext4: add error checking to calls to ext4_handle_dirty_metadata()
  ext4: fix trimming of a single group
  ext4: fix uninitialized variable in ext4_register_li_request
  ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
  ext4: drop i_state_flags on architectures with 64-bit longs
  ext4: reorder ext4_inode_info structure elements to remove unneeded padding
  ext4: drop ec_type from the ext4_ext_cache structure
  ext4: use ext4_lblk_t instead of sector_t for logical blocks
  ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
  ext4: fix 32bit overflow in ext4_ext_find_goal()
  ext4: add more error checks to ext4_mkdir()
  ext4: ext4_ext_migrate should use NULL not 0
  ext4: Use ext4_error_file() to print the pathname to the corrupted inode
  ext4: use IS_ERR() to check for errors in ext4_error_file
  ...
2011-01-11 14:37:31 -08:00
Jan Kara 0f0a25bf51 ext4: fix trimming starting with block 0 with small blocksize
When s_first_data_block is not zero (which happens e.g. when block size is 1KB)
and trim ioctl is called to start trimming from block 0, the math in
ext4_get_group_no_and_offset() overflows. The overall result is that ioctl
returns EINVAL which is kind of unexpected and we probably don't want
userspace tools to bother with internal details of filesystem structure.
So just silently increase starting offset (and shorten length) when starting
block is below s_first_data_block.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-11 15:16:31 -05:00
Theodore Ts'o 0a2179b169 ext4: revert buggy trim overflow patch
This reverts commit 4f531501e44: ext4: fix possible overflow in
ext4_trim_fs()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-11 14:42:29 -05:00
Eric Sandeen d002ebf1d8 ext4: don't pass entire map to check_eofblocks_fl
Since check_eofblocks_fl() only uses the m_lblk portion of the map
structure, we may as well pass that directly, rather than passing the
entire map, which IMHO obfuscates what parameters check_eofblocks_fl()
cares about.  Not a big deal, but seems tidier and less confusing, to
me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 13:03:35 -05:00
Theodore Ts'o 1c5b9e9065 ext4: fix memory leak in ext4_free_branches
Commit 40389687 moved a call to ext4_forget() out of
ext4_free_branches and let ext4_free_blocks() handle calling
bforget().  But that change unfortunately did not replace the call to
ext4_forget() with brelse(), which was needed to drop the in-use count
of the indirect block's buffer head, which lead to a memory leak when
deleting files that used indirect blocks.  Fix this.

Thanks to Hugh Dickins for pointing this out.

Cc: stable@kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:51:28 -05:00
Theodore Ts'o a5196f8cdf ext4: remove ext4_mb_return_to_preallocation()
This function was never implemented, except for a BUG_ON which was
tripping when ext4 is run without a journal.  The problem is that
although the comment asserts that "truncate (which is the only way to
free block) discards all preallocations", ext4_free_blocks() is also
called in various error recovery paths when blocks have been
allocated, but for various reasons, we were not able to use those data
blocks (for example, because we ran out of memory while trying to
manipulate the extent tree, or some other similar situation).

In addition to the fact that this function isn't implemented except
for the incorrect BUG_ON, the single caller of this function,
ext4_free_blocks(), doesn't use it all if the journal is enabled.

So remove the (stub) function entirely for now.  If we decide it's
better to add it back, it's only going to be useful with a relatively
large number of code changes anyway.

Google-Bug-Id: 3236408

Cc: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:47:07 -05:00
Jiaying Zhang 3889fd57ea ext4: flush the i_completed_io_list during ext4_truncate
Ted first found the bug when running 2.6.36 kernel with dioread_nolock
mount option that xfstests #13 complained about wrong file size during fsck.
However, the bug exists in the older kernels as well although it is
somehow harder to trigger.

The problem is that ext4_end_io_work() can happen after we have truncated an
inode to a smaller size. Then when ext4_end_io_work() calls 
ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
been truncated, so the inode size becomes inconsistent with the allocated
blocks. 

The following patch flushes the i_completed_io_list during truncate to reduce 
the risk that some pending end_io requests are executed later and convert 
already truncated blocks to initialized. 

Note that although the fix helps reduce the problem a lot there may still 
be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
held, it can race with an ongoing write request so that the io_end request is
processed later when the corresponding blocks have been truncated.

Ted and I have discussed the problem offline and we saw a few ways to fix
the race completely:

a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
whenever vmtruncate() is called. The i_mutex lock prevents any new write
requests from entering writeback and the i_alloc_sem prevents the race
from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
is called from do_truncate(), which is probably the most common case.
However, there are places where we may call vmtruncate() without holding
either i_mutex or i_alloc_sem. I would like to ask for other people's
opinions on what locks are expected to be held before calling vmtruncate().
There seems a disagreement among the callers of that function.

b) We change the ext4 write path so that we change the extent tree to contain 
the newly allocated blocks and update i_size both at the same time --- when 
the write of the data blocks is completed.

c) We add some additional locking to synchronize vmtruncate() and 
ext4_end_io_work(). This approach may have performance implications so we
need to be careful.

All of the above proposals may require more substantial changes, so
we may consider to take the following patch as a bandaid.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:47:05 -05:00
Theodore Ts'o b40971426a ext4: add error checking to calls to ext4_handle_dirty_metadata()
Call ext4_std_error() in various places when we can't bail out
cleanly, so the file system can be marked as in error.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:46:59 -05:00
Jan Kara ca6e909f9b ext4: fix trimming of a single group
When ext4_trim_fs() is called to trim a part of a single group, the
logic will wrongly set last block of the interval to 'len' instead
of 'first_block + len'. Thus a shorter interval is possibly trimmed.
Fix it.

CC: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:39 -05:00
Andrew Morton 6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o 8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o 353eb83c14 ext4: drop i_state_flags on architectures with 64-bit longs
We can store the dynamic inode state flags in the high bits of
EXT4_I(inode)->i_flags, and eliminate i_state_flags.  This saves 8
bytes from the size of ext4_inode_info structure, which when
multiplied by the number of the number of in the inode cache, can save
a lot of memory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:18:25 -05:00
Theodore Ts'o 8a2005d3f8 ext4: reorder ext4_inode_info structure elements to remove unneeded padding
By reordering the elements in the ext4_inode_info structure, we can
reduce the padding needed on an x86_64 system by 16 bytes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:42 -05:00
Theodore Ts'o b05e6ae58a ext4: drop ec_type from the ext4_ext_cache structure
We can encode the ec_type information by using ee_len == 0 to denote
EXT4_EXT_CACHE_NO, ee_start == 0 to denote EXT4_EXT_CACHE_GAP, and if
neither is true, then the cache type must be EXT4_EXT_CACHE_EXTENT.
This allows us to reduce the size of ext4_ext_inode by another 8
bytes.  (ec_type is 4 bytes, plus another 4 bytes of padding)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:26 -05:00
Theodore Ts'o 01f49d0b9d ext4: use ext4_lblk_t instead of sector_t for logical blocks
This fixes a number of places where we used sector_t instead of
ext4_lblk_t for logical blocks, which for ext4 are still 32-bit data
types.  No point wasting space in the ext4_inode_info structure, and
requiring 64-bit arithmetic on 32-bit systems, when it isn't
necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:13:03 -05:00
Theodore Ts'o f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Kazuya Mio ad4fb9cafe ext4: fix 32bit overflow in ext4_ext_find_goal()
ext4_ext_find_goal() returns an ideal physical block number that the block
allocator tries to allocate first. However, if a required file offset is
smaller than the existing extent's one, ext4_ext_find_goal() returns
a wrong block number because it may overflow at
"block - le32_to_cpu(ex->ee_block)". This patch fixes the problem.

ext4_ext_find_goal() will also return a wrong block number in case
a file offset of the existing extent is too big. In this case,
the ideal physical block number is fixed in ext4_mb_initialize_context(),
so it's no problem.

reproduce:
# dd if=/dev/zero of=/mnt/mp1/tmp bs=127M count=1 oflag=sync
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 seek=1 oflag=sync
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0     128    67456             128 eof
/mnt/mp1/file: 2 extents found
# rm -rf /mnt/mp1/tmp
# echo $((512*4096)) > /sys/fs/ext4/loop0/mb_stream_req
# dd if=/dev/zero of=/mnt/mp1/file bs=512K count=1 oflag=sync conv=notrunc

result (linux-2.6.37-rc2 + ext4 patch queue):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    33280             128 
   1     128    67456    33407    128 eof
/mnt/mp1/file: 2 extents found

result(apply this patch):
# filefrag -v /mnt/mp1/file
Filesystem type is: ef53
File size of /mnt/mp1/file is 1048576 (256 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0    66560             128 
   1     128    67456    66687    128 eof
/mnt/mp1/file: 2 extents found

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:28 -05:00
Namhyung Kim dabd991f9d ext4: add more error checks to ext4_mkdir()
Check return value of ext4_journal_get_write_access,
ext4_journal_dirty_metadata and ext4_mark_inode_dirty. Move brelse()
under 'out_stop' to release bh properly in case of journal error.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:16 -05:00
Eric Paris f1dffc4c54 ext4: ext4_ext_migrate should use NULL not 0
ext4_ext_migrate() calls ext4_new_inode() and passes 0 instead of a pointer
to a struct qstr.  This patch uses NULL, to make it obvious to the caller
that this was a pointer.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:11:00 -05:00
Theodore Ts'o f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Dan Carpenter 13195184a8 ext4: test the correct variable in ext4_init_pageio()
This is a copy and paste error.  The intent was to check
"io_page_cachep".  We tested "io_page_cachep" earlier.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:44 -05:00
Wang Sheng-Hui 6e9510b0e0 ext2,ext3,ext4: clarify comment for extN_xattr_set_handle
Signed-off-by: Wang Sheng-Hui <crosslonelyover@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:30 -05:00
Theodore Ts'o eaeef86718 ext4: clean up ext4_xattr_list()'s error code checking and return strategy
Any time you see code that tries to add error codes together, you
should want to claw your eyes out...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:07 -05:00
Lukas Czerner 9325963667 ext4: remove warning message from ext4_issue_discard helper
ext4_issue_discard is supposed to be helper for calling discard, however
in case that underlying device does not support discard it prints out
the warning message and clears the DISCARD t_mount_opt flag. Since it
can be (and is) used by others, it should not do anything and let the
caller to handle the error case.

This commit removes warning message and flag setting from
ext4_issue_discard and use it just in place where it is really needed
(release_blocks_on_commit). FITRIM ioctl should not set any flags nor it
should print out warning messages, so get rid of the warning as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:09:59 -05:00
Lukas Czerner 4f531501e4 ext4: fix possible overflow in ext4_trim_fs()
When determining last group through ext4_get_group_no_and_offset() the
result may be wrong in cases when range->start and range-len are too
big, because it may overflow when summing up those two numbers.

Fix that by checking range->len and limit its value to
ext4_blocks_count(). This commit was tested by myself with expected
result.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-01-10 12:04:55 -05:00
Nick Piggin 73598611ad ext2,3,4: provide simple rcu-walk ACL implementation
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:30 +11:00
Nick Piggin b74c79e993 fs: provide rcu-walk aware permission i_ops
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:29 +11:00
Nick Piggin fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Theodore Ts'o 8a7411a243 ext4: fix on-line resizing regression
https://bugzilla.kernel.org/show_bug.cgi?id=25352

This regression was caused by commit a31437b85: "ext4: use
sb_issue_zeroout in setup_new_group_blocks", by accidentally dropping
the code which reserved the block group descriptor and inode table
blocks.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-23 15:00:54 -05:00
Jiri Kosina 4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
Theodore Ts'o b72143ab3e ext4: Add error checking to kmem_cache_alloc() call in ext4_free_blocks()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-20 07:26:59 -05:00
Joe Perches 0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches 94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Eric Sandeen af0b44a197 ext4: zero out nanosecond timestamps for small inodes
When nanosecond timestamp resolution isn't supported on an ext4
partition (inode size = 128), stat() appears to be returning
uninitialized garbage in the nanosecond component of timestamps.

EXT4_INODE_GET_XTIME should zero out tv_nsec when EXT4_FITS_IN_INODE
evaluates to false.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:10:31 -05:00
Theodore Ts'o cad3f00763 ext4: optimize ext4_check_dir_entry() with unlikely() annotations
This function gets called a lot for large directories, and the answer
is almost always "no, no, there's no problem".  This means using
unlikely() is a good thing.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:07:02 -05:00
Jesper Juhl b17b35ec13 ext4: use kmem_cache_zalloc() in ext4_init_io_end()
Use advantage of kmem_cache_zalloc() to remove a memset() call in
ext4_init_io_end() and save a few bytes.

Before:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3016       0     624    3640     e38 fs/ext4/page-io.o
After:
 [jj@dragon linux-2.6]$ size fs/ext4/page-io.o
    text    data     bss     dec     hex filename
    3000       0     624    3624     e28 fs/ext4/page-io.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:41:55 -05:00
Tobias Klauser 6ca7b13dea ext4: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 21:38:46 -05:00
Theodore Ts'o a8901d3487 ext4: Use pr_warning_ratelimited() instead of printk_ratelimit()
printk_ratelimit() is deprecated since it is a global instead of a
per-printk ratelimit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-17 10:40:47 -05:00
Theodore Ts'o 225db7d35c ext4: Fix up comments in inode.c
This fixes up some broken argument descriptions that Namhyung Kim had
originally submitted for ext3.  This fixes the comments that were
still applicable in ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-16 16:38:26 -05:00
Theodore Ts'o a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o 673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Aaro Koskinen 6d5c3aa84b ext4: fix typo which broke '..' detection in ext4_find_entry()
There should be a check for the NUL character instead of '0'.

Fortunately the only thing that cares about this is NFS serving, which
is why we didn't notice this in the merge window testing.

Reported-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 21:45:31 -05:00
Theodore Ts'o 1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Lukas Czerner e681c047e4 ext4: Add EXT4_IOC_TRIM ioctl to handle batched discard
Filesystem independent ioctl was rejected as not common enough to be in
core vfs ioctl. Since we still need to access to this functionality this
commit adds ext4 specific ioctl EXT4_IOC_TRIM to dispatch
ext4_trim_fs().

It takes fstrim_range structure as an argument. fstrim_range is definec in
the include/linux/fs.h and its definition is as follows.

struct fstrim_range {
	__u64 start;
	__u64 len;
	__u64 minlen;
}

start	- first Byte to trim
len	- number of Bytes to trim from start
minlen	- minimum extent length to trim, free extents shorter than this
  number of Bytes will be ignored. This will be rounded up to fs
  block size.

After the FITRIM is done, the number of actually discarded Bytes is stored
in fstrim_range.len to give the user better insight on how much storage
space has been really released for wear-leveling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:47:07 -05:00
Lukas Czerner 93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Darrick J. Wong 5a9ae68a34 ext4: ext4_fill_super shouldn't return 0 on corruption
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret.  However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail.  This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.

A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 09:56:44 -05:00
Dan Carpenter f4c8cc652d ext4: missing unlock in ext4_clear_request_list()
If the the li_request_list was empty then it returned with the lock
held.  Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:25 -05:00
Markus Trippelsdorf 08da1193d2 ext4: fix setting random pages PageUptodate
ext4_end_bio calls put_page and kmem_cache_free before calling
SetPageUpdate(). This can result in setting the PageUptodate bit on
random pages and causes the following BUG:

 BUG: Bad page state in process rm  pfn:52e54
 page:ffffea0001222260 count:0 mapcount:0 mapping:          (null) index:0x0
 arch kernel: page flags: 0x4000000000000008(uptodate)

Fix the problem by moving put_io_page() after the SetPageUpdate() call.

Thanks to Hugh Dickins for analyzing this problem.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:06 -05:00
Wu Fengguang 380cf090f4 ext4: fix redirty_page_for_writepage() typo in comment
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-15 14:14:52 +01:00
Tejun Heo d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Linus Torvalds a7bcf21e60 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Add new ext4 inode tracepoints
  ext4: Don't call sb_issue_discard() in ext4_free_blocks()
  ext4: do not try to grab the s_umount semaphore in ext4_quota_off
  ext4: fix potential race when freeing ext4_io_page structures
  ext4: handle writeback of inodes which are being freed
  ext4: initialize the percpu counters before replaying the journal
  ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
  ext4: fix lazyinit hang after removing request
2010-11-08 11:54:53 -08:00
Theodore Ts'o 7ff9c073dd ext4: Add new ext4 inode tracepoints
Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
ext4_begin_ordered_truncate()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:51:33 -05:00
Theodore Ts'o b56ff9d397 ext4: Don't call sb_issue_discard() in ext4_free_blocks()
Commit 5c521830cf (ext4: Support discard requests when running in
no-journal mode) attempts to add sb_issue_discard() for data blocks
(in data=writeback mode) and in no-journal mode.  Unfortunately, this
no longer works, because in commit dd3932eddf (block: remove
BLKDEV_IFL_WAIT), sb_issue_discard() only presents a synchronous
interface, and there are times when we call ext4_free_blocks() when we
are are holding a spinlock, or are otherwise in an atomic context.

For now, I've removed the call to sb_issue_discard() to prevent a
deadlock or (if spinlock debugging is enabled) failures like this:

BUG: scheduling while atomic: rc.sysinit/1376/0x00000002
Pid: 1376, comm: rc.sysinit Not tainted 2.6.36-ARCH #1
Call Trace:
[<ffffffff810397ce>] __schedule_bug+0x5e/0x70
[<ffffffff81403110>] schedule+0x950/0xa70
[<ffffffff81060bad>] ? insert_work+0x7d/0x90
[<ffffffff81060fbd>] ? queue_work_on+0x1d/0x30
[<ffffffff81061127>] ? queue_work+0x37/0x60
[<ffffffff8140377d>] schedule_timeout+0x21d/0x360
[<ffffffff812031c3>] ? generic_make_request+0x2c3/0x540
[<ffffffff81402680>] wait_for_common+0xc0/0x150
[<ffffffff81041490>] ? default_wake_function+0x0/0x10
[<ffffffff812034bc>] ? submit_bio+0x7c/0x100
[<ffffffff810680a0>] ? wake_bit_function+0x0/0x40
[<ffffffff814027b8>] wait_for_completion+0x18/0x20
[<ffffffff8120a969>] blkdev_issue_discard+0x1b9/0x210
[<ffffffff811ba03e>] ext4_free_blocks+0x68e/0xb60
[<ffffffff811b1650>] ? __ext4_handle_dirty_metadata+0x110/0x120
[<ffffffff811b098c>] ext4_ext_truncate+0x8cc/0xa70
[<ffffffff810d713e>] ? pagevec_lookup+0x1e/0x30
[<ffffffff81191618>] ext4_truncate+0x178/0x5d0
[<ffffffff810eacbb>] ? unmap_mapping_range+0xab/0x280
[<ffffffff810d8976>] vmtruncate+0x56/0x70
[<ffffffff811925cb>] ext4_setattr+0x14b/0x460
[<ffffffff811319e4>] notify_change+0x194/0x380
[<ffffffff81117f80>] do_truncate+0x60/0x90
[<ffffffff811e08fa>] ? security_inode_permission+0x1a/0x20
[<ffffffff811eaec1>] ? tomoyo_path_truncate+0x11/0x20
[<ffffffff81127539>] do_last+0x5d9/0x770
[<ffffffff811278bd>] do_filp_open+0x1ed/0x680
[<ffffffff8140644f>] ? page_fault+0x1f/0x30
[<ffffffff81132bfc>] ? alloc_fd+0xec/0x140
[<ffffffff81118db1>] do_sys_open+0x61/0x120
[<ffffffff81118e8b>] sys_open+0x1b/0x20
[<ffffffff81002e6b>] system_call_fastpath+0x16/0x1b

https://bugzilla.kernel.org/show_bug.cgi?id=22302

Reported-by: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: jiayingz@google.com
2010-11-08 13:49:33 -05:00
Dmitry Monakhov 87009d86dc ext4: do not try to grab the s_umount semaphore in ext4_quota_off
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2010-11-08 13:47:33 -05:00
Theodore Ts'o 83668e7141 ext4: fix potential race when freeing ext4_io_page structures
Use an atomic_t and make sure we don't free the structure while we
might still be submitting I/O for that page.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:45:33 -05:00
Theodore Ts'o f7ad6d2e92 ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename).  In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed.  If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.

To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed.  That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).

Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.

  kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
  Call Trace:
   [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
   [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
   [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
   [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
   [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
   [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
   [<ffffffff81087910>] do_writepages+0x1c/0x25
   [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
   [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
   [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
   [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
   [<ffffffff810c14a3>] evict+0x22/0x92
   [<ffffffff810c1a3d>] iput+0x212/0x249
   [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
   [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
   [<ffffffff810be613>] dput+0x13a/0x147
   [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
   [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
   [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
   [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
   [<ffffffff810b99c6>] sys_rename+0x16/0x18
   [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
2010-11-08 13:43:33 -05:00
Theodore Ts'o ce7e010aef ext4: initialize the percpu counters before replaying the journal
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-03 12:03:21 -04:00
Theodore Ts'o b2c78cd09b ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
Newer GCC's reported the following build warning:

   fs/ext4/super.c: In function 'ext4_lazyinit_thread':
   fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function

Fix it by removing the need for the ret variable in the first place.

Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:19:30 -04:00
Lukas Czerner f4245bd4eb ext4: fix lazyinit hang after removing request
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:07:17 -04:00
Theodore Ts'o eb8abb927a ext4: Remove useless spinlock in ext4_getattr()
Linus noted, and complained to me, that doing while lots of "git diff"'s
of kernel sources, these spinlocks were responsible for 27% of the
spinlock cost on his two-processor system as reported by perf.

Git was doing lots of parallel stats, and this was putting a lot of
pressure on ext4_getattr().  A spinlock to protect a single
memory-to-memory copy is pointless, so remove it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-02 10:38:30 -04:00
Uwe Kleine-König b595076a18 tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
Al Viro 152a083666 new helper: mount_bdev()
... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:13 -04:00
Linus Torvalds 671f837a04 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: BUG_ON fix: check if page has buffers before calling page_buffers()
2010-10-28 15:46:57 -07:00
Theodore Ts'o b1142e8fec ext4: BUG_ON fix: check if page has buffers before calling page_buffers()
We need to make check if a page does not have buffes by checking
page_has_buffers(page) before calling page_buffers(page) in
ext4_writepage().  Otherwise page_buffers() could throw a BUG_ON.

Thanks also to Markus Trippelsdorf and Avinash Kurup who also reported
the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Sedat Dilek <sedat.dilek@googlemail.com>
Tested-by: Sedat Dilek <sedat.dilek@googlemail.com>
2010-10-28 17:33:57 -04:00
Ingo Molnar 19ef20143f ext4: fix compile with CONFIG_EXT4_FS_XATTR disabled
Commit 5dabfc78dc ("ext4: rename {exit,init}_ext4_*() to
ext4_{exit,init}_*()") causes

  fs/ext4/super.c:4776: error: implicit declaration of function ‘ext4_init_xattr’

when CONFIG_EXT4_FS_XATTR is disabled.

It renamed init_ext4_xattr to ext4_init_xattr but forgot to update the
dummy definition in fs/ext4/xattr.h.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-28 09:29:17 -07:00
Theodore Ts'o a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Dmitry Monakhov 3d287de3b8 ext4: optimize orphan_list handling for ext4_setattr
Surprisingly chown() on ext4 is not SMP scalable operation. 
Due to unconditional orphan_del(NULL, inode) in ext4_setattr()
result in significant performance overhead because of global orphan
mutex, especially in no-journal mode (where orphan_add() is noop).
It is possible to skip explicit orphan_del if possible.
Results of fchown() micro-benchmark in no-journal mode
while (1) {
   iteration++;
   fchown(fd, uid, gid);
   fchown(fd, uid + 1, gid + 1)
}
measured: iterations per millisecond
| nr_tasks | w/o patch | with patch |
|        1 |       142 |        185 |
|        4 |       109 |        642 |

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:46 -04:00
Nicolas Kaiser beed5ecbaa ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:42 -04:00
Kazuya Mio a6371b636f ext4: fix compile error in ext4_fallocate()
When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got
the following compile error.

  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function 'ext4_fallocate':
fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function)
fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once
fs/ext4/extents.c:3772: error: for each function it appears in.)
make[2]: *** [fs/ext4/extents.o] Error 1

The patch fixes this problem.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Eric Sandeen eee4adc709 ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static
These functions are only used within fs/ext4/mballoc.c, so move them
so they are used after they are defined, and then make them be static.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o 61d08673de ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()
Fix a namespace leak from fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:15 -04:00
Theodore Ts'o 4a873a472b ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static
Fix a namespace leak by moving the function to the file where it is
used and making it static.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o bf89d16f6e ext4: rename {ext,idx}_pblock and inline small extent functions
Cleanup namespace leaks from fs/ext4 and the inline trivial functions
ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the
code size actually shrinks when we make these functions inline,
they're so trivial.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o 1f109d5a17 ext4: make various ext4 functions be static
These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o 5dabfc78dc ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o 7f93cff90f ext4: fix kernel oops if the journal superblock has a non-zero j_errno
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen 72f84e6560 ext4: update writeback_index based on last page scanned
As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.

As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout.  If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.

Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen 5b41d92437 ext4: implement writeback livelock avoidance using page tagging
This is analogous to Jan Kara's commit,
f446daaea9
mm: implement writeback livelock avoidance using page tagging

but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)

If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.

If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.

This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...

With the following patch, we only sync as much as was
dirty at the time of the sync.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Eric Sandeen bbd08344e3 ext4: tidy up a void argument in inode.c
This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()

__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.

Fixing this up makes it slightly more obvious to read, and 
enables proper typechecking.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner 27ee40df2b ext4: add batched_discard into ext4 feature list
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner 7360d1731e ext4: Add batched discard support for ext4
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner 77ca6cdf0a ext4: Use return value from sb_issue_discard()
Use return value from sb_issue_discard() as return value in
ext4_issue_discard(). Since sb_issue_discard() may result in more
serious errors than just -EOPNOTSUPP it is worth to inform user of this
function about them to handle error cases properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Namhyung Kim 877836905d ext4: Check return value of sb_getblk() and friends
Fail block allocation if sb_getblk() returns NULL. In that case,
sb_find_get_block() also likely to fail so that it should skip
calling ext4_forget().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:11 -04:00
Theodore Ts'o bd2d0210cf ext4: use bio layer instead of buffer layer in mpage_da_submit_io
Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o 1de3e3df91 ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()
This massively simplifies the ext4_da_writepages() code path by
completely removing mpage_put_bnr_bhs(), which is almost 100 lines of
code iterating over a set of pages using pagevec_lookup(), and folds
that functionality into mpage_da_submit_io()'s existing
pagevec_lookup() loop.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o 3ecdb3a193 ext4: inline walk_page_buffers() into mpage_da_submit_io
Expand the call:

  if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
                        ext4_bh_delay_or_unwritten))
	goto redirty_page

into mpage_da_submit_io().

This will allow us to merge in mpage_put_bnr_to_bhs() in the next
patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Theodore Ts'o cb20d51883 ext4: inline ext4_writepage() into mpage_da_submit_io()
As a prepratory step to switching to bio_submit, inline
ext4_writepage() into mpage_da_submit() and then simplify things a
bit.  This makes it clearer what mpage_da_submit needs to do.

Also, move the ClearPageChecked(page) call into
__ext4_journalled_writepage(), as a minor bit of cleanup refactoring.

This also allows us to pull i_size_read() and
ext4_should_journal_data() out of the loop, which should be a very
minor CPU savings.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o a42afc5f56 ext4: simplify ext4_writepage()
The actual code in ext4_writepage() is unnecessarily convoluted.
Simplify it so it is easier to understand, but otherwise logically
equivalent.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o 5a87b7a5da ext4: call mpage_da_submit_io() from mpage_da_map_blocks()
Eventually we need to completely reorganize the ext4 writepage
callpath, but for now, we simplify things a little by calling
mpage_da_submit_io() from mpage_da_map_blocks(), since all of the
places where we call mpage_da_map_blocks() it is followed up by a call
to mpage_da_submit_io().

We're also a wee bit better with respect to error handling, but there
are still a number of issues where it's not clear what the right thing
is to do with ext4 functions deep in the writeback codepath fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o 16828088f9 ext4: use KMEM_CACHE instead of kmem_cache_create
Also remove the SLAB_RECLAIM_ACCOUNT flag from the system zone kmem
cache.  This slab tends to be fairly static, so it shouldn't be marked
as likely to have free pages that can be reclaimed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:09 -04:00
Theodore Ts'o 7845c04975 ext4: use search_dirblock() in ext4_dx_find_entry()
Use the search_dirblock() in ext4_dx_find_entry().  It makes the code
easier to read, and it takes advantage of common code.  It also saves
100 bytes or so of text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
2010-10-27 21:30:08 -04:00
Theodore Ts'o 8941ec8bb6 ext4: avoid uninitialized memory references in ext3_htree_next_block()
If the first block of htree directory is missing '.' or '..' but is
otherwise a valid directory, and we do a lookup for '.' or '..', it's
possible to dereference an uninitialized memory pointer in
ext4_htree_next_block().

We avoid this by moving the special case from ext4_dx_find_entry() to
ext4_find_entry(); this also means we can optimize ext4_find_entry()
slightly when NFS looks up "..".

Thanks to Brad Spengler for pointing a Clang warning that led me to
look more closely at this code.  The warning was harmless, but it was
useful in pointing out code that was too ugly to live.  This warning was
also reported by Roman Borisov.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Spengler <spender@grsecurity.net>
2010-10-27 21:30:08 -04:00
Eric Sandeen 640e939656 ext4: remove unused ext4_sb_info members
Not that these take up a lot of room, but the structure is long enough
as it is, and there's no need to confuse people with these various
undocumented & unused structure members...

Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:08 -04:00
Eric Sandeen c999af2b34 ext4: queue conversion after adding to inode's completed IO list
By queuing the io end on the unwritten workqueue before adding it
to our inode's list of completed IOs, I think we run the risk
of the work getting completed, and the IO freed, before we try
to add it to the inode's i_completed_io_list.

It should be safe to add it to the inode's list of completed
IOs, and -then- queue it for completion, I think.

Thanks to Dave Chinner for pointing out the race.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Eric Sandeen 3e1e5f5016 ext4: don't use ext4_allocation_contexts for tracing
Many tracepoints were populating an ext4_allocation_context
to pass in, but this requires a slab allocation even when
tracepoints are off.  In fact, 4 of 5 of these allocations
were only for tracing.  In addition, we were only using a
small fraction of the 144 bytes of this structure for this
purpose.

We can do away with all these alloc/frees of the ac and
simply pass in the bits we care about, instead.

I tested this by turning on tracing and running through
xfstests on x86_64.  I did not actually do anything with
the trace output, however.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Toshiyuki Okajima 0c9169ccad ext4: fix potential infinite loop in ext4_da_writepages()
On linux-2.6.36-rc2, if we execute the following script, we can hang
the system when the /bin/sync command is executed:

========================================================================
#!/bin/sh

echo -n "HANG UP TEST: "
/bin/dd if=/dev/zero of=/tmp/img bs=1k count=1 seek=1M 2> /dev/null
/sbin/mkfs.ext4 -Fq /tmp/img
/bin/mount -o loop -t ext4 /tmp/img /mnt
/bin/dd if=/dev/zero of=/mnt/file bs=1 count=1 \
seek=$((16*1024*1024*1024*1024-4096)) 2> /dev/null
/bin/sync
/bin/umount /mnt
echo "DONE"
exit 0
========================================================================

We can see the following backtrace if we get the kdump when this
hangup occurs:

======================================================================
kthread()
=> bdi_writeback_thread()
   => wb_do_writeback()
      => wb_writeback()
         => writeback_inodes_wb()
            => writeback_sb_inodes()
               => writeback_single_inode()
                  => ext4_da_writepages()  ---+ 
                                ^ infinite    |
                                |   loop      |
                                +-------------+
======================================================================

The reason why this hangup happens is described as follows:
1) We write the last extent block of the file whose size is the filesystem 
   maximum size.
2) "BH_Delay" flag is set on the buffer_head of its block.
3) - the member, "m_lblk" of struct mpage_da_data is 4294967295 (UINT_MAX)
   - the member, "m_len" of struct mpage_da_data is 1
  mpage_put_bnr_to_bhs() which is called via ext4_da_writepages()
  cannot clear "BH_Delay" flag of the buffer_head because the type of
  m_lblk is ext4_lblk_t and then m_lblk + m_len is overflow.

  Therefore an infinite loop occurs because ext4_da_writepages()
  cannot write the page (which corresponds to the block) since
  "BH_Delay" flag isn't cleared.
----------------------------------------------------------------------
static void mpage_put_bnr_to_bhs(struct mpage_da_data *mpd,
				struct ext4_map_blocks *map)
{
...
	int blocks = map->m_len;
...
		do {
			// cur_logical = 4294967295
			// map->m_lblk = 4294967295
			// blocks = 1
			// *** map->m_lblk + blocks == 0 (OVERFLOW!) ***
			// (cur_logical >= map->m_lblk + blocks) => true
			if (cur_logical >= map->m_lblk + blocks)
				break;
----------------------------------------------------------------------

NOTE: Mounting with the nodelalloc option will avoid this codepath,
and thus, avoid this hang

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:07 -04:00
Toshiyuki Okajima e0d10bfa91 ext4: improve llseek error handling for overly large seek offsets
The llseek system call should return EINVAL if passed a seek offset
which results in a write error.  What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.


If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
written (write systemcall) is different from the maximum size which can be 
sought (lseek systemcall).

For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:

#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>

Table. the max file size which we can write or seek
       at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag|                               |                               |
|      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
|case       \|                               |                               |
+------------+-------------------------------+-------------------------------+
| #1         |   write:      2194719883264   | write:       --------------   |
|            |   seek:       2199023251456   | seek:        --------------   |
+------------+-------------------------------+-------------------------------+
| #2         |   write:      4402345721856   | write:       17592186044415   |
|            |   seek:      17592186044415   | seek:        17592186044415   |
+------------+-------------------------------+-------------------------------+

The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)

Therefore we create ext4 llseek function which uses 2 maxbytes.

The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
2010-10-27 21:30:06 -04:00
Maciej Żenczykowski c41303ced6 ext4: don't update sb journal_devnum when RO dev
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner 2407518de6 ext4: use sb_issue_zeroout in ext4_ext_zeroout
Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
own approach to zero out extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner a31437b85a ext4: use sb_issue_zeroout in setup_new_group_blocks
Use sb_issue_zeroout to zero out inode table and descriptor table
blocks instead of old approach which involves journaling.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner 857ac889cc ext4: add interface to advertise ext4 features in sysfs
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner bfff68738f ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Sergey Senozhatsky a1c6c5698d ext4: fix NULL pointer dereference in print_daily_error_info()
Fix NULL pointer dereference in print_daily_error_info, when   
called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
reporting timer in ext4_put_super.

Google-Bug-Id: 3017663

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Lukas Czerner 53fdcf992d ext4: don't hold spinlock while calling ext4_issue_discard()
We can't hold the block group spinlock because we ext4_issue_discard()
calls wait and hence can get rescheduled.

Google-Bug-Id: 3017678

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Lukas Czerner 5829870982 ext4: check for negative error code from sb_issue_discard
sb_issue_discard() is returning negative error code, so check for
-EOPNOTSUPP.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Eric Sandeen b443e7339a ext4: don't bump up LONG_MAX nr_to_write by a factor of 8
I'm uneasy with lots of stuff going on in ext4_da_writepages(),
but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
making anything better, so avoid the multiplier in that case.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Eric Sandeen 659c6009ca ext4: stop looping in ext4_num_dirty_pages when max_pages reached
Today we simply break out of the inner loop when we have accumulated
max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
in the while (!done) loop, this does potentially a lot of work
with no net effect.

When we have accumulated max_pages, just clean up and return.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:03 -04:00
Curt Wohlgemuth fb1813f4a8 ext4: use dedicated slab caches for group_info structures
ext4_group_info structures are currently allocated with kmalloc().
With a typical 4K block size, these are 136 bytes each -- meaning
they'll each consume a 256-byte slab object.  On a system with many
ext4 large partitions, that's a lot of wasted kernel slab space.
(E.g., a single 1TB partition will have about 8000 block groups, using
about 2MB of slab, of which nearly 1MB is wasted.)

This patch creates an array of slab pointers created as needed --
depending on the superblock block size -- and uses these slabs to
allocate the group info objects.

Google-Bug-Id: 2980809

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:29:12 -04:00
Theodore Ts'o 58590b06d7 ext4: fix EOFBLOCKS_FL handling
It turns out we have several problems with how EOFBLOCKS_FL is
handled.  First of all, there was a fencepost error where we were not
clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
but rather when we allocate the next block _after_ the uninitalized
block.  Secondly we were not testing to see if we needed to clear the
EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
an uninitialized block (which is the most common case).

Google-Bug-Id: 2928259

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:23:12 -04:00
Christoph Hellwig 85fe4025c6 fs: do not assign default i_ino in new_inode
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Al Viro 7de9c6ee3e new helper: ihold()
Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Christoph Hellwig ebdec241d5 fs: kill block_prepare_write
__block_write_begin and block_prepare_write are identical except for slightly
different calling conventions.  Convert all callers to the __block_write_begin
calling conventions and drop block_prepare_write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:18:20 -04:00
Linus Torvalds a2887097f2 Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
  xen-blkfront: disable barrier/flush write support
  Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
  block: remove BLKDEV_IFL_WAIT
  aic7xxx_old: removed unused 'req' variable
  block: remove the BH_Eopnotsupp flag
  block: remove the BLKDEV_IFL_BARRIER flag
  block: remove the WRITE_BARRIER flag
  swap: do not send discards as barriers
  fat: do not send discards as barriers
  ext4: do not send discards as barriers
  jbd2: replace barriers with explicit flush / FUA usage
  jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
  jbd: replace barriers with explicit flush / FUA usage
  nilfs2: replace barriers with explicit flush / FUA usage
  reiserfs: replace barriers with explicit flush / FUA usage
  gfs2: replace barriers with explicit flush / FUA usage
  btrfs: replace barriers with explicit flush / FUA usage
  xfs: replace barriers with explicit flush / FUA usage
  block: pass gfp_mask and flags to sb_issue_discard
  dm: convey that all flushes are processed as empty
  ...
2010-10-22 17:07:18 -07:00
Linus Torvalds 79f14b7c56 Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
  BKL: remove BKL from freevxfs
  BKL: remove BKL from qnx4
  autofs4: Only declare function when CONFIG_COMPAT is defined
  autofs: Only declare function when CONFIG_COMPAT is defined
  ncpfs: Lock socket in ncpfs while setting its callbacks
  fs/locks.c: prepare for BKL removal
  BKL: Remove BKL from ncpfs
  BKL: Remove BKL from OCFS2
  BKL: Remove BKL from squashfs
  BKL: Remove BKL from jffs2
  BKL: Remove BKL from ecryptfs
  BKL: Remove BKL from afs
  BKL: Remove BKL from USB gadgetfs
  BKL: Remove BKL from autofs4
  BKL: Remove BKL from isofs
  BKL: Remove BKL from fat
  BKL: Remove BKL from ext2 filesystem
  BKL: Remove BKL from do_new_mount()
  BKL: Remove BKL from cgroup
  BKL: Remove BKL from NTFS
  ...
2010-10-22 10:52:01 -07:00
Jan Blunck f2143c4e2e BKL: Remove BKL from ext4 filesystem
The BKL is still used in ext4_put_super(), ext4_fill_super() and
ext4_remount(). All three calles are protected against concurrent calls by
the s_umount rw semaphore of struct super_block.

Therefore the BKL is protecting nothing in this case.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:38 +02:00
Jan Blunck db71922217 BKL: Explicitly add BKL around get_sb/fill_super
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
       don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-04 21:10:10 +02:00
Christoph Hellwig dd3932eddf block: remove BLKDEV_IFL_WAIT
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16 20:52:58 +02:00
Patrick J. LoPresti 30ca22c70e ext3/ext4: Factor out disk addressability check
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 08:41:42 -07:00
Christoph Hellwig 61002f7db3 ext4: do not send discards as barriers
ext4 already uses synchronous discards, no need to add I/O barriers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Christoph Hellwig 2cf6d26a35 block: pass gfp_mask and flags to sb_issue_discard
We'll need to get rid of the BLKDEV_IFL_BARRIER flag, and to facilitate
that and to make the interface less confusing pass all flags explicitly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:38 +02:00
Linus Torvalds 5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Andreas Gruenbacher 2aec7c5232 mbcache: Remove unused features
The mbcache code was written to support a variable number of indexes,
but all the existing users use exactly one index.  Simplify to code to
support only that case.

There are also no users of the cache entry free operation, and none of
the users keep extra data in cache entries.  Remove those features as
well.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:45 -04:00
Al Viro 0930fcc1ee convert ext4 to ->evict_inode()
pretty much brute-force...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:30 -04:00
Christoph Hellwig 1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00
Christoph Hellwig 6e1db88d53 introduce __block_write_begin
Split up the block_write_begin implementation - __block_write_begin is a new
trivial wrapper for block_prepare_write that always takes an already
allocated page and can be either called from block_write_begin or filesystem
code that already has a page allocated.  Remove the handling of already
allocated pages from block_write_begin after switching all callers that
do it to __block_write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:32 -04:00
Christoph Hellwig eafdc7d190 sort out blockdev_direct_IO variants
Move the call to vmtruncate to get rid of accessive blocks to the callers
in prepearation of the new truncate calling sequence.  This was only done
for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
its _newtrunc variant while at it as just opencoding the two additional
paramters is shorted than the name suffix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:29 -04:00
Linus Torvalds 09dc942c2a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Adding error check after calling ext4_mb_regular_allocator()
  ext4: Fix dirtying of journalled buffers in data=journal mode
  ext4: re-inline ext4_rec_len_(to|from)_disk functions
  jbd2: Remove t_handle_lock from start_this_handle()
  jbd2: Change j_state_lock to be a rwlock_t
  jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
  ext4: Add mount options in superblock
  ext4: force block allocation on quota_off
  ext4: fix freeze deadlock under IO
  ext4: drop inode from orphan list if ext4_delete_inode() fails
  ext4: check to make make sure bd_dev is set before dereferencing it
  jbd2: Make barrier messages less scary
  ext4: don't print scary messages for allocation failures post-abort
  ext4: fix EFBIG edge case when writing to large non-extent file
  ext4: fix ext4_get_blocks references
  ext4: Always journal quota file modifications
  ext4: Fix potential memory leak in ext4_fill_super
  ext4: Don't error out the fs if the user tries to make a file too big
  ext4: allocate stripe-multiple IOs on stripe boundaries
  ext4: move aio completion after unwritten extent conversion
  ...

Fix up conflicts in fs/ext4/inode.c as per Ted.

Fix up xfs conflicts as per earlier xfs merge.
2010-08-07 13:03:53 -07:00
Aditya Kali 6c7a120ac6 ext4: Adding error check after calling ext4_mb_regular_allocator()
If the bitmap block on disk is bad, ext4_mb_load_buddy() returns an
error. This error is returned to the caller,
ext4_mb_regular_allocator() and then to ext4_mb_new_blocks().  But
ext4_mb_new_blocks() did not check for the return value of
ext4_mb_regular_allocator() and would repeatedly try to load the
bitmap block. The fix simply catches the return value and exits out of
the 'repeat' loop after cleanup.

We also take the opportunity to clean up the error handling in
ext4_mb_new_blocks().

Google-Bug-Id: 2853530

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-05 16:22:24 -04:00
Jan Kara 56d35a4cd1 ext4: Fix dirtying of journalled buffers in data=journal mode
In data=journal mode, we still use block_write_begin() to prepare
page for writing. This function can occasionally mark buffer dirty
which violates journalling assumptions - when a buffer is part of
a transaction, it should be dirty and a buffer can be already part
of a forget list of some transaction when block_write_begin()
gets called. This violation of journalling assumptions then results
in "JBD: Spotted dirty metadata buffer..." warnings.

In fact, temporary dirtying the buffer while the page is still locked
does not really cause problems to the journalling because we won't write
the buffer until the page gets unlocked. So we just have to make sure
to clear dirty bits before unlocking the page.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-08-05 14:41:42 -04:00
Eric Sandeen 0cfc9255a1 ext4: re-inline ext4_rec_len_(to|from)_disk functions
commit 3d0518f4, "ext4: New rec_len encoding for very
large blocksizes" made several changes to this path, but from
a perf perspective, un-inlining ext4_rec_len_from_disk() seems
most significant.  This function is called from ext4_check_dir_entry(),
which on a file-creation workload is called extremely often.

I tested this with bonnie:

# bonnie++ -u root -s 0 -f -x 200 -d /mnt/test -n 32

(this does 200 iterations) and got this for the file creations:

ext4 stock:   Average =  21206.8 files/s
ext4 inlined: Average =  22346.7 files/s  (+5%)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-05 01:46:37 -04:00
Jiri Kosina d790d4d583 Merge branch 'master' into for-next 2010-08-04 15:14:38 +02:00
Uwe Kleine-König 73b2c7165b fix comment typo "choosed" -> "chosen"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-08-04 15:08:39 +02:00
Theodore Ts'o a931da6ac9 jbd2: Change j_state_lock to be a rwlock_t
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores.  So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:35:12 -04:00
Theodore Ts'o 8b67f04ab9 ext4: Add mount options in superblock
Allow mount options to be stored in the superblock.  Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 23:14:20 -04:00
Dmitry Monakhov ca0e05e4b1 ext4: force block allocation on quota_off
Perform full sync procedure so that any delayed allocation blocks are
allocated so quota will be consistent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:48:36 -04:00
Eric Sandeen 437f88cc03 ext4: fix freeze deadlock under IO
Commit 6b0310fbf0 caused a regression resulting in deadlocks
when freezing a filesystem which had active IO; the vfs_check_frozen
level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
through.  Duh.

Changing the test to FREEZE_TRANS should let the normal freeze
syncing get through the fs, but still block any transactions from
starting once the fs is completely frozen.

I tested this by running fsstress in the background while periodically
snapshotting the fs and running fsck on the result.  I ran into
occasional deadlocks, but different ones.  I think this is a
fine fix for the problem at hand, and the other deadlocky things
will need more investigation.

Reported-by: Phillip Susi <psusi@cfl.rr.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:33:29 -04:00
Theodore Ts'o 4538821993 ext4: drop inode from orphan list if ext4_delete_inode() fails
There were some error paths in ext4_delete_inode() which was not
dropping the inode from the orphan list.  This could lead to a BUG_ON
on umount when the orphan list is discovered to be non-empty.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-29 15:06:10 -04:00
Theodore Ts'o f613dfcb33 ext4: check to make make sure bd_dev is set before dereferencing it
There are some drivers which may not set bdev->bd_dev.  So make sure
it is non-NULL before dereferencing it.

Google-Bug-Id: 1773557

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Eric Sandeen e3570639c8 ext4: don't print scary messages for allocation failures post-abort
I often get emails containing the "This should not happen!!" message,
conveniently trimmed to remove things like:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 13 c9 70 00 00 28 00
end_request: I/O error, dev sda, sector 51628400
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only

I don't think there is any value to the verbosity if the reason is
due to a filesystem abort; it just obfuscates the root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Toshiyuki Okajima d889dc8382 ext4: fix EFBIG edge case when writing to large non-extent file
By running the following reproducer, we can confirm that the write 
system call returns with 0 when it should return the error EFBIG.

#!/bin/sh

/bin/dd if=/dev/zero of=./img bs=1k count=1 seek=1024k > /dev/null 2>&1
/sbin/mkfs.ext3 -Fq ./img
/bin/mount -o loop -t ext4 ./img /mnt
/bin/touch /mnt/file
strace /bin/dd if=/dev/zero of=/mnt/file conv=notrunc bs=1k count=1 seek=$((2194719883264/1024)) 2>&1 | /bin/egrep "write.* 1024\) = "
/bin/umount /mnt
exit

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
2010-07-27 11:56:07 -04:00
Eric Sandeen 79e8303677 ext4: fix ext4_get_blocks references
ext4_get_blocks got renamed to ext4_map_blocks, but left stale
comments and a prototype littered around.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Jan Kara 62d2b5f2dc ext4: Always journal quota file modifications
When journaled quota options are not specified, we do writes
to quota files just in data=ordered mode. This actually causes
warnings from JBD2 about dirty journaled buffer because ext4_getblk
unconditionally treats a block allocated by it as metadata. Since
quota actually is filesystem metadata, the easiest way to get rid
of the warning is to always treat quota writes as metadata...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Cyrill Gorcunov dcc7dae3cb ext4: Fix potential memory leak in ext4_fill_super
Under heavy memory pressure we may hit out of memory
situation and as result kstrdup'ed options will not be
freed. Fix it.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Theodore Ts'o 0c095c7f11 ext4: Don't error out the fs if the user tries to make a file too big
If the user attempts to make a non-extent-mapped file to be too large,
return EFBIG, but don't call ext4_std_err() which will end up marking
the file system as containing an error.

Thanks to Toshiyuki Okajima-san at Fujitsu for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Eric Sandeen 506bf2d821 ext4: allocate stripe-multiple IOs on stripe boundaries
For some reason, today mballoc only allocates IOs which are exactly
stripe-sized on a stripe boundary.  If you have a multiple (say, a
128k IO on a 64k stripe) you may end up unaligned.

It seems to me that a simple change to align stripe-multiple IOs
on stripe boundaries would be a very good idea, unless this breaks
some other mballoc heuristic for some reason...

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
jiayingz@google.com (Jiaying Zhang) 5b3ff237be ext4: move aio completion after unwritten extent conversion
This patch is to be applied upon Christoph's "direct-io: move aio_complete
into ->end_io" patch. It adds iocb and result fields to struct ext4_io_end_t,
so that we can call aio_complete from  ext4_end_io_nolock() after the extent
conversion has finished.

I have verified with Christoph's aio-dio test that used to fail after a few
runs on an original kernel but now succeeds on the patched kernel.

See http://thread.gmane.org/gmane.comp.file-systems.ext4/19659 for details.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Christoph Hellwig 552ef8024f direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated. 

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Jiaying Zhang 5c521830cf ext4: Support discard requests when running in no-journal mode
Issue discard request in ext4_free_blocks() when ext4 has no journal and
is mounted with discard option.
    
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Amir G 4038968738 ext4: Fix block bitmap inconsistencies after a crash when deleting files
We have experienced bitmap inconsistencies after crash during file
delete under heavy load.  The crash is not file system related and I
the following patch in ext4_free_branches() fixes the recovery
problem.

If the transaction is restarted and there is a crash before the new
transaction is committed, then after recovery, the blocks that this
indirect block points to have been freed, but the indirect block
itself has not been freed and may still point to some of the free
blocks (because of the ext4_forget()).

So ext4_forget() should be called inside ext4_free_blocks() to avoid
this problem.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Joe Perches a271fe8527 ext4: Remove unnecessary casts of private_data
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o e5880d76ae ext4: fix potential NULL dereference while tracing
The allocation_context pointer can be NULL.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o 89eeddf033 ext4: Define s_jnl_backup_type in superblock
This has been in use by e2fsprogs for a while; define it to keep the
super block fields in sync.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o 66e61a9e95 ext4: Once a day, printk file system error information to dmesg
This allows us to grab any file system error messages by scraping
/var/log/messages.  This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o 1c13d5c087 ext4: Save error information to the superblock for analysis
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:03 -04:00
Theodore Ts'o c398eda0e4 ext4: Pass line numbers to ext4_error() and friends
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:40 -04:00
Theodore Ts'o 60fd4da34d ext4: Cleanup ext4_check_dir_entry so __func__ is now implicit
Also start passing the line number to ext4_check_dir since we're going
to need it in upcoming patch.
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:54:40 -04:00
Christoph Hellwig 40e2e97316 direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:09:02 -05:00
Theodore Ts'o 90c7201b97 ext4: Pass line number to ext4_journal_abort_handle()
This allows the error messages to include the line number

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 14:53:24 -04:00
Theodore Ts'o e29136f80e ext4: Enhance ext4_grp_locked_error() to take block and function numbers
Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 12:54:28 -04:00
Theodore Ts'o c67d859e39 ext4: clean up ext4_abort() so __func__ is now implicit
Use a macro definition for ext4_abort() to clean up the .c files a wee
bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:07:07 -04:00
Theodore Ts'o 4a9cdec73f ext4: Add new superblock fields reserved for the Next3 snapshot feature
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:00:23 -04:00
Jiri Kosina f1bbbb6912 Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
Uwe Kleine-König 421f91d21a fix typos concerning "initiali[zs]e"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-06-16 18:05:05 +02:00
Jan Kara c6ac12a615 ext4: update ctime when changing the file's permission by setfacl
ext4 didn't update the ctime of the file when its permission was
changed.

Steps to reproduce:
 # touch aaa
 # stat -c %Z aaa
 1275289822
 # setfacl -m  'u::x,g::x,o::x' aaa
 # stat -c %Z aaa
 1275289822                         <- unchanged

But, according to the spec of the ctime, ext4 must update it.

Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-15 12:19:59 -04:00
Christoph Hellwig 206f7ab4f4 ext4: remove vestiges of nobh support
The nobh option was only supported for writeback mode, but given that all
write paths actually create buffer heads it effectively was a no-op already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 14:42:49 -04:00
Andi Kleen 5a0790c2c4 ext4: remove initialized but not read variables
No real bugs found, just removed some dead code.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 13:28:03 -04:00
Theodore Ts'o 07a038245b ext4: Convert more i_flags references to use accessor functions
These changes are not ones which are likely to result in races, but
they should be fixed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 09:54:48 -04:00
Theodore Ts'o a0375156ca ext4: Clean up s_dirt handling
We don't need to set s_dirt in most of the ext4 code when journaling
is enabled.  In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted.  As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt.  There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.

This patch will optimize out some unneeded disk writes when using ext4
with a journal.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-11 23:14:04 -04:00
Dmitry Monakhov 84a8dce271 ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags
A few functions were still modifying i_flags in a racy manner.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-05 11:51:27 -04:00
Theodore Ts'o 1f5a81e41f ext4: Make sure the MOVE_EXT ioctl can't overwrite append-only files
Dan Roseberg has reported a problem with the MOVE_EXT ioctl.  If the
donor file is an append-only file, we should not allow the operation
to proceed, lest we end up overwriting the contents of an append-only
file.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>
2010-06-02 22:04:39 -04:00
Linus Torvalds d28619f156 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Convert quota statistics to generic percpu_counter
  ext3 uses rb_node = NULL; to zero rb_root.
  quota: Fixup dquot_transfer
  reiserfs: Fix resuming of quotas on remount read-write
  pohmelfs: Remove dead quota code
  ufs: Remove dead quota code
  udf: Remove dead quota code
  quota: rename default quotactl methods to dquot_
  quota: explicitly set ->dq_op and ->s_qcop
  quota: drop remount argument to ->quota_on and ->quota_off
  quota: move unmount handling into the filesystem
  quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
  quota: move remount handling into the filesystem
  ocfs2: Fix use after free on remount read-only

Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
2010-05-30 09:11:11 -07:00
Christoph Hellwig 1b061d9247 rename the generic fsync implementations
We don't name our generic fsync implementations very well currently.
The no-op implementation for in-memory filesystems currently is called
simple_sync_file which doesn't make too much sense to start with,
the the generic one for simple filesystems is called simple_fsync
which can lead to some confusion.

This patch renames the generic file fsync method to generic_file_fsync
to match the other generic_file_* routines it is supposed to be used
with, and the no-op implementation to noop_fsync to make it obvious
what to expect.  In addition add some documentation for both methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:06:06 -04:00
Christoph Hellwig 7ea8085910 drop unused dentry argument to ->fsync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:05:02 -04:00
Linus Torvalds e4ce30f377 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Make fsync sync new parent directories in no-journal mode
  ext4: Drop whitespace at end of lines
  ext4: Fix compat EXT4_IOC_ADD_GROUP
  ext4: Conditionally define compat ioctl numbers
  tracing: Convert more ext4 events to DEFINE_EVENT
  ext4: Add new tracepoints to track mballoc's buddy bitmap loads
  ext4: Add a missing trace hook
  ext4: restart ext4_ext_remove_space() after transaction restart
  ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
  ext4: Avoid crashing on NULL ptr dereference on a filesystem error
  ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
  ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
  ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
  ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
  ext4: Use our own write_cache_pages()
  ext4: Show journal_checksum option
  ext4: Fix for ext4_mb_collect_stats()
  ext4: check for a good block group before loading buddy pages
  ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
  ext4: Remove extraneous newlines in ext4_msg() calls
  ...

Fixed up trivial conflict in fs/ext4/fsync.c
2010-05-27 10:26:37 -07:00
Christoph Hellwig 287a80958c quota: rename default quotactl methods to dquot_
Follow the dquot_* style used elsewhere in dquot.c.

[Jan Kara: Fixed up missing conversion of ext2]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:17 +02:00
Christoph Hellwig 307ae18a56 quota: drop remount argument to ->quota_on and ->quota_off
Remount handling has fully moved into the filesystem, so all this is
superflous now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig e0ccfd959c quota: move unmount handling into the filesystem
Currently the VFS calls into the quotactl interface for unmounting
filesystems.  This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount.  Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.

Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem.  But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig 0f0dd62fdd quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly.  Also rename vfs_quota_disable to
dquot_disable while we're at it.

[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:40 +02:00
Christoph Hellwig c79d967de3 quota: move remount handling into the filesystem
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw.  Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself.  This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:39 +02:00
Linus Torvalds e8bebe2f71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
  fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
  get rid of home-grown mutex in cris eeprom.c
  switch ecryptfs_write() to struct inode *, kill on-stack fake files
  switch ecryptfs_get_locked_page() to struct inode *
  simplify access to ecryptfs inodes in ->readpage() and friends
  AFS: Don't put struct file on the stack
  Ban ecryptfs over ecryptfs
  logfs: replace inode uid,gid,mode initialization with helper function
  ufs: replace inode uid,gid,mode initialization with helper function
  udf: replace inode uid,gid,mode init with helper
  ubifs: replace inode uid,gid,mode initialization with helper function
  sysv: replace inode uid,gid,mode initialization with helper function
  reiserfs: replace inode uid,gid,mode initialization with helper function
  ramfs: replace inode uid,gid,mode initialization with helper function
  omfs: replace inode uid,gid,mode initialization with helper function
  bfs: replace inode uid,gid,mode initialization with helper function
  ocfs2: replace inode uid,gid,mode initialization with helper function
  nilfs2: replace inode uid,gid,mode initialization with helper function
  minix: replace inode uid,gid,mode init with helper
  ext4: replace inode uid,gid,mode init with helper
  ...

Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21 19:37:45 -07:00
Dmitry Monakhov b10b852090 ext4: replace inode uid,gid,mode init with helper
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:24 -04:00
Stephen Hemminger 11e2752807 ext4: constify xattr_handler
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:19 -04:00
Jens Axboe ee9a3607fb Merge branch 'master' into for-2.6.35
Conflicts:
	fs/ext3/fsync.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:27:26 +02:00
Dmitry Monakhov 12755627bd quota: unify quota init condition in setattr
Quota must being initialized if size or uid/git changes requested.
But initialization performed in two different places:
in case of i_size file system is responsible for dquot init
, but in case of uid/gid init will be called internally in
dquot_transfer().
This ambiguity makes code harder to understand.
Let's move this logic to one common helper function.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-21 19:30:45 +02:00
Frank Mayhar 14ece1028b ext4: Make fsync sync new parent directories in no-journal mode
Add a new ext4 state to tell us when a file has been newly created; use
that state in ext4_sync_file in no-journal mode to tell us when we need
to sync the parent directory as well as the inode and data itself.  This
fixes a problem in which a panic or power failure may lose the entire
file even when using fsync, since the parent directory entry is lost.

Addresses-Google-Bug: #2480057

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 08:00:00 -04:00
Theodore Ts'o 60e6679e28 ext4: Drop whitespace at end of lines
This patch was generated using:

#!/usr/bin/perl -i
while (<>) {
    s/[ 	]+$//;
    print;
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 07:00:00 -04:00
Ben Hutchings 4d92dc0f00 ext4: Fix compat EXT4_IOC_ADD_GROUP
struct ext4_new_group_input needs to be converted because u64 has
only 32-bit alignment on some 32-bit architectures, notably i386.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 06:00:00 -04:00
Ben Hutchings 899ad0cea6 ext4: Conditionally define compat ioctl numbers
It is unnecessary, and in general impossible, to define the compat
ioctl numbers except when building the filesystem with CONFIG_COMPAT
defined.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 05:00:00 -04:00
Theodore Ts'o f307333e14 ext4: Add new tracepoints to track mballoc's buddy bitmap loads
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 03:00:00 -04:00
Li Zefan 5a58ec8766 ext4: Add a missing trace hook
Commit f8ec9d6837 added a
trace event ext4_da_release_space, but didn't add some
corresponding trace hook.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 02:00:00 -04:00
Dmitry Monakhov 0617b83fa2 ext4: restart ext4_ext_remove_space() after transaction restart
If i_data_sem was internally dropped due to transaction restart, it is
necessary to restart path look-up because extents tree was possibly
modified by ext4_get_block().

https://bugzilla.kernel.org/show_bug.cgi?id=15827

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
2010-05-17 01:00:00 -04:00
Theodore Ts'o 786ec7915e ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted
Dimitry Monakhov discovered an edge case where it was possible for the
EXT4_EOFBLOCKS_FL flag could get cleared unnecessarily.  This is true;
I have a test case that can be exercised via downloading and
decompressing the file:

wget ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-testcases/eofblocks-fl-test-case.img.bz2
bunzip2 eofblocks-fl-test-case.img
dd if=/dev/zero of=eofblocks-fl-test-case.img bs=1k seek=17925 bs=1k count=1 conv=notrunc

However, triggering it in real life is highly unlikely since it
requires an extremely fragmented sparse file with a hole in exactly
the right place in the extent tree.  (It actually took quite a bit of
work to generate this test case.)  Still, it's nice to get even
extreme corner cases to be correct, so this patch makes sure that we
don't clear the EXT4_EOFBLOCKS_FL incorrectly even in this corner
case.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 00:00:00 -04:00
Theodore Ts'o f70f362b4a ext4: Avoid crashing on NULL ptr dereference on a filesystem error
If the EOFBLOCK_FL flag is set when it should not be and the inode is
zero length, then eh_entries is zero, and ex is NULL, so dereferencing
ex to print ex->ee_block causes a kernel OOPS in
ext4_ext_map_blocks().

On top of that, the error message which is printed isn't very helpful.
So we fix this by printing something more explanatory which doesn't
involve trying to print ex->ee_block.

Addresses-Google-Bug: #2655740

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 23:00:00 -04:00
Dmitry Monakhov 12e9b89200 ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 22:00:00 -04:00
Theodore Ts'o 24676da469 ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE()
EXT4_ERROR_INODE() tends to provide better error information and in a
more consistent format.  Some errors were not even identifying the inode
or directory which was corrupted, which made them not very useful.

Addresses-Google-Bug: #2507977

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 21:00:00 -04:00
Theodore Ts'o 2ed886852a ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks()
This saves a huge amount of stack space by avoiding unnecesary struct
buffer_head's from being allocated on the stack.

In addition, to make the code easier to understand, collapse and
refactor ext4_get_block(), ext4_get_block_write(),
noalloc_get_block_write(), into a single function.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 20:00:00 -04:00
Theodore Ts'o e35fd6609b ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks()
Jack up ext4_get_blocks() and add a new function, ext4_map_blocks()
which uses a much smaller structure, struct ext4_map_blocks which is
20 bytes, as opposed to a struct buffer_head, which nearly 5 times
bigger on an x86_64 machine.  By switching things to use
ext4_map_blocks(), we can save stack space by using ext4_map_blocks()
since we can avoid allocating a struct buffer_head on the stack.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 19:00:00 -04:00
Theodore Ts'o 8e48dcfbd7 ext4: Use our own write_cache_pages()
Make a copy of write_cache_pages() for the benefit of
ext4_da_writepages().  This allows us to simplify the code some, and
will allow us to further customize the code in future patches.

There are some nasty hacks in write_cache_pages(), which Linus has
(correctly) characterized as vile.  I've just copied it into
write_cache_pages_da(), without trying to clean those bits up lest I
break something in the ext4's delalloc implementation, which is a bit
fragile right now.  This will allow Dave Chinner to clean up
write_cache_pages() in mm/page-writeback.c, without worrying about
breaking ext4.  Eventually write_cache_pages_da() will go away when I
rewrite ext4's delayed allocation and create a general
ext4_writepages() which is used for all of ext4's writeback.  Until
now this is the lowest risk way to clean up the core
write_cache_pages() function.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
2010-05-16 18:00:00 -04:00
Jan Kara 39a4bade8c ext4: Show journal_checksum option
We failed to show journal_checksum option in /proc/mounts. Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 17:00:00 -04:00
Curt Wohlgemuth 291dae472a ext4: Fix for ext4_mb_collect_stats()
Fix ext4_mb_collect_stats() to use the correct test for s_bal_success; it
should be testing "best-extent.fe_len >= orig-extent.fe_len" , not
"orig-extent.fe_len >= goal-extent.fe_len" .

Signed-off-by: Curt Wohlgemuth <curtw@google.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 16:00:00 -04:00
Curt Wohlgemuth 8a57d9d61a ext4: check for a good block group before loading buddy pages
This adds a new field in ext4_group_info to cache the largest available
block range in a block group; and don't load the buddy pages until *after*
we've done a sanity check on the block group.

With large allocation requests (e.g., fallocate(), 8MiB) and relatively full
partitions, it's easy to have no block groups with a block extent large
enough to satisfy the input request length.  This currently causes the loop
during cr == 0 in ext4_mb_regular_allocator() to load the buddy bitmap pages
for EVERY block group.  That can be a lot of pages.  The patch below allows
us to call ext4_mb_good_group() BEFORE we load the buddy pages (although we
have check again after we lock the block group).

Addresses-Google-Bug: #2578108
Addresses-Google-Bug: #2704453

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 15:00:00 -04:00
Nikanth Karthikesan 6d19c42b7c ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate
Currently using posix_fallocate one can bypass an RLIMIT_FSIZE limit
and create a file larger than the limit. Add a check for that.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 14:00:00 -04:00
Curt Wohlgemuth fbe845ddf3 ext4: Remove extraneous newlines in ext4_msg() calls
Addresses-Google-Bug: #2562325

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 13:00:00 -04:00
Curt Wohlgemuth d4c402d9fd ext4: Print mount options in when mounting and add a remount message
This adds a "re-mounted" message to ext4_remount(), and both it and
the mount message in ext4_fill_super() now have the original mount
options data string.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 12:00:00 -04:00
Eric Sandeen 72b8ab9dde ext4: don't use quota reservation for speculative metadata
Because we can badly over-reserve metadata when we
calculate worst-case, it complicates things for quota, since
we must reserve and then claim later, retry on EDQUOT, etc.
Quota is also a generally smaller pool than fs free blocks,
so this over-reservation hurts more, and more often.

I'm of the opinion that it's not the worst thing to allow
metadata to push a user slightly over quota.  This simplifies
the code and avoids the false quota rejections that result
from worst-case speculation.

This patch stops the speculative quota-charging for
worst-case metadata requirements, and just charges quota
when the blocks are allocated at writeout.  It also is
able to remove the try-again loop on EDQUOT.

This patch has been tested indirectly by running the xfstests
suite with a hack to mount & enable quota prior to the test.

I also did a more specific test of fragmenting freespace
and then doing a large delalloc write under quota; quota
stopped me at the right amount of file IO, and then the
writeout generated enough metadata (due to the fragmentation)
that it put me slightly over quota, as expected.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 11:00:00 -04:00
Dmitry Monakhov 84061e07c5 ext4: init statistics after journal recovery
Currently block/inode/dir counters initialized before journal was
recovered. In fact after journal recovery this info will probably
change. And freeblocks it critical for correct delalloc mode
accounting.

https://bugzilla.kernel.org/show_bug.cgi?id=15768

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 08:00:00 -04:00
Dmitry Monakhov d17413c08c ext4: clean up inode bitmaps manipulation in ext4_free_inode
- Reorganize locking scheme to batch two atomic operation in to one.
  This also allow us to state what healthy group must obey following rule
  ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
  inode bitmaps.
- Move non-group members update out of group_lock.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 07:00:00 -04:00
Dmitry Monakhov 21ca087a38 ext4: Do not zero out uninitialized extents beyond i_size
The extents code will sometimes zero out blocks and mark them as
initialized instead of splitting an extent into several smaller ones.
This optimization however, causes problems if the extent is beyond
i_size because fsck will complain if there are uninitialized blocks
after i_size as this can not be distinguished from an inode that has
an incorrect i_size field.

https://bugzilla.kernel.org/show_bug.cgi?id=15742

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 06:00:00 -04:00
Eric Sandeen c445e3e0a5 ext4: don't scan/accumulate more pages than mballoc will allocate
There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.

At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.

Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.

However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.

This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.

Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch.  Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.

Eventually we need to change mpage_da_map_pages() also submit its I/O
to the block layer, subsuming mpage_da_submit_io(), and then change it
call ext4_get_blocks() multiple times.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 04:00:00 -04:00
Eric Sandeen a30eec2a86 ext4: stop issuing discards if not supported by device
Turn off issuance of discard requests if the device does
not support it - similar to the action we take for barriers.
This will save a little computation time if a non-discardable
device is mounted with -o discard, and also makes it obvious
that it's not doing what was asked at mount time ...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 03:00:00 -04:00
Eric Sandeen 6b0310fbf0 ext4: don't return to userspace after freezing the fs with a mutex held
ext4_freeze() used jbd2_journal_lock_updates() which takes
the j_barrier mutex, and then returns to userspace.  The
kernel does not like this:

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
 #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0

Use vfs_check_frozen() added to ext4_journal_start_sb() and
ext4_force_commit() instead.

Addresses-Red-Hat-Bugzilla: #568503

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 02:00:00 -04:00
Dmitry Monakhov 256a453546 ext4: symlink must be handled via filesystem specific operation
generic setattr implementation is no longer responsible for
quota transfer so synlinks must be handled via ext4_setattr.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 02:00:00 -04:00
Eric Sandeen 42007efd56 ext4: check s_log_groups_per_flex in online resize code
If groups_per_flex < 2, sbi->s_flex_groups[] doesn't get filled out,
and every other access to this first tests s_log_groups_per_flex;
same thing needs to happen in resize or we'll wander off into
a null pointer when doing an online resize of the file system.

Thanks to Christoph Biedl, who came up with the trivial testcase:

# truncate --size 128M fsfile
# mkfs.ext3 -F fsfile
# tune2fs -O extents,uninit_bg,dir_index,flex_bg,huge_file,dir_nlink,extra_isize fsfile
# e2fsck -yDf -C0 fsfile
# truncate --size 132M fsfile
# losetup /dev/loop0 fsfile
# mount /dev/loop0 mnt
# resize2fs -p /dev/loop0

	https://bugzilla.kernel.org/show_bug.cgi?id=13549

Reported-by: Alessandro Polverini <alex@nibbles.it>
Test-case-by: Christoph Biedl  <bugzilla.kernel.bpeb@manchmal.in-ulm.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 01:00:00 -04:00
Dmitry Monakhov 35121c9860 ext4: fix quota accounting in case of fallocate
allocated_meta_data is already included in 'used' variable.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 00:00:00 -04:00
Christian Borntraeger b684b2ee94 ext4: allow defrag (EXT4_IOC_MOVE_EXT) in 32bit compat mode
I have an x86_64 kernel with i386 userspace. e4defrag fails on the
EXT4_IOC_MOVE_EXT ioctl because it is not wired up for the compat
case. It seems that struct move_extent is compat save, only types
with fixed widths are used:
{
        __u32 reserved;         /* should be zero */
        __u32 donor_fd;         /* donor file descriptor */
        __u64 orig_start;       /* logical start offset in block for orig */
        __u64 donor_start;      /* logical start offset in block for donor */
        __u64 len;              /* block length to be moved */
        __u64 moved_len;        /* moved block length */
};

Lets just wire up EXT4_IOC_MOVE_EXT for the compat case.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
CC: Akira Fujita <a-fujita@rs.jp.nec.com>
2010-05-15 00:00:00 -04:00
Jing Zhang e39e07fdfd ext4: rename ext4_mb_release_desc() to ext4_mb_unload_buddy()
This function cleans up after ext4_mb_load_buddy(), so the renaming
makes the code clearer.

Signed-off-by: Jing Zhang <zj.barak@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-14 00:00:00 -04:00
Jing Zhang 62e823a2cb ext4: Remove unnecessary call to ext4_get_group_desc() in mballoc
Signed-off-by: Jing Zhang <zj.barak@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-13 00:00:00 -04:00
Jing Zhang b720303df7 ext4: fix memory leaks in error path handling of ext4_ext_zeroout()
When EIO occurs after bio is submitted, there is no memory free
operation for bio, which results in memory leakage. And there is also
no check against bio_alloc() for bio.

Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jing Zhang <zj.barak@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-12 00:00:00 -04:00
Steven Liu c26d0bad3d ext4: Fix coding style in fs/ext4/move_extent.c
Making sure ee_block is initialized to zero to prevent gcc from
kvetching.  It's harmless (although it's not obvious that it's
harmless) from code inspection:

fs/ext4/move_extent.c:478: warning: 'start_ext.ee_block' may be used
uninitialized in this function

Thanks to Stefan Richter for first bringing this to the attention of
linux-ext4@vger.kernel.org.

Signed-off-by: LiuQi <lingjiujianke@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
2010-05-11 00:00:00 -04:00
Dmitry Monakhov 0671e70465 ext4: check missed return value in ext4_sync_file()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-10 00:00:00 -04:00
Jens Axboe 7407cf355f Merge branch 'master' into for-2.6.35
Conflicts:
	fs/block_dev.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-29 09:36:24 +02:00
Dmitry Monakhov fbd9b09a17 blkdev: generalize flags for blkdev_issue_fn functions
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-28 19:47:36 +02:00
Linus Torvalds 202f2bb070 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Issue the discard operation *before* releasing the blocks to be reused
  ext4: Fix buffer head leaks after calls to ext4_get_inode_loc()
  ext4: Fix possible lost inode write in no journal mode
2010-04-25 10:01:51 -07:00
Theodore Ts'o b90f687018 ext4: Issue the discard operation *before* releasing the blocks to be reused
Otherwise, we can end up having data corruption because the blocks
could get reused and then discarded!

https://bugzilla.kernel.org/show_bug.cgi?id=15579

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-04-20 16:51:59 -04:00
Curt Wohlgemuth fd2dd9fbaf ext4: Fix buffer head leaks after calls to ext4_get_inode_loc()
Calls to ext4_get_inode_loc() returns with a reference to a buffer
head in iloc->bh.  The callers of this function in ext4_write_inode()
when in no journal mode and in ext4_xattr_fiemap() don't release the
buffer head after using it.

Addresses-Google-Bug: #2548165

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-04-03 17:44:16 -04:00
Curt Wohlgemuth 8b472d739b ext4: Fix possible lost inode write in no journal mode
In the no-journal case, ext4_write_inode() will fetch the bh and call
sync_dirty_buffer() on it.  However, if the bh has already been
written and the bh reclaimed for some other purpose, AND if the inode
is the only one in the inode table block in use, then
ext4_get_inode_loc() will not read the inode table block from disk,
but as an optimization, fill the block with zero's assuming that its
caller will copy in the on-disk version of the inode.  This is not
done by ext4_write_inode(), so the contents of the inode can simply
get lost.  The fix is to use __ext4_get_inode_loc() with in_mem set to
0, instead of ext4_get_inode_loc().  Long term the API needs to be
fixed so it's obvious why latter is not safe.

Addresses-Google-Bug: #2526446

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-04-03 16:45:06 -04:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Linus Torvalds 39f1cd635c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
  ext4: Don't use delayed allocation by default when used instead of ext3
  ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
  ext4: Fix estimate of # of blocks needed to write indirect-mapped files
2010-03-25 14:10:53 -07:00
Eric Sandeen c4caae2518 ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
When used_dirs was introduced for the flex_groups struct, it looks
like the accounting was not put into place properly, in some places
manipulating free_inodes rather than used_dirs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-23 21:32:00 -04:00
Jan Kara ba69f9ab7d ext4: Don't use delayed allocation by default when used instead of ext3
When ext4 driver is used to mount a filesystem instead of the ext3 file
system driver (through CONFIG_EXT4_USE_FOR_EXT23), do not enable delayed
allocation by default since some ext3 users and application writers have
developed unfortunate expectations about the safety of writing files on
systems subject to sudden and violent death without using fsync().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:18:37 -04:00
Theodore Ts'o 37f328eb60 ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
Oops.  (Blush.)

Thanks to Sedat Dilek for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:06:41 -04:00
Jan Kara d330a5befb ext4: Fix estimate of # of blocks needed to write indirect-mapped files
http://bugzilla.kernel.org/show_bug.cgi?id=15420

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-14 18:17:54 -04:00
Linus Torvalds c32da02342 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
  doc: fix typo in comment explaining rb_tree usage
  Remove fs/ntfs/ChangeLog
  doc: fix console doc typo
  doc: cpuset: Update the cpuset flag file
  Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
  Remove drivers/parport/ChangeLog
  Remove drivers/char/ChangeLog
  doc: typo - Table 1-2 should refer to "status", not "statm"
  tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
  No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
  devres/irq: Fix devm_irq_match comment
  Remove reference to kthread_create_on_cpu
  tree-wide: Assorted spelling fixes
  tree-wide: fix 'lenght' typo in comments and code
  drm/kms: fix spelling in error message
  doc: capitalization and other minor fixes in pnp doc
  devres: typo fix s/dev/devm/
  Remove redundant trailing semicolons from macros
  fix typo "definetly" -> "definitely" in comment
  tree-wide: s/widht/width/g typo in comments
  ...

Fix trivial conflict in Documentation/laptops/00-INDEX
2010-03-12 16:04:50 -08:00
Jiri Kosina 318ae2edc3 Merge branch 'for-next' into for-linus
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
2010-03-08 16:55:37 +01:00
Emese Revfy 52cf25d0ab Driver core: Constify struct sysfs_ops in struct kobj_type
Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

 * prevents modification of data that is shared
   (referenced) by many other structure instances
   at runtime

 * detects/prevents accidental (but not intentional)
   modification attempts on archs that enforce
   read-only kernel data at runtime

 * potentially better optimized code as the compiler
   can assume that the const data cannot be changed

 * the compiler/linker move const data into .rodata
   and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:49 -08:00
Linus Torvalds e213e26ab3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
  quota: stop using QUOTA_OK / NO_QUOTA
  dquot: cleanup dquot initialize routine
  dquot: move dquot initialization responsibility into the filesystem
  dquot: cleanup dquot drop routine
  dquot: move dquot drop responsibility into the filesystem
  dquot: cleanup dquot transfer routine
  dquot: move dquot transfer responsibility into the filesystem
  dquot: cleanup inode allocation / freeing routines
  dquot: cleanup space allocation / freeing routines
  ext3: add writepage sanity checks
  ext3: Truncate allocated blocks if direct IO write fails to update i_size
  quota: Properly invalidate caches even for filesystems with blocksize < pagesize
  quota: generalize quota transfer interface
  quota: sb_quota state flags cleanup
  jbd: Delay discarding buffers in journal_unmap_buffer
  ext3: quota_write cross block boundary behaviour
  quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
  quota: split out compat_sys_quotactl support from quota.c
  quota: split out netlink notification support from quota.c
  quota: remove invalid optimization from quota_sync_all
  ...

Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
2010-03-05 13:20:53 -08:00
Linus Torvalds 9467c4fdd6 Merge branch 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'write_inode2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  pass writeback_control to ->write_inode
  make sure data is on disk before calling ->write_inode
2010-03-05 11:53:53 -08:00
Linus Torvalds 1f63b9c15b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits)
  ext4: fix up rb_root initializations to use RB_ROOT
  ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
  ext4: Fix the NULL reference in double_down_write_data_sem()
  ext4: Fix insertion point of extent in mext_insert_across_blocks()
  ext4: consolidate in_range() definitions
  ext4: cleanup to use ext4_grp_offs_to_block()
  ext4: cleanup to use ext4_group_first_block_no()
  ext4: Release page references acquired in ext4_da_block_invalidatepages
  ext4: Fix ext4_quota_write cross block boundary behaviour
  ext4: Convert BUG_ON checks to use ext4_error() instead
  ext4: Use direct_IO_no_locking in ext4 dio read
  ext4: use ext4_get_block_write in buffer write
  ext4: mechanical rename some of the direct I/O get_block's identifiers
  ext4: make "offset" consistent in ext4_check_dir_entry()
  ext4: Handle non empty on-disk orphan link
  ext4: explicitly remove inode from orphan list after failed direct io
  ext4: fix error handling in migrate
  ext4: deprecate obsoleted mount options
  ext4: Fix fencepost error in chosing choosing group vs file preallocation.
  jbd2: clean up an assertion in jbd2_journal_commit_transaction()
  ...
2010-03-05 10:47:00 -08:00
Christoph Hellwig a9185b41a4 pass writeback_control to ->write_inode
This gives the filesystem more information about the writeback that
is happening.  Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-05 13:25:52 -05:00
Venkatesh Pallipadi 64e290ec69 ext4: fix up rb_root initializations to use RB_ROOT
ext4 uses rb_node = NULL; to zero rb_root at few places.  Using
RB_ROOT as the initializer is more portable in case the underlying
implementation of rbtrees changes in the future.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Paris <eparis@redhat.com>
2010-03-04 22:25:21 -05:00
Christoph Hellwig 871a293155 dquot: cleanup dquot initialize routine
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig 907f4554e2 dquot: move dquot initialization responsibility into the filesystem
Currently various places in the VFS call vfs_dq_init directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the initialization.   For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.

For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.

For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.

Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig 9f75475802 dquot: cleanup dquot drop routine
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig 257ba15ced dquot: move dquot drop responsibility into the filesystem
Currently clear_inode calls vfs_dq_drop directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the drop inside the ->clear_inode
superblock operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig b43fa8284d dquot: cleanup dquot transfer routine
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig 63936ddaa1 dquot: cleanup inode allocation / freeing routines
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.

Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Christoph Hellwig 5dd4056db8 dquot: cleanup space allocation / freeing routines
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.

Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not.  Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Linus Torvalds 0f2cc4ecd8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
  init: Open /dev/console from rootfs
  mqueue: fix typo "failues" -> "failures"
  mqueue: only set error codes if they are really necessary
  mqueue: simplify do_open() error handling
  mqueue: apply mathematics distributivity on mq_bytes calculation
  mqueue: remove unneeded info->messages initialization
  mqueue: fix mq_open() file descriptor leak on user-space processes
  fix race in d_splice_alias()
  set S_DEAD on unlink() and non-directory rename() victims
  vfs: add NOFOLLOW flag to umount(2)
  get rid of ->mnt_parent in tomoyo/realpath
  hppfs can use existing proc_mnt, no need for do_kern_mount() in there
  Mirror MS_KERNMOUNT in ->mnt_flags
  get rid of useless vfsmount_lock use in put_mnt_ns()
  Take vfsmount_lock to fs/internal.h
  get rid of insanity with namespace roots in tomoyo
  take check for new events in namespace (guts of mounts_poll()) to namespace.c
  Don't mess with generic_permission() under ->d_lock in hpfs
  sanitize const/signedness for udf
  nilfs: sanitize const/signedness in dealing with ->d_name.name
  ...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04 08:15:33 -08:00
Akira Fujita c437b27335 ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
a) Fix sparse warning in ext4_ioctl()
b) Remove unneeded variable in mext_leaf_block()
c) Fix spelling typo in mext_check_arguments()

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:39:24 -05:00
Akira Fujita 7247c0caa2 ext4: Fix the NULL reference in double_down_write_data_sem()
If EXT4_IOC_MOVE_EXT ioctl is called with NULL donor_fd, fget() in
ext4_ioctl() gets inappropriate file structure for donor; so we need
to do this check earlier, before calling double_down_write_data_sem().

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:34:58 -05:00
Akira Fujita 5fd5249aa3 ext4: Fix insertion point of extent in mext_insert_across_blocks()
If the leaf node has 2 extent space or fewer and EXT4_IOC_MOVE_EXT
ioctl is called with the file offset where after the 2nd extent
covers, mext_insert_across_blocks() always tries to insert extent into
the first extent.  As a result, the file gets corrupted because of
wrong extent order.  The patch fixes this problem.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 00:31:06 -05:00
Akinobu Mita 731eb1a03a ext4: consolidate in_range() definitions
There are duplicate macro definitions of in_range() in mballoc.h and
balloc.c.  This consolidates these two definitions into ext4.h, and
changes extents.c to use in_range() as well.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:55:01 -05:00
Akinobu Mita bda00de7e8 ext4: cleanup to use ext4_grp_offs_to_block()
More cleanup to convert open-coded calculations of the first block
number of a free extent to use ext4_grp_offs_to_block() instead.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:25 -05:00
Akinobu Mita 5661bd6861 ext4: cleanup to use ext4_group_first_block_no()
This is a cleanup and simplification patch which takes some open-coded
calculations to calculate the first block number of a group and
converts them to use the (already defined) ext4_group_first_block_no()
function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@sun.com>
2010-03-03 23:53:39 -05:00
Jan Kara 9b1d0998d2 ext4: Release page references acquired in ext4_da_block_invalidatepages
We forget to release page references we acquire in
ext4_da_block_invalidatepages.  Luckily, this function gets called only if we
are not able to allocate blocks for delay-allocated data so that function
should better never be called.

Also cleanup handling of index variable.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-03 16:19:32 -05:00
Al Viro 3899167dbd Get rid of mnt_mountpoint abuses in ext4
path to mnt/mnt->mnt_root is no worse than that to
mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no
pinning the sucker down (mnt is not going away and
mnt->mnt_root won't change)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03 14:07:54 -05:00
Dmitry Monakhov 67eeb5685d ext4: Fix ext4_quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext4_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal
credits reservation, and later a BUG_ON.  As soon this never happen
the boundary cross loop is NOOP.  In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 08:08:51 -05:00
Frank Mayhar 273df556b6 ext4: Convert BUG_ON checks to use ext4_error() instead
Convert a bunch of BUG_ONs to emit a ext4_error() message and return
EIO.  This is a first pass and most notably does _not_ cover
mballoc.c, which is a morass of void functions.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 11:46:09 -05:00
Jiaying Zhang b7adc1f363 ext4: Use direct_IO_no_locking in ext4 dio read
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:26:36 -05:00
Jiaying Zhang 744692dc05 ext4: use ext4_get_block_write in buffer write
Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 16:14:02 -05:00
Jiaying Zhang c7064ef13b ext4: mechanical rename some of the direct I/O get_block's identifiers
This commit renames some of the direct I/O's block allocation flags,
variables, and functions introduced in Mingming's "Direct IO for holes
and fallocate" patches so that they can be used by ext4's buffered
write path as well.  Also changed the related function comments
accordingly to cover both direct write and buffered write cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:28:44 -05:00
Toshiyuki Okajima b8b8afe236 ext4: make "offset" consistent in ext4_check_dir_entry()
The callers of ext4_check_dir_entry() usually pass in the "file
offset" (ext4_readdir, htree_dirblock_to_tree, search_dirblock,
ext4_dx_find_entry, empty_dir), but a few callers (add_dirent_to_buf,
ext4_delete_entry) only pass in the buffer offset.

To accomodate those last two (which would be hard to fix otherwise),
this patch changes ext4_check_dir_entry() to print the physical block
number and the relative offset as well as the passed-in offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 00:21:35 -05:00
Dmitry Monakhov 6e3617e579 ext4: Handle non empty on-disk orphan link
In case of truncate errors we explicitly remove inode from in-core
orphan list via orphan_del(NULL, inode) without modifying the on-disk list.

But later on, the same inode may be inserted in the orphan list again
which will result the on-disk linked list getting corrupted.  If inode
i_dtime contains valid value, then skip on-disk list modification.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:29:39 -05:00
Dmitry Monakhov da1dafca84 ext4: explicitly remove inode from orphan list after failed direct io
Otherwise non-empty orphan list will be triggered on umount.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:15:02 -05:00
Dmitry Monakhov f39490bcd1 ext4: fix error handling in migrate
Set i_nlink to zero for temporary inode from very beginning.
otherwise we may fail to start new journal handle and this
inode will be unreferenced but with i_nlink == 1
Since we hold inode reference it can not be pruned.

Also add missed journal_start retval check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:14:36 -05:00
Dmitry Monakhov 437ca0fda3 ext4: deprecate obsoleted mount options
Declare following list of mount options as deprecated:
 - bsddf, miniddf
 - grpid, bsdgroups, nogrpid, sysvgroups

Declare following list of default mount options as deprecated:
 - bsdgroups

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 22:29:21 -05:00
Tao Ma cc483f102c ext4: Fix fencepost error in chosing choosing group vs file preallocation.
The ext4 multiblock allocator decides whether to use group or file
preallocation based on the file size.  When the file size reaches
s_mb_stream_request (default is 16 blocks), it changes to use a
file-specific preallocation. This is cool, but it has a tiny problem.

See a simple script:
mkfs.ext4 -b 1024 /dev/sda8 1000000
mount -t ext4 -o nodelalloc /dev/sda8 /mnt/ext4
for((i=0;i<5;i++))
do
cat /mnt/4096>>/mnt/ext4/a	#4096 is a file with 4096 characters.
cat /mnt/4096>>/mnt/ext4/b
done
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1

And you get
BLOCKS:
(0-14):8705-8719, (15):2356, (16-19):8465-8468

So there are 3 extents, a bit strange for the lonely 15th logical
block.  As we write to the 16 blocks, we choose file preallocation in
ext4_mb_group_or_file, but in ext4_mb_normalize_request, we meet with
the 16*1024 range, so no preallocation will be carried. file b then
reserves the space after '2356', so when when write 16, we start from
another part.

This patch just change the check in ext4_mb_group_or_file, so
that for the lonely 15 we will still use group preallocation.
After the patch, we will get:
debuge4fs -R 'stat a' /dev/sda8|grep BLOCKS -A 1
BLOCKS:
(0-15):8705-8720, (16-19):8465-8468

Looks more sane. Thanks.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 19:06:35 -05:00
Dmitry Monakhov 56c50f11f4 ext4: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:28:41 -05:00
Dmitry Monakhov 482a74258f ext4: mount flags manipulation cleanup
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 11:35:32 -05:00
Jiaying Zhang c8d46e41bc ext4: Add flag to files with blocks intentionally past EOF
fallocate() may potentially instantiate blocks past EOF, depending
on the flags used when it is called.

e2fsck currently has a test for blocks past i_size, and it
sometimes trips up - noticeably on xfstests 013 which runs fsstress.

This patch from Jiayang does fix it up - it (along with
e2fsprogs updates and other patches recently from Aneesh) has
survived many fsstress runs in a row.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 09:52:53 -05:00
Tejun Heo 003cb608a2 percpu: add __percpu sparse annotations to fs
Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-02-17 11:17:38 +09:00
Curt Wohlgemuth 73b50c1c92 ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode
Calls to ext4_handle_dirty_metadata should only pass in an inode
pointer for inode-specific metadata, and not for shared metadata
blocks such as inode table blocks, block group descriptors, the
superblock, etc.

The BUG_ON can get tripped when updating a special device (such as a
block device) that is opened (so that i_mapping is set in
fs/block_dev.c) and the file system is mounted in no journal mode.

Addresses-Google-Bug: #2404870

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-16 15:06:29 -05:00
Leonard Michlmayr aca92ff6f5 ext4: correctly calculate number of blocks for fiemap
ext4_fiemap() rounds the length of the requested range down to
blocksize, which is is not the true number of blocks that cover the
requested region.  This problem is especially impressive if the user
requests only the first byte of a file: not a single extent will be
reported.

We fix this by calculating the last block of the region and then
subtract to find the number of blocks in the extents.

Signed-off-by: Leonard Michlmayr <leonard.michlmayr@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 17:07:28 -05:00
Roel Kluin 9aaab0589b ext4: add missing error checking to ext4_expand_extra_isize_ea()
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
2010-02-15 14:26:16 -05:00
Eric Sandeen 12062dddda ext4: move __func__ into a macro for ext4_warning, ext4_error
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 14:19:27 -05:00
Daniel Mack 3ad2f3fbb9 tree-wide: Assorted spelling fixes
In particular, several occurances of funny versions of 'success',
'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address',
'beginning', 'desirable', 'separate' and 'necessary' are fixed.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-09 11:13:56 +01:00
Daniel Mack 1537a3638c tree-wide: fix 'lenght' typo in comments and code
Some misspelled occurences of 'octet' and some comments were also fixed
as I was on it.

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:45 +01:00
Thadeu Lima de Souza Cascardo d6b198bc8a fix ext3/ext4 comment typo compain -> complain
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:36 +01:00
Theodore Ts'o f710b4b96b ext4: Reserve INCOMPAT_EA_INODE and INCOMPAT_DIRDATA feature codepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-25 03:31:32 -05:00
Theodore Ts'o 19f5fb7ad6 ext4: Use bitops to read/modify EXT4_I(inode)->i_state
At several places we modify EXT4_I(inode)->i_state without holding
i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage,
ext4_do_update_inode, ...). These modifications are racy and we can
lose updates to i_state. So convert handling of i_state to use bitops
which are atomic.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-24 14:34:07 -05:00
Aneesh Kumar K.V 1296cc85c2 ext4: Drop EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE flag
We should update reserve space if it is delalloc buffer
and that is indicated by EXT4_GET_BLOCKS_DELALLOC_RESERVE flag.
So use EXT4_GET_BLOCKS_DELALLOC_RESERVE in place of
EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2010-01-15 01:27:59 -05:00
Aneesh Kumar K.V 5f634d064c ext4: Fix quota accounting error with fallocate
When we fallocate a region of the file which we had recently written,
and which is still in the page cache marked as delayed allocated blocks
we need to make sure we don't do the quota update on writepage path.
This is because the needed quota updated would have already be done
by fallocate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2010-01-25 04:00:31 -05:00
Aneesh Kumar K.V 1db913823c ext4: Handle -EDQUOT error on write
We need to release the journal before we do a write_inode.  Otherwise
we could deadlock.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2010-01-22 17:06:20 -05:00
Theodore Ts'o 9d0be50230 ext4: Calculate metadata requirements more accurately
In the past, ext4_calc_metadata_amount(), and its sub-functions
ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount()
badly over-estimated the number of metadata blocks that might be
required for delayed allocation blocks.  This didn't matter as much
when functions which managed the reserved metadata blocks were more
aggressive about dropping reserved metadata blocks as delayed
allocation blocks were written, but unfortunately they were too
aggressive.  This was fixed in commit 0637c6f, but as a result the
over-estimation by ext4_calc_metadata_amount() would lead to reserving
2-3 times the number of pending delayed allocation blocks as
potentially required metadata blocks.  So if there are 1 megabytes of
blocks which have been not yet been allocation, up to 3 megabytes of
space would get reserved out of the user's quota and from the file
system free space pool until all of the inode's data blocks have been
allocated.

This commit addresses this problem by much more accurately estimating
the number of metadata blocks that will be required.  It will still
somewhat over-estimate the number of blocks needed, since it must make
a worst case estimate not knowing which physical blocks will be
needed, but it is much more accurate than before.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 02:41:30 -05:00
Theodore Ts'o ee5f4d9cdf ext4: Fix accounting of reserved metadata blocks
Commit 0637c6f had a typo which caused the reserved metadata blocks to
not be released correctly.   Fix this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 02:36:15 -05:00
Theodore Ts'o 0637c6f413 ext4: Patch up how we claim metadata blocks for quota purposes
As reported in Kernel Bugzilla #14936, commit d21cd8f triggered a BUG
in the function ext4_da_update_reserve_space() found in
fs/ext4/inode.c.  The root cause of this BUG() was caused by the fact
that ext4_calc_metadata_amount() can severely over-estimate how many
metadata blocks will be needed, especially when using direct
block-mapped files.

In addition, it can also badly *under* estimate how much space is
needed, since ext4_calc_metadata_amount() assumes that the blocks are
contiguous, and this is not always true.  If the application is
writing blocks to a sparse file, the number of metadata blocks
necessary can be severly underestimated by the functions
ext4_da_reserve_space(), ext4_da_update_reserve_space() and
ext4_da_release_space().  This was the cause of the dq_claim_space
reports found on kerneloops.org.

Unfortunately, doing this right means that we need to massively
over-estimate the amount of free space needed.  So in some cases we
may need to force the inode to be written to disk asynchronously in
to avoid spurious quota failures.

http://bugzilla.kernel.org/show_bug.cgi?id=14936

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-30 14:20:45 -05:00
Aneesh Kumar K.V 515f41c33a ext4: Ensure zeroout blocks have no dirty metadata
This fixes a bug (found by Curt Wohlgemuth) in which new blocks
returned from an extent created with ext4_ext_zeroout() can have dirty
metadata still associated with them.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-29 23:39:06 -05:00
Richard Kennedy 2faf2e19dd ext4: return correct wbc.nr_to_write in ext4_da_writepages
When ext4_da_writepages increases the nr_to_write in writeback_control
then it must always re-base the return value.  Originally there was a
(misguided) attempt prevent wbc.nr_to_write from going negative.  In
fact, it's necessary to allow nr_to_write to be negative so that
wb_writeback() can correctly calculate how many pages were actually
written.  

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-25 15:46:07 -05:00
Eric Sandeen c8afb44682 ext4: flush delalloc blocks when space is low
Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
    echo -n > $SCRATCH_MNT/$i
    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation is not
usually needed.

When freespace is low, kicking off an async writeback will start
converting that worst-case space usage into something more realistic,
almost always freeing up space to continue.

This resolves the testcase for me, and survives all 4 generic
ENOSPC tests in xfstests.

We'll still need a hard synchronous sync to squeeze out the last bit,
but this fixes things up to a large degree.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:58:12 -05:00
Julia Lawall d3533d72e7 ext4: Eliminate potential double free on error path
b_entry_name and buffer are initially NULL, are initialized within a loop
to the result of calling kmalloc, and are freed at the bottom of this loop.
The loop contains gotos to cleanup, which also frees b_entry_name and
buffer.  Some of these gotos are before the reinitializations of
b_entry_name and buffer.  To maintain the invariant that b_entry_name and
buffer are NULL at the top of the loop, and thus acceptable arguments to
kfree, these variables are now set to NULL after the kfrees.

This seems to be the simplest solution.  A more complicated solution
would be to introduce more labels in the error handling code at the end of
the function.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
identifier E;
expression E1;
iterator I;
statement S;
@@

*kfree(E);
... when != E = E1
    when != I(E,...) S
    when != &E
*kfree(E);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:52:31 -05:00
Andrew Morton a6b43e3825 ext4: fix unsigned long long printk warning in super.c
sparc64 allmodconfig:

fs/ext4/super.c: In function `lifetime_write_kbytes_show':
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:48:08 -05:00
Theodore Ts'o cc3e1bea5d ext4, jbd2: Add barriers for file systems with exernal journals
This is a bit complicated because we are trying to optimize when we
send barriers to the fs data disk.  We could just throw in an extra
barrier to the data disk whenever we send a barrier to the journal
disk, but that's not always strictly necessary.

We only need to send a barrier during a commit when there are data
blocks which are must be written out due to an inode written in
ordered mode, or if fsync() depends on the commit to force data blocks
to disk.  Finally, before we drop transactions from the beginning of
the journal during a checkpoint operation, we need to guarantee that
any blocks that were flushed out to the data disk are firmly on the
rust platter before we drop the transaction from the journal.

Thanks to Oleg Drokin for pointing out this flaw in ext3/ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 06:52:08 -05:00
Surbhi Palande 034fb4c95f ext4: replace BUG() with return -EIO in ext4_ext_get_blocks
This patch fixes the Kernel BZ #14286.  When the address of an extent
corresponding to a valid block is corrupted, a -EIO should be reported
instead of a BUG().  This situation should not normally not occur
except in the case of a corrupted filesystem.  If however it does,
then the system should not panic directly but depending on the mount
time options appropriate action should be taken. If the mount options
so permit, the I/O should be gracefully aborted by returning a -EIO.

http://bugzilla.kernel.org/show_bug.cgi?id=14286

Signed-off-by: Surbhi Palande <surbhi.palande@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-14 09:53:52 -05:00
Theodore Ts'o 51b7e3c9fb ext4: add module aliases for ext2 and ext3
Add module aliases for ext2 and ext3 when CONFIG_EXT4_USE_FOR_EXT23 is
set.  This makes the existing user-space stuff like mkinitrd working
as is.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-21 10:56:09 -05:00
David Howells 84c6647303 ext4: Don't ask about supporting ext2/3 in ext4 if ext4 is not configured
Don't offer to build ext2/3 support into ext4 if ext4 itself is not
configured on.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-21 10:54:09 -05:00
Huang Weiyi 149feb00d7 ext4: remove unused #include <linux/version.h>
Remove unused #include <linux/version.h>('s) in
  fs/ext4/block_validity.c
  fs/ext4/mballoc.h

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-14 09:24:20 -05:00
Dmitry Monakhov 39bc680a81 ext4: fix sleep inside spinlock issue with quota and dealloc (#14739)
Unlock i_block_reservation_lock before vfs_dq_reserve_block().
This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=14739

CC: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:44:12 +01:00
Dmitry Monakhov d21cd8f163 ext4: Fix potential quota deadlock
We have to delay vfs_dq_claim_space() until allocation context destruction.
Currently we have following call-trace:
ext4_mb_new_blocks()
  /* task is already holding ac->alloc_semp */
 ->ext4_mb_mark_diskspace_used
    ->vfs_dq_claim_space()  /*  acquire dqptr_sem here. Possible deadlock */
 ->ext4_mb_release_context() /* drop ac->alloc_semp here */

Let's move quota claiming to ext4_da_update_reserve_space()

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.32-rc7 #18
 -------------------------------------------------------
 write-truncate-/3465 is trying to acquire lock:
  (&s->s_dquot.dqptr_sem){++++..}, at: [<c025e73b>] dquot_claim_space+0x3b/0x1b0

 but task is already holding lock:
  (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #3 (&meta_group_info[i]->alloc_sem){++++..}:
        [<c017d04b>] __lock_acquire+0xd7b/0x1260
        [<c017d5ea>] lock_acquire+0xba/0xd0
        [<c0527191>] down_read+0x51/0x90
        [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370
        [<c02d0c1c>] ext4_mb_free_blocks+0x46c/0x870
        [<c029c9d3>] ext4_free_blocks+0x73/0x130
        [<c02c8cfc>] ext4_ext_truncate+0x76c/0x8d0
        [<c02a8087>] ext4_truncate+0x187/0x5e0
        [<c01e0f7b>] vmtruncate+0x6b/0x70
        [<c022ec02>] inode_setattr+0x62/0x190
        [<c02a2d7a>] ext4_setattr+0x25a/0x370
        [<c022ee81>] notify_change+0x151/0x340
        [<c021349d>] do_truncate+0x6d/0xa0
        [<c0221034>] may_open+0x1d4/0x200
        [<c022412b>] do_filp_open+0x1eb/0x910
        [<c021244d>] do_sys_open+0x6d/0x140
        [<c021258e>] sys_open+0x2e/0x40
        [<c0103100>] sysenter_do_call+0x12/0x32

 -> #2 (&ei->i_data_sem){++++..}:
        [<c017d04b>] __lock_acquire+0xd7b/0x1260
        [<c017d5ea>] lock_acquire+0xba/0xd0
        [<c0527191>] down_read+0x51/0x90
        [<c02a5787>] ext4_get_blocks+0x47/0x450
        [<c02a74c1>] ext4_getblk+0x61/0x1d0
        [<c02a7a7f>] ext4_bread+0x1f/0xa0
        [<c02bcddc>] ext4_quota_write+0x12c/0x310
        [<c0262d23>] qtree_write_dquot+0x93/0x120
        [<c0261708>] v2_write_dquot+0x28/0x30
        [<c025d3fb>] dquot_commit+0xab/0xf0
        [<c02be977>] ext4_write_dquot+0x77/0x90
        [<c02be9bf>] ext4_mark_dquot_dirty+0x2f/0x50
        [<c025e321>] dquot_alloc_inode+0x101/0x180
        [<c029fec2>] ext4_new_inode+0x602/0xf00
        [<c02ad789>] ext4_create+0x89/0x150
        [<c0221ff2>] vfs_create+0xa2/0xc0
        [<c02246e7>] do_filp_open+0x7a7/0x910
        [<c021244d>] do_sys_open+0x6d/0x140
        [<c021258e>] sys_open+0x2e/0x40
        [<c0103100>] sysenter_do_call+0x12/0x32

 -> #1 (&sb->s_type->i_mutex_key#7/4){+.+...}:
        [<c017d04b>] __lock_acquire+0xd7b/0x1260
        [<c017d5ea>] lock_acquire+0xba/0xd0
        [<c0526505>] mutex_lock_nested+0x65/0x2d0
        [<c0260c9d>] vfs_load_quota_inode+0x4bd/0x5a0
        [<c02610af>] vfs_quota_on_path+0x5f/0x70
        [<c02bc812>] ext4_quota_on+0x112/0x190
        [<c026345a>] sys_quotactl+0x44a/0x8a0
        [<c0103100>] sysenter_do_call+0x12/0x32

 -> #0 (&s->s_dquot.dqptr_sem){++++..}:
        [<c017d361>] __lock_acquire+0x1091/0x1260
        [<c017d5ea>] lock_acquire+0xba/0xd0
        [<c0527191>] down_read+0x51/0x90
        [<c025e73b>] dquot_claim_space+0x3b/0x1b0
        [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
        [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
        [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
        [<c02a5966>] ext4_get_blocks+0x226/0x450
        [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
        [<c02a6ed6>] ext4_da_writepages+0x506/0x790
        [<c01de272>] do_writepages+0x22/0x50
        [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
        [<c01d7b9b>] filemap_flush+0x2b/0x30
        [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
        [<c029e595>] ext4_release_file+0x75/0xb0
        [<c0216b59>] __fput+0xf9/0x210
        [<c0216c97>] fput+0x27/0x30
        [<c02122dc>] filp_close+0x4c/0x80
        [<c014510e>] put_files_struct+0x6e/0xd0
        [<c01451b7>] exit_files+0x47/0x60
        [<c0146a24>] do_exit+0x144/0x710
        [<c0147028>] do_group_exit+0x38/0xa0
        [<c0159abc>] get_signal_to_deliver+0x2ac/0x410
        [<c0102849>] do_notify_resume+0xb9/0x890
        [<c01032d2>] work_notifysig+0x13/0x21

 other info that might help us debug this:

 3 locks held by write-truncate-/3465:
  #0:  (jbd2_handle){+.+...}, at: [<c02e1f8f>] start_this_handle+0x38f/0x5c0
  #1:  (&ei->i_data_sem){++++..}, at: [<c02a57f6>] ext4_get_blocks+0xb6/0x450
  #2:  (&meta_group_info[i]->alloc_sem){++++..}, at: [<c02ce962>] ext4_mb_load_buddy+0xb2/0x370

 stack backtrace:
 Pid: 3465, comm: write-truncate- Not tainted 2.6.32-rc7 #18
 Call Trace:
  [<c0524cb3>] ? printk+0x1d/0x22
  [<c017ac9a>] print_circular_bug+0xca/0xd0
  [<c017d361>] __lock_acquire+0x1091/0x1260
  [<c016bca2>] ? sched_clock_local+0xd2/0x170
  [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
  [<c017d5ea>] lock_acquire+0xba/0xd0
  [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
  [<c0527191>] down_read+0x51/0x90
  [<c025e73b>] ? dquot_claim_space+0x3b/0x1b0
  [<c025e73b>] dquot_claim_space+0x3b/0x1b0
  [<c02cb95f>] ext4_mb_mark_diskspace_used+0x36f/0x380
  [<c02d210a>] ext4_mb_new_blocks+0x34a/0x530
  [<c02c601d>] ? ext4_ext_find_extent+0x25d/0x280
  [<c02c83fb>] ext4_ext_get_blocks+0x122b/0x13c0
  [<c016bca2>] ? sched_clock_local+0xd2/0x170
  [<c016be60>] ? sched_clock_cpu+0x120/0x160
  [<c016beef>] ? cpu_clock+0x4f/0x60
  [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
  [<c052712c>] ? down_write+0x8c/0xa0
  [<c02a5966>] ext4_get_blocks+0x226/0x450
  [<c016be60>] ? sched_clock_cpu+0x120/0x160
  [<c016beef>] ? cpu_clock+0x4f/0x60
  [<c017908b>] ? trace_hardirqs_off+0xb/0x10
  [<c02a5ff3>] mpage_da_map_blocks+0xc3/0xaa0
  [<c01d69cc>] ? find_get_pages_tag+0x16c/0x180
  [<c01d6860>] ? find_get_pages_tag+0x0/0x180
  [<c02a73bd>] ? __mpage_da_writepage+0x16d/0x1a0
  [<c01dfc4e>] ? pagevec_lookup_tag+0x2e/0x40
  [<c01ddf1b>] ? write_cache_pages+0xdb/0x3d0
  [<c02a7250>] ? __mpage_da_writepage+0x0/0x1a0
  [<c02a6ed6>] ext4_da_writepages+0x506/0x790
  [<c016beef>] ? cpu_clock+0x4f/0x60
  [<c016bca2>] ? sched_clock_local+0xd2/0x170
  [<c016be60>] ? sched_clock_cpu+0x120/0x160
  [<c016be60>] ? sched_clock_cpu+0x120/0x160
  [<c02a69d0>] ? ext4_da_writepages+0x0/0x790
  [<c01de272>] do_writepages+0x22/0x50
  [<c01d766d>] __filemap_fdatawrite_range+0x6d/0x80
  [<c01d7b9b>] filemap_flush+0x2b/0x30
  [<c02a40ac>] ext4_alloc_da_blocks+0x5c/0x60
  [<c029e595>] ext4_release_file+0x75/0xb0
  [<c0216b59>] __fput+0xf9/0x210
  [<c0216c97>] fput+0x27/0x30
  [<c02122dc>] filp_close+0x4c/0x80
  [<c014510e>] put_files_struct+0x6e/0xd0
  [<c01451b7>] exit_files+0x47/0x60
  [<c0146a24>] do_exit+0x144/0x710
  [<c017b163>] ? lock_release_holdtime+0x33/0x210
  [<c0528137>] ? _spin_unlock_irq+0x27/0x30
  [<c0147028>] do_group_exit+0x38/0xa0
  [<c017babb>] ? trace_hardirqs_on+0xb/0x10
  [<c0159abc>] get_signal_to_deliver+0x2ac/0x410
  [<c0102849>] do_notify_resume+0xb9/0x890
  [<c0178fd0>] ? trace_hardirqs_off_caller+0x20/0xd0
  [<c017b163>] ? lock_release_holdtime+0x33/0x210
  [<c0165b50>] ? autoremove_wake_function+0x0/0x50
  [<c017ba54>] ? trace_hardirqs_on_caller+0x134/0x190
  [<c017babb>] ? trace_hardirqs_on+0xb/0x10
  [<c0300ba4>] ? security_file_permission+0x14/0x20
  [<c0215761>] ? vfs_write+0x131/0x190
  [<c0214f50>] ? do_sync_write+0x0/0x120
  [<c0103115>] ? sysenter_do_call+0x27/0x32
  [<c01032d2>] work_notifysig+0x13/0x21

CC: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:44:12 +01:00
Dmitry Monakhov a9e7f44720 ext4: Convert to generic reserved quota's space management.
This patch also fixes write vs chown race condition.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:33:55 +01:00
Linus Torvalds b6e3224fb2 Revert "task_struct: make journal_info conditional"
This reverts commit e4c570c4cb, as
requested by Alexey:

 "I think I gave a good enough arguments to not merge it.
  To iterate:
   * patch makes impossible to start using ext3 on EXT3_FS=n kernels
     without reboot.
   * this is done only for one pointer on task_struct"

  None of config options which define task_struct are tristate directly
  or effectively."

Requested-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-17 13:23:24 -08:00
Christoph Hellwig 431547b3c4 sanitize xattr handler prototypes
Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods.  This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute.  With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.

Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.

[with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16 12:16:49 -05:00
André Goddard Rosa e7d2860b69 tree-wide: convert open calls to remove spaces to skip_spaces() lib function
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
   text    data     bss     dec     hex filename
  64688     584     592   65864   10148 (TOTALS-BEFORE)
  64641     584     592   65817   10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
    drivers/leds/led-class.c
    drivers/leds/ledtrig-timer.c
    drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str &&  isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:32 -08:00
Hiroshi Shimamoto e4c570c4cb task_struct: make journal_info conditional
journal_info in task_struct is used in journaling file system only.  So
introduce CONFIG_FS_JOURNAL_INFO and make it conditional.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:27 -08:00
Linus Torvalds d0316554d3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
  m68k: rename global variable vmalloc_end to m68k_vmalloc_end
  percpu: add missing per_cpu_ptr_to_phys() definition for UP
  percpu: Fix kdump failure if booted with percpu_alloc=page
  percpu: make misc percpu symbols unique
  percpu: make percpu symbols in ia64 unique
  percpu: make percpu symbols in powerpc unique
  percpu: make percpu symbols in x86 unique
  percpu: make percpu symbols in xen unique
  percpu: make percpu symbols in cpufreq unique
  percpu: make percpu symbols in oprofile unique
  percpu: make percpu symbols in tracer unique
  percpu: make percpu symbols under kernel/ and mm/ unique
  percpu: remove some sparse warnings
  percpu: make alloc_percpu() handle array types
  vmalloc: fix use of non-existent percpu variable in put_cpu_var()
  this_cpu: Use this_cpu_xx in trace_functions_graph.c
  this_cpu: Use this_cpu_xx for ftrace
  this_cpu: Use this_cpu_xx in nmi handling
  this_cpu: Use this_cpu operations in RCU
  this_cpu: Use this_cpu ops for VM statistics
  ...

Fix up trivial (famous last words) global per-cpu naming conflicts in
	arch/x86/kvm/svm.c
	mm/slab.c
2009-12-14 09:58:24 -08:00
Linus Torvalds 3126c136bc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
  ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
  ext3: Fix data / filesystem corruption when write fails to copy data
  ext4: Support for 64-bit quota format
  ext3: Support for vfsv1 quota format
  quota: Implement quota format with 64-bit space and inode limits
  quota: Move definition of QFMT_OCFS2 to linux/quota.h
  ext2: fix comment in ext2_find_entry about return values
  ext3: Unify log messages in ext3
  ext2: clear uptodate flag on super block I/O error
  ext2: Unify log messages in ext2
  ext3: make "norecovery" an alias for "noload"
  ext3: Don't update the superblock in ext3_statfs()
  ext3: journal all modifications in ext3_xattr_set_handle
  ext2: Explicitly assign values to on-disk enum of filetypes
  quota: Fix WARN_ON in lookup_one_len
  const: struct quota_format_ops
  ubifs: remove manual O_SYNC handling
  afs: remove manual O_SYNC handling
  kill wait_on_page_writeback_range
  vfs: Implement proper O_SYNC semantics
  ...
2009-12-11 15:31:13 -08:00
Linus Torvalds 4515c3069d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
  ext4: Do not override ext2 or ext3 if built they are built as modules
  jbd2: Export jbd2_log_start_commit to fix ext4 build
  ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
  ext4: Wait for proper transaction commit on fsync
  ext4: fix incorrect block reservation on quota transfer.
  ext4: quota macros cleanup
  ext4: ext4_get_reserved_space() must return bytes instead of blocks
  ext4: remove blocks from inode prealloc list on failure
  ext4: wait for log to commit when umounting
  ext4: Avoid data / filesystem corruption when write fails to copy data
  ext4: Use ext4 file system driver for ext2/ext3 file system mounts
  ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
  jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
  ext4: remove unused parameter wbc from __ext4_journalled_writepage()
  ext4: remove encountered_congestion trace
  ext4: move_extent_per_page() cleanup
  ext4: initialize moved_len before calling ext4_move_extents()
  ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
  ext4: use ext4_data_block_valid() in ext4_free_blocks()
  ...
2009-12-10 09:33:29 -08:00
Jan Kara 5a20bdfcdc ext4: Support for 64-bit quota format
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Theodore Ts'o fab3a549e2 ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
Fix the following potential circular locking dependency between
mm->mmap_sem and ei->i_data_sem:

    =======================================================
    [ INFO: possible circular locking dependency detected ]
    2.6.32-04115-gec044c5 #37
    -------------------------------------------------------
    ureadahead/1855 is trying to acquire lock:
     (&mm->mmap_sem){++++++}, at: [<ffffffff81107224>] might_fault+0x5c/0xac

    but task is already holding lock:
     (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&ei->i_data_sem){++++..}:
           [<ffffffff81099bfa>] __lock_acquire+0xb67/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81516633>] down_read+0x51/0x84
           [<ffffffff811a2414>] ext4_get_blocks+0x50/0x2a5
           [<ffffffff811a3453>] ext4_get_block+0xab/0xef
           [<ffffffff81154f39>] do_mpage_readpage+0x198/0x48d
           [<ffffffff81155360>] mpage_readpages+0xd0/0x114
           [<ffffffff811a104b>] ext4_readpages+0x1d/0x1f
           [<ffffffff810f8644>] __do_page_cache_readahead+0x12f/0x1bc
           [<ffffffff810f86f2>] ra_submit+0x21/0x25
           [<ffffffff810f0cfd>] filemap_fault+0x19f/0x32c
           [<ffffffff81107b97>] __do_fault+0x55/0x3a2
           [<ffffffff81109db0>] handle_mm_fault+0x327/0x734
           [<ffffffff8151aaa9>] do_page_fault+0x292/0x2aa
           [<ffffffff81518205>] page_fault+0x25/0x30
           [<ffffffff812a34d8>] clear_user+0x38/0x3c
           [<ffffffff81167e16>] padzero+0x20/0x31
           [<ffffffff81168b47>] load_elf_binary+0x8bc/0x17ed
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff81166d64>] load_script+0x1b8/0x1cc
           [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
           [<ffffffff8113255f>] do_execve+0x1ce/0x2cf
           [<ffffffff81027494>] sys_execve+0x43/0x5a
           [<ffffffff8102918a>] stub_execve+0x6a/0xc0

    -> #0 (&mm->mmap_sem){++++++}:
           [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81107251>] might_fault+0x89/0xac
           [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
           [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
           [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
           [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
           [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
           [<ffffffff811392ca>] sys_ioctl+0x56/0x79
           [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

    other info that might help us debug this:

    1 lock held by ureadahead/1855:
     #0:  (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159

    stack backtrace:
    Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37
    Call Trace:
     [<ffffffff81098c70>] print_circular_bug+0xa8/0xb7
     [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
     [<ffffffff8102f229>] ? sched_clock+0x9/0xd
     [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81107251>] might_fault+0x89/0xac
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff81124b44>] ? __kmalloc+0x13b/0x18c
     [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
     [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
     [<ffffffff811bca0b>] ? ext4_ext_fiemap_cb+0x0/0x157
     [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
     [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
     [<ffffffff81107224>] ? might_fault+0x5c/0xac
     [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
     [<ffffffff8129f6d0>] ? __up_read+0x8d/0x95
     [<ffffffff81517fb5>] ? retint_swapgs+0x13/0x1b
     [<ffffffff811392ca>] sys_ioctl+0x56/0x79
     [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:30:02 -05:00
Theodore Ts'o a214238d3b ext4: Do not override ext2 or ext3 if built they are built as modules
The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
ext2 or ext3 file systems if the those file system drivers are
configured to be built as mdoules.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:09:58 -05:00
Jiri Kosina d014d04386 Merge branch 'for-next' into for-linus
Conflicts:

	kernel/irq/chip.c
2009-12-07 18:36:35 +01:00
Theodore Ts'o f8ec9d6837 ext4: Add new tracepoints to debug delayed allocation space functions
Add tracepoints for ext4_da_reserve_space(),
ext4_da_update_reserve_space(), and ext4_da_release_space().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 01:00:21 -05:00
Theodore Ts'o 1f2acb6017 ext4: Add block validity check when truncating indirect block mapped inodes
Add checks to ext4_free_branches() to make sure a block number found
in an indirect block are valid before trying to free it.  If a bad
block number is found, stop freeing the indirect block immediately,
since the file system is corrupt and we will need to run fsck anyway.
This also avoids spamming the logs, and specifically avoids
driver-level "attempt to access beyond end of device" errors obscure
what is really going on.

If you get *really*, *really*, *really* unlucky, without this patch, a
supposed indirect block containing garbage might contain a reference
to a primary block group descriptor, in which case
ext4_free_branches() could end up zero'ing out a block group
descriptor block, and if then one of the block bitmaps for a block
group described by that bg descriptor block is not in memory, and is
read in by ext4_read_block_bitmap().  This function calls
ext4_valid_block_bitmap(), which assumes that bg_inode_table() was
validated at mount time and hasn't been modified since.  Since this
assumption is no longer valid, it's possible for the value
(ext4_inode_table(sb, desc) - group_first_block) to go negative, which
will cause ext4_find_next_zero_bit() to trigger a kernel GPF.

Addresses-Google-Bug: #2220436

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-22 17:40:42 -05:00
Eric Sandeen 15121c18a2 ext4: Fix optional-arg mount options
We have 2 mount options, "barrier" and "auto_da_alloc" which may or
may not take a 1/0 argument.  This causes the ext4 superblock mount
code to subtract uninitialized pointers and pass the result to
kmalloc, which results in very noisy failures.

Per Ted's suggestion, initialize the args struct so that
we know whether match_token() found an argument for the
option, and skip match_int() if not.

Also, return error (0) from parse_options if we thought
we found an argument, but match_int() Fails.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 20:17:55 -05:00
Eric Sandeen a1de02dccf ext4: fix async i/o writes beyond 4GB to a sparse file
The "offset" member in ext4_io_end holds bytes, not blocks, so
ext4_lblk_t is wrong - and too small (u32).

This caused the async i/o writes to sparse files beyond 4GB to fail
when they wrapped around to 0.

Also fix up the type of arguments to ext4_convert_unwritten_extents(),
it gets ssize_t from ext4_end_aio_dio_nolock() and
ext4_ext_direct_IO().

Reported-by: Giel de Nijs <giel@vectorwise.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2010-02-04 23:58:38 -05:00
Akira Fujita 4a58579b9e ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
This patch fixes three problems in the handling of the
EXT4_IOC_MOVE_EXT ioctl:

1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for
original and donor files, but they allow the illegal write access to
donor file, since donor file is overwritten by original file data.  To
fix this problem, change access mode checks of original (r->r/w) and
donor (r->w) files.

2.  Disallow the use of donor files that have a setuid or setgid bits.

3.  Call mnt_want_write() and mnt_drop_write() before and after
ext4_move_extents() calling to get write access to a mount.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-06 23:38:31 -05:00
Jan Kara b436b9bef8 ext4: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 23:51:10 -05:00
Dmitry Monakhov 194074acac ext4: fix incorrect block reservation on quota transfer.
Inside ->setattr() call both ATTR_UID and ATTR_GID may be valid
This means that we may end-up with transferring all quotas. Add
we have to reserve QUOTA_DEL_BLOCKS for all quotas, as we do in
case of QUOTA_INIT_BLOCKS.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:42:28 -05:00
Dmitry Monakhov 5aca07eb7d ext4: quota macros cleanup
Currently all quota block reservation macros contains hard-coded "2"
aka MAXQUOTAS value. This is no good because in some places it is not
obvious to understand what does this digit represent. Let's introduce
new macro with self descriptive name.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:42:15 -05:00
Dmitry Monakhov 8aa6790f87 ext4: ext4_get_reserved_space() must return bytes instead of blocks
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:41:52 -05:00
Curt Wohlgemuth b844167edc ext4: remove blocks from inode prealloc list on failure
This fixes a leak of blocks in an inode prealloc list if device failures
cause ext4_mb_mark_diskspace_used() to fail.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 22:18:25 -05:00
Josef Bacik d4edac314e ext4: wait for log to commit when umounting
There is a potential race when a transaction is committing right when
the file system is being umounting.  This could reduce in a race
because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super
before the commit code calls a callback so the mballoc code can
release freed blocks in the transaction, resulting in a panic trying
to access the freed s_group_info.

The fix is to wait for the transaction to finish committing before we
shutdown the multiblock allocator.  

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:48:58 -05:00
Jan Kara b9a4207d5e ext4: Avoid data / filesystem corruption when write fails to copy data
When ext4_write_begin fails after allocating some blocks or
generic_perform_write fails to copy data to write, we truncate blocks
already instantiated beyond i_size.  Although these blocks were never
inside i_size, we have to truncate the pagecache of these blocks so
that corresponding buffers get unmapped.  Otherwise subsequent
__block_prepare_write (called because we are retrying the write) will
find the buffers mapped, not call ->get_block, and thus the page will
be backed by already freed blocks leading to filesystem and data
corruption.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:24:33 -05:00
Theodore Ts'o 24b584240a ext4: Use ext4 file system driver for ext2/ext3 file system mounts
Add a new config option, CONFIG_EXT4_USE_FOR_EXT23 which if enabled,
will cause ext4 to be used for either ext2 or ext3 file system mounts
when ext2 or ext3 is not enabled in the configuration.

This allows minimalist kernel fanatics to drop to file system drivers
from their compiled kernel with out losing functionality.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-07 14:08:51 -05:00
Roel Kluin c09eef305d ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-07 10:38:16 -05:00
André Goddard Rosa af901ca181 tree-wide: fix assorted typos all over the place
That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:55 +01:00
Uwe Kleine-König bf48aabb89 tree-wide: fix typos "offest" -> "offset"
This patch was generated by

	git grep -E -i -l 'offest' | xargs -r perl -p -i -e 's/offest/offset/'

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-04 15:39:50 +01:00
Wu Fengguang 3f0ca30985 ext4: remove unused parameter wbc from __ext4_journalled_writepage()
CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 11:15:44 -05:00
Akira Fujita ac48b0a1d0 ext4: move_extent_per_page() cleanup
Integrate duplicate lines (acquire/release semaphore and invalidate
extent cache in move_extent_per_page()) into mext_replace_branches(),
to reduce source and object code size.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:31:56 -05:00
Kazuya Mio 446aaa6e7e ext4: initialize moved_len before calling ext4_move_extents()
The move_extent.moved_len is used to pass back the number of exchanged
blocks count to user space.  Currently the caller must clear this
field; but we spend more code space checking for this requirement than
simply zeroing the field ourselves, so let's just make life easier for
everyone all around.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:28:48 -05:00
Akira Fujita 94d7c16cbb ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
At the beginning of ext4_move_extent(), we call
ext4_discard_preallocations() to discard inode PAs of orig and donor
inodes.  But in the following case, blocks can be double freed, so
move ext4_discard_preallocations() to the end of ext4_move_extents().

1. Discard inode PAs of orig and donor inodes with
   ext4_discard_preallocations() in ext4_move_extents().

   orig : [ DATA1 ]
   donor: [ DATA2 ]

2. While data blocks are exchanging between orig and donor inodes, new
   inode PAs is created to orig by other process's block allocation.
   (Since there are semaphore gaps in ext4_move_extents().)  And new
   inode PAs is used partially (2-1).

   2-1 Create new inode PAs to orig inode
   orig : [ DATA1 | used PA1 | free PA1 ]
   donor: [ DATA2 ]

3. Donor inode which has old orig inode's blocks is deleted after
   EXT4_IOC_MOVE_EXT finished (3-1, 3-2).  So the block bitmap
   corresponds to old orig inode's blocks are freed.

   3-1 After EXT4_IOC_MOVE_EXT finished
   orig : [ DATA2 |  free PA1 ]
   donor: [ DATA1 |  used PA1 ]

   3-2 Delete donor inode
   orig : [ DATA2 |  free PA1 ]
   donor: [ FREE SPACE(DATA1) | FREE SPACE(used PA1) ]

4. The double-free of blocks is occurred, when close() is called to
   orig inode.  Because ext4_discard_preallocations() for orig inode
   frees used PA1 and free PA1, though used PA1 is already freed in 3.

   4-1 Double-free of blocks is occurred
   orig : [ DATA2 |  FREE SPACE(free PA1) ]
   donor: [ FREE SPACE(DATA1) | DOUBLE FREE(used PA1) ]

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 10:19:57 -05:00
Theodore Ts'o 9084d47197 ext4: use ext4_data_block_valid() in ext4_free_blocks()
The block validity framework does a more comprehensive set of checks,
and it saves object code space to use the ext4_data_block_valid() than
the limited open-coded version that had been in ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:48:42 -05:00
Theodore Ts'o 1585d8d89a ext4: add check for wraparound in ext4_data_block_valid()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:48:34 -05:00
Theodore Ts'o e6362609b6 ext4: call ext4_forget() from ext4_free_blocks()
Add the facility for ext4_forget() to be called from
ext4_free_blocks().  This simplifies the code in a large number of
places, and centralizes most of the work of calling ext4_forget() into
a single place.

Also fix a bug in the extents migration code; it wasn't calling
ext4_forget() when releasing the indirect blocks during the
conversion.  As a result, if the system cashed during or shortly after
the extents migration, and the released indirect blocks get reused as
data blocks, the journal replay would corrupt the data blocks.  With
this new patch, fixing this bug was as simple as adding the
EXT4_FREE_BLOCKS_FORGET flags to the call to ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-11-23 07:17:05 -05:00
Theodore Ts'o 4433871130 ext4: fold ext4_free_blocks() and ext4_mb_free_blocks()
ext4_mb_free_blocks() is only called by ext4_free_blocks(), and the
latter function doesn't really do much.  So merge the two functions
together, such that ext4_free_blocks() is now found in
fs/ext4/mballoc.c.  This saves about 200 bytes of compiled text space.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 07:44:56 -05:00
Theodore Ts'o b7e57e7c2a ext4: fold ext4_journal_forget() into ext4_forget()
Convert the last two callers of ext4_journal_forget() to use
ext4_forget() instead, and then fold ext4_journal_forget() into
ext4_forget().  This reduces are code complexity and shortens our call
stack.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:13 -05:00
Theodore Ts'o e4684b3fbb ext4: fold ext4_journal_revoke() into ext4_forget()
The only caller of ext4_journal_revoke() is ext4_forget(), so we can
fold ext4_journal_revoke() into ext4_forget() to simplify the code and
shorten the call stack.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-24 11:05:59 -05:00
Theodore Ts'o d6797d14b1 ext4: move ext4_forget() to ext4_jbd2.c
The ext4_forget() function better belongs in ext4_jbd2.c.  This will
allow us to do some cleanup of the ext4_journal_revoke() and
ext4_journal_forget() functions, as well as giving us better error
reporting since we can report the caller of ext4_forget() when things
go wrong.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 20:52:12 -05:00
Eric Sandeen e3bb52ae2b ext4: make "norecovery" an alias for "noload"
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:28:50 -05:00
Eric Sandeen 5328e63531 ext4: make trim/discard optional (and off by default)
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues.  Make
this mount-time optional, and default it to off until we know
that things are working out OK.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:25:42 -05:00
Jan Kara 2bba702d4f ext4: fix error handling in ext4_ind_get_blocks()
When an error happened in ext4_splice_branch we failed to notice that
in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem
by checking for error properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:48 -05:00
Theodore Ts'o 6b17d902fd ext4: avoid issuing unnecessary barriers
We don't to issue an I/O barrier on an error or if we force commit
because we are doing data journaling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2009-11-23 07:24:57 -05:00
Theodore Ts'o 1032988c71 ext4: fix block validity checks so they work correctly with meta_bg
The block validity checks used by ext4_data_block_valid() wasn't
correctly written to check file systems with the meta_bg feature.  Fix
this.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:29:56 -05:00
Theodore Ts'o 8dadb198cb ext4: fix uninit block bitmap initialization when s_meta_first_bg is non-zero
The number of old-style block group descriptor blocks is
s_meta_first_bg when the meta_bg feature flag is set.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:38 -05:00
Theodore Ts'o 3f8fb9490e ext4: don't update the superblock in ext4_statfs()
commit a71ce8c6c9 updated ext4_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:52 -05:00
Eric Sandeen 86ebfd08a1 ext4: journal all modifications in ext4_xattr_set_handle
ext4_xattr_set_handle() was zeroing out an inode outside
of journaling constraints; this is one of the accesses that
was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Reviewed-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:30:52 -05:00
Julia Lawall 30c6e07a92 ext4: fix i_flags access in ext4_da_writepages_trans_blocks()
We need to be testing the i_flags field in the ext4 specific portion
of the inode, instead of the (confusingly aliased) i_flags field in
the generic struct inode.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-15 15:30:58 -05:00
Theodore Ts'o 5068969686 ext4: make sure directory and symlink blocks are revoked
When an inode gets unlinked, the functions ext4_clear_blocks() and
ext4_remove_blocks() call ext4_forget() for all the buffer heads
corresponding to the deleted inode's data blocks.  If the inode is a
directory or a symlink, the is_metadata parameter must be non-zero so
ext4_forget() will revoke them via jbd2_journal_revoke().  Otherwise,
if these blocks are reused for a data file, and the system crashes
before a journal checkpoint, the journal replay could end up
corrupting these data blocks.

Thanks to Curt Wohlgemuth for pointing out potential problems in this
area.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:17:34 -05:00
Theodore Ts'o beac2da756 ext4: add tracepoint for ext4_forget()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:25:08 -05:00
Theodore Ts'o cf40db137c ext4: remove failed journal checksum check
Now that we are checking for failed journal checksums in the jbd2
layer, we don't need to check in the ext4 mount path --- since a
checksum fail will result in ext4_load_journal() returning an error,
causing the file system to refuse to be mounted until e2fsck can deal
with the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:01 -05:00
Theodore Ts'o 567f3e9a70 ext4: plug a buffer_head leak in an error path of ext4_iget()
One of the invalid error paths in ext4_iget() forgot to brelse() the
inode buffer head.  Fix it by adding a brelse() in the common error
return path, which also simplifies function.

Thanks to Andi Kleen <ak@linux.intel.com> reporting the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-14 08:19:05 -05:00
Akira Fujita 92c28159dc ext4: fix spelling typos in move_extent.c
Fix a few spelling typos in move_extent.c

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.co.jp>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:50 -05:00
Akira Fujita 49bd22bc4d ext4: fix possible recursive locking warning in EXT4_IOC_MOVE_EXT
If CONFIG_PROVE_LOCKING is enabled, the double_down_write_data_sem()
will trigger a false-positive warning of a recursive lock.  Since we
take i_data_sem for the two inodes ordered by their inode numbers,
this isn't a problem.  Use of down_write_nested() will notify the lock
dependency checker machinery that there is no problem here.

This problem was reported by Brian Rogers:

	http://marc.info/?l=linux-ext4&m=125115356928011&w=1

Reported-by: Brian Rogers <brian@xyzw.org>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:41 -05:00
Akira Fujita fc04cb49a8 ext4: fix lock order problem in ext4_move_extents()
ext4_move_extents() checks the logical block contiguousness
of original file with ext4_find_extent() and mext_next_extent().
Therefore the extent which ext4_ext_path structure indicates
must not be changed between above functions.

But in current implementation, there is no i_data_sem protection
between ext4_ext_find_extent() and mext_next_extent().  So the extent
which ext4_ext_path structure indicates may be overwritten by
delalloc.  As a result, ext4_move_extents() will exchange wrong blocks
between original and donor files.  I change the place where
acquire/release i_data_sem to solve this problem.

Moreover, I changed move_extent_per_page() to start transaction first,
and then acquire i_data_sem.  Without this change, there is a
possibility of the deadlock between mmap() and ext4_move_extents():

* NOTE: "A", "B" and "C" mean different processes

A-1: ext4_ext_move_extents() acquires i_data_sem of two inodes.

B:   do_page_fault() starts the transaction (T),
     and then tries to acquire i_data_sem.
     But process "A" is already holding it, so it is kept waiting.

C:   While "A" and "B" running, kjournald2 tries to commit transaction (T)
     but it is under updating, so kjournald2 waits for it.

A-2: Call ext4_journal_start with holding i_data_sem,
     but transaction (T) is locked.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:24:43 -05:00
Akira Fujita f868a48d06 ext4: fix the returned block count if EXT4_IOC_MOVE_EXT fails
If the EXT4_IOC_MOVE_EXT ioctl fails, the number of blocks that were
exchanged before the failure should be returned to the userspace
caller.  Unfortunately, currently if the block size is not the same as
the page size, the returned block count that is returned is the
page-aligned block count instead of the actual block count.  This
commit addresses this bug.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-23 07:25:48 -05:00
Theodore Ts'o 503358ae01 ext4: avoid divide by zero when trying to mount a corrupted file system
If s_log_groups_per_flex is greater than 31, then groups_per_flex will
will overflow and cause a divide by zero error.  This can cause kernel
BUG if such a file system is mounted.

Thanks to Nageswara R Sastry for analyzing the failure and providing
an initial patch.

http://bugzilla.kernel.org/show_bug.cgi?id=14287

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:46 -05:00
Theodore Ts'o 2de770a406 ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC
Previously add_dirent_to_buf() did not free its passed-in buffer head
in the case of ENOSPC, since in some cases the caller still needed it.
However, this led to potential buffer head leaks since not all callers
dealt with this correctly.  Fix this by making simplifying the freeing
convention; now add_dirent_to_buf() *never* frees the passed-in buffer
head, and leaves that to the responsibility of its caller.  This makes
things cleaner and easier to prove that the code is neither leaking
buffer heads or calling brelse() one time too many.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: stable@kernel.org
2009-11-23 07:25:49 -05:00
Theodore Ts'o 1e424a3483 ext4: partial revert to fix double brelse WARNING()
This is a partial revert of commit 6487a9d (only the changes made to
fs/ext4/namei.c), since it is causing the following brelse()
double-free warning when running fsstress on a file system with 1k
blocksize and we run into a block allocation failure while converting
a single-block directory to a multi-block hash-tree indexed directory.

WARNING: at fs/buffer.c:1197 __brelse+0x2e/0x33()
Hardware name: 
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 2226, comm: jbd2/sdd-8 Not tainted 2.6.32-rc6-00577-g0003f55 #101
Call Trace:
 [<c01587fb>] warn_slowpath_common+0x65/0x95
 [<c0158869>] warn_slowpath_fmt+0x29/0x2c
 [<c021168e>] __brelse+0x2e/0x33
 [<c0288a9f>] jbd2_journal_refile_buffer+0x67/0x6c
 [<c028a9ed>] jbd2_journal_commit_transaction+0x319/0x14d8
 [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
 [<c0175bcc>] ? sched_clock_cpu+0x12a/0x13e
 [<c017f6b4>] ? trace_hardirqs_off+0xb/0xd
 [<c0175c1f>] ? cpu_clock+0x3f/0x5b
 [<c017f6ec>] ? lock_release_holdtime+0x36/0x137
 [<c0664ad0>] ? _spin_unlock_irqrestore+0x44/0x51
 [<c0180af3>] ? trace_hardirqs_on_caller+0x103/0x124
 [<c0180b1f>] ? trace_hardirqs_on+0xb/0xd
 [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
 [<c0290d1c>] kjournald2+0x11a/0x310
 [<c017118e>] ? autoremove_wake_function+0x0/0x38
 [<c0290c02>] ? kjournald2+0x0/0x310
 [<c0170ee6>] kthread+0x66/0x6b
 [<c0170e80>] ? kthread+0x0/0x6b
 [<c01251b3>] kernel_thread_helper+0x7/0x10
---[ end trace 5579351b86af61e3 ]---

Commit 6487a9d was an attempt some buffer head leaks in an ENOSPC
error path, but in some cases it actually results in an excess ENOSPC,
as shown above.  Fixing this means cleaning up who is responsible for
releasing the buffer heads from the callee to the caller of
add_dirent_to_buf().

Since that's a relatively complex change, and we're late in the rcX
development cycle, I'm reverting this now, and holding back a more
complete fix until after 2.6.32 ships.  We've lived with this
buffer_head leak on ENOSPC in ext3 and ext4 for a very long time; a
few more months won't kill us.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>
2009-11-08 15:45:44 -05:00
Mingming ba230c3f6d ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O
To prepare for a direct I/O write, we need to split the unwritten
extents before submitting the I/O.  When no extents needed to be
split, ext4_split_unwritten_extents() was incorrectly returning 0
instead of the size of uninitialized extents. This bug caused the
wrong return value sent back to VFS code when it gets called from
async IO path, leading to an unnecessary fall back to buffered IO.

This bug also hid the fact that the check to see whether or not a
split would be necessary was incorrect; we can only skip splitting the
extent if the write completely covers the uninitialized extent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-06 04:01:23 -05:00
Mingming 4b70df1816 ext4: code clean up for dio fallocate handling
The ext4_debug() call in ext4_end_io_dio() should be moved after the
check to make sure that io_end is non-NULL.

The comment above ext4_get_block_dio_write() ("Maximum number of
blocks...") is a duplicate; the original and correct comment is above
the #define DIO_MAX_BLOCKS up above.

Based on review comments from Curt Wohlgemuth.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-03 14:44:54 -05:00
Mingming 5f5249507e ext4: skip conversion of uninit extents after direct IO if there isn't any
At the end of direct I/O operation, ext4_ext_direct_IO() always called
ext4_convert_unwritten_extents(), regardless of whether there were any
unwritten extents involved in the I/O or not.

This commit adds a state flag so that ext4_ext_direct_IO() only calls
ext4_convert_unwritten_extents() when necessary.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-10 10:48:04 -05:00
Mingming 109f556519 ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extents
After a direct I/O request covering an uninitalized extent (i.e.,
created using the fallocate system call) or a hole in a file, ext4
will convert the uninitialized extent so it is marked as initialized
by calling ext4_convert_unwritten_extents().  This function returns
zero on success.

This return value was getting returned by ext4_direct_IO(); however
the file system's direct_IO function is supposed to return the number
of bytes read or written on a success.  By returning zero, it confused
the direct I/O code into falling back to buffered I/O unnecessarily.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-10 10:48:08 -05:00
Aneesh Kumar K.V fa5d11133b ext4: discard preallocation when restarting a transaction during truncate
When restart a transaction during a truncate operation, we drop and
reacquire i_data_sem.  After reacquiring i_data_sem, we need to
discard any inode-based preallocation that might have been grabbed
while we released i_data_sem (for example, if pdflush is allocating
blocks and racing against the truncate).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-02 18:50:49 -05:00
Linus Torvalds d4da6c9ccf Revert "ext4: Remove journal_checksum mount option and enable it by default"
This reverts commit d0646f7b63, as
requested by Eric Sandeen.

It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.

Quoth Eric:

   "My hand-wavy hunch about what is happening is that we're finding a
    bad checksum on the last partially-written transaction, which is
    not surprising, but if we have a wrapped log and we're doing the
    initial scan for head/tail, and we abort scanning on that bad
    checksum, then we are essentially running an unrecovered filesystem.

    But that's hand-wavy and I need to go look at the code.

    We lived without journal checksums on by default until now, and at
    this point they're doing more harm than good, so we should revert
    the default-changing commit until we can fix it and do some good
    power-fail testing with the fixes in place."

See

	http://bugzilla.kernel.org/show_bug.cgi?id=14354

for all the gory details.

Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-02 10:15:27 -08:00
Linus Torvalds 9117703fab Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  [PATCH] ext4: retry failed direct IO allocations
  ext4: Fix build warning in ext4_dirty_inode()
  ext4: drop ext4dev compat
  ext4: fix a BUG_ON crash by checking that page has buffers attached to it
2009-10-03 11:24:19 -07:00
Christoph Lameter ca0c9584b1 this_cpu: Straight transformations
Use this_cpu_ptr and __this_cpu_ptr in locations where straight
transformations are possible because per_cpu_ptr is used with
either smp_processor_id() or raw_smp_processor_id().

cc: David Howells <dhowells@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
cc: Ingo Molnar <mingo@elte.hu>
cc: Rusty Russell <rusty@rustcorp.com.au>
cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2009-10-03 19:48:22 +09:00
Eric Sandeen fbbf694566 [PATCH] ext4: retry failed direct IO allocations
On a 256M filesystem, doing this in a loop:

        xfs_io -F -f -d -c 'pwrite 0 64m' test
        rm -f test

eventually leads to ENOSPC.  (the xfs_io command does a
64m direct IO write to the file "test")

As with other block allocation callers, it looks like we need to
potentially retry the allocations on the initial ENOSPC.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-10-02 21:20:55 -04:00
Curt Wohlgemuth 74072d0a63 ext4: Fix build warning in ext4_dirty_inode()
This fixes the following warning:

fs/ext4/inode.c: In function 'ext4_dirty_inode':
fs/ext4/inode.c:5615: warning: unused variable 'current_handle'

We remove the jbd_debug() statement which does use current_handle, as
it's not terribly important in the grand scheme of things.

Thanks to Stephen Rothwell for pointing this out.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-10-02 21:08:32 -04:00
Eric Sandeen f0e2dfa7f3 ext4: drop ext4dev compat
Kconfig & super.c promised it'd be gone by 2.6.31, so it's
about time to drop it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-10-01 02:21:07 -04:00
Theodore Ts'o 1f94533d9c ext4: fix a BUG_ON crash by checking that page has buffers attached to it
In ext4_num_dirty_pages() we were calling page_buffers() before
checking to see if the page actually had pages attached to it; this
would cause a BUG check crash in the inline function page_buffers().

Thanks to Markus Trippelsdorf for reporting this bug.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 22:57:41 -04:00
Linus Torvalds 9f44fdc518 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix time encoding with extra epoch bits
  ext4: Add a stub for mpage_da_data in the trace header
  jbd2: Use tracepoints for history file
  ext4: Use tracepoints for mb_history trace file
  ext4, jbd2: Drop unneeded printks at mount and unmount time
  ext4: Handle nested ext4_journal_start/stop calls without a journal
  ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
  ext4: Avoid updating the inode table bh twice in no journal mode
  ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
  ext4: async direct IO for holes and fallocate support
  ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
  ext4: Split uninitialized extents for direct I/O
  ext4: release reserved quota when block reservation for delalloc retry
  ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
  ext4: Fix hueristic which avoids group preallocation for closed files
  ext4: Use ext4_msg() for ext4_da_writepage() errors
  ext4: Update documentation about quota mount options
2009-09-30 09:32:30 -07:00
Theodore Ts'o c1fccc0696 ext4: Fix time encoding with extra epoch bits
"Looking at ext4.h, I think the setting of extra time fields forgets to
mask the epoch bits so the epoch part overwrites nsec part. The second
change is only for coherency (2 -> EXT4_EPOCH_BITS)."

Thanks to Damien Guibouret for pointing out this problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 01:13:55 -04:00
Theodore Ts'o 296c355cd6 ext4: Use tracepoints for mb_history trace file
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.  

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:42 -04:00
Theodore Ts'o 90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
Curt Wohlgemuth d3d1faf6a7 ext4: Handle nested ext4_journal_start/stop calls without a journal
This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 11:01:03 -04:00
Curt Wohlgemuth f3dc272fd5 ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
This patch a problem that ext4_dirty_inode() was not calling
ext4_mark_inode_dirty() if the current_handle is not valid, which it
is the case in no journal mode.

It also removes a test for non-matching transaction which can never
happen.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 16:06:01 -04:00
Frank Mayhar 830156c79b ext4: Avoid updating the inode table bh twice in no journal mode
This is a cleanup of commit 91ac6f4.  Since ext4_mark_inode_dirty()
has already called ext4_mark_iloc_dirty(), which in turn calls
ext4_do_update_inode(), it's not necessary to have ext4_write_inode()
call ext4_do_update_inode() in no journal mode.  Indeed, it would be
duplicated work.

Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 10:07:47 -04:00
Theodore Ts'o f3ce8064b3 ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
Move the check to make sure the original and donor inodes are
different earlier, to avoid a potential deadlock by trying to lock the
same inode twice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:58:29 -04:00
Mingming Cao 8d5d02e6b1 ext4: async direct IO for holes and fallocate support
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.

But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.

Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue.  When fsync() is called, it will go
through the list and do the conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-28 15:48:29 -04:00
Mingming Cao 4c0425ff68 ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
Currently the DIO VFS code passes create = 0 when writing to the
middle of file.  It does this to avoid block allocation for holes, so
as not to expose stale data out when there is a parallel buffered read
(which does not hold the i_mutex lock).  Direct I/O writes into holes
falls back to buffered IO for this reason.

Since preallocated extents are treated as holes when doing a
get_block() look up (buffer is not mapped), direct IO over fallocate
also falls back to buffered IO.  Thus ext4 actually silently falls
back to buffered IO in above two cases, which is undesirable.

To fix this, this patch creates unitialized extents when a direct I/O
write into holes in sparse files, and registering an end_io callback which
converts the uninitialized extent to an initialized extent after the
I/O is completed.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:48:41 -04:00
Mingming Cao 0031462b5b ext4: Split uninitialized extents for direct I/O
When writing into an unitialized extent via direct I/O, and the direct
I/O doesn't exactly cover the unitialized extent, split the extent
into uninitialized and initialized extents before submitting the I/O.
This avoids needing to deal with an ENOSPC error in the end_io
callback that gets used for direct I/O.

When the IO is complete, the written extent will be marked as initialized.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:49:08 -04:00
Mingming Cao 9f0ccfd8e0 ext4: release reserved quota when block reservation for delalloc retry
ext4_da_reserve_space() can reserve quota blocks multiple times if
ext4_claim_free_blocks() fail and we retry the allocation. We should
release the quota reservation before restarting.

Bug found by Jan Kara.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:49:52 -04:00
Theodore Ts'o 55138e0bc2 ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small.  This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time.  So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode.  We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 13:31:31 -04:00
Theodore Ts'o 7178057730 ext4: Fix hueristic which avoids group preallocation for closed files
The hueristic was designed to avoid using locality group preallocation
when writing the last segment of a closed file.  Fix it by move
setting size to the maximum of size and isize until after we check
whether size == isize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 00:06:20 -04:00
Alexey Dobriyan f0f37e2f77 const: mark struct vm_struct_operations
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-27 11:39:25 -07:00
Theodore Ts'o 1693918e0b ext4: Use ext4_msg() for ext4_da_writepage() errors
This allows the user to see what filesystem was involved with a
particular ext4_da_writepage() error.  Also, use KERN_CRIT which is
more appropriate than KERN_EMERG.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-26 17:43:59 -04:00
Linus Torvalds db16826367 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
  HWPOISON: Enable error_remove_page on btrfs
  HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  HWPOISON: Add madvise() based injector for hardware poisoned pages v4
  HWPOISON: Enable error_remove_page for NFS
  HWPOISON: Enable .remove_error_page for migration aware file systems
  HWPOISON: The high level memory error handler in the VM v7
  HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
  HWPOISON: shmem: call set_page_dirty() with locked page
  HWPOISON: Define a new error_remove_page address space op for async truncation
  HWPOISON: Add invalidate_inode_page
  HWPOISON: Refactor truncate to allow direct truncating of page v2
  HWPOISON: check and isolate corrupted free pages v2
  HWPOISON: Handle hardware poisoned pages in try_to_unmap
  HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  HWPOISON: Add poison check to page fault handling
  HWPOISON: Add basic support for poisoned pages in fault handler v3
  HWPOISON: Add new SIGBUS error codes for hardware poison signals
  HWPOISON: Add support for poison swap entries v2
  HWPOISON: Export some rmap vma locking to outside world
  ...
2009-09-24 07:53:22 -07:00
Linus Torvalds 342ff1a1b5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  trivial: fix typo in aic7xxx comment
  trivial: fix comment typo in drivers/ata/pata_hpt37x.c
  trivial: typo in kernel-parameters.txt
  trivial: fix typo in tracing documentation
  trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
  trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
  trivial: remove unnecessary semicolons
  trivial: Fix duplicated word "options" in comment
  trivial: kbuild: remove extraneous blank line after declaration of usage()
  trivial: improve help text for mm debug config options
  trivial: doc: hpfall: accept disk device to unload as argument
  trivial: doc: hpfall: reduce risk that hpfall can do harm
  trivial: SubmittingPatches: Fix reference to renumbered step
  trivial: fix typos "man[ae]g?ment" -> "management"
  trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
  trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
  trivial: fix missing printk space in amd_k7_smp_check
  trivial: fix typo s/ketymap/keymap/ in comment
  trivial: fix typo "to to" in multiple files
  trivial: fix typos in comments s/DGBU/DBGU/
  ...
2009-09-22 07:51:45 -07:00
Alexey Dobriyan 0d54b217a2 const: make struct super_block::s_qcop const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan 61e225dc34 const: make struct super_block::dq_op const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Anand Gadiyar fd589a8f0a trivial: fix typo "to to" in multiple files
Signed-off-by: Anand Gadiyar <gadiyar@ti.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-21 15:14:55 +02:00
Eric Sandeen 0a80e9867d ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
There's no reason to redefine the maximum allowable offset
in an extent-based file just for defrag; 
EXT_MAX_BLOCK already does this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 11:55:58 -04:00
Theodore Ts'o 5534fb5bb3 ext4: Fix the alloc on close after a truncate hueristic
In an attempt to avoid doing an unneeded flush after opening a
(previously non-existent) file with O_CREAT|O_TRUNC, the code only
triggered the hueristic if ei->disksize was non-zero.  Turns out that
the VFS doesn't call ->truncate() if the file doesn't exist, and
ei->disksize is always zero even if the file previously existed.  So
remove the test, since it isn't necessary and in fact disabled the
hueristic.

Thanks to Clemens Eisserer that he was seeing problems with files
written using kwrite and eclipse after sudden crashes caused by a
buggy Intel video driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 09:34:16 -04:00
Theodore Ts'o fb40ba0d98 ext4: Add a tracepoint for ext4_alloc_da_blocks()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 19:30:40 -04:00
Theodore Ts'o 1b9c12f44c ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags
EXT4_EXT_MIGRATE is only intended to be used for an in-memory flag,
and the hex value assigned to it collides with FS_DIRECTIO_FL (which
is also stored in i_flags).  There's no reason for the
EXT4_EXT_MIGRATE bit to be stored in i_flags, so we switch it to use
i_state instead.

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 08:32:22 -04:00
Eric Sandeen fb0a387dcd ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.

This should address RH Bug 519471, ext4 bitmap allocator 
must limit blocks to < 2^32

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
  so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
  is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
  search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
  preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs

No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes.  Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:45:10 -04:00
Akira Fujita c40ce3c9ea ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT
If logical block offset of original file which is passed to
EXT4_IOC_MOVE_EXT is different from donor file's,
a calculation error occurs in ext4_calc_swap_extents(),
therefore wrong block is exchanged between original file and donor file.
As a result, we hit ext4_error() in check_block_validity().
To detect the logical offset difference in EXT4_IOC_MOVE_EXT,
add checks to mext_calc_swap_extents() and handle it as error,
since data exchange must be done between the same blocks in EXT4_IOC_MOVE_EXT.

Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:25:39 -04:00
Akira Fujita 347fa6f1c7 ext4: Add null extent check to ext_get_path
There is the possibility that path structure which is taken
by ext4_ext_find_extent() indicates null extents.
Because during data block exchanging in ext4_move_extents(),
constitution of an extent tree may be changed.
As a solution, the patch adds null extent check
to ext_get_path().

Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:25:07 -04:00
Akira Fujita 2147b1a6a4 ext4: Replace BUG_ON() with ext4_error() in move_extents.c
Replace BUG_ON calls with a call to ext4_error()
to print an error message if EXT4_IOC_MOVE_EXT failed
with some kind of reasons.  This will help to debug.
Ted pointed this out, thanks.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 13:46:35 -04:00
Akira Fujita e8505970af ext4: Replace get_ext_path macro with an inline funciton
Replace get_ext_path macro with an inline function,
since this macro looks like a function call but its arguments
get modified. Ted pointed this out, thanks.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 13:46:38 -04:00
Andi Kleen aa261f549d HWPOISON: Enable .remove_error_page for migration aware file systems
Enable removing of corrupted pages through truncation
for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
These should cover most server needs.

I chose the set of migration aware file systems for this
for now, assuming they have been especially audited.
But in general it should be safe for all file systems
on the data area that support read/write and truncate.

Caveat: the hardware error handler does not take i_mutex
for now before calling the truncate function. Is that ok?

Cc: tytso@mit.edu
Cc: hch@infradead.org
Cc: mfasheh@suse.com
Cc: aia21@cantab.net
Cc: hugh.dickins@tiscali.co.uk
Cc: swhiteho@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2009-09-16 11:50:16 +02:00
Theodore Ts'o 3661d28615 ext4: Fix include/trace/events/ext4.h to work with Systemtap
Using relative pathnames in #include statements interacts badly with
SystemTap, since the fs/ext4/*.h header files are not packaged up as
part of a distribution kernel's header files.  Since systemtap doesn't
use TP_fast_assign(), we can use a blind structure definition and then
make sure the needed header files are defined before the ext4 source
files #include the trace/events/ext4.h header file.

https://bugzilla.redhat.com/show_bug.cgi?id=512478

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-14 22:59:50 -04:00
Jan Kara 0d34ec62e1 ext4: Remove syncing logic from ext4_file_write
The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.

CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:16 +02:00
Theodore Ts'o 7ad9bb651f ext4: Fix initalization of s_flex_groups
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg.  By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.  

The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.

Thanks to Damien Guibouret for pointing out this problem!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 16:51:28 -04:00
Andreas Schlick 1f7bebb9e9 ext4: Always set dx_node's fake_dirent explicitly.
When ext4_dx_add_entry() has to split an index node, it has to ensure that
name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck
won't recognise it as an intermediate htree node and consider the htree to
be corrupted.

Signed-off-by: Andreas Schlick <schlick@lavabit.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 23:16:07 -04:00
Theodore Ts'o 71290b368a ext4: Don't update superblock write time when filesystem is read-only
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 17:31:04 -04:00
Aneesh Kumar K.V 08c3a81338 ext4: Clarify the locking details in mballoc
We don't need to take the alloc_sem lock when we are adding new
groups, since mballoc won't see the new group added until we bump
sbi->s_groups_count.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2009-09-09 23:50:17 -04:00
Aneesh Kumar K.V f41c075053 ext4: check for need init flag in ext4_mb_load_buddy
We should check for need init flag with the group's alloc_sem held, to
make sure while we are loading the buddy cache and holding a reference
to it, a file system resize can't add new blocks to same group.

The patch also drops the need init flag check in
ext4_mb_regular_allocator() because doing the check without holding
alloc_sem is racy.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2009-09-09 23:34:50 -04:00
Aneesh Kumar K.V b6a758ec3a ext4: move ext4_mb_init_group() function earlier in the mballoc.c
This moves the function around so that it can be called from
ext4_mb_load_buddy().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 23:47:46 -04:00
Frank Mayhar 91ac6f4331 ext4: Make non-journal fsync work properly
Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode:  If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter.  If that parameter is nonzero _and_ we're
not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
instead of ext4_handle_dirty_metadata().

This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync().  It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 22:33:47 -04:00
Theodore Ts'o fe188c0e08 ext4: Assure that metadata blocks are written during fsync in no journal mode
When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list via mark_buffer_dirty_inode() so that
ext4_sync_file() --- which is called to service fsync() and
fdatasync() system calls --- can write out the inode's metadata blocks
by calling sync_mapping_buffers().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-12 13:41:55 -04:00
Theodore Ts'o c7acb4c166 ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}()
When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be dropped from the
current transaction and/or revoked.  This is done by calling the
functions ext4_journal_forget() and ext4_journal_revoke(), which call
jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.

Since the jbd2_journal_forget() and jbd2_journal_revoke() call
bforget(), if ext4 is not using a journal, ext4_journal_forget() and
ext4_journal_revoke() must call bforget() to avoid a dirty metadata
block overwriting a block after it has been reallocated and reused for
another inode's data block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 21:32:41 -04:00
Linus Torvalds 1d5ccd1c42 ext[234]: move over to 'check_acl' permission model
Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.

Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-08 11:09:04 -07:00
Theodore Ts'o 80e42468d6 ext4: print more sysadmin-friendly message in check_block_validity()
Drop the WARN_ON(1), as he stack trace is not appropriate, since it is
triggered by file system corruption, and it misleads users into
thinking there is a kernel bug.  In addition, change the message
displayed by ext4_error() to make it clear that this is a file system
corruption problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-08 08:21:26 -04:00
Aneesh Kumar K.V a827eaffff ext4: Take page lock before looking at attached buffer_heads flags
In order to check whether the buffer_heads are mapped we need to hold
page lock. Otherwise a reclaim can cleanup the attached buffer_heads.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-09 22:36:03 -04:00
Akira Fujita 44fc48f704 ext4: Fix small typo for move_extent_per_page()
This function means moving extents every page, so change its name from
move_exgtent_par_page().

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.co.jp>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 23:12:41 -04:00
Akira Fujita 8d6669133d ext4: Return exchanged blocks count to user space in failure
Return exchanged blocks count (moved_len) to user space,
if ext4_move_extents() failed on the way.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 22:46:29 -04:00
Akira Fujita daea696dba ext4: Remove unneeded BUG_ON() in ext4_move_extents()
The ext4_move_extents() functions checks with BUG_ON() whether the
exchanged blocks count accords with request blocks count.  But, if the
target range (orig_start + len) includes sparse block(s), 'moved_len'
(exchanged blocks count) does not agree with 'len' (request blocks
count), since sparse block is not counted in 'moved_len'.  This causes
us to hit the BUG_ON(), even though the function succeeded.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 22:11:55 -04:00
Akira Fujita 70d5d3dcea ext4: Fix wrong comparisons in mext_check_arguments()
The mext_check_arguments() function in move_extents.c has wrong
comparisons.  orig_start which is passed from user-space is block
unit, but i_size of inode is byte unit, therefore the checks do not
work fine.  This mis-check leads to the overflow of 'len' and then
hits BUG_ON() in ext4_move_extents().  The patch fixes this issue.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Greg Freemyer <greg.freemyer@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:28:22 -04:00
Christoph Hellwig 5f3481e9a8 ext4: fix cache flush in ext4_sync_file
We need to flush the write cache unconditionally in ->fsync, otherwise
writes into already allocated blocks can get lost.  Writes into fully
allocated files are very common when using disk images for
virtualization, and without this fix can easily lose data after
an fdatasync, which is the typical implementation for a cache flush on
the virtual drive.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 21:42:42 -04:00
Theodore Ts'o d0646f7b63 ext4: Remove journal_checksum mount option and enable it by default
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 12:50:43 -04:00
Tobias Klauser 7f1346a9de ext4: Declare seq_operations and file_operations structures as const
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 09:28:54 -04:00
Theodore Ts'o b3a3ca8ca0 ext4: Add new tracepoint: trace_ext4_da_write_pages()
Add a new tracepoint which shows the pages that will be written using
write_cache_pages() by ext4_da_writepages().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-31 23:13:11 -04:00
Theodore Ts'o de89de6e0c ext4: Restore wbc->range_start in ext4_da_writepages()
To solve a lock inversion problem, we implement part of the
range_cyclic algorithm in ext4_da_writepages().  (See commit 2acf2c26
for more details.)

As part of that change wbc->range_start was modified by ext4's
writepages function, which causes its callers to get confused since
they aren't expecting the filesystem to modify it.  The simplest fix
is to save and restore wbc->range_start in ext4_da_writepages.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-31 17:00:59 -04:00
Theodore Ts'o b05ab1dc37 ext4: Limit number of links that can be created by ext4_link()
In ext4_link we need to check using EXT4_LINK_MAX, and not
EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of
regular files, and not directories.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-29 21:08:08 -04:00
Aneesh Kumar K.V 2c94eb86c6 ext4: Allow rename to create more than EXT4_LINK_MAX subdirectories
Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new
parent directory without running into the EXT4_LINK_MAX limit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-28 21:43:15 -04:00
Theodore Ts'o 55ad63bf3a ext4: fix extent sanity checking code with AGGRESSIVE_TEST
The extents sanity-checking code depends on the ext4_ext_space_*()
functions returning the maximum alloable size for eh_max; however,
when the debugging #ifdef AGGRESSIVE_TEST is enabled to test the
extent tree handling code, this prevents a normally created ext4
filesystem from being mounted with the errors:

Aug 26 15:43:50 bsd086 kernel: [   96.070277] EXT4-fs error (device sda8): ext4_ext_check_inode: bad header/extent in inode #8: too large eh_max - magic f30a, entries 1, max 4(3), depth 0(0)
Aug 26 15:43:50 bsd086 kernel: [   96.070526] EXT4-fs (sda8): no journal found

Bug reported by Akira Fujita.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-28 10:40:33 -04:00
Eric Sandeen a36b44988c ext4: use ext4_grpblk_t more extensively
unsigned  short is potentially too small to track blocks within
a group; today it is safe due to restrictions in e2fsprogs but
we have _lo / _hi bits for group blocks with the intent to go
up to 32 bits, so clean this up now.

There are many more places where we use unsigned/int/unsigned int
to contain a group block but this should at least fix all the
short types.

I added a few comments to the struct ext4_group_info definition
as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-25 22:36:45 -04:00
Eric Sandeen 1927805e65 ext4: use variables not types in sizeofs() for allocations
Precursor to changing some types; to keep things in sync, it 
seems better to allocate/memset based on the size of the 
variables we are using rather than on some disconnected 
basic type like "unsigned short"

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2009-08-25 22:36:25 -04:00
Aneesh Kumar K.V a8526e84ac ext4: Add missing unlock_new_inode() call in extent migration code
We need to unlock the new inode before iput.  This patch fixes the
following warning when calling chattr +e to migrate a file to use
extents.  It also fixes problems in when e4defrag attempts to
defragment an inode.

[  470.400044] ------------[ cut here ]------------
[  470.400065] WARNING: at fs/inode.c:1210 generic_delete_inode+0x65/0x16a()
[  470.400072] Hardware name: N/A
.....
...
[  470.400353] Pid: 4451, comm: chattr Not tainted 2.6.31-rc7-red-debug #4
[  470.400359] Call Trace:
[  470.400372]  [<ffffffff81037771>] warn_slowpath_common+0x77/0x8f
[  470.400385]  [<ffffffff81037798>] warn_slowpath_null+0xf/0x11
[  470.400395]  [<ffffffff810b7f28>] generic_delete_inode+0x65/0x16a
[  470.400405]  [<ffffffff810b8044>] generic_drop_inode+0x17/0x1bd
[  470.400413]  [<ffffffff810b7083>] iput+0x61/0x65
[  470.400455]  [<ffffffffa003b229>] ext4_ext_migrate+0x5eb/0x66a [ext4]
[  470.400492]  [<ffffffffa002b1f8>] ext4_ioctl+0x340/0x756 [ext4]
[  470.400507]  [<ffffffff810b1a91>] vfs_ioctl+0x1d/0x82
[  470.400517]  [<ffffffff810b1ff0>] do_vfs_ioctl+0x483/0x4c9
[  470.400527]  [<ffffffff81059c30>] ? trace_hardirqs_on+0xd/0xf
[  470.400537]  [<ffffffff810b2087>] sys_ioctl+0x51/0x74
[  470.400549]  [<ffffffff8100ba6b>] system_call_fastpath+0x16/0x1b
[  470.400557] ---[ end trace ab85723542352dac ]---

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-25 22:36:05 -04:00
Eric Sandeen a13fb1a453 ext4: Add feature set check helper for mount & remount paths
A user reported that although his root ext4 filesystem was mounting
fine, other filesystems would not mount, with the:

"Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF"

error on his 32-bit box built without CONFIG_LBDAF.  This is because
the test at mount time for this situation was not being re-checked
on remount, and the normal boot process makes an ro->rw transition,
so this was being missed.

Refactor to make a common helper function to test the filesystem
features against the type of mount request (RO vs. RW) so that we 
stay consistent.

Addresses Red-Hat-Bugzilla: #517650

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-18 00:20:23 -04:00
Eric Sandeen 38877f4e8d simplify some logic in ext4_mb_normalize_request
While reading through some of the mballoc code it seems that a couple
spots in the size normalization function could be streamlined.

The test for non-overlapping PAs can be or'd for the start & end
conditions, and the tests for adjacent PAs can be else-if'd - 
it's essentially independently testing:

	if (A + B <= C)
		...
	if (A > C)
		...

These cannot both be true so it seems like the else-if might
be slightly more efficient and/or informative.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 23:55:24 -04:00
Eric Sandeen 0373130d5b ext4: open-code ext4_mb_update_group_info
ext4_mb_update_group_info is only called in one place, and it's
extremely simple.  There's no reason to have it in a separate function
in a separate file as far as I can tell, it just obfuscates what's
really going on.

Perhaps it was intended to keep the grp->bb_* manipulation local to
mballoc.c but we're already accessing other grp-> fields in balloc.c
directly so this seems ok.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 23:51:29 -04:00
Eric Sandeen bf43d84b18 ext4: reject too-large filesystems on 32-bit kernels
ext4 will happily mount a > 16T filesystem on a 32-bit box, but
this is not safe; writes to the block device will wrap past 16T
and the page cache can't index past 16T (232 index * 4k pages).

Adding another test to the existing "too many sectors" test
should do the trick.

Add a comment, a relevant return value, and fix the reference
to the CONFIG_LBD(AF) option as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 23:48:51 -04:00
Jan Kara 487caeef9f ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks()
During truncate we are sometimes forced to start a new transaction as
the amount of blocks to be journaled is both quite large and hard to
predict. So far we restarted a transaction while holding i_data_sem
and that violates lock ordering because i_data_sem ranks below a
transaction start (and it can lead to a real deadlock with
ext4_get_blocks() mapping blocks in some page while having a
transaction open).

We fix the problem by dropping the i_data_sem before restarting the
transaction and acquire it afterwards. It's slightly subtle that this
works:

1) By the time ext4_truncate() is called, all the page cache for the
truncated part of the file is dropped so get_block() should not be
called on it (we only have to invalidate extent cache after we
reacquire i_data_sem because some extent from not-truncated part could
extend also into the part we are going to truncate).

2) Writes, migrate or defrag hold i_mutex so they are stopped for all
the time of the truncate.

This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 22:17:20 -04:00
Mingming 553f900893 ext4: Show unwritten extent flag in ext4_ext_show_leaf()
ext4_ext_show_leaf() will display the leaf extents when extent
debugging is enabled.

Printing out the unwritten bit is useful for debugging unwritten
extent, allow us to see the unwritten extents vs written extents,
after the unwritten extents are splitted or converted.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-18 13:34:55 -04:00
Mingming 84fe3bef59 ext4: Compile warning fix when EXT_DEBUG enabled
When EXT_DEBUG is enabled I received the following compile warning on
PPC64:

  CC [M]  fs/ext4/inode.o
  CC [M]  fs/ext4/extents.o
fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’:
fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’
fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’:
fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’
fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’
fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’
  CC [M]  fs/ext4/migrate.o

The patch fixes compile warning.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Index: linux-2.6.31-rc4/fs/ext4/extents.c
===================================================================
2009-09-01 08:44:37 -04:00
Theodore Ts'o 50797481a7 ext4: Avoid group preallocation for closed files
Currently the group preallocation code tries to find a large (512)
free block from which to do per-cpu group allocation for small files.
The problem with this scheme is that it leaves the filesystem horribly
fragmented.  In the worst case, if the filesystem is unmounted and
remounted (after a system shutdown, for example) we forget the fact
that wee were using a particular (now-partially filled) 512 block
extent.  So the next time we try to allocate space for a small file,
we will find *another* completely free 512 block chunk to allocate
small files.  Given that there are 32,768 blocks in a block group,
after 64 iterations of "mount, write one 4k file in a directory,
unmount", the block group will have 64 files, each separated by 511
blocks, and the block group will no longer have any free 512
completely free chunks of blocks for group preallocation space.

So if we try to allocate blocks for a file that has been closed, such
that we know the final size of the file, and the filesystem is not
busy, avoid using group preallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-18 13:34:02 -04:00
Theodore Ts'o 4ba74d00a2 ext4: Fix bugs in mballoc's stream allocation mode
The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
screwed up.  These fields were getting unconditionally all the time,
set even when stream allocation had not taken place, and if they were
being used when the file was smaller than s_mb_stream_request, which
is when the allocation should _not_ be doing stream allocation.

Fix this by determining whether or not we stream allocation should
take place once, in ext4_mb_group_or_file(), and setting a flag which
gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
This simplifies the code and assures that we are consistently using
(or not using) the stream allocation logic.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-09 22:01:13 -04:00
Theodore Ts'o 0ef90db93a ext4: Display the mballoc flags in mb_history in hex instead of decimal
Displaying the flags in base 16 makes it easier to see which flags
have been set.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-09 16:46:13 -04:00
Theodore Ts'o 6ba495e925 ext4: Add configurable run-time mballoc debugging
Allow mballoc debugging to be enabled at run-time instead of just at
compile time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-18 13:38:55 -04:00
Peng Tao 91cc219ad9 ext4: fix journal ref count in move_extent_par_page
move_extent_par_page calls a_ops->write_begin() to increase journal
handler's reference count. However, if either mext_replace_branches()
or ext4_get_block fails, the increased reference count isn't
decreased. This will cause a later attempt to umount of the fs to hang
forever. The patch addresses the issue by calling ext4_journal_stop()
if page is not NULL (which means a_ops->write_end() isn't invoked).

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-10 23:05:28 -04:00
Roel Kluin c333e073b7 ext4: remove redundant test on unsigned
unsigned i_block cannot be less than 0.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-10 22:47:22 -04:00
Peng Tao 785b4b3a5a ext4: fix build warning when EXT4FS_DEBUG is on
When compiling with EXT4FS_DEBUG on, gcc will complain with following warnings:

linux-2.6/fs/ext4/ialloc.c: In function ‘ext4_count_free_inodes’:
linux-2.6/fs/ext4/ialloc.c:1192: warning: format ‘%lu’ expects type
‘long unsigned int’, but argument 2 has type ‘ext4_group_t’

So add a type cast to suppress it. 

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 21:44:40 -04:00
Linus Torvalds 1cf29683f4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix race between write_metadata_buffer and get_write_access
  ext4: Fix ext4_mb_initialize_context() to initialize all fields
  ext4: fix null handler of ioctls in no journal mode
  ext4: Fix buffer head reference leak in no-journal mode
  ext4: Move __ext4_journalled_writepage() to avoid forward declaration
  ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc
  ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation
  ext4: Don't look at buffer_heads outside i_size.
  ext4: Fix goal inum check in the inode allocator
  ext4: fix no journal corruption with locale-gen
  ext4: Calculate required journal credits for inserting an extent properly
  ext4: Fix truncation of symlinks after failed write
  jbd2: Fix a race between checkpointing code and journal_get_write_access()
  ext4: Use rcu_barrier() on module unload.
  ext4: naturally align struct ext4_allocation_request
  ext4: mark several more functions in mballoc.c as noinline
  ext4: Fix potential reclaim deadlock when truncating partial block
  jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region
  ext4: Fix type warning on 64-bit platforms in tracing events header
2009-07-13 16:39:25 -07:00
Theodore Ts'o 833576b362 ext4: Fix ext4_mb_initialize_context() to initialize all fields
Pavel Roskin pointed out that kmemcheck indicated that
ext4_mb_store_history() was accessing uninitialized values of
ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc
history.  Fix this by initializing the entire structure to all zeros
first.

Also, two fields were getting doubly initialized by the caller of
ext4_mb_initialize_context, so remove them for efficiency's sake.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:45:52 -04:00
Peng Tao ac046f1d61 ext4: fix null handler of ioctls in no journal mode
The EXT4_IOC_GROUP_ADD and EXT4_IOC_GROUP_EXTEND ioctls should not
flush the journal in no_journal mode.  Otherwise, running resize2fs on
a mounted no_journal partition triggers the following error messages:

BUG: unable to handle kernel NULL pointer dereference at 00000014
IP: [<c039d282>] _spin_lock+0x8/0x19
*pde = 00000000 
Oops: 0002 [#1] SMP

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:30:17 -04:00
Curt Wohlgemuth e6b5d30104 ext4: Fix buffer head reference leak in no-journal mode
We found a problem with buffer head reference leaks when using an ext4
partition without a journal.  In particular, calls to ext4_forget() would
not to a brelse() on the input buffer head, which will cause pages they
belong to to not be reclaimable.

Further investigation showed that all places where ext4_journal_forget() and
ext4_journal_revoke() are called are subject to the same problem.  The patch
below changes __ext4_journal_forget/__ext4_journal_revoke to do an explicit
release of the buffer head when the journal handle isn't valid.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:07:20 -04:00
Alexey Dobriyan 405f55712d headers: smp_lock.h redux
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:22:34 -07:00
Akira Fujita 1c71850517 ext4: Fix compile warnings with MB_DEBUG
When MB_DEBUG is enabled, we get some compile warnings because
ext4_group_t is unsigned int.  This patch fixes them.

Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:04:36 -04:00
Joe Perches 5a4a798937 ext4: Remove unnecessary semicolons in mballoc.c
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:33:08 -04:00
Curt Wohlgemuth 6487a9d3b5 ext4: More buffer head reference leaks
After the patch I posted last week regarding buffer head ref leaks in
no-journal mode, I looked at all the code that uses buffer heads and
searched for more potential leaks.

The patch below fixes the issues I found; these can occur even when a
journal is present.

The change to inode.c fixes a double release if
ext4_journal_get_create_access() fails.

The changes to namei.c are more complicated.  add_dirent_to_buf() will
release the input buffer head EXCEPT when it returns -ENOSPC.  There are
some callers of this routine that don't always do the brelse() in the event
that -ENOSPC is returned.  Unfortunately, to put this fix into ext4_add_entry()
required capturing the return value of make_indexed_dir() and
add_dirent_to_buf().

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 10:54:08 -04:00
Theodore Ts'o 78f1ddbb49 ext4: Avoid null pointer dereference when decoding EROFS w/o a journal
We need to check to make sure a journal is present before checking the
journal flags in ext4_decode_error().

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 23:09:47 -04:00
Manish Katiyar 43b3852029 ext4: Fix typo in ext4/Kconfig
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 21:38:17 -04:00
Aneesh Kumar K.V 024eab4d5b ext4: Fix memory leak fix when mounting an ext4 filesystem
The allocation of the ext4_group_info array was moved to a new
function ext4_mb_add_group_info() in commit 5f21b0e6 so that online
resize would use a common (and correct) codepath.  Unfortunately, the
call to the new ext4_mb_add_group_info() function was added without
removing the code which originally allocated the array.  This caused a
memory leak each time an ext4 filesystem was mounted.

The fix is simple; remove the code that did the original allocation,
since it is no longer needed.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 09:01:04 -04:00
Al Viro 073aaa1b14 helpers for acl caching + switch to those
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).

ubifs/xattr.c needed includes reordered, the rest is a plain switchover.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:07 -04:00
Al Viro d4bfe2f76d switch ext4 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Linus Torvalds 31583d6acf Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix kernel-doc parameter name typo in blk-settings.c:
  block: rename CONFIG_LBD to CONFIG_LBDAF
  block: Fix bounce_pfn setting
  hd: stop defining MAJOR_NR
2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz 90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Theodore Ts'o 210ad6aedb ext4: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
to do POSIX ACL checks on any files not owned by the caller, and it does
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful
about it, especially with locking. That's doubly sad, since the common
case tends to be that there are no ACL's associated with the files in
question.

ext4 already caches the ACL data so that it doesn't have to look it up
over and over again, but it does so by taking the inode->i_lock spinlock
on every lookup. Which is a noticeable overhead even if it's a private
lock, especially on CPU's where the serialization is expensive (eg Intel
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is
unnecessary. Even if somebody else were to be changing the ACL's on
another CPU, we simply don't care - if we've seen a NULL ACL, we might as
well use it.

So just load the ACL speculatively without any locking, and if it was
NULL, just use it. If it's non-NULL (either because we had a cached
entry, or because the cache hasn't been filled in at all), it means that
we'll need to get the lock and re-load it properly.

(This commit was ported from a patch originally authored by Linus for
ext3.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:35 -04:00
Theodore Ts'o 4159175058 ext4: Don't update ctime for non-extent-mapped inodes
The VFS handles updating ctime, so we don't need to update the inode's
ctime in ext4_splace_branch() to update the direct or indirect blocks.
This was harmless when we did this in ext3, but in ext4, thanks to
delayed allocation, updating the ctime in ext4_splice_branch() can
cause the ctime to mysteriously jump when the blocks are finally
allocated.

Thanks to Björn Steinbrink for pointing out this problem on the git
mailing list.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-15 03:41:23 -04:00
Aneesh Kumar K.V 62e086be5d ext4: Move __ext4_journalled_writepage() to avoid forward declaration
In addition, fix two unused variable warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:59:34 -04:00
Aneesh Kumar K.V 43ce1d23b4 ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc
This patch fixes the mmap/truncate race that was fixed for delayed
allocation by merging ext4_{journalled,normal,da}_writepage() into
ext4_writepage().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:58:45 -04:00
Aneesh Kumar K.V c364b22c95 ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation
It is possible to see buffer_heads which are not mapped in the
writepage callback in the following scneario (where the fs blocksize
is 1k and the page size is 4k):

1) truncate(f, 1024)
2) mmap(f, 0, 4096)
3) a[0] = 'a'
4) truncate(f, 4096)
5) writepage(...)

Now if we get a writepage callback immediately after (4) and before an
attempt to write at any other offset via mmap address (which implies we
are yet to get a pagefault and do a get_block) what we would have is the
page which is dirty have first block allocated and the other three
buffer_heads unmapped.

In the above case the writepage should go ahead and try to write the
first blocks and clear the page_dirty flag. Further attempts to write
to the page will again create a fault and result in allocating blocks
and marking page dirty.  If we don't write any other offset via mmap
address we would still have written the first block to the disk and
rest of the space will be considered as a hole.

So to address this, we change all of the places where we look for
delayed, unmapped, or unwritten buffer heads, and only check for
delayed or unwritten buffer heads instead.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:57:10 -04:00
Theodore Ts'o de9a55b841 ext4: Fix up whitespace issues in fs/ext4/inode.c
This is a pure cleanup patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:45:34 -04:00
Theodore Ts'o 0610b6e999 ext4: Fix 64-bit block type problem on 32-bit platforms
The function ext4_mb_free_blocks() was using an "unsigned long" to
pass a block number; this will cause 64-bit block numbers to get
truncated on x86 and other 32-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2009-06-15 03:45:05 -04:00
Andreas Dilger 11013911da ext4: teach the inode allocator to use a goal inode number
Enhance the inode allocator to take a goal inode number as a
paremeter; if it is specified, it takes precedence over Orlov or
parent directory inode allocation algorithms.

The extents migration function uses the goal inode number so that the
extent trees allocated the migration function use the correct flex_bg.
In the future, the goal inode functionality will also be used to
allocate an adjacent inode for the extended attributes.

Also, for testing purposes the goal inode number can be specified via
/sys/fs/{dev}/inode_goal.  This can be useful for testing inode
allocation beyond 2^32 blocks on very large filesystems.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:45:35 -04:00
Theodore Ts'o f157a4aa98 ext4: Use a hash of the topdir directory name for the Orlov parent group
Instead of using a random number to determine the goal parent grop for
the Orlov top directories, use a hash of the directory name.  This
allows for repeatable results when trying to benchmark filesystem
layout algorithms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:09:42 -04:00
Theodore Ts'o 4ab2f15b7f ext4: move the abort flag from s_mount_opts to s_mount_flags
We're running out of space in the mount options word, and
EXT4_MOUNT_ABORT isn't really a mount option, but a run-time flag.  So
move it to become EXT4_MF_FS_ABORTED in s_mount_flags.

Also remove bogus ext2_fs.h / ext4.h simultaneous #include protection,
which can never happen.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:36 -04:00
Theodore Ts'o bc0b0d6d69 ext4: update the s_last_mounted field in the superblock
This field can be very helpful when a system administrator is trying
to sort through large numbers of block devices or filesystem images.
What is stored in this field can be ambiguous if multiple filesystem
namespaces are in play; what we store in practice is the mountpoint
interpreted by the process's namespace which first opens a file in the
filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:48 -04:00
Theodore Ts'o 7f4520cc62 ext4: change s_mount_opt to be an unsigned int
We can only fit 32 options in s_mount_opt because an unsigned long is
32-bits on a x86 machine.  So use an unsigned int to save space on
64-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:41 -04:00
Akira Fujita 748de6736c ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
The EXT4_IOC_MOVE_EXT exchanges the blocks between orig_fd and donor_fd,
and then write the file data of orig_fd to donor_fd.
ext4_mext_move_extent() is the main fucntion of ext4 online defrag,
and this patch includes all functions related to ext4 online defrag.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 19:24:03 -04:00
Alessio Igor Bogani 337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Christoph Hellwig ebc1ac1645 ->write_super lock_super pushdown
Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

 * bfs, nilfs2 - no other uses of s_lock and have internal locks in
	->write_super
 * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
 * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
 	->write_super
 * xfs - no other uses of s_lock and uses internal lock (buffer lock on
	superblock buffer) to serialize ->write_super.  Also xfs_fs_write_super
	is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Al Viro bbd6851a32 Push lock_super() into the ->remount_fs() of filesystems that care about it
Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own.  Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Christoph Hellwig 6cfd014842 push BKL down into ->put_super
Move BKL into ->put_super from the only caller.  A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment.  Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro a9e220f832 No need to do lock_super() for exclusion in generic_shutdown_super()
We can't run into contention on it.  All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held.  So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it.  The rest of filesystems do not care at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Christoph Hellwig 8c85e12512 remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.

Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.

Exceptions:

 - affs already has identical copy & pasted code at the beginning of
   affs_put_super so no need to do it twice.
 - xfs does the right thing without it and I have changes pending for
   the xfs tree touching this are so I don't really need conflicts
   here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Linus Torvalds c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Aneesh Kumar K.V a41f207169 ext4: Avoid corrupting the uninitialized bit in the extent during truncate
The unitialized bit was not properly getting preserved in in an extent
which is partially truncated because the it was geting set to the
value of the first extent to be removed or truncated as part of the
truncate operation, and if there are multiple extents are getting
removed or modified as part of the truncate operation, it is only the
last extent which will might be partially truncated, and its
uninitalized bit is not necessarily the same as the first extent to be
truncated.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-10 14:22:55 -04:00
Theodore Ts'o 0eab928221 ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's
no need to treat this as a true file truncation, so we shouldn't
activate the replace-via-truncate hueristic.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 09:54:40 -04:00
Toshiyuki Okajima 9aee228607 ext4: fix dx_map_entry to support 256k directory blocks
The dx_map_entry structure doesn't support over 64KB block size by
current usage of its member("offs"). Because "offs" treats an offset
of copies of the ext4_dir_entry_2 structure as is. This member size is
16 bits. But real offset for over 64KB(256KB) block size needs 18
bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
not used.

Therefore, we do the following to fix this limitation:
For "store": 
	we divide the real offset by 4 and then store this result to "offs" 
	member.
For "use":
	we multiply "offs" member by 4 and then use this result 
	as real offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-08 12:41:35 -04:00
Aneesh Kumar K.V f8514083cd ext4: truncate the file properly if we fail to copy data from userspace
In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size.  We should truncate the file in the above
case so that we don't have blocks allocated outside inode->i_size.  Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the
truncate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 00:56:49 -04:00
Aneesh Kumar K.V 1938a150c2 ext4: Avoid leaking blocks after a block allocation failure
We should add inode to the orphan list in the same transaction
as block allocation.  This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 01:00:26 -04:00
Eric Sandeen b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Jan Kara 03f5d8bcf0 ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
seems to be a relict from some old days and setting disksize in this
function does not make much sense.  Currently it was set only by
ext4_getblk().  Since the parameter has some effect only if create ==
1, it is easy to check by grepping through the sources that the three
callers which end up calling ext4_getblk() with create == 1
(ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 00:17:05 -04:00
Aneesh Kumar K.V b767e78a17 ext4: Don't look at buffer_heads outside i_size.
Buffer heads outside i_size will be unmapped. So when we
are doing "walk_page_buffers" limit ourself to i_size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Josef Bacik <jbacik@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
----
2009-06-04 08:06:06 -04:00
Johann Lombardi e6462869e4 ext4: Fix goal inum check in the inode allocator
The goal inode is specificed by inode number which belongs
to [1; s_inodes_count].

Signed-off-by: Johann Lombardi <johann@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:45:11 -04:00
Theodore Ts'o 5adfee9c17 ext4: fix no journal corruption with locale-gen
If there is no journal, ext4_should_writeback_data() should return
TRUE.  This will fix ext4_set_aops() to set ext4_da_ops in the case of
delayed allocation; otherwise ext4_journaled_aops gets used by
default, which doesn't handle delayed allocation properly.

The advantage of using ext4_should_writeback_data() approach is that
it should handle nobh better as well.

Thanks to Curt Wohlgemuth for investigating this problem, and Aneesh
Kumar for suggesting this approach.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-08 17:11:24 -04:00
Aneesh Kumar K.V 5887e98b60 ext4: Calculate required journal credits for inserting an extent properly
When we have space in the extent tree leaf node we should be able to
insert the extent with much less journal credits. The code was doing
proper calculation but missed a return statement.

Reported-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:12:04 -04:00
Jan Kara ffacfa7a79 ext4: Fix truncation of symlinks after failed write
Contents of long symlinks is written via standard write methods. So
when the write fails, we add inode to orphan list. But symlinks don't
have .truncate method defined so nobody properly removes them from the
on disk orphan list.

Fix this by calling ext4_truncate() directly instead of calling
vmtruncate() (which is saner anyway since we don't need anything
vmtruncate() does except from calling .truncate in these paths).  We
also add inode to orphan list only if ext4_can_truncate() is true
(currently, it can be false for symlinks when there are no blocks
allocated) - otherwise orphan list processing will complain and
ext4_truncate() will not remove inode from on-disk orphan list.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 16:22:22 -04:00
Jesper Dangaard Brouer 3e03f9ca6a ext4: Use rcu_barrier() on module unload.
The ext4 module uses rcu_call() thus it should use rcu_barrier()on
module unload.

The kmem cache ext4_pspace_cachep is sometimes free'ed using
call_rcu() callbacks.  Thus, we must wait for completion of call_rcu()
before doing kmem_cache_destroy().

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:29:27 -04:00
Eric Sandeen 726447d803 ext4: naturally align struct ext4_allocation_request
As Ted noted, the ext4_allocation_request isn't well aligned.  Looking
at it with pahole we're wasting space on 64-bit arches:

struct ext4_allocation_request {
        struct inode *             inode;              /*     0     8 */
        ext4_lblk_t                logical;            /*     8     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               goal;               /*    16     8 */
        ext4_lblk_t                lleft;              /*    24     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               pleft;              /*    32     8 */
        ext4_lblk_t                lright;             /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               pright;             /*    48     8 */
        unsigned int               len;                /*    56     4 */
        unsigned int               flags;              /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */

        /* size: 64, cachelines: 1, members: 9 */
        /* sum members: 52, holes: 3, sum holes: 12 */
};

Grouping 32-bit members together closes these holes and shrinks the
structure by 12 bytes. which is important since ext4 can get on the
hairy edge of stack overruns.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 10:24:17 -04:00