Commit Graph

13766 Commits

Author SHA1 Message Date
Alan Cox d29a2e9438 vfat: Note the NLS requirement
Close bug #4754. Stop people getting into a situation where they can't
get their FAT filesystems to mount as they expect.

Signed-off-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-17 09:32:11 -07:00
Randy Dunlap b80901bbf5 splice: fix new kernel-doc warnings
splice: fix kernel-doc warnings

  Warning(fs/splice.c:617): bad line:
  Warning(fs/splice.c:722): No description found for parameter 'sd'
  Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-17 07:38:07 -07:00
Jeff Layton 22c9d52bc0 cifs: remove unneeded bcc_ptr update in CIFSTCon
This pointer isn't used again after this point. It's also not updated in
the ascii case, so there's no need to update it here.

Pointed-out-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton 313fecfa69 cifs: add cFYI messages with some of the saved strings from ssetup/tcon
...to make it easier to find problems in this area in the future.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton f083def68f cifs: fix buffer size for tcon->nativeFileSystem field
The buffer for this was resized recently to fix a bug. It's still
possible however that a malicious server could overflow this field
by sending characters in it that are >2 bytes in the local charset.
Double the size of the buffer to account for this possibility.

Also get rid of some really strange and seemingly pointless NULL
termination. It's NULL terminating the string in the source buffer,
but by the time that happens, we've already copied the string.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Jeff Layton 27b87fe52b cifs: fix unicode string area word alignment in session setup
The handling of unicode string area alignment is wrong.
decode_unicode_ssetup improperly assumes that it will always be preceded
by a pad byte. This isn't the case if the string area is already
word-aligned.

This problem, combined with the bad buffer sizing for the serverDomain
string can cause memory corruption. The bad alignment can make it so
that the alignment of the characters is off. This can make them
translate to characters that are greater than 2 bytes each. If this
happens we can overflow the allocation.

Fix this by fixing the alignment in CIFS_SessSetup instead so we can
verify it against the head of the response. Also, clean up the
workaround for improperly terminated strings by checking for a
odd-length unicode buffers and then forcibly terminating them.

Finally, resize the buffer for serverDomain. Now that we've fixed
the alignment, it's probably fine, but a malicious server could
overflow it.

A better solution for handling these strings is still needed, but
this should be a suitable bandaid.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Steve French 88dd47fff4 [CIFS] Fix build break caused by change to new current_umask helper function
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:50 +00:00
Steve French bc8cd4390c [CIFS] Fix sparse warnings
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French a6ce4932fb [CIFS] Add support for posix open during lookup
This patch by utilizing lookup intents, and thus removing a network
roundtrip in the open path, improves performance dramatically on
open (30% or more) to Samba and other servers which support the
cifs posix extensions

Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton d9fb5c091b cifs: no need to use rcu_assign_pointer on immutable keys
cifs: no need to use rcu_assign_pointer on immutable keys

Neither keytype in use by CIFS has an "update" method. This means that
the keys are immutable once instantiated. We don't need to use RCU
to set the payload data pointers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton 5144ebf408 cifs: remove dnotify thread code
cifs: remove dnotify thread code

Al Viro recently removed the dir_notify code from the kernel along with
the CIFS code that used it. We can also get rid of the dnotify thread
as well.

In actuality, it never had anything to do with dir_notify anyway. All
it did was unnecessarily wake up all the tasks waiting on the response
queues every 15s. Previously that happened to prevent tasks from hanging
indefinitely when the server went unresponsive, but we put those to
sleep with proper timeouts now so there's no reason to keep this around.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French 2d6d589d80 [CIFS] remove some build warnings
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Jeff Layton fbec9ab952 cifs: vary timeout on writes past EOF based on offset (try #5)
This is the fourth version of this patch:

The first three generated a compiler warning asking for explicit curly
braces.

The first two didn't handle update the size correctly when writes that
didn't start at the eof were done.

The first patch also didn't update the size correctly when it explicitly
set via truncate().

This patch adds code to track the client's current understanding of the
size of the file on the server separate from the i_size, and then to use
this info to semi-intelligently set the timeout for writes past the EOF.

This helps prevent timeouts when trying to write large, sparse files on
windows servers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:49 +00:00
Steve French d036f50fc2 [CIFS] Fix build break from recent DFS patch when DFS support not enabled
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Igor Mammedov 1bfe73c258 Remote DFS root support.
Allows to mount share on a server that returns -EREMOTE
 at the tree connect stage or at the check on a full path
 accessibility.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Steve French 85a6dac54a [CIFS] Endian convert UniqueId when reporting inode numbers from server files
Jeff made a good point that we should endian convert the UniqueId when we use
it to set i_ino Even though this value is opaque to the client, when comparing
the inode numbers of the same server file from two different clients (one
big endian, one little endian) or when we compare a big endian client's view
of i_ino with what the server thinks - we should get the same value

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Wei Yongjun 74496d365a cifs: remove some pointless conditionals before kfree()
Remove some pointless conditionals before kfree().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
Jeff Layton 0f4d634c59 cifs: flush data on any setattr
We already flush all the dirty pages for an inode before doing
ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
we change the mode so that the file becomes read-only then we may not
be able to write data to it after a reconnect.

Fix this by just going back to flushing all the dirty data on any
setattr call. There are probably some cases that can be optimized out,
but I'm not sure they're worthwhile and we need to consider them more
carefully to make sure that we don't cause regressions if we have
to reconnect before writeback occurs.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-17 01:26:48 +00:00
KOSAKI Motohiro 31b07093c4 proc: mounts_poll() make consistent to mdstat_poll
In recently sysfs_poll discussion, Neil Brown pointed out /proc/mounts
also should be fixed.

SUSv3 says "Regular files shall always poll TRUE for reading and
writing".  see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html

Then, mounts_poll()'s default should be "POLLIN | POLLRDNORM".  it mean
always readable.

In addition, event trigger should use "POLLERR | POLLPRI" instead
POLLERR.  it makes consistent to mdstat_poll() and sysfs_poll(). and,
select(2) can handle POLLPRI easily.


Reported-by: Neil Brown <neilb@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:10 -07:00
KOSAKI Motohiro 1af3557abd sysfs: sysfs poll keep the poll rule of regular file.
Currently, following test programs don't finished.

% ruby -e '
Thread.new { sleep }
File.read("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies")
'

strace expose the reason.

...
open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies", O_RDONLY|O_LARGEFILE) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf9fa6b8) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(3, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
_llseek(3, 0, [0], SEEK_CUR)            = 0
select(4, [3], NULL, NULL, NULL)        = 1 (in [3])
read(3, "1400000 1300000 1200000 1100000 1"..., 4096) = 62
select(4, [3], NULL, NULL, NULL


Because Ruby (the scripting language) VM assume select system-call
against regular file don't block.  it because SUSv3 says "Regular files
shall always poll TRUE for reading and writing".  see
http://www.opengroup.org/onlinepubs/009695399/functions/poll.html it
seems valid assumption.

But sysfs_poll() don't keep this rule although sysfs file can read and
write always.

This patch restore proper poll behavior to sysfs.
/sys/block/md*/md/sync_action polling application and another sysfs
updating sensitive application still can use POLLERR and POLLPRI.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:09 -07:00
Alex Chiang d110271e1f sysfs: don't use global workqueue in sysfs_schedule_callback()
A sysfs attribute using sysfs_schedule_callback() to commit suicide
may end up calling device_unregister(), which will eventually call
a driver's ->remove function.

Drivers may call flush_scheduled_work() in their shutdown routines,
in which case lockdep will complain with something like the following:

  =============================================
  [ INFO: possible recursive locking detected ]
  2.6.29-rc8-kk #1
  ---------------------------------------------
  events/4/56 is trying to acquire lock:
  (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0

  but task is already holding lock:
  (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230

  other info that might help us debug this:
  3 locks held by events/4/56:
  #0:  (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
  #1:  (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
  #2:  (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40

  stack backtrace:
  Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
  Call Trace:
  [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
  [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
  [<ffffffff8026f148>] lock_acquire+0x58/0x80
  [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
  [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
  [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
  [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
  [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
  [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
  [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
  [<ffffffff80441da9>] __device_release_driver+0x59/0x90
  [<ffffffff80441edb>] device_release_driver+0x2b/0x40
  [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
  [<ffffffff8043e46b>] device_del+0x12b/0x190
  [<ffffffff8043e4f6>] device_unregister+0x26/0x70
  [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
  [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
  [<ffffffff803c10d9>] remove_callback+0x29/0x40
  [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
  [<ffffffff8025769a>] run_workqueue+0x15a/0x230
  [<ffffffff80257648>] ? run_workqueue+0x108/0x230
  [<ffffffff8025846f>] worker_thread+0x9f/0x100
  [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff802583d0>] ? worker_thread+0x0/0x100
  [<ffffffff8025b89d>] kthread+0x4d/0x80
  [<ffffffff8020d4ba>] child_rip+0xa/0x20
  [<ffffffff8020cebc>] ? restore_args+0x0/0x30
  [<ffffffff8025b850>] ? kthread+0x0/0x80
  [<ffffffff8020d4b0>] ? child_rip+0x0/0x20

Although we know that the device_unregister path will never acquire
a lock that a driver might try to acquire in its ->remove, in general
we should never attempt to flush a workqueue from within the same
workqueue, and lockdep rightly complains.

So as long as sysfs attributes cannot commit suicide directly and we
are stuck with this callback mechanism, put the sysfs callbacks on
their own workqueue instead of the global one.

This has the side benefit that if a suicidal sysfs attribute kicks
off a long chain of ->remove callbacks, we no longer induce a long
delay on the global queue.

This also fixes a missing module_put in the error path introduced
by sysfs-only-allow-one-scheduled-removal-callback-per-kobj.patch.

We never destroy the workqueue, but I'm not sure that's a
problem.

Reported-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Tested-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-04-16 16:17:08 -07:00
Chris Mason 35c80d5f40 Add block_write_full_page_endio for passing endio handler
block_write_full_page doesn't allow the caller to control what happens
when the IO is over.  This adds a new call named block_write_full_page_endio
so the buffer head end_io handler can be provided by the caller.

This will be used by the ext3 data=guarded mode to do i_size updates in
a workqueue based end_io handler.  end_buffer_async_write is also
exported so it can be called to do the dirty work of managing page
writeback for the higher level end_io handler.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Acked-by: Theodore Tso <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-16 07:47:49 -07:00
Linus Torvalds a2c252ebde Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use DEFINE_SPINLOCK
  GFS2: cleanup file_operations mess
  GFS2: Move umount flush rwsem
  GFS2: Fix symlink creation race
  GFS2: Make quotad's waiting interruptible
2009-04-15 09:04:12 -07:00
Nikanth Karthikesan b1fffc9ca6 gfs2: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_NOFS implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Nikanth Karthikesan 226e7dabf5 ext4: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_NOIO implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Nikanth Karthikesan 4d1f9fdb61 dio: Remove code handling bio_alloc failure with __GFP_WAIT
Remove code handling bio_alloc failure with __GFP_WAIT.
GFP_KERNEL implies __GFP_WAIT.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:13 +02:00
Jens Axboe 86c824b943 bio: add documentation to bio_alloc()
Explain that with __GFP_WAIT set it will not fail, and that the caller
must never allocate more than 1 bio at the time.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi 61e0d47c33 splice: add helpers for locking pipe inode
There are lots of sequences like this, especially in splice code:

	if (pipe->inode)
		mutex_lock(&pipe->inode->i_mutex);
	/* do something */
	if (pipe->inode)
		mutex_unlock(&pipe->inode->i_mutex);

so introduce helpers which do the conditional locking and unlocking.
Also replace the inode_double_lock() call with a pipe_double_lock()
helper to avoid spreading the use of this functionality beyond the
pipe code.

This patch is just a cleanup, and should cause no behavioral changes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi f8cc774ce4 splice: remove generic_file_splice_write_nolock()
Remove the now unused generic_file_splice_write_nolock() function.
It's conceptually broken anyway, because splice may need to wait for
pipe events so holding locks across the whole operation is wrong.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi 328eaaba4e ocfs2: fix i_mutex locking in ocfs2_splice_to_file()
Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:12 +02:00
Miklos Szeredi eb443e5a25 splice: fix i_mutex locking in generic_splice_write()
Rearrange locking of i_mutex on destination so it's only held while
buffers are copied with the pipe_to_file() actor, and not while
waiting for more data on the pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Miklos Szeredi 2933970b96 splice: remove i_mutex locking in splice_from_pipe()
splice_from_pipe() is only called from two places:

  - generic_splice_sendpage()
  - splice_write_null()

Neither of these require i_mutex to be taken on the destination inode.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Miklos Szeredi b3c2d2ddd6 splice: split up __splice_from_pipe()
Split up __splice_from_pipe() into four helper functions:

  splice_from_pipe_begin()
  splice_from_pipe_next()
  splice_from_pipe_feed()
  splice_from_pipe_end()

splice_from_pipe_next() will wait (if necessary) for more buffers to
be added to the pipe.  splice_from_pipe_feed() will feed the buffers
to the supplied actor and return when there's no more data available
(or if all of the requested data has been copied).

This is necessary so that implementations can do locking around the
non-waiting splice_from_pipe_feed().

This patch should not cause any change in behavior.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 12:10:11 +02:00
Xu Gang 1328df7252 GFS2: Use DEFINE_SPINLOCK
SPIN_LOCK_UNLOCKED is deprecated, use DEFINE_SPINLOCK instead.
(as suggested in Documentation/spinlocks.txt)

Signed-off-by: Xu Gang <xug@cn.fujitsu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:18:07 +01:00
Christoph Hellwig 10d2198805 GFS2: cleanup file_operations mess
Remove the weird pointer to file_operations mess and replace it with
straight-forward defining of the lockinginstance names to the _nolock
variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:17:18 +01:00
Steven Whitehouse a228df6339 GFS2: Move umount flush rwsem
The rwsem, used only on umount, is in the wrong place in glock.c.
This patch moves it up a bit so that it does not get called under
a spinlock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:16:13 +01:00
Steven Whitehouse 5cf32524de GFS2: Fix symlink creation race
In certain cases symlinks can appear to have zero size if a lookup
on the inode occurs within a certain (very short) time after the
symlink has been created. The symlink is correctly created on disk
but appears to have zero size when stat()ed. This patch closes the
race and prevents incorrect sizes appearing.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:15:38 +01:00
Steven Whitehouse 7fa5d20d1a GFS2: Make quotad's waiting interruptible
So we don't count its D state in the loadavg.

Reported-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-15 10:15:08 +01:00
Jens Axboe 053c525fcf buffer: switch do_emergency_thaw() away from pdflush_operation()
This is (again) a preparatory patch similar to commit
a2a9537ac0. It open codes a simple
async way of executing do_thaw_all() out of context, so we can get
rid of pdflush.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15 08:28:12 +02:00
Linus Torvalds e9de427e40 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix "direct_io" private mmap
  fuse: fix argument type in fuse_get_user_pages()
2009-04-14 10:12:07 -07:00
Linus Torvalds 9fc0178caa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix possible mismatch of sufile counters on recovery
  nilfs2: segment usage file cleanups
  nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error
  nilfs2: simplify handling of active state of segments fix
  nilfs2: remove module version
  nilfs2: fix lockdep recursive locking warning on meta data files
  nilfs2: fix lockdep recursive locking warning on bmap
  nilfs2: return f_fsid for statfs2
2009-04-14 10:10:53 -07:00
Theodore Ts'o 38d726d153 jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 10:10:47 -04:00
Theodore Ts'o 67c457a8c3 jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 07:50:56 -04:00
Chuck Ebbert 6b82f3cb2d ext4: really print the find_group_flex fallback warning only once
Missing braces caused the warning to print more than once.

Signed-Off-By: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-14 07:37:40 -04:00
Jan Kara 316cb4ef3e ext2: fix data corruption for racing writes
If two writers allocating blocks to file race with each other (e.g.
because writepages races with ordinary write or two writepages race with
each other), ext2_getblock() can be called on the same inode in parallel.
Before we are going to allocate new blocks, we have to recheck the block
chain we have obtained so far without holding truncate_mutex.  Otherwise
we could overwrite the indirect block pointer set by the other writer
leading to data loss.

The below test program by Ying is able to reproduce the data loss with ext2
on in BRD in a few minutes if the machine is under memory pressure:

long kMemSize  = 50 << 20;
int kPageSize = 4096;

int main(int argc, char **argv) {
	int status;
	int count = 0;
	int i;
	char *fname = "/mnt/test.mmap";
	char *mem;
	unlink(fname);
	int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
	status = ftruncate(fd, kMemSize);
	mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	// Fill the memory with 1s.
	memset(mem, 1, kMemSize);
	sleep(2);
	for (i = 0; i < kMemSize; i++) {
		int byte_good = mem[i] != 0;
		if (!byte_good && ((i % kPageSize) == 0)) {
			//printf("%d ", i / kPageSize);
			count++;
		}
	}
	munmap(mem, kMemSize);
	close(fd);
	unlink(fname);

	if (count > 0) {
		printf("Running %d bad page\n", count);
		return 1;
	}
	return 0;
}

Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:33 -07:00
Jan Kara 3243387948 jbd: update locking coments
Update information about locking in JBD revoke code.

Reported-by: Lin Tan <tammy000@gmail.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:32 -07:00
Dave Anderson eb2e5f452a hfs: fix memory leak when unmounting
When an HFS filesystem is unmounted, it leaks a 2-page bitmap.  Also,
under extreme memory pressure, it's possible that hfs_releasepage() may
use a tree pointer that has not been initialized, and if so, the release
request should just be rejected.

[akpm@linux-foundation.org: free_pages(0) is legal, remove obvious comment]
Signed-off-by: Dave Anderson <anderson@redhat.com>
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13 15:04:29 -07:00
Linus Torvalds 3c1795cc4b Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: remove xfs_flush_space
  xfs: flush delayed allcoation blocks on ENOSPC in create
  xfs: block callers of xfs_flush_inodes() correctly
  xfs: make inode flush at ENOSPC synchronous
  xfs: use xfs_sync_inodes() for device flushing
  xfs: inform the xfsaild of the push target before sleeping
  xfs: prevent unwritten extent conversion from blocking I/O completion
  xfs: fix double free of inode
  xfs: validate log feature fields correctly
2009-04-13 14:35:13 -07:00
Ryusuke Konishi c85399c2da nilfs2: fix possible mismatch of sufile counters on recovery
On-disk counters ndirtysegs and ncleansegs of sufile, can go wrong
after roll-forward recovery because
nilfs_prepare_segment_for_recovery() function marks segments dirty
without adjusting value of these counters.

This fixes the problem by adding a function to sufile which does the
operation adjusting the counters, and by letting the recovery function
use it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:52 +09:00
Ryusuke Konishi a703018f7b nilfs2: segment usage file cleanups
This will simplify sufile.c by sharing common code which repeatedly
appears in routines updating a segment usage entry; a wrapper function
nilfs_sufile_update() is introduced for the purpose, and counter
modifications are integrated to a new function
nilfs_sufile_mod_counter().

This is a preparation for the successive bugfix patch ("nilfs2: fix
possible mismatch of sufile counters on recovery").

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi 88072faf9a nilfs2: fix wrong accounting and duplicate brelse in nilfs_sufile_set_error
The nilfs_sufile_set_error() function wrongly adjusts the number of
dirty segments instead of the number of clean segments.  In addition,
the function calls brelse() twice for the same buffer head.

This fixes these bugs.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi 3efb55b496 nilfs2: simplify handling of active state of segments fix
This fixes a bug of ("nilfs2: simplify handling of active state of
segments") patch.  The patch did not take account that a base index is
increased in nilfs_sufile_get_suinfo() function if requested entries
go across block boundary on sufile.

Due to this bug, the active flag sometimes appears on wrong segments
and has induced malfunction of garbage collection.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:51 +09:00
Ryusuke Konishi e7a7402c0d nilfs2: remove module version
A MODULE_VERSION() macro has been used in out-of-tree nilfs modules,
but it's needless and not updated in tree.  So, this removes it along
with the version declaration.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:50 +09:00
Ryusuke Konishi c2698e50e3 nilfs2: fix lockdep recursive locking warning on meta data files
This fixes the following false detection of lockdep against nilfs meta
data files:

=============================================
[ INFO: possible recursive locking detected ]
2.6.29 #26
---------------------------------------------
mount.nilfs2/4185 is trying to acquire lock:
 (&mi->mi_sem){----}, at: [<d0c7925b>] nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
 but task is already holding lock:
  (&mi->mi_sem){----}, at: [<d0c72026>] nilfs_count_free_blocks+0x48/0x84 [nilfs2]

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:50 +09:00
Ryusuke Konishi bcb48891b0 nilfs2: fix lockdep recursive locking warning on bmap
The bmap semaphore of DAT file can be held while a bmap of other files
is locked.  This has caused the following false detection of lockdep
check:

mount.nilfs2/4667 is trying to acquire lock:
 (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2]

but task is already holding lock:
 (&bmap->b_sem){..--}, at: [<d0c6c4b4>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2]

This will fix the false detection by distinguishing semaphores of the
DAT and other files.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:49 +09:00
Ryusuke Konishi c306af23e1 nilfs2: return f_fsid for statfs2
This follows the change of Coly Li's series ("fs: return f_fsid for
statfs(2)"), and make nilfs2 return f_fsid info for statfs(2).

Acked-by: Coly Li <coly.li@suse.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-04-13 09:53:49 +09:00
Linus Torvalds 54f93b74cf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: check block device size on mount
  ext4: Fix off-by-one-error in ext4_valid_extent_idx()
  ext4: Fix big-endian problem in __ext4_check_blockref()
2009-04-09 16:42:05 -07:00
Felix Blyakher dc2a5536d6 Merge branch 'master' into for-linus 2009-04-09 14:12:07 -05:00
Stoyan Gaydarov 11ff5f6aff afs: BUG to BUG_ON changes
Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-09 10:41:19 -07:00
Miklos Szeredi 3121bfe763 fuse: fix "direct_io" private mmap
MAP_PRIVATE mmap could return stale data from the cache for
"direct_io" files.  Fix this by flushing the cache on mmap. 

Found with a slightly modified fsx-linux.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-09 17:37:53 +02:00
Miklos Szeredi ce60a2f157 fuse: fix argument type in fuse_get_user_pages()
Fix the following warning:

fs/fuse/file.c: In function 'fuse_direct_io':
fs/fuse/file.c:1002: warning: passing argument 3 of 'fuse_get_user_pages' from incompatible pointer type

This was introduced by commit f4975c67 "fuse: allow kernel to access
"direct_io" files".

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-09 17:37:52 +02:00
Linus Torvalds a7b334de4d Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext3: Try to avoid starting a transaction in writepage for data=writepage
  block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG
2009-04-08 17:42:32 -07:00
Nobuhiro Iwamatsu 4c967291fc nommu: fix typo vma->pg_off to vma->vm_pgoff
6260a4b052 ("/proc/pid/maps: don't show
pgoff of pure ANON VMAs" had a typo.

fs/proc/task_nommu.c:138: error: 'struct vm_area_struct' has no member named 'pg_off'
distcc[21484] ERROR: compile fs/proc/task_nommu.c on sprygo/32 failed

Signed-off-by: Nobuhiro Iwamatsu <iwamatsu.nobuhiro@renesas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:44 -07:00
Alexander Beregalov 2b3fffefea befs: fix build on parisc
fs/befs/super.c:85: error: 'PAGE_SIZE' undeclared

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-08 10:21:43 -07:00
Jan Kara 430db323fa ext3: Try to avoid starting a transaction in writepage for data=writepage
This does the same as commit 9e80d40773
(avoid starting a transaction when no block allocation is needed)
but for data=writeback mode of ext3. We also cleanup the data=ordered
case a bit to stick to coding style...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-08 13:15:10 -04:00
Theodore Ts'o 6e34eeddf7 block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG
Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
the block device I/O queue between each page that gets flushed out.

Otherwise, when we run sync() or fsync() and we need to write out a
large number of pages, the block device queue will get unplugged
between for every page that is flushed out, which will be a pretty
serious performance regression caused by commit a64c8610.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-08 13:15:09 -04:00
Trond Myklebust 2b2ec7554c NFS: Fix the return value in nfs_page_mkwrite()
Commit c2ec175c39 ("mm: page_mkwrite
change prototype to match fault") exposed a bug in the NFS
implementation of page_mkwrite.  We should be returning 0 on success...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 14:07:03 -07:00
From: Thiemo Nagel 0f2ddca66d ext4: check block device size on mount
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-07 14:07:47 -04:00
Tao Ma 035a571120 ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir.
In ocfs2_expand_inline_dir, we calculate whether we need 1 extra
cluster if we can't store the dx inline the root and save it in
dx_alloc. So add it when we call ocfs2_reserve_clusters.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-07 09:40:17 -07:00
Miklos Szeredi 7bfac9ecf0 splice: fix deadlock in splicing to file
There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():

 - task A calls generic_file_splice_write()
 - this calls inode_double_lock(), which locks i_mutex on both
   pipe->inode and target inode
 - ordering depends on inode pointers, can happen that pipe->inode is
   locked first
 - __splice_from_pipe() needs more data, calls pipe_wait()
 - this releases lock on pipe->inode, goes to interruptible sleep
 - task B calls generic_file_splice_write(), similarly to the first
 - this locks pipe->inode, then tries to lock inode, but that is
   already held by task A
 - task A is interrupted, it tries to lock pipe->inode, but fails, as
   it is already held by task B
 - ABBA deadlock

Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode.  This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:34:46 -07:00
Ryusuke Konishi 612392307c nilfs2: support nanosecond timestamp
After a review of user's feedback for finding out other compatibility
issues, I found nilfs improperly initializes timestamps in inode;
CURRENT_TIME was used there instead of CURRENT_TIME_SEC even though nilfs
didn't have nanosecond timestamps on disk.  A few users gave us the report
that the tar program sometimes failed to expand symbolic links on nilfs,
and it turned out to be the cause.

Instead of applying the above displacement, I've decided to support
nanosecond timestamps on this occation.  Fortunetaly, a needless 64-bit
field was in the nilfs_inode struct, and I found it's available for this
purpose without impact for the users.

So, this will do the enhancement and resolve the tar problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi e339ad31f5 nilfs2: introduce secondary super block
The former versions didn't have extra super blocks.  This improves the
weak point by introducing another super block at unused region in tail of
the partition.

This doesn't break disk format compatibility; older versions just ingore
the secondary super block, and new versions just recover it if it doesn't
exist.  The partition created by an old mkfs may not have unused region,
but in that case, the secondary super block will not be added.

This doesn't make more redundant copies of the super block; it is a future
work.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi cece552074 nilfs2: simplify handling of active state of segments
will reduce some lines of segment constructor.  Previously, the state was
complexly controlled through a list of segments in order to keep
consistency in meta data of usage state of segments.  Instead, this
presents ``calculated'' active flags to userland cleaner program and stop
maintaining its real flag on disk.

Only by this fake flag, the cleaner cannot exactly know if each segment is
reclaimable or not.  However, the recent extension of nilfs_sustat ioctl
struct (nilfs2-extend-nilfs_sustat-ioctl-struct.patch) can prevent the
cleaner from reclaiming in-use segment wrongly.

So, now I can apply this for simplification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi c96fa464a5 nilfs2: mark minor flag for checkpoint created by internal operation
Nilfs creates checkpoints even for garbage collection or metadata updates
such as checkpoint mode change.  So, user often sees checkpoints created
only by such internal operations.

This is inconvenient in some situations.  For example, application that
monitors checkpoints and changes them to snapshots, will fall into an
infinite loop because it cannot distinguish internally created
checkpoints.

This patch solves this sort of problem by adding a flag to checkpoint for
identification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 458c5b0822 nilfs2: clean up sketch file
The sketch file is a file to mark checkpoints with user data.  It was
experimentally introduced in the original implementation, and now
obsolete.  The file was handled differently with regular files; the file
size got truncated when a checkpoint was created.

This stops the special treatment and will treat it as a regular file.
Most users are not affected because mkfs.nilfs2 no longer makes this file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi e626874685 nilfs2: super block operations fix endian bug
This adds a missing endian conversion of checksum field in the super
block.  This fixes compatibility issue on big endian machines which will
come to surface after supporting recovery of super block.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 1f5abe7e7d nilfs2: replace BUG_ON and BUG calls triggerable from ioctl
Pekka Enberg advised me:
> It would be nice if BUG(), BUG_ON(), and panic() calls would be
> converted to proper error handling using WARN_ON() calls. The BUG()
> call in nilfs_cpfile_delete_checkpoints(), for example, looks to be
> triggerable from user-space via the ioctl() system call.

This will follow the comment and keep them to a minimum.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 2c2e52fc4f nilfs2: extend nilfs_sustat ioctl struct
This adds a new argument to the nilfs_sustat structure.

The extended field allows to delete volatile active state of segments,
which was needed to protect freshly-created segments from garbage
collection but has confused code dealing with segments.  This
extension alleviates the mess and gives room for further
simplifications.

The volatile active flag is not persistent, so it's eliminable on this
occasion without affecting compatibility other than the ioctl change.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 7a9461939a nilfs2: use unlocked_ioctl
Pekka Enberg suggested converting ->ioctl operations to use
->unlocked_ioctl to avoid BKL.

The conversion was verified to be safe, so I will take it on this
occasion.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 8082d36aed nilfs2: remove compat ioctl code
This removes compat code from the nilfs ioctls and applies the same
function for both .ioctl and .compat_ioctl file operations.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi dc498d09be nilfs2: use fixed sized types for ioctl structures
Nilfs ioctl had structures not having fixed sized types such as:

  struct nilfs_argv {
         void *v_base;
         size_t v_nmembs;
         size_t v_size;
         int v_index;
         int v_flags;
  };

Further, some of them are wrongly aligned:

  e.g.

  struct nilfs_cpmode {
        __u64 cm_cno;
        int cm_mode;
  };

The size of wrongly aligned structures varies depending on
architectures, and it breaks the identity of ioctl commands, which
leads to arch dependent errors.

Previously, these are compensated by using compat_ioctl.

This fixes these problems and allows removal of compat ioctl.

Since this will change sizes of those structures, binary compatibility
for the past utilities will once break; new utilities have to be used
instead.  However, it would be helpful to avoid platform dependent
problems in the long term.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi 1088dcf4c3 nilfs2: remove timedwait ioctl command
This removes NILFS_IOCTL_TIMEDWAIT command from ioctl interface along
with the related flags and wait queue.

The command is terrible because it just sleeps in the ioctl.  I prefer
to avoid this by devising means of event polling in userland program.
By reconsidering the userland GC daemon, I found this is possible
without changing behaviour of the daemon and sacrificing efficiency.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi 76068c4ff1 nilfs2: fix buggy behavior seen in enumerating checkpoints
This will fix the weird behavior of lscp command in listing continuously
created checkpoints; the output of lscp is rewinded regularly for the
recent nilfs.  As a result of debugging, a defect was found in
nilfs_cpfile_do_get_cpinfo() function.

Though the function can be repeatedly called to enumerate checkpoints and
it can skip invalid checkpoint entries, the index value was not carried
between successive calls.

The bug has long been present, and came to surface after applying a bugfix
nilfs2-fix-problems-of-memory-allocation-in-ioctl.patch, which increased
frequency of calling the function.  The similar bugfix was already applied
for ``snapshots'' by
nilfs2-fix-gc-failure-on-volumes-keeping-numerous-snapshots.patch.

This fixes the problem by making the index argument bidirectional on the
function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Pekka Enberg 8acfbf0939 nilfs2: clean up indirect function calling conventions
This cleans up the strange indirect function calling convention used in
nilfs to follow the normal kernel coding style.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 7fa10d2001 nilfs2: fix improper return values of nilfs_get_cpinfo ioctl
A few tool developers gave me requests for fixing inconvenient return
value of nilfs_get_cpinfo() ioctl; if the requested mode is NILFS_SNAPSHOT
and the specified start entry is not a snapshot, the ioctl unnaturally
returns one as the number of acquired snapshot item.

In addition, the ioctl function returns an ENOENT error for checkpoints
within blocks deleted by garbage collection.

These behaviors require corrections for programs which enumerate
snapshots.  This resolves the inconvenience by changing the return values
to zero for the above cases.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi b028fcfc4c nilfs2: fix gc failure on volumes keeping numerous snapshots
This resolves the following failure of nilfs2 cleaner daemon:

 nilfs_cleanerd[20670]: cannot clean segments: No such file or directory
 nilfs_cleanerd[20670]: shutdown

When creating thousands of snapshots, the cleaner daemon had rarely died
as above due to an error returned from the kernel code.

After applying the recent patch which fixed memory allocation problems in
ioctl (Message-Id: <20081215.155840.105124170.ryusuke@osrg.net>), the
problem gets more frequent.

It turned out to be a bug of nilfs_ioctl_wrap_copy function and one of its
callback routines to read out information of snapshots; if the
nilfs_ioctl_wrap_copy function divided a large read request into multiple
requests, the second and later requests have failed since a restart
position on snapshot meta data was not properly set forward.

It's a deficiency of the callback interface that cannot pass the restart
position among multiple requests.  This patch fixes the issue by allowing
nilfs_ioctl_wrap_copy and snapshot read functions to exchange a position
argument.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 047180f2d7 nilfs2: insert explanations in gcinode file
The file gcinode.c gives buffer cache functions for on-disk blocks
moved in garbage collection.  Joern Engel has suggested inserting its
explanations in the source file (Message-ID:
<20080917144146.GD8750@logfs.org> and
<20080917224953.GB14644@logfs.org>).

This follows the comment.

Cc: Joern Engel <joern@logfs.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 47420c7998 nilfs2: avoid double error caused by nilfs_transaction_end
Pekka Enberg pointed out that double error handlings found after
nilfs_transaction_end() can be avoided by separating abort operation:

 OK, I don't understand this. The only way nilfs_transaction_end() can
 fail is if we have NILFS_TI_SYNC set and we fail to construct the
 segment. But why do we want to construct a segment if we don't commit?

 I guess what I'm asking is why don't we have a separate
 nilfs_transaction_abort() function that can't fail for the erroneous
 case to avoid this double error value tracking thing?

This does the separation and renames nilfs_transaction_end() to
nilfs_transaction_commit() for clarification.

Since, some calls of these functions were used just for exclusion control
against the segment constructor, they are replaced with semaphore
operations.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi a2e7d2df82 nilfs2: cleanup nilfs_clear_inode
This will remove the following unnecessary locks and cleanup code in
nilfs_clear_inode():

- unnecessary protection using nilfs_transaction_begin() and
  nilfs_transaction_end().

- cleanup code of i_dirty list field which is never chained
  when this function is called.

- spinlock used when releasing i_bh field.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 3358b4aaa8 nilfs2: fix problems of memory allocation in ioctl
This is another patch for fixing the following problems of a memory
copy function in nilfs2 ioctl:

(1) It tries to allocate 128KB size of memory even for small objects.

(2) Though the function repeatedly tries large memory allocations
    while reducing the size, GFP_NOWAIT flag is not specified.
    This increases the possibility of system memory shortage.

(3) During the retries of (2), verbose warnings are printed
    because _GFP_NOWARN flag is not used for the kmalloc calls.

The first patch was still doing large allocations by kmalloc which are
repeatedly tried while reducing the size.

Andi Kleen told me that using copy_from_user for large memory is not
good from the viewpoint of preempt latency:

 On Fri, 12 Dec 2008 21:24:11 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > > In the current interface, each data item is copied twice: one is to
 > > the allocated memory from user space (via copy_from_user), and another
 >
 > For such large copies it is better to use multiple smaller (e.g. 4K)
 > copy user, that gives better real time preempt latencies. Each cfu has a
 > cond_resched(), but only one, not multiple times in the inner loop.

He also advised me that:

 On Sun, 14 Dec 2008 16:13:27 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > Better would be if you could go to PAGE_SIZE. order 0 allocations
 > are typically the fastest / least likely to stall.
 >
 > Also in this case it's a good idea to use __get_free_pages()
 > directly, kmalloc tends to be become less efficient at larger
 > sizes.

For the function in question, the size of buffer memory can be reduced
since the buffer is repeatedly used for a number of small objects.  On
the other hand, it may incur large preempt latencies for larger buffer
because a copy_from_user (and a copy_to_user) was applied only once
each cycle.

With that, this revision uses the order 0 allocations with
__get_free_pages() to fix the original problems.

Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 0c4fb87764 nilfs2: update makefile and Kconfig
This adds a Makefile for the nilfs2 file system, and updates the
makefile and Kconfig file in the file system directory.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Koji Sato 7942b919f7 nilfs2: ioctl operations
This adds userland interface implemented with ioctl.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi a3d93f709e nilfs2: block cache for garbage collection
This adds the cache of on-disk blocks to be moved in garbage
collection.  The disk blocks are held with dummy inodes (called
gcinodes), and this file provides lookup function of the dummy inodes,
and their buffer read function.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 84ef1ecfde nilfs2: another dat for garbage collection
NILFS2 uses another DAT inode during garbage collection to ensure
atomicity and consistency of the DAT in the transient state.  This
twin inode is called GCDAT.

This adds functions to initialize the GCDAT and to switch page caches
and B-tree node caches between these two inodes.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 0f3e1c7f23 nilfs2: recovery functions
This adds recovery function on mount.

Usually the recovery is achieved by just finding the latest super
root.  When logs without checkpoints were appended for data sync
operations after the latest super root, the recovery function will
perform roll forwarding and reconstruct new log(s) with a super root.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi f30bf3e40f nilfs2: fix missed-sync issue for do_sync_mapping_range()
Chris Mason pointed out that there is a missed sync issue in
nilfs_writepages():

On Wed, 17 Dec 2008 21:52:55 -0500, Chris Mason wrote:
> It looks like nilfs_writepage ignores WB_SYNC_NONE, which is used by
> do_sync_mapping_range().

where WB_SYNC_NONE in do_sync_mapping_range() was replaced with
WB_SYNC_ALL by Nick's patch (commit:
ee53a891f4).

This fixes the problem by letting nilfs_writepages() write out the log of
file data within the range if sync_mode is WB_SYNC_ALL.

This involves removal of nilfs_file_aio_write() which was previously
needed to ensure O_SYNC sync writes.

Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 9ff05123e3 nilfs2: segment constructor
This adds the segment constructor (also called log writer).

The segment constructor collects dirty buffers for every dirty inode,
makes summaries of the buffers, assigns disk block addresses to the
buffers, and then submits BIOs for the buffers.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 64b5a32e0b nilfs2: segment buffer
This adds the segment buffer which is used to constuct logs.

[akpm@linux-foundation.org: BIO_RW_SYNC got removed]
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 783f61843e nilfs2: super block operations
This adds super block operations for the nilfs2 file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 8a9d2191e9 nilfs2: operations for the_nilfs core object
This adds functions on the_nilfs object, which keeps shared resources and
states among a read/write mount and snapshots mounts going individually.

the_nilfs is allocated per block device; it is created when user first
mount a snapshot or a read/write mount on the device, then it is reused
for successive mounts.  It will be freed when all mount instances on the
device are detached.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi d25006523d nilfs2: pathname operations
This adds pathname operations, most of which comes from the ext2 file
system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Yoshiji Amagai 2ba466d74e nilfs2: directory entry operations
This adds directory handling functions, most of which comes from the ext2
file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi f183ff4f05 nilfs2: file operations
This adds primitives for regular file handling.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 05fe58fdc1 nilfs2: inode operations
This adds inode level operations of the nilfs2 file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato 6c98cd4ecb nilfs2: segment usage file
This adds a meta data file which stores the allocation state of segments.

[konishi.ryusuke@lab.ntt.co.jp: fix wrong counting of checkpoints and dirty segments]
Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato 2961980972 nilfs2: checkpoint file
This adds a meta data file which holds checkpoint entries in its data
blocks.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 43bfb45ed4 nilfs2: inode map file
This adds a meta data file which stores on-disk inodes in its data blocks.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato a17564f58b nilfs2: disk address translator
This adds the disk address translation file (DAT) whose primary function
is to convert virtual disk block numbers to actual disk block numbers.

The virtual block numbers of NILFS are associated with checkpoint
generation numbers, and this file also provides functions to manage the
lifetime information of each virtual block number.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 5442680fd2 nilfs2: persistent object allocator
This adds common functions to allocate or deallocate entries with bitmaps
on a meta data file.  This feature is used by the DAT and ifile.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 5eb563f5f2 nilfs2: meta data file
This adds the meta data file, which serves common buffer functions to the
DAT, sufile, cpfile, ifile, and so forth.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 0bd49f9446 nilfs2: buffer and page operations
This adds common routines for buffer/page operations used in B-tree
node caches, meta data files, or segment constructor (log writer).

NILFS uses copy functions for buffers and pages due to the following
reasons:

 1) Relocation required for COW
    Since NILFS changes address of on-disk blocks, moving buffers
    in page cache is needed for the buffers which are not addressed
    by a file offset.  If buffer size is smaller than page size,
    this involves partial copy of pages.

 2) Freezing mmapped pages
    NILFS calculates checksums for each log to ensure its validity.
    If page data changes after the checksum calculation, this validity
    check will not work correctly.  To avoid this failure for mmaped
    pages, NILFS freezes their data by copying.

 3) Copy-on-write for DAT pages
    NILFS makes clones of DAT page caches in a copy-on-write manner
    during GC processes, and this ensures atomicity and consistency
    of the DAT in the transient state.

In addition, NILFS uses two obsolete functions, nilfs_mark_buffer_dirty()
and nilfs_clear_page_dirty() respectively.

* nilfs_mark_buffer_dirty() was required to avoid NULL pointer
  dereference faults:

  Since the page cache of B-tree node pages or data page cache of pseudo
  inodes does not have a valid mapping->host, calling mark_buffer_dirty()
  for their buffers causes the fault; it calls __mark_inode_dirty(NULL)
  through __set_page_dirty().

* nilfs_clear_page_dirty() was needed in the two cases:

 1) For B-tree node pages and data pages of the dat/gcdat, NILFS2 clears
    page dirty flags when it copies back pages from the cloned cache
    (gcdat->{i_mapping,i_btnode_cache}) to its original cache
    (dat->{i_mapping,i_btnode_cache}).

 2) Some B-tree operations like insertion or deletion may dispose buffers
    in dirty state, and this needs to cancel the dirty state of their
    pages.  clear_page_dirty_for_io() caused faults because it does not
    clear the dirty tag on the page cache.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi a60be987d4 nilfs2: B-tree node cache
This adds routines for B-tree node buffers.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato 36a580eb48 nilfs2: direct block mapping
This adds block mappings using direct pointers which are stored in the
i_bmap array of inode.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato 17c76b0104 nilfs2: B-tree based block mapping
This adds declarations and functions of NILFS2 B-tree.

Two variants are integrated in the NILFS2 B-tree.  The B-tree for the most
files points to the child nodes or data blocks with virtual block
addresses, whereas the B-tree of the DAT uses actual block addresses.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato bdb265eae0 nilfs2: integrated block mapping
This adds structures and operations for the block mapping (bmap for
short).  NILFS2 uses direct mappings for short files or B-tree based
mappings for longer files.

Every on-disk data block is held with inodes and managed through this
block mapping.  The nilfs_bmap structure and a set of functions here
provide this capability to the NILFS2 inode.

[penberg@cs.helsinki.fi: remove a bunch of bmap wrapper macros]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 65b4643d3b nilfs2: add inode and other major structures
This adds the following common structures of the NILFS2 file system.

* nilfs_inode_info structure:
  gives on-memory inode.

* nilfs_sb_info structure:
  keeps per-mount state and a special inode for the ifile.
  This structure is attached to the super_block structure.

* the_nilfs structure:
  keeps shared state and locks among a read/write mount and snapshot
  mounts.  This keeps special inodes for the sufile, cpfile, dat, and
  another dat inode used during GC (gcdat).  This also has a hash table
  of dummy inodes to cache disk blocks during GC (gcinodes).

* nilfs_transaction_info structure:
  keeps per task state while nilfs is writing logs or doing indivisible
  inode or namespace operations.  This structure is used to identify
  context during log making and store nest level of the lock which
  ensures atomicity of file system operations.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:12 -07:00
Coly Li 8a59f5d252 fs/romfs: return f_fsid for statfs(2)
Make romfs return f_fsid info for statfs(2).

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:10 -07:00
Serge E. Hallyn 909e6d9479 namespaces: move proc_net_get_sb to a generic fs/super.c helper
The mqueuefs filesystem will use this helper as well.  Proc's main get_sb
could also be made to use it, but that will require a bit more rework.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:09 -07:00
KAMEZAWA Hiroyuki 6260a4b052 /proc/pid/maps: don't show pgoff of pure ANON VMAs
Recently, it's argued that what proc/pid/maps shows is ugly when a 32bit
binary runs on 64bit host.

/proc/pid/maps outputs vma's pgoff member but vma->pgoff is of no use
information is the vma is for ANON.  With this patch, /proc/pid/maps shows
just 0 if no file backing store.

[akpm@linux-foundation.org: coding-style fixes]
[kamezawa.hiroyu@jp.fujitsu.com: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mike Waychison <mikew@google.com>
Reported-by: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:03 -07:00
Ingo Molnar f8201abcb2 ramfs: fix double freeing s_fs_info on failed mount
If ramfs mount fails, s_fs_info will be freed twice in ramfs_fill_super()
and ramfs_kill_sb(), leading to kernel oops.

Consolidate and beautify the code.
Make sure s_fs_info and s_root are in known good states.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 07:39:59 -07:00
Trond Myklebust d508afb437 NFS: Fix a double free in nfs_parse_mount_options()
Due to an apparent typo, commit a67d18f89f
(NFS: load the rpc/rdma transport module automatically) lead to the
'proto=' mount option doing a double free, while Opt_mountproto leaks a
string.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 17:19:48 -07:00
Linus Torvalds bbae8bcc49 ext3: make default data ordering mode configurable
This makes the defautl ext3 data ordering mode (when no explicit
ordering is set) configurable, so as to allow people to default to
'data=writeback' and get the resulting latency improvements.

This is a non-issue if a filesystem has been explicitly set to some
ordering (with 'tune2fs').

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 17:16:47 -07:00
Linus Torvalds e0724bf6e4 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix recovery bug
  UBIFS: add R/O compatibility
  UBIFS: fix compiler warnings
  UBIFS: fully sort GCed nodes
  UBIFS: fix commentaries
  UBIFS: introduce a helpful variable
  UBIFS: use KERN_CONT
  UBIFS: fix lprops committing bug
  UBIFS: fix bogus assertion
  UBIFS: fix bug where page is marked uptodate when out of space
  UBIFS: amend key_hash return value
  UBIFS: improve find function interface
  UBIFS: list usage cleanup
  UBIFS: fix dbg_chk_lpt_sz()
2009-04-06 15:00:19 -07:00
Linus Torvalds 22ae77bc7a Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (53 commits)
  [MTD] struct device - replace bus_id with dev_name(), dev_set_name()
  [MTD] [NOR] Fixup for Numonyx M29W128 chips
  [MTD] mtdpart: Make ecc_stats more realistic.
  powerpc/85xx: TQM8548: Update DTS file for multi-chip support
  powerpc: NAND: FSL UPM: document new bindings
  [MTD] [NAND] FSL-UPM: Add wait flags to support board/chip specific delays
  [MTD] [NAND] FSL-UPM: add multi chip support
  [MTD] [NOR] Add device parent info to physmap_of
  [MTD] [NAND] Add support for NAND on the Socrates board
  [MTD] [NAND] Add support for 4KiB pages.
  [MTD] sysfs support should not depend on CONFIG_PROC_FS
  [MTD] [NAND] Add parent info for CAFÉ controller
  [MTD] support driver model updates
  [MTD] driver model updates (part 2)
  [MTD] driver model updates
  [MTD] [NAND] move gen_nand's probe function to .devinit.text
  [MTD] [MAPS] move sa1100 flash's probe function to .devinit.text
  [MTD] fix use after free in register_mtd_blktrans
  [MTD] [MAPS] Drop now unused sharpsl-flash map
  [MTD] ofpart: Check name property to determine partition nodes.
  ...

Manually fix trivial conflict in drivers/mtd/maps/Makefile
2009-04-06 14:56:26 -07:00
Linus Torvalds 12fe32e4f9 Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kmemtrace: trace kfree() calls with NULL or zero-length objects
  kmemtrace: small cleanups
  kmemtrace: restore original tracing data binary format, improve ABI
  kmemtrace: kmemtrace_alloc() must fill type_id
  kmemtrace: use tracepoints
  kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints
  kmemtrace, rcu: fix rcupreempt.c data structure dependencies
  kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
  kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
  kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c
  kmemtrace, squashfs: fix slab.h dependency problem in squasfs
  kmemtrace, befs: fix slab.h dependency problem
  kmemtrace, security: fix linux/key.h header file dependencies
  kmemtrace, fs: fix linux/fdtable.h header file dependencies
  kmemtrace, fs: uninline simple_transaction_set()
  kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h
2009-04-06 13:30:00 -07:00
Linus Torvalds a63856252d Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits)
  nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4
  nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
  nfsd41: Documentation/filesystems/nfs41-server.txt
  nfsd41: CREATE_EXCLUSIVE4_1
  nfsd41: SUPPATTR_EXCLCREAT attribute
  nfsd41: support for 3-word long attribute bitmask
  nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
  nfsd41: pass writable attrs mask to nfsd4_decode_fattr
  nfsd41: provide support for minor version 1 at rpc level
  nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
  nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
  nfsd41: access_valid
  nfsd41: clientid handling
  nfsd41: check encode size for sessions maxresponse cached
  nfsd41: stateid handling
  nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
  nfsd41: destroy_session operation
  nfsd41: non-page DRC for solo sequence responses
  nfsd41: Add a create session replay cache
  nfsd41: create_session operation
  ...
2009-04-06 13:25:56 -07:00
Dave Chinner 8de2bf937a xfs: remove xfs_flush_space
The only thing we need to do now when we get an ENOSPC condition during delayed
allocation reservation is flush all the other inodes with delalloc blocks on
them and retry without EOF preallocation. Remove the unneeded mess that is
xfs_flush_space() and just call xfs_flush_inodes() directly from
xfs_iomap_write_delay().

Also, change the location of the retry label to avoid trying to do EOF
preallocation because we don't want to do that at ENOSPC. This enables us to
remove the BMAPI_SYNC flag as it is no longer used.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:49:12 +02:00
Dave Chinner 153fec43ce xfs: flush delayed allcoation blocks on ENOSPC in create
If we are creating lots of small files, we can fail to get
a reservation for inode create earlier than we should due to
EOF preallocation done during delayed allocation reservation.
Hence on the first reservation ENOSPC failure flush all the
delayed allocation blocks out of the system and retry.

This fixes the last commonly triggered spurious ENOSPC issue
that has been reported.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:48:30 +02:00
Dave Chinner e43afd72d2 xfs: block callers of xfs_flush_inodes() correctly
xfs_flush_inodes() currently uses a magic timeout to wait for
some inodes to be flushed before returning. This isn't
really reliable but used to be the best that could be done
due to deadlock potential of waiting for the entire flush.

Now the inode flush is safe to execute while we hold page
and inode locks, we can wait for all the inodes to flush
synchronously. Convert the wait mechanism to a completion
to do this efficiently. This should remove all remaining
spurious ENOSPC errors from the delayed allocation reservation
path.

This is extracted almost line for line from a larger patch
from Mikulas Patocka.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:47:27 +02:00
Dave Chinner 5825294edd xfs: make inode flush at ENOSPC synchronous
When we are writing to a single file and hit ENOSPC, we trigger a background
flush of the inode and try again.  Because we hold page locks and the iolock,
the flush won't proceed until after we release these locks. This occurs once
we've given up and ENOSPC has been reported. Hence if this one is the only
dirty inode in the system, we'll get an ENOSPC prematurely.

To fix this, remove the async flush from the allocation routines and move
it to the top of the write path where we can do a synchronous flush
and retry the write again. Only retry once as a second ENOSPC indicates
that we really are ENOSPC.

This avoids a page cache deadlock when trying to do this flush synchronously
in the allocation layer that was identified by Mikulas Patocka.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:45:44 +02:00
Dave Chinner a8d770d987 xfs: use xfs_sync_inodes() for device flushing
Currently xfs_device_flush calls sync_blockdev() which is
a no-op for XFS as all it's metadata is held in a different
address to the one sync_blockdev() works on.

Call xfs_sync_inodes() instead to flush all the delayed
allocation blocks out. To do this as efficiently as possible,
do it via two passes - one to do an async flush of all the
dirty blocks and a second to wait for all the IO to complete.
This requires some modification to the xfs-sync_inodes_ag()
flush code to do efficiently.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:44:54 +02:00
Dave Chinner 9d7fef74b2 xfs: inform the xfsaild of the push target before sleeping
When trying to reserve log space, we find the amount of space
we need, then go to sleep waiting for space. When we are
woken, we try to push the tail of the log forward to make
sure we have space available.

Unfortunately, this means that if there is not space available, and
everyone who needs space goes to sleep there is no-one left to push
the tail of the log to make space available. Once we have a thread
waiting for space to become available, the others queue up behind
it in a FIFO, and none of them push the tail of the log.

This can result in everyone going to sleep in xlog_grant_log_space()
if the first sleeper races with the last I/O that moves the tail
of the log forward. With no further I/O tomove the tail of the log,
there is nothing to wake the sleepers and hence all transactions
just stop.

Fix this by making sure the xfsaild will create enough space for the
transaction that is about to sleep by moving the push target far
enough forwards to ensure that that the curent proceeees will have
enough space available when it is woken. That is, we push the
AIL before we go to sleep.

Because we've inserted the log ticket into the queue before we've
pushed and gone to sleep, subsequent transactions will wait behind
this one. Hence we are guaranteed to have space available when we
are woken.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:42:59 +02:00
Dave Chinner c626d174cf xfs: prevent unwritten extent conversion from blocking I/O completion
Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
deadlock situation.

Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
overwrite data.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:42:11 +02:00
Dave Chinner 705db3fd46 xfs: fix double free of inode
If we fail to initialise the VFS inode in inode_init_always(),
it will call ->delete_inode internally resulting in the inode being
freed. Hence we need to delay the call to inode_init_always()
until after the XFS inode is sufficient set up to handle a
call to ->delete_inode, and then if that fails do not touch
the inode again at all as it has been freed.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:40:17 +02:00
Dave Chinner a6cb767e24 xfs: validate log feature fields correctly
If the large log sector size feature bit is set in the
superblock by accident (say disk corruption), the then
fields that are now considered valid are not checked on
production kernels. The checks are present as ASSERT
statements so cause a panic on a debug kernel.

Change this so that the fields are validity checked if
the feature bit is set and abort the log mount if the
fields do not contain valid values.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:39:27 +02:00
Benny Halevy f0ad670d70 nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
Fixes the following compiler error:
fs/nfsd/nfssvc.c: In function 'set_max_drc':
fs/nfsd/nfssvc.c:240: error: 'NFSD_DRC_SIZE_SHIFT' undeclared

CONFIG_NFSD_V4 is not set

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-06 09:17:53 -07:00
Jens Axboe 1aa2a7cc6f block: switch sync_dirty_buffer() over to WRITE_SYNC
We should now have the logic in place to handle this properly
without regressing on the write performance, so re-enable
the sync writes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe aeb6fafb8f block: Add flag for telling the IO schedulers NOT to anticipate more IO
By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.

Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe 4194b1eaf1 jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNC
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe 6c4bac6b33 jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:53 -07:00
Jens Axboe 9cf6b720f8 block: fsync_buffers_list() should use SWRITE_SYNC_PLUG
Then it can submit all the buffers without unplugging for each one.
We will kick off the pending IO if we come across a new address space.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:53 -07:00
Linus Torvalds 714f83d5d9 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
  tracing, net: fix net tree and tracing tree merge interaction
  tracing, powerpc: fix powerpc tree and tracing tree interaction
  ring-buffer: do not remove reader page from list on ring buffer free
  function-graph: allow unregistering twice
  trace: make argument 'mem' of trace_seq_putmem() const
  tracing: add missing 'extern' keywords to trace_output.h
  tracing: provide trace_seq_reserve()
  blktrace: print out BLK_TN_MESSAGE properly
  blktrace: extract duplidate code
  blktrace: fix memory leak when freeing struct blk_io_trace
  blktrace: fix blk_probes_ref chaos
  blktrace: make classic output more classic
  blktrace: fix off-by-one bug
  blktrace: fix the original blktrace
  blktrace: fix a race when creating blk_tree_root in debugfs
  blktrace: fix timestamp in binary output
  tracing, Text Edit Lock: cleanup
  tracing: filter fix for TRACE_EVENT_FORMAT events
  ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
  x86: kretprobe-booster interrupt emulation code fix
  ...

Fix up trivial conflicts in
 arch/parisc/include/asm/ftrace.h
 include/linux/memory.h
 kernel/extable.c
 kernel/module.c
2009-04-05 11:04:19 -07:00
Thiemo Nagel e44543b83b ext4: Fix off-by-one-error in ext4_valid_extent_idx()
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-04 23:30:44 -04:00
Thiemo Nagel f73953c065 ext4: Fix big-endian problem in __ext4_check_blockref()
Commit fe2c8191 introduced a regression on big-endian system, because
the checks to make sure block references in non-extent inodes are
valid failed to use le32_to_cpu().

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Tested-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-04-07 18:46:47 -04:00
Linus Torvalds 601cc11d05 Make non-compat preadv/pwritev use native register size
Instead of always splitting the file offset into 32-bit 'high' and 'low'
parts, just split them into the largest natural word-size - which in C
terms is 'unsigned long'.

This allows 64-bit architectures to avoid the unnecessary 32-bit
shifting and masking for native format (while the compat interfaces will
obviously always have to do it).

This also changes the order of 'high' and 'low' to be "low first".  Why?
Because when we have it like this, the 64-bit system calls now don't use
the "pos_high" argument at all, and it makes more sense for the native
system call to simply match the user-mode prototype.

This results in a much more natural calling convention, and allows the
compiler to generate much more straightforward code.  On x86-64, we now
generate

        testq   %rcx, %rcx      # pos_l
        js      .L122   #,
        movq    %rcx, -48(%rbp) # pos_l, pos

from the C source

        loff_t pos = pos_from_hilo(pos_h, pos_l);
	...
        if (pos < 0)
                return -EINVAL;

and the 'pos_h' register isn't even touched.  It used to generate code
like

        mov     %r8d, %r8d      # pos_low, pos_low
        salq    $32, %rcx       #, tmp71
        movq    %r8, %rax       # pos_low, pos.386
        orq     %rcx, %rax      # tmp71, pos.386
        js      .L122   #,
        movq    %rax, -48(%rbp) # pos.386, pos

which isn't _that_ horrible, but it does show how the natural word size
is just a more sensible interface (same arguments will hold in the user
level glibc wrapper function, of course, so the kernel side is just half
of the equation!)

Note: in all cases the user code wrapper can again be the same. You can
just do

	#define HALF_BITS (sizeof(unsigned long)*4)
	__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);

or something like that.  That way the user mode wrapper will also be
nicely passing in a zero (it won't actually have to do the shifts, the
compiler will understand what is going on) for the last argument.

And that is a good idea, even if nobody will necessarily ever care: if
we ever do move to a 128-bit lloff_t, this particular system call might
be left alone.  Of course, that will be the least of our worries if we
really ever need to care, so this may not be worth really caring about.

[ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]

Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-04 14:20:34 -07:00
Benny Halevy 79fb54abd2 nfsd41: CREATE_EXCLUSIVE4_1
Implement the CREATE_EXCLUSIVE4_1 open mode conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

This mode allows the client to atomically create a file
if it doesn't exist while setting some of its attributes.

It must be implemented if the server supports persistent
reply cache and/or pnfs.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Benny Halevy 8c18f2052e nfsd41: SUPPATTR_EXCLCREAT attribute
Return bitmask for supported EXCLUSIVE4_1 create attributes.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Andy Adamson 7e70570647 nfsd41: support for 3-word long attribute bitmask
Also, use client minorversion to generate supported attrs

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Benny Halevy 95ec28cda3 nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
_nfsd4_verify currently skips 3 words from the encoded buffer begining.
With support for 3-word attr bitmaps in nfsd41, nfsd4_encode_fattr
may encode 1, 2, or 3 words, and not always 2 as it used to be, hence
we need to find out where to skip using the encoded bitmap length.

Note: This patch may be applied over pre-nfsd41 nfsd.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:22 -07:00
Benny Halevy c0d6fc8a2d nfsd41: pass writable attrs mask to nfsd4_decode_fattr
In preparation for EXCLUSIVE4_1

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:22 -07:00