Commit Graph

30831 Commits

Author SHA1 Message Date
Namjae Jeon e975082411 f2fs: add compat_ioctl to provide backward compatability
adding compat_ioctl to provide support for backward comptability - 32bit binary
execution on 64bit kernel.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Jaegeuk Kim b7250d2d84 f2fs: fix calculation of max. gc cost in the SSR case
In the SSR case, the max gc cost should be the number of pages in a segment.
Otherwise, f2fs is able to fail getting dirty segments frequently for SSR.

In get_victim_by_default() previously,

while(1) {
   ...
   cost = get_gc_cost(); <- cost is between 0 ~ 512.
   ...
   if (cost == get_max_cost(sbi, &p)) <- max cost is UINT_MAX due to GC_CB type
	continue;

   if (nsearched++ >= MAX_VICTIM_SEARCH)
	break;
}

So, if there are a number of fully valid segments in series, f2fs cannot skip
those segments by comparing the cost and max cost of each segment.

Note that, the cost is the number of valid blocks at the time of the last
checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Jaegeuk Kim 437275272f f2fs: clarify and enhance the f2fs_gc flow
This patch makes clearer the ambiguous f2fs_gc flow as follows.

1. Remove intermediate checkpoint condition during f2fs_gc
 (i.e., should_do_checkpoint() and GC_BLOCKED)

2. Remove unnecessary return values of f2fs_gc because of #1.
 (i.e., GC_NODE, GC_OK, etc)

3. Simplify write_checkpoint() because of #2.

4. Clarify the main f2fs_gc flow.
 o monitor how many freed sections during one iteration of do_garbage_collect().
 o do GC more without checkpoints if we can't get enough free sections.
 o do checkpoint once we've got enough free sections through forground GCs.

5. Adopt thread-logging (Slack-Space-Recycle) scheme more aggressively on data
  log types. See. get_ssr_segement()

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Namjae Jeon b1f1daf8c7 f2fs: optimize the return condition for has_not_enough_free_secs
Instead of evaluating the free_sections and then deciding to return
true/false from that path. We can directly use the evaluation condition
for returning proper value.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Namjae Jeon 5ac206cf4f f2fs: make an accessor to get sections for particular block type
Introduce accessor to get the sections based upon the block type
(node,dents...) and modify the functions : should_do_checkpoint,
has_not_enough_free_secs to use this accessor function to get
the node sections and dent sections.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Namjae Jeon 25718423ea f2fs: mark gc_thread as NULL when thread creation is failed
When gc thread creation is failed, mark gc_thread as NULL to avoid
crash while trying to stop invalid thread in stop_gc_thread->kthread_stop.
Instead make it return from:
	if (!gc_th)
       return;

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Namjae Jeon ec7b1f2dd1 f2fs: name gc task as per the block device
Currently GC task is started for each f2fs formatted/mounted device.
But, when we check the task list, using 'ps', there is no distinguishing
factor between the tasks. So, name the task as per the block device just
like the flusher threads.
Also, remove the macro GC_THREAD_NAME and instead use the name: f2fs_gc
to avoid name length truncation, as the command length is 16
-> TASK_COMM_LEN 16 and example name like:
f2fs_gc_task:8:16 -> this exceeds name length

Before Patch for 2 F2FS formatted partitions:
root  28061  0.0  0.0  0 0 ? S 10:31   0:00 [f2fs_gc_task]
root  28087  0.0  0.0  0 0 ? S 10:32   0:00 [f2fs_gc_task]

After Patch:
root  16756  0.0  0.0  0  0 ?  S  14:57   0:00 [f2fs_gc-8:18]
root  16765  0.0  0.0  0  0 ?  S  14:57   0:00 [f2fs_gc-8:19]

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Changman Lee 48600e44c1 f2fs: remove unnecessary gc option check and balance_fs
1. If f2fs is mounted with background_gc_off option, checking
    BG_GC is not redundant.
 2. f2fs_balance_fs is checked in f2fs_gc, so this is also redundant.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:02 +09:00
Changman Lee 94787d91cb f2fs: remove repeated F2FS_SET_SB_DIRT call
F2FS_SET_SB_DIRT is called in inc_page_count and
it is directly called one more time in the next line.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
majianpeng 14d7e9de05 f2fs: when check superblock failed, try to check another superblock
In f2fs, there are two superblocks. So when the first superblock was
invalidate, it should try to check another.

By Jaegeuk Kim:
 o Remove a white space for coding style
 o Clean up for code readability
 o Fix a typo

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
majianpeng 5c9b469295 f2fs: use F2FS_BLKSIZE to judge bloksize and page_cache_size
In some system PAGE_CACHE_SIZE isn't 4K. So using F2FS_BLKSIZE to judge.

By Jaegeuk Kim:
 o f2fs does not support no other 4KB page cache size.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
majianpeng f83759e283 f2fs: add device name in debugfs
In file status, it can't distinguish between different devices.
So add device name to do this function.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
Changman Lee facb020540 f2fs: stop repeated checking if cp is needed
If it is decided that f2fs should do checkpoint, skip next comparison.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
2013-02-12 07:15:01 +09:00
Jaegeuk Kim d4686d56ec f2fs: avoid balanc_fs during evict_inode
1. Background

Previously, if f2fs tries to move data blocks of an *evicting* inode during the
cleaning process, it stops the process incompletely and then restarts the whole
process, since it needs a locked inode to grab victim data pages in its address
space. In order to get a locked inode, iget_locked() by f2fs_iget() is normally
used, but, it waits if the inode is on freeing.

So, here is a deadlock scenario.
1. f2fs_evict_inode()       <- inode "A"
  2. f2fs_balance_fs()
    3. f2fs_gc()
      4. gc_data_segment()
        5. f2fs_iget()      <- inode "A" too!

If step #1 and #5 treat a same inode "A", step #5 would fall into deadlock since
the inode "A" is on freeing. In order to resolve this, f2fs_iget_nowait() which
skips __wait_on_freeing_inode() was introduced in step #5, and stops f2fs_gc()
to complete f2fs_evict_inode().

1. f2fs_evict_inode()           <- inode "A"
  2. f2fs_balance_fs()
    3. f2fs_gc()
      4. gc_data_segment()
        5. f2fs_iget_nowait()   <- inode "A", then stop f2fs_gc() w/ -ENOENT

2. Problem and Solution

In the above scenario, however, f2fs cannot finish f2fs_evict_inode() only if:
 o there are not enough free sections, and
 o f2fs_gc() tries to move data blocks of the *evicting* inode repeatedly.

So, the final solution is to use f2fs_iget() and remove f2fs_balance_fs() in
f2fs_evict_inode().
The f2fs_evict_inode() actually truncates all the data and node blocks, which
means that it doesn't produce any dirty node pages accordingly.
So, we don't need to do f2fs_balance_fs() in practical.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
Jaegeuk Kim 369a708c2a f2fs: remove the use of page_cache_release
Let's remove the use of page_cache_release() in f2fs, and instead, use
f2fs_put_page(page, 0) which is exactly same but for code readability.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
Namjae Jeon 324ddc702e f2fs: fix typo mistake for data_version description
In f2fs_inode_info structure, the description for data_version
has a typo mistake. It should be latest instead of lastes.
So, correcting that.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
Namjae Jeon a2b52a598a f2fs: reorganize code for ra_node_page
We can remove unneeded label unlock_out, avoid unnecessary jump
and reorganize the returning conditions in this function.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:01 +09:00
Namjae Jeon 3786dfdf4f f2fs: avoid redundant call to has_not_enough_free_secs in f2fs_gc
After doing a write_checkpoint from garbage collection path if there is still
need to do more garbage collection, gc_more label is used to jump and start
the process again. And in that process, first step before getting victim is to
check if there are not enough free sections, which is already done before
doing a jump to gc_more. We can avoid the redundant call to check free
sections, by checking the gc_type flag which will remain FG_GC(value 1) under
this condition.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
Changman Lee d6212a5f18 f2fs: add un/freeze_fs into super_operations
This patch supports ioctl FIFREEZE and FITHAW to snapshot filesystem.
Before calling f2fs_freeze, all writers would be suspended and sync_fs
would be completed. So no f2fs has to do something.
Just background gc operation should be skipped due to generate dirty
nodes and data until unfreeze.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
majianpeng a2617dc686 f2fs: clean up the add_orphan_inode func
For the code
> prev = list_entry(orphan->list.prev, typeof(*prev), list);
if orphan->list.prev == head, it can't get the right prev.
And we can use the parameter 'this' to add.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
Alejandro Martinez Ruiz aa43507f68 f2fs: fix disable_ext_identify option spelling
There is a typo in the ->show_options function for disable_ext_identify.
Fix it to match the spelling from the documentation.

Signed-off-by: Alejandro Martinez Ruiz <alex@nowcomputing.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
Jaegeuk Kim bd43df021a f2fs: cover global locks for reserve_new_block
The fill_zero() from fallocate() calls get_new_data_page() in which calls
reserve_new_block().
The reserve_new_block() should be covered by *DATA_NEW*, one of global locks.
And also, before getting the lock, we should check free sections by calling
f2fs_balance_fs().

If we break this rule, f2fs is able to face with out-of-control free space
management and fall into infinite loop like the following scenario as well.

[f2fs_sync_fs()]             [fallocate()]
 - write_checkpoint()        - fill_zero()
  - block_operations()        - get_new_data_page()
   : grab NODE_NEW             - get_dnode_of_data()
                                : get locked dirty node page
    - sync_node_pages()
                                : try to grab NODE_NEW for data allocation
     : trylock and skip the dirty node page
   : call sync_node_pages() repeatedly in order to flush all the dirty node
     pages!

In order to avoid this, we should grab another global lock such as DATA_NEW
before calling get_new_data_page() in fill_zero().

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
Jaegeuk Kim 577e349514 f2fs: prevent checkpoint once any IO failure is detected
This patch enhances the checkpoint routine to cope with IO errors.

Basically f2fs detects IO errors from end_io_write, and the errors are able to
be occurred during one of data, node, and meta page writes.

In the previous code, when an IO error is occurred during writes, f2fs sets a
flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
Afterwards, write_checkpoint() will check the flag and remount f2fs as a
read-only (ro) mode.

However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
freely able to be written to disk by flusher or kswapd in background.
In such a case, after cold reboot, f2fs would restore the checkpoint data having
CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
a ro mode again.

Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
occurred, and remount f2fs as a ro mode right away at that moment.

Reported-by: Oliver Winker <oliver@oli1170.net>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-02-12 07:15:00 +09:00
Changman Lee 7d79e75f64 f2fs: save device node number into f2fs_inode
This patch stores inode->i_rdev into on-disk inode structure.

Alun reported that:
 aspire tmp # mount -t f2fs /dev/sdb mnt
 aspire tmp # mknod mnt/sda1 b 8 1
 aspire tmp # mknod mnt/null c 1 3
 aspire tmp # mknod mnt/console c 5 1
 aspire tmp # ls -l mnt
 total 2
 crw-r--r-- 1 root root 5, 1 Jan 22 18:44 console
 crw-r--r-- 1 root root 1, 3 Jan 22 18:44 null
 brw-r--r-- 1 root root 8, 1 Jan 22 18:44 sda1
 aspire tmp # umount mnt
 aspire tmp # mount -t f2fs /dev/sdb mnt
 aspire tmp # ls -l mnt
 total 2
 crw-r--r-- 1 root root 0, 0 Jan 22 18:44 console
 crw-r--r-- 1 root root 0, 0 Jan 22 18:44 null
 brw-r--r-- 1 root root 0, 0 Jan 22 18:44 sda1

In this report, f2fs lost the major/minor numbers of device files after umount.
The reason was revealed that f2fs does not store the inode->i_rdev to the
on-disk inode data structure.

So, as the other file systems do, f2fs also stores i_rdev into the i_addr fields
in on-disk inode structure without any on-disk layout changes.
Note that, this bug is limited to device files made by mknod().

Reported-and-Tested-by: Alun Jones <alun.linux@ty-penguin.org.uk>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-02-12 07:15:00 +09:00
Fengguang Wu e56a316214 nfsd4: free_stid can be static
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-02-11 16:22:50 -05:00
Trond Myklebust c21443c2c7 NFSv4: Fix a reboot recovery race when opening a file
If the server reboots after it has replied to our OPEN, but before we
call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
will not see a stateid for this open, and so will fail to recover it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:14 -05:00
Trond Myklebust 65b62a29f7 NFSv4: Ensure delegation recall and byte range lock removal don't conflict
Add a mutex to the struct nfs4_state_owner to ensure that delegation
recall doesn't conflict with byte range lock removal.

Note that we nest the new mutex _outside_ the state manager reclaim
protection (nfsi->rwsem) in order to avoid deadlocks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust 37380e4264 NFSv4: Fix up the return values of nfs4_open_delegation_recall
Adjust the return values so that they return EAGAIN to the caller in
cases where we might want to retry the delegation recall after
the state recovery has run.
Note that we can't wait and retry in this routine, because the caller
may be the state manager thread.

If delegation recall fails due to a session or reboot related issue,
also ensure that we mark the stateid as delegated so that
nfs_delegation_claim_opens can find it again later.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust d25be546a8 NFSv4.1: Don't lose locks when a server reboots during delegation return
If the server reboots while we are converting a delegation into
OPEN/LOCK stateids as part of a delegation return, the current code
will simply exit with an error. This causes us to lose both
delegation state and locking state (i.e. locking atomicity).

Deal with this by exposing the delegation stateid during delegation
return, so that we can recover the delegation, and then resume
open/lock recovery.

Note that not having to hold the nfs_inode->rwsem across the
calls to nfs_delegation_claim_opens() also fixes a deadlock against
the NFSv4.1 reboot recovery code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust 9a99af494b NFSv4.1: Prevent deadlocks between state recovery and file locking
We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep.  This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...

Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.

Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust c137afabe3 NFSv4: Allow the state manager to mark an open_owner as being recovered
This patch adds a seqcount_t lock for use by the state manager to
signal that an open owner has been recovered. This mechanism will be
used by the delegation, open and byte range lock code in order to
figure out if they need to replay requests due to collisions with
lock recovery.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:11 -05:00
Rafael J. Wysocki a9834cb205 Merge branch 'acpi-pm'
* acpi-pm: (35 commits)
  ACPI / PM: Handle missing _PSC in acpi_bus_update_power()
  ACPI / PM: Do not power manage devices in unknown initial states
  ACPI / PM: Fix acpi_bus_get_device() check in drivers/acpi/device_pm.c
  ACPI / PM: Fix /proc/acpi/wakeup for devices w/o bus or parent
  ACPI / PM: Fix consistency check for power resources during resume
  ACPI / PM: Expose lists of device power resources to user space
  sysfs: Functions for adding/removing symlinks to/from attribute groups
  ACPI / PM: Expose current status of ACPI power resources
  ACPI / PM: Expose power states of ACPI devices to user space
  ACPI / scan: Prevent device add uevents from racing with user space
  ACPI / PM: Fix device power state value after transitions to D3cold
  ACPI / PM: Use string "D3cold" to represent ACPI_STATE_D3_COLD
  ACPI / PM: Sanitize checks in acpi_power_on_resources()
  ACPI / PM: Always evaluate _PSn after setting power resources
  ACPI / PM: Introduce helper for executing _PSn methods
  ACPI / PM: Make acpi_bus_init_power() more robust
  ACPI / PM: Fix build for unusual combination of Kconfig options
  ACPI / PM: remove leading whitespace from #ifdef
  ACPI / PM: Consolidate suspend-specific and hibernate-specific code
  ACPI / PM: Move device power management functions to device_pm.c
  ...
2013-02-11 13:20:56 +01:00
M. Mohan Kumar b6f4bee02f fs/9p: Fix atomic_open
Return EEXISTS if requested file already exists, without this patch open
call will always succeed even if the file exists and user specified
O_CREAT|O_EXCL.

Following test code can be used to verify this patch. Without this patch
executing following test code on 9p mount will result in printing 'test case
failed' always.

main()
{
        int fd;

        /* first create the file */
        fd = open("./file", O_CREAT|O_WRONLY);
        if (fd < 0) {
                perror("open");
                return -1;
        }
        close(fd);

        /* Now opening same file with O_CREAT|O_EXCL should fail */
        fd = open("./file", O_CREAT|O_EXCL);
        if (fd < 0 && errno == EEXIST)
	        printf("test case pass\n");
        else
	        printf("test case failed\n");
        close(fd);
        return 0;
}

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:59 -06:00
Aneesh Kumar K.V 03f0e02273 fs/9p: Don't use O_TRUNC flag in TOPEN and TLOPEN request
We do the truncate via setattr request, hence don't pass the O_TRUNC flag in
open request. Without this patch we end up sending zero sized write request
to server when we try to truncate. Some servers (VirtFS) were not handling that
properly.

Reported-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:47 -06:00
Al Viro 7ffdea7ea3 locking in fs/9p ->readdir()
... is really excessive.  First of all, ->readdir() is serialized by
file->f_path.dentry->d_inode->i_mutex; playing with file->f_path.dentry->d_lock
is not buying you anything.  Moreover, rdir->mutex is pointless for exactly
the same reason - you'll never see contention on it.

	While we are at it, there's no point in having rdir->buf a pointer -
you have it point just past the end of rdir, so it might as well be a flex
array (and no, it's not a gccism).

	Absolutely untested patch follows:

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-02-10 16:29:33 -06:00
Theodore Ts'o b6e96d0067 jbd2: use module parameters instead of debugfs for jbd_debug
There are multiple reasons to move away from debugfs.  First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the jbd2 module.

Secondly, as a module paramter it can be specified as a boot option if
jbd2 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/jbd2/parameters/jbd2_debug.  So it is more flexible.

Ultimately we want to move away from using jbd_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:29:20 -05:00
Theodore Ts'o a0b30c1229 ext4: use module parameters instead of debugfs for mballoc_debug
There are multiple reasons to move away from debugfs.  First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.

Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.

Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:28:20 -05:00
Theodore Ts'o 1139575a92 ext4: start handle at the last possible moment when creating inodes
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated.  In order to do this,
we need to start the handle in the ext4_new_inode().  So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:27:09 -05:00
Theodore Ts'o 95eaefbdec ext4: fix the number of credits needed for acl ops with inline data
Operations which modify extended attributes may need extra journal
credits if inline data is used, since there is a chance that some
extended attributes may need to get pushed to an external attribute
block.

Changes to reflect this was made in xattr.c, but they were missed in
fs/ext4/acl.c.  To fix this, abstract the calculation of the number of
credits needed for xattr operations to an inline function defined in
ext4_jbd2.h, and use it in acl.c and xattr.c.

Also move the function declarations used in inline.c from xattr.h
(where they are non-obviously hidden, and caused problems since
ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
them to ext4.h.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:23:03 -05:00
Theodore Ts'o 64044abf05 ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir()
The ext4_unlink() and ext4_rmdir() don't actually release the blocks
associated with the file/directory.  This gets done in a separate jbd2
handle called via ext4_evict_inode().  Thus, we don't need to reserve
lots of journal credits for the truncate.

Note that using too many journal credits is non-optimal because it can
leading to the journal transmit getting closed too early, before it is
strictly necessary.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 15:06:24 -05:00
Theodore Ts'o 4b217630d0 ext4: fix the number of credits needed for ext4_ext_migrate()
The migration ioctl creates a temporary inode.  Since this inode is
never linked to a directory, we don't need to reserve journal credits
required for modifying the directory.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 12:50:27 -05:00
Theodore Ts'o 8dcfaad244 ext4: start handle at the last possible moment in ext4_rmdir()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:45:11 -05:00
Theodore Ts'o 931b68649d ext4: start handle at the last possible moment in ext4_unlink()
Don't start the jbd2 transaction handle until after the directory
entry has been found, to minimize the amount of time that a handle is
held active.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:43:39 -05:00
Theodore Ts'o 47564bfb95 ext4: grab page before starting transaction handle in write_begin()
The grab_cache_page_write_begin() function can potentially sleep for a
long time, since it may need to do memory allocation which can block
if the system is under significant memory pressure, and because it may
be blocked on page writeback.  If it does take a long time to grab the
page, it's better that we not hold an active jbd2 handle.

So grab a handle on the page first, and _then_ start the transaction
handle.

This commit fixes the following long transaction handle hold time:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-09 09:24:14 -05:00
Theodore Ts'o 9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Jeff Layton 01a7decf75 nfsd: keep a checksum of the first 256 bytes of request
Now that we're allowing more DRC entries, it becomes a lot easier to hit
problems with XID collisions. In order to mitigate those, calculate a
checksum of up to the first 256 bytes of each request coming in and store
that in the cache entry, along with the total length of the request.

This initially used crc32, but Chuck Lever and Jim Rees pointed out that
crc32 is probably more heavyweight than we really need for generating
these checksums, and recommended looking at using the same routines that
are used to generate checksums for IP packets.

On an x86_64 KVM guest measurements with ftrace showed ~800ns to use
csum_partial vs ~1750ns for crc32.  The difference probably isn't
terribly significant, but for now we may as well use csum_partial.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Stones-thrown-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-08 16:02:26 -05:00
Theodore Ts'o 722887ddc8 ext4: move the jbd2 wrapper functions out of super.c
Move the jbd2 wrapper functions which start and stop handles out of
super.c, where they don't really logically belong, and into
ext4_jbd2.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 13:00:31 -05:00
Theodore Ts'o 343d9c283c jbd2: add tracepoints which provide per-handle statistics
Handles which stay open a long time are problematic when it comes time
to close down a transaction so it can be committed.  These tracepoints
will help us determine which ones are the problematic ones, and to
validate whether changes makes things better or worse.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 13:00:22 -05:00
Al Viro b7f7a5e0be f2fs: get rid of fake on-stack dentries
those should never be used for a lot of reasons...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-08 02:55:04 -05:00
Al Viro 69f24eac55 f2fs: switch init_inode_metadata() to passing parent and name separately
... sure, it's tempting to just pass dentry.  Except that we don't
_have_ anything resembling a real dentry on one of the paths to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-08 02:55:04 -05:00
Al Viro c004363dd6 f2fs: switch new_inode_page() from dentry to qstr
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-08 02:55:03 -05:00
Al Viro 53dc9a6776 f2fs: init_dent_inode() should take qstr
for one thing, it doesn't (and shouldn't) use anything else from dentry;
for another, on some call chains the dentry is fake and should
be eliminated completely.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-08 02:55:03 -05:00
Linus Torvalds 8d19514fad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've got corner cases for updating i_size that ceph was hitting,
  error handling for quotas when we run out of space, a very subtle
  snapshot deletion race, a crash while removing devices, and one
  deadlock between subvolume creation and the sb_internal code (thanks
  lockdep)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: move d_instantiate outside the transaction during mksubvol
  Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
  Btrfs: fix possible stale data exposure
  Btrfs: fix missing i_size update
  Btrfs: fix race between snapshot deletion and getting inode
  Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
  Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
  Btrfs: do not merge logged extents if we've removed them from the tree
  btrfs: don't try to notify udev about missing devices
2013-02-08 12:06:46 +11:00
Clark Williams 8bd75c77b7 sched/rt: Move rt specific bits into new header file
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07 20:51:08 +01:00
Wang Shilong 712ddc52ff Ext2: remove the static function release_blocks to optimize the kernel
Because the static function 'release_blocks' is only called
when releasing blocks,it will be more simple and efficient to
call the function 'percpu_counter_add' directly.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-07 16:44:56 +01:00
Wang Shilong 8e3dffc651 Ext2: mark inode dirty after the function dquot_free_block_nodirty is called
We should mark inode dirty after the function dquot_free_block_nodirty
is called.Besides,add a check whether it is necessary to call
dquot_free_block_nodirty functon.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-07 16:44:55 +01:00
Alex Elder 311f08acde xfs: memory barrier before wake_up_bit()
In xfs_ifunlock() there is a call to wake_up_bit() after clearing
the flush lock on the xfs inode.  This is not guaranteed to be safe,
as noted in the comments above wake_up_bit() beginning with:

    In order for this to function properly, as it uses
    waitqueue_active() internally, some kind of memory
    barrier must be done prior to calling this.


Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-07 09:39:48 -06:00
Eric Wong 634734b63a fuse: allow control of adaptive readdirplus use
For some filesystems (e.g. GlusterFS), the cost of performing a
normal readdir and readdirplus are identical.  Since adaptively
using readdirplus has no benefit for those systems, give
users/filesystems the option to control adaptive readdirplus use.

v2 of this patch incorporates Miklos's suggestion to simplify the code,
as well as improving consistency of macro names and documentation.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-07 14:25:44 +01:00
Theodore Ts'o 9fff24aa2c jbd2: track request delay statistics
Track the delay between when we first request that the commit begin
and when it actually begins, so we can see how much of a gap exists.
In theory, this should just be the remaining scheduling quantuum of
the thread which requested the commit (assuming it was not a
synchronous operation which triggered the commit request) plus
scheduling overhead; however, it's possible that real time processes
might get in the way of letting the kjournald thread from executing.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-06 22:30:23 -05:00
Chris Mason 1a65e24b0b Btrfs: move d_instantiate outside the transaction during mksubvol
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.

btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.

This commit moves the d_instantiate after the transaction closes.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 12:11:10 -05:00
Jan Schmidt eb6b88d92c Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.

Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.

Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-06 09:24:40 -05:00
Wang Shilong 98783e453c Ext2: remove the overhead check about sb in the function ext2_new_blocks
It can be guranteed that inode->i_sb should not be null in vfs.
So here the check about it is overhead.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-06 13:47:02 +01:00
Chris Mason 24f8ebe918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus 2013-02-05 19:24:44 -05:00
Josef Bacik 59fe4f4197 Btrfs: fix possible stale data exposure
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data.  The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size.  Fix this by
checking if the extent ends past the disk_i_size.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:16 -05:00
Josef Bacik 5d1f40202b Btrfs: fix missing i_size update
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data.  The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update.  So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:14 -05:00
Liu Bo 6f1c36055f Btrfs: fix race between snapshot deletion and getting inode
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().

And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.

Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.

(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0).  So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.

Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.

So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:13 -05:00
Miao Xie 843fcf3573 Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:11 -05:00
Miao Xie 0a3404dcff Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:10 -05:00
Josef Bacik 222c81dc38 Btrfs: do not merge logged extents if we've removed them from the tree
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it.  The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-05 16:09:03 -05:00
Jan Kara 288be96de6 udf: Remove unused s_extLength from udf_bitmap
s_extLength was assigned to but the value was never really used. So
just remove the field.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-05 17:29:53 +01:00
Jan Kara c60305b578 udf: Make s_block_bitmap standard array
struct udf_bitmap has array of buffer pointers attached to it. The code
unnecessarily used s_block_bitmap as a pointer to the array instead of
the standard trick of using 0 length array in the declaration. Change
that to make code more readable and actually shrink the structure by one
pointer.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-05 17:29:52 +01:00
Jan Kara 89b1f39eb4 udf: Fix bitmap overflow on large filesystems with small block size
For large UDF filesystems with 512-byte blocks the number of necessary
bitmap blocks is larger than 2^16 so s_nr_groups in udf_bitmap overflows
(the number will overflow for filesystems larger than 128 GB with
512-byte blocks). That results in ENOSPC errors despite the filesystem
has plenty of free space.

Fix the problem by changing s_nr_groups' type to 'int'. That is enough
even for filesystems 2^32 blocks (UDF maximum) and 512-byte blocksize.

Reported-and-tested-by: v10lator@myway.de
Signed-off-by: Jan Kara <jack@suse.cz>
2013-02-05 17:29:30 +01:00
Chris Mason 0e4e026366 Merge branch 'for-linus' into raid56-experimental
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:04:03 -05:00
Chris Mason 1f0905ec15 Btrfs: remove conflicting check for minimum number of devices in raid56
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:01:42 -05:00
Tomasz Torcz 10e78e3a8a Btrfs: select XOR_BLOCKS in Kconfig
The Btrfs raid56 uses the generic xor helpers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 09:55:30 -05:00
Jeff Layton 5976687a2b sunrpc: move address copy/cmp/convert routines and prototypes from clnt.h to addr.h
These routines are used by server and client code, so having them in a
separate header would be best.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-05 09:41:14 -05:00
J. Bruce Fields 3abdb60712 nfsd4: simplify idr allocation
We don't really need to preallocate at all; just allocate and initialize
everything at once, but leave the sc_type field initially 0 to prevent
finding the stateid till it's fully initialized.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-05 09:41:12 -05:00
majianpeng 2d32b29a1c nfsd: Fix memleak
When free nfs-client, it must free the ->cl_stateids.

Cc: stable@kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-05 09:40:47 -05:00
Ingo Molnar b2c77a57e4 This implements the cputime accounting on full dynticks CPUs.
Typical cputime stats infrastructure relies on the timer tick and
 its periodic polling on the CPU to account the amount of time
 spent by the CPUs and the tasks per high level domains such as
 userspace, kernelspace, guest, ...
 
 Now we are preparing to implement full dynticks capability on
 Linux for Real Time and HPC users who want full CPU isolation.
 This feature requires a cputime accounting that doesn't depend
 on the timer tick.
 
 To implement it, this new cputime infrastructure plugs into
 kernel/user/guest boundaries to take snapshots of cputime and
 flush these to the stats when needed. This performs pretty
 much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
 and cputime snaphots are synchronized between write and read
 side such that the latter can safely retrieve the pending tickless
 cputime of a task and add it to its latest cputime snapshot to
 return the correct result to the user.
 
 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRBsKnAAoJEIUkVEdQjox3lMgP/2R6DU2f8PyGIao3hne4M3Pu
 L3q+mAG53b24Dy014KeW7gd8yv45fE7wp/rs8CGLte9VzbLkRCDSFQPgBuXVagRj
 tV5nfAuqD0wHTnA+HhBE3l3C2RKAPGIu79rBpnIR/QIPPl8Z3Dby8YgmxEQKDf8G
 j7MEBu2LthSuqEi2ZXemnO5r0oEnQAzAp4TTi/M38k0Fmt59nOGyjLnI+xHYCBMa
 1pnz7j3jjR9NJExGu8iVvbo+jupuQngP8qmkLXHvYnj/TEJNwzO1hHVoSwOpjYpS
 9ycl+T8IKQLbAkBywLtq3Mzde43xt/t8wYyGZ0oAV+Z7MIpz/9YIfDJwqQeqoNbD
 dAdbNjKMbsxCgmrnyqSagfMQg/r3CPZ4vf40TMCaN4gNUJC4Ie+E4kPRKRh59+PB
 Ukthmqujn0f40LAa+HXTUuzafd3b0s/ewH+8FuQ6LAG9b5+WnoN8JTJ5u6+ydokO
 ZleeOowuRZZEg+abQ8Sm2GRm/BzN29gi/npb//I+ZDXWv/+3yccgsiPjCRzCAAaO
 g1RmYryFSRUwHQbGNNypVWVuOLWvrBQ4jqbGO7BBuBByZMSHryKxR6mb+inH3qLE
 xIDM9SdSJisc292OzoFKwVZki4MaXaadJXJduVvqYlZQvXXs7eAa4wo3euhtVITD
 NLQO5OZXE4oIQmDFb0FV
 =1Tzp
 -----END PGP SIGNATURE-----

Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core

Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:

 "This implements the cputime accounting on full dynticks CPUs.

  Typical cputime stats infrastructure relies on the timer tick and
  its periodic polling on the CPU to account the amount of time
  spent by the CPUs and the tasks per high level domains such as
  userspace, kernelspace, guest, ...

  Now we are preparing to implement full dynticks capability on
  Linux for Real Time and HPC users who want full CPU isolation.
  This feature requires a cputime accounting that doesn't depend
  on the timer tick.

  To implement it, this new cputime infrastructure plugs into
  kernel/user/guest boundaries to take snapshots of cputime and
  flush these to the stats when needed. This performs pretty
  much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
  and cputime snaphots are synchronized between write and read
  side such that the latter can safely retrieve the pending tickless
  cputime of a task and add it to its latest cputime snapshot to
  return the correct result to the user."

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-05 13:10:33 +01:00
Linus Torvalds fe547d7714 Merge branch 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fix from David Teigland:
 "Thanks to Jana who reported the problem and was able to test this fix
  so quickly."

This fixes an incorrect size check that triggered for CONFIG_COMPAT
whether the code was actually doing compat or not.  The incorrect write
size check broke userland (clvmd) when maximum resource name lengths are
used.

* 'fix-max-write' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: check the write size from user
2013-02-05 20:50:11 +11:00
Vyacheslav Dubeyko a9bae18954 nilfs2: fix fix very long mount time issue
There exists a situation when GC can work in background alone without
any other filesystem activity during significant time.

The nilfs_clean_segments() method calls nilfs_segctor_construct() that
updates superblocks in the case of NILFS_SC_SUPER_ROOT and
THE_NILFS_DISCONTINUED flags are set.  But when GC is working alone the
nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
As a result, the update of superblocks doesn't occurred all this time
and in the case of SPOR superblocks keep very old values of last super
root placement.

SYMPTOMS:

Trying to mount a NILFS2 volume after SPOR in such environment ends with
very long mounting time (it can achieve about several hours in some
cases).

REPRODUCING PATH:

1. It needs to use external USB HDD, disable automount and doesn't
   make any additional filesystem activity on the NILFS2 volume.

2. Generate temporary file with size about 100 - 500 GB (for example,
   dd if=/dev/zero of=<file_name> bs=1073741824 count=200).  The size of
   file defines duration of GC working.

3. Then it needs to delete file.

4. Start GC manually by means of command "nilfs-clean -p 0".  When you
   start GC by means of such way then, at the end, superblocks is updated
   by once.  So, for simulation of SPOR, it needs to wait sometime (15 -
   40 minutes) and simply switch off USB HDD manually.

5. Switch on USB HDD again and try to mount NILFS2 volume.  As a
   result, NILFS2 volume will mount during very long time.

REPRODUCIBILITY: 100%

FIX:

This patch adds checking that superblocks need to update and set
THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.

Reported-by: Sergey Alexandrov <splavgm@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05 20:38:46 +11:00
Jeff Layton b4e7f2c945 nfsd: register a shrinker for DRC cache entries
Since we dynamically allocate them now, allow the system to call us up
to release them if it gets low on memory. Since these entries aren't
replaceable, only free ones that are expired or that are over the cap.
The the seeks value is set to '1' however to indicate that freeing the
these entries is low-cost.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:13 -05:00
Jeff Layton aca8a23de6 nfsd: add recurring workqueue job to clean the cache
It's not sufficient to only clean the cache when requests come in. What
if we have a flurry of activity and then the server goes idle? Add a
workqueue job that will clean the cache every RC_EXPIRE period.

Care is taken to only run this when we expect to have entries expiring.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:12 -05:00
Jeff Layton 2c6b691c05 nfsd: when updating an entry with RC_NOCACHE, just free it
There's no need to keep entries around that we're declaring RC_NOCACHE.
Ditto if there's a problem with the entry.

With this change too, there's no need to test for RC_UNUSED in the
search function. If the entry's in the hash table then it's either
INPROG or DONE.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:11 -05:00
Jeff Layton 13cc8a78e8 nfsd: remove the cache_disabled flag
With the change to dynamically allocate entries, the cache is never
disabled on the fly. Remove this flag.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:11 -05:00
Jeff Layton 0338dd1572 nfsd: dynamically allocate DRC entries
The existing code keeps a fixed-size cache of 1024 entries. This is much
too small for a busy server, and wastes memory on an idle one.  This
patch changes the code to dynamically allocate and free these cache
entries.

A cap on the number of entries is retained, but it's much larger than
the existing value and now scales with the amount of low memory in the
machine.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:10 -05:00
Jeff Layton 0ee0bf7ee5 nfsd: track the number of DRC entries in the cache
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:09 -05:00
Jeff Layton 56c2548b2d nfsd: always move DRC entries to the end of LRU list when updating timestamp
...otherwise, we end up with the list ordering wrong. Currently, it's
not a problem since we skip RC_INPROG entries, but keeping the ordering
strict will be necessary for a later patch that adds a cache cleaner.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 17:19:09 -05:00
David Teigland d4b0bcf32b dlm: check the write size from user
Return EINVAL from write if the size is larger than
allowed.  Do this before allocating kernel memory for
the bogus size, which could lead to OOM.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Jana Saout <jana@saout.de>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-02-04 15:31:22 -06:00
Theodore Ts'o 40ae348762 ext4: optimize mballoc for large allocations
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group.  That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.

For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.

To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10?  Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group.  If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.

To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
2013-02-04 15:08:40 -05:00
Enke Chen 0415d29102 fuse: send poll events
commit 626cf23660 "poll: add poll_requested_events()..." enabled us to send the
requested events to the filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-04 16:14:32 +01:00
Miklos Szeredi dfca7cebc2 fuse: don't WARN when nlink is zero
drop_nlink() warns if nlink is already zero.  This is triggerable by a buggy
userspace filesystem.  The cure, I think, is worse than the disease so disable
the warning.

Reported-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-04 15:57:42 +01:00
Eric Wong 6a4e922c3d fuse: avoid out-of-scope stack access
The all pointers within fuse_req must point to valid memory once
fuse_force_forget() returns.

This bug appeared in "fuse: implement NFS-like readdirplus support"
and was never in any official Linux release.

I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and
verified the FORGET operation was called properly in userspace.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-04 15:22:23 +01:00
Jeff Layton 2eeb9b2abc nfsd: initialize the exp->ex_uuid field in svc_export_init
commit 885c91f746 in Bruce's tree was causing oopses for me:

general protection fault: 0000 [#1] SMP
Modules linked in: nfsd(OF) nfs_acl(OF) auth_rpcgss(OF) lockd(OF) sunrpc(OF) kvm_amd kvm microcode i2c_piix4 virtio_net virtio_balloon cirrus drm_kms_helper ttm drm virtio_blk i2c_core
CPU 0
Pid: 564, comm: exportfs Tainted: GF          O 3.8.0-0.rc5.git2.1.fc19.x86_64 #1 Bochs Bochs
RIP: 0010:[<ffffffff811b1509>]  [<ffffffff811b1509>] kfree+0x49/0x280
RSP: 0018:ffff88007a3d7c50  EFLAGS: 00010203
RAX: 01adaf8dadadad80 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000001
RDX: ffffffff7fffffff RSI: 0000000000000000 RDI: 6b6b6b6b6b6b6b6b
RBP: ffff88007a3d7c80 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000000
R10: 0000000000000018 R11: 0000000000000000 R12: ffff88006a117b50
R13: ffffffffa01a589c R14: ffff8800631b0f50 R15: 01ad998dadadad80
FS:  00007fcaa3616740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f5d84b6fdd8 CR3: 0000000064db4000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process exportfs (pid: 564, threadinfo ffff88007a3d6000, task ffff88006af28000)
Stack:
 ffff88007a3d7c80 ffff88006a117b68 ffff88006a117b50 0000000000000000
 ffff8800631b0f50 ffff88006a117b50 ffff88007a3d7ca0 ffffffffa01a589c
 ffff880036be1148 ffff88007a3d7cf8 ffff88007a3d7e28 ffffffffa01a6a98
Call Trace:
 [<ffffffffa01a589c>] svc_export_put+0x5c/0x70 [nfsd]
 [<ffffffffa01a6a98>] svc_export_parse+0x328/0x7e0 [nfsd]
 [<ffffffffa016f1c7>] cache_do_downcall+0x57/0x70 [sunrpc]
 [<ffffffffa016f25e>] cache_downcall+0x7e/0x100 [sunrpc]
 [<ffffffffa016f338>] cache_write_procfs+0x58/0x90 [sunrpc]
 [<ffffffffa016f2e0>] ? cache_downcall+0x100/0x100 [sunrpc]
 [<ffffffff8123b0e5>] proc_reg_write+0x75/0xb0
 [<ffffffff811ccecf>] vfs_write+0x9f/0x170
 [<ffffffff811cd089>] sys_write+0x49/0xa0
 [<ffffffff816e0919>] system_call_fastpath+0x16/0x1b
Code: 66 66 66 90 48 83 fb 10 0f 86 c3 00 00 00 48 89 df 49 bf 00 00 00 00 00 ea ff ff e8 f2 12 ea ff 48 c1 e8 0c 48 c1 e0 06 49 01 c7 <49> 8b 07 f6 c4 80 0f 85 1d 02 00 00 49 8b 07 a8 80 0f 84 ee 01
RIP  [<ffffffff811b1509>] kfree+0x49/0x280
 RSP <ffff88007a3d7c50>

I think Majianpeng's patch is correct, but incomplete. In order for it
to be safe to free the ex_uuid unconditionally in svc_export_put, we
need to make sure it's initialized to NULL in the init routine.

Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:24 -05:00
Jeff Layton a4a3ec3291 nfsd: break out hashtable search into separate function
Later, we'll need more than one call site for this, so break it out
into a new function.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:24 -05:00
Jeff Layton d1a0774de6 nfsd: clean up and clarify the cache expiration code
Add a preprocessor constant for the expiry time of cache entries, and
move the test for an expired entry into a function. Note that the current
code does not test for RC_INPROG. It just assumes that it won't take more
than 2 minutes to fill out an in-progress entry.

I'm not sure how valid that assumption is though, so let's just ensure
that we never consider an RC_INPROG entry to be expired.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:23 -05:00
Jeff Layton 25e6b8b0e1 nfsd: remove redundant test from nfsd_reply_cache_free
Entries can only get a c_type of RC_REPLBUFF iff they are
RC_DONE. Therefore the test for RC_DONE isn't necessary here.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:22 -05:00
Jeff Layton f09841fdfa nfsd: add alloc and free functions for DRC entries
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:22 -05:00
Jeff Layton 8a8bc40d9b nfsd: create a dedicated slabcache for DRC entries
Currently we use kmalloc() which wastes a little bit of memory on each
allocation since it's a power of 2 allocator. Since we're allocating a
1024 of these now, and may need even more later, let's create a new
slabcache for them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:21 -05:00
Jeff Layton 09662d58d5 nfsd: get rid of RC_INTR
The reply cache code never returns this status.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:20 -05:00
Jeff Layton 6dc8889589 nfsd: remove unneeded spinlock in nfsd_cache_update
The locking rules for cache entries say that locking the cache_lock
isn't needed if you're just touching the current entry. Earlier
in this function we set rp->c_state to RC_UNUSED without any locking,
so I believe it's ok to do the same here.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:19 -05:00
Jeff Layton 7b9e8522a6 nfsd: fix IPv6 address handling in the DRC
Currently, it only stores the first 16 bytes of any address. struct
sockaddr_in6 is 28 bytes however, so we're currently ignoring the last
12 bytes of the address.

Expand the c_addr field to a sockaddr_in6, and cast it to a sockaddr_in
as necessary. Also fix the comparitor to use the existing RPC
helpers for this.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-04 09:16:19 -05:00
Adam Thomas 8afd500cb5 UBIFS: fix double free of ubifs_orphan objects
The last orphan in the dnext list has its dnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the dnext list
and frees it immediately instead ignoring it as a second delete. The
orphan is later freed again by erase_deleted.

This change adds an explicit flag to ubifs_orphan indicating whether
it is pending delete.

Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: stable@vger.kernel.org
2013-02-04 12:31:48 +02:00
Adam Thomas 2928f0d0c5 UBIFS: fix use of freed ubifs_orphan objects
The last orphan in the cnext list has its cnext set to NULL. Because
of that, ubifs_delete_orphan assumes that it is not on the cnext list
and frees it immediately instead of adding it to the dnext list. The
freed orphan is later modified by write_orph_node.

This can cause various inconsistencies including directory entries
that cannot be removed and this error:

UBIFS error (pid 20685): layout_cnodes: LPT out of space at LEB 14:129009 needing 17, done_ltab 1, done_lsave 1

This is a regression introduced by
"7074e5eb UBIFS: remove invalid reference to list iterator variable".

This change adds an explicit flag to ubifs_orphan indicating whether
it is pending commit.

Signed-off-by: Adam Thomas <adamthomas1111@gmail.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v3.6+
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2013-02-04 12:31:00 +02:00
Thomas Gleixner 90889a635a Merge branch 'fortglx/3.9/time' of git://git.linaro.org/people/jstultz/linux into timers/core
Trivial conflict in arch/x86/Kconfig

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-04 11:03:03 +01:00
Al Viro 9d94b9e2f3 switch timerfd compat syscalls to COMPAT_SYSCALL_DEFINE
... and move them over to fs/timerfd.c.  Cleaner and easier
that way...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:25 -05:00
Al Viro f482e1b4a4 switch compat_sys_open* to COMPAT_SYSCALL_DEFINE 2013-02-03 15:09:24 -05:00
Theodore Ts'o 8dc0aa8cf0 ext4: check incompatible mount options while mounting ext2/3
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:38:39 -05:00
Jan Kara e33e60eaed ext4: print error when argument of inode_readahead_blk is invalid
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:14:31 -05:00
Jan Kara 5f3633e36b ext4: make mount option parsing loop more logical
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:09:36 -05:00
Jan Kara 0efb3b2300 ext4: move several mount options to standard handling loop
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 22:52:19 -05:00
Cong Ding 0e79537d30 ext4: reduce one "if" comparison in ext4_dirhash()
It is unnecessary to check i<4 after the loop; just do it before the
break.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 22:33:21 -05:00
Niu Yawei f116700971 ext4: fix race in ext4_mb_add_n_trim()
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-01 21:31:27 -05:00
Akria Fujita 87e698734b ext4: fix smatch warning in move_extent.c's mext_replace_branches()
Commit 2147b1a6a4 resulted in a new smatch warning:

> fs/ext4/move_extent.c:693 mext_replace_branches()
> 	 warn: variable dereferenced before check 'dext' (see line 683)

Fix this by adding a check to make sure dext is non-NULL before we
derefrence it.

Signed-off-by: Akria Fujita <a-fujita@rs.jp.nec.com>
[ modified by tytso to make sure an ext4_error is called ]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:52:46 -05:00
Julia Lawall 524c19ebc9 ext4: use WARN in ext4_alloc_blocks
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-01 20:07:21 -05:00
Jeff Liu a21cd50367 xfs: refactor space log reservation for XFS_TRANS_ATTR_SET
Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:

1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.

2) ((ext * (mp)->m_sb.sb_sectsize) + \
    (ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
    (128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.

This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time.  Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.

In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:56:31 -06:00
Jeff Liu 762c585b18 xfs: make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy()
Make use of XFS_SB_LOG_RES() at xfs_fs_log_dummy().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:55:59 -06:00
Jeff Liu 5166ab0655 xfs: make use of XFS_SB_LOG_RES() at xfs_mount_log_sb()
Make use of XFS_SB_LOG_RES() at xfs_mount_log_sb().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:55:08 -06:00
Jeff Liu e457274b60 xfs: make use of XFS_SB_LOG_RES() at xfs_log_sbcount()
Make use of XFS_SB_LOG_RES() at xfs_log_sbcount().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:47:18 -06:00
Jeff Liu a7bd794a0f xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk
Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:46:35 -06:00
Jeff Liu 762d7ba657 xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time
Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:45:50 -06:00
Jeff Liu a1bd955754 xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time
Convert the calculation of quota off transaction log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:44:29 -06:00
Jeff Liu 4800104438 xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time
The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:43:51 -06:00
Jeff Liu f0f2df94fa xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time
For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:43:11 -06:00
Jeff Liu f910a8c620 xfs: calculate xfs_qm_write_sb_changes() space log reservation at mount time
For the transaction that write the incore superblock changes of quota flags
to disk, it would reserve the same log space to clear/reset quota flags
transaction, hence we can use XFS_TRANS_SBCHANGE_LOG_RES() for it as well.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:42:32 -06:00
Jeff Liu b0c10b983a xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time
The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:40:17 -06:00
Jeff Liu 5b292ae3a9 xfs: make use of xfs_calc_buf_res() in xfs_trans.c
Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:39:29 -06:00
Bob Peterson d2b47cfb26 GFS2: Get a block reservation before resizing a file
This patch allocates a block reservation structure before growing
or shrinking a file. Without this structure, the grow or shink code
can reference the bad pointer.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-02-01 20:37:33 +00:00
Steven Whitehouse 4506a519f2 GFS2: Split glock lru processing into two parts
The intent here is to split the processing of the glock lru
list into two parts, so that the selection of glocks and the
disposal are separate functions. The plan is then, that further
updates can then be made to these functions in the future
to improve the selection of glocks and also the efficiency of
glock disposal.

The new feature which this patch brings is sorting the
glocks to be disposed of into glock number (and thus also
disk block number) order. Not all glocks will need i/o in
order to dispose of them, but some will, and at least we'll
generate mostly disk block order i/o now.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-02-01 20:36:03 +00:00
Jeff Liu 4f3b57832b xfs: add a helper to figure out the space log reservation per item
Add a new helper xfs_calc_buf_res() to calcuate out the transaction space
reservations per item.  xfs_buf_log_overhead() is used to figure out the
extra space for struct xfs_buf_log_format that gets written into the log
for every buffer as well as a log opheader, i.e. struct xlog_op_header.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-02-01 14:35:06 -06:00
Chris Mason bb721703aa Btrfs: reduce CPU contention while waiting for delayed extent operations
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.

It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.

The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.

This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself.  All the extra contention just slows
down the operations and doesn't get things done any faster.

This commit changes things to use a wait queue instead.  As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:25 -05:00
Chris Mason 242e18c7c1 Btrfs: reduce lock contention on extent buffer locks
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree.  On tree roots and
other extent buffers that very cache hot, this can be highly contended.

These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:25 -05:00
Chris Mason 8de972b4fa Btrfs: fix cluster alignment for mount -o ssd
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:24 -05:00
Chris Mason 6ac0f4884e Btrfs: add a plugging callback to raid56 writes
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.

This adds a plugging callback to gather those smaller writes full stripe
writes.

Example on flash:

fio job to do 64K writes in batches of 3 (which makes a full stripe):

With plugging: 450MB/s
Without plugging: 220MB/s

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:24 -05:00
Chris Mason 4ae10b3a13 Btrfs: Add a stripe cache to raid56
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk.  Pages are cached when:

* They are read in during a read/modify/write cycle

* They are written during a read/modify/write cycle

* They are involved in a parity rebuild

Pages are not cached if we're doing a full stripe write.  We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.

This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.

The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.

Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct

Without the stripe cache  -- 2.1MB/s
With the stripe cache 21MB/s

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
David Woodhouse 53b381b3ab Btrfs: RAID5 and RAID6
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio.  This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle.  This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk.  This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
David Woodhouse 64a167011b Btrfs: add rw argument to merge_bio_hook()
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:49:47 -05:00
Eric Sandeen 3c91160808 btrfs: don't try to notify udev about missing devices
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:47:37 -05:00
Trond Myklebust 322b2b9032 Revert "NFS: add nfs_sb_deactive_async to avoid deadlock"
This reverts commit 324d003b0c.

The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-01 10:13:48 -05:00
Linus Torvalds bf6c8a8148 NFS client bugfixe for Linux 3.8
- Error reporting in nfs_xdev_mount incorrectly maps all errors to ENOMEM
 - Fix an NFSv4 refcounting issue
 - Fix a mount failure when the server reboots during NFSv4 trunking discovery
 - NFSv4.1 mounts may need to run the lease recovery thread.
 - Don't silently fail setattr() requests on mountpoints
 - Fix a SUNRPC socket/transport livelock and priority queue issue
 - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRCpS4AAoJEGcL54qWCgDyqucP/2CTv5leu+X5/0PXBOykAIHg
 8oEsTEz7/4IvIxTXzuHDYirMnm/mulfGF6NrdPqfvxpAHqRfVBfLFocfNLMVhQci
 97RmBfEEGM22AToYUubML5bIxr0QllV4s9Vmyh/zGDan52y7zNNlZX+v6aLjZbJB
 Fbolihpcch6lhQEUNAzK0B0ddimDl9lazx/WTmMOD/JrwOqzA4FJC+YxBe88nfzQ
 c6sYyEptBaSirbCOlueqGpv8skB1CLpFJXguXToPXFxpWed6uoGrIwLO7MLUdFpJ
 Xw+j8cuv/wjyYJGVKjhW7kXtwK8T7+u4bT2L883R01XYXr8XfkkLON0dgG1X/unk
 80mLzCO1+qRdoDSQ4b/V4B0nScPRCJuoZpftjCi2uhKewNcxQPMZ2V5/D7pO3uyE
 NdhxByB8D86JfNrIcBcRaxfuiQsurQBDvsDNmWPBZSOmH/dmHqTQGLcIe6N94A0B
 c7KFtXrN2MzOl8S68dUpbhftObq9X0oK2oxFFLWRQoqjrFtiDLU5JilV0bEH2CMo
 gJX7CRrPoJ2wmNToKdPJRWBYDmtMMIThq3vIpCj1FFzX17r5grJpqCmp/TViu/ER
 r8rmJzni+nmHaO1NLJMTCHzLJ8soiKyBq8PZKAbipTJxy+TXI/69jnWzpIQ/pbM/
 JN+0tiCcqmUmXsO+hrCP
 =lWFV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Error reporting in nfs_xdev_mount incorrectly maps all errors to
   ENOMEM

 - Fix an NFSv4 refcounting issue

 - Fix a mount failure when the server reboots during NFSv4 trunking
   discovery

 - NFSv4.1 mounts may need to run the lease recovery thread.

 - Don't silently fail setattr() requests on mountpoints

 - Fix a SUNRPC socket/transport livelock and priority queue issue

 - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.

* tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
  SUNRPC: When changing the queue priority, ensure that we change the owner
  NFS: Don't silently fail setattr() requests on mountpoints
  NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
  NFSv4: Fix NFSv4 trunking discovery
  NFSv4: Fix NFSv4 reference counting for trunked sessions
  NFS: Fix error reporting in nfs_xdev_mount
2013-02-01 08:43:52 +11:00
Feng Shuo 4582a4ab2a FUSE: Adapt readdirplus to application usage patterns
Use the same adaptive readdirplus mechanism as NFS:

http://permalink.gmane.org/gmane.linux.nfs/49299

If the user space implementation wants to disable readdirplus
temporarily, it could just return ENOTSUPP. Then kernel will
recall it with readdir.

Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-31 17:08:11 +01:00
Anatol Pomozov c2132c1bc7 Do not use RCU for current process credentials
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming
that 'task' is some other process but in fact this parameter always
equals to 'current'. Inline this parameter to make it more readable
and remove RCU lock as it is not needed when access current process
credentials.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-31 17:08:10 +01:00
Trond Myklebust c489ee290b NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
usually means that the server is busy handling an unfinished RPC
request. Just sleep for a second and then retry.
We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
value. If the NFS server has outstanding callbacks, we just want to
similarly sleep & retry.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-30 17:45:15 -05:00
Trond Myklebust ab22541782 NFS: Don't silently fail setattr() requests on mountpoints
Ensure that any setattr and getattr requests for junctions and/or
mountpoints are sent to the server. Ever since commit
0ec26fd069 (vfs: automount should ignore LOOKUP_FOLLOW), we have
silently dropped any setattr requests to a server-side mountpoint.
For referrals, we have silently dropped both getattr and setattr
requests.

This patch restores the original behaviour for setattr on mountpoints,
and tries to do the same for referrals, provided that we have a
filehandle...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-30 17:41:04 -05:00
Alex Elder 969e5aa3b0 Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing 2013-01-30 07:54:34 -06:00
Eric Sandeen e7b04ac00e jbd2: don't wake kjournald unnecessarily
Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
transaction has already been requested for commit.

commit deeeaf13 "jbd2: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
[tytso@mit.edu: reworked check to make it clearer]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-30 00:39:28 -05:00
Jan Kara 091e26dfc1 ext4: fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-29 22:48:17 -05:00
Linus Torvalds f96736e1ba xfs: bugfixes for 3.8-rc6
- fix return value when filesystem probe finds no XFS magic, a
   regression introduced in 9802182.
 
 - fix stack switch in __xfs_bmapi_allocate by moving the check for stack
   switch up into xfs_bmapi_write.
 
 - fix oops in _xfs_buf_find by validating that the requested block is
   within the filesystem bounds.
 
 - limit speculative preallocation near ENOSPC.
 
 - fix an unmount hang in xfs_wait_buftarg by freeing the
   xfs_buf_log_item in xfs_buf_item_unlock.
 
 - fix a possible use after free with AIO.
 
 - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
   regression introduced in fb59581404.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRBvmgAAoJENaLyazVq6ZOOacP/RilCPsi41NkJqRx1Rs5aRGE
 UvinrHfAL/tBupS2JVo1niIilBNJG/cI+lLcpV/P5omLBJfpEu0trzZUSxU7S1Vc
 a2M8J0qhmKfBcl70fuCALAxPY52895y+44gufxaH0O5HQDN6tB8n4MMqYGPmS8hz
 Ul/q3MO601hVyBHaoYa7BNGS3YG0TCdFGtWcC5tQaR3v7upTLR2ouZrGQ8CV0BBa
 Ek1xdxLh4D0fRybSL7lUw64W957iyldoLsEg+zQrE9NSfTE8DSqUG+NPWB0wjPce
 ICtmO6TbE5c6q1ScOL3YCC2cmYvjR9mlAHnPy73SqWSIsTqUsVzdibNo+tUJJZ5r
 RZf3u6Uri6uKC6Hl4XEtg4LVnnquKosTXfoiHmn+eh0dhYL7sZG0Ya5we5pH5Tmi
 P6B2DlfUA1fj4Ne4Asx2d7mwOJaZcLHDZoeCs/Haz2Z6kGVEm7ImyAb1h76uNOZo
 l0NFhXJGcOQLyjPtQjl81SGjQmntiIN0Poia3528zjxxGXlBNwAwalkOtdJnk5iN
 IaYRKtvIcrdjFvunasiKZIsV/O9w3/mguXlrSqBDgUsKPUc/cq5vLfsa70jYGc2j
 M6ldJRRqTvSjkVXc/7SXv4GLt/qbUWa92ESzhZQXEABIjZJUnOZpZuLmSVdRUyZk
 +SaGvbMphE0U/4ps3pO/
 =wViN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 "Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
  filesystem, the stack switch in xfs_bmapi_allocate, a crash in
  _xfs_buf_find, speculative preallocation as the filesystem nears
  ENOSPC, an unmount hang, a race with AIO, and a regression with
  xfs_fsr:

   - fix return value when filesystem probe finds no XFS magic, a
     regression introduced in 9802182.

   - fix stack switch in __xfs_bmapi_allocate by moving the check for
     stack switch up into xfs_bmapi_write.

   - fix oops in _xfs_buf_find by validating that the requested block is
     within the filesystem bounds.

   - limit speculative preallocation near ENOSPC.

   - fix an unmount hang in xfs_wait_buftarg by freeing the
     xfs_buf_log_item in xfs_buf_item_unlock.

   - fix a possible use after free with AIO.

   - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
     regression introduced in commit fb59581404a."

* tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
  xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
  xfs: Fix possible use-after-free with AIO
  xfs: fix shutdown hang on invalid inode during create
  xfs: limit speculative prealloc near ENOSPC thresholds
  xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
  xfs: pull up stack_switch check into xfs_bmapi_write
  xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
2013-01-30 11:59:37 +11:00
majianpeng 885c91f746 nfsd: Fix memleak in svc_export_put
In func svc_export_parse, the uuid which used kmemdup to alloc will be
changed in func export_update.So the later kfree don't free this memory.
And it can't be free in func svc_export_parse because other place still
used.So put this operation in func svc_export_put.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-29 16:50:03 -05:00
Steven Whitehouse 4513899092 GFS2: Use ->writepages for ordered writes
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.

The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.

The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:29:17 +00:00
Steven Whitehouse d564053f07 GFS2: Clean up freeze code
The freeze code has not been looked at a lot recently. Upstream has
moved on, and this is an attempt to catch us back up again. There
is a vfs level interface for the freeze code which can be called
from our (obsolete, but kept for backward compatibility purposes)
sysfs freeze interface. This means freezing this way vs. doing it
from the ioctl should now work in identical fashion.

As a result of this, the freeze function is only called once
and we can drop our own special purpose code for counting the
number of freezes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:29:05 +00:00
Steven Whitehouse c76c4d96bd GFS2: Merge gfs2_attach_bufdata() into trans.c
The locking in gfs2_attach_bufdata() was type specific (data/meta)
which made the function rather confusing. This patch moves the core
of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()

As a result all of the locking related to adding data and metadata to
the journal is now in these two functions. This should help to clarify
what is going on, and give us some opportunities to simplify in
some cases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:44 +00:00
Steven Whitehouse 767f433f34 GFS2: Copy gfs2_trans_add_bh into new data/meta functions
This patch copies the body of gfs2_trans_add_bh into the two newly
added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
then move the .lo_add functions from lops.c into trans.c and call
them directly.

As a result of this, we no longer need to use the .lo_add functions
at all, so that is removed from the log operations structure.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:28 +00:00
Steven Whitehouse 350a9b0a72 GFS2: Split gfs2_trans_add_bh() into two
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.

Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:04 +00:00
Steven Whitehouse 75f2b879ae GFS2: Merge revoke adding functions
This moves the lo_add function for revokes into trans.c, removing
a function call and making the code easier to read.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:27:46 +00:00
Steven Whitehouse 2a00585593 GFS2: Separate LRU scanning from shrinker
This breaks out the LRU scanning function from the shrinker in
preparation for adding other callers to the LRU scanner.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:27:28 +00:00
Jiri Kosina 617677295b Merge branch 'master' into for-next
Conflicts:
	drivers/devfreq/exynos4_bus.c

Sync with Linus' tree to be able to apply patches that are
against newer code (mvneta).
2013-01-29 10:48:30 +01:00
Guo Chao b1deefc99e ext4: remove unnecessary NULL pointer check
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:41:02 -05:00
Guo Chao 41be871f74 ext4: remove useless assignment in dx_probe()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:33:28 -05:00
Guo Chao 2bbbee2a68 ext4: remove unused variable in add_dirent_to_buf()
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:26:44 -05:00
Guo Chao d5ac777305 ext4: release buffer when checksum failed
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:23:24 -05:00
Lukas Czerner b06acd38a4 ext4: remove explicit WARN_ON when ext4_map_blocks() fails
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.

Also use ext4_warning() instead of ext4_msg() and printk().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:21:12 -05:00
Lukas Czerner cfa7275482 ext4: remove unused variable flags
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-01-28 21:14:11 -05:00
Jan Kara fe386132f6 ext4: fix ext4_writepage() to achieve data=ordered guarantees
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:06:42 -05:00
Jan Kara 8a850c3fb8 ext4: Make ext4_bio_writepage() handle unprepared buffers
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:

a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
   pages that may have only some of dirty buffers mapped.

So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 20:53:28 -05:00
Torsten Kaiser 65e3aa77f1 xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and  truncate_pagecache_range() directly.

But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'

Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 16:05:10 -06:00
Torsten Kaiser 2729423cf2 xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and  truncate_pagecache_range() directly.

But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'

Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 13:50:10 -06:00
Jan Kara 4b05d09c18 xfs: Fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:51:22 -06:00
Dave Chinner 9f87832a82 xfs: fix shutdown hang on invalid inode during create
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:

[   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck

The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:

xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock

And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.

The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.

The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:51:12 -06:00
Dave Chinner f2a459565b xfs: limit speculative prealloc near ENOSPC thresholds
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.

Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:50:50 -06:00
Dave Chinner eb178619f9 xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.

In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported.  Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:49:21 -06:00
Brian Foster d26978dd86 xfs: pull up stack_switch check into xfs_bmapi_write
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:48:55 -06:00
Eric Sandeen 1bee12b8c4 xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:48:21 -06:00
Jan Kara b6a8e62f8b ext4: simplify mpage_add_bh_to_extent()
The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 13:06:48 -05:00
Jan Kara f8bec37037 ext4: dirty page has always buffers attached
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 12:55:08 -05:00
Jan Kara 002bd7fa3a ext4: simplify list handling in ext4_do_flush_completed_IO()
The function splices i_completed_io_list to its private list
first.  From that moment on we don't need any lock for working with
io_end structures because all io_end structure on the list are only
our own. So we can remove the other two lists in the function and free
io_end immediately after we are done with it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:49:15 -05:00
Jan Kara 84c17543ab ext4: move work from io_end to inode
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself.  This also allows for a small simplification in
processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:43:46 -05:00
Jan Kara fe089c77f1 ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()
We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:38:49 -05:00
Jan Kara 1ae48a6354 ext4: use redirty_page_for_writepage() in ext4_bio_write_page()
When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.

Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:32:54 -05:00
Jan Kara 36ade451a5 ext4: Always use ext4_bio_write_page() for writeout
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:30:52 -05:00
Zheng Liu 8bad6fc813 ext4: add punching hole support for non-extent-mapped files
This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:21:37 -05:00
David Teigland d4e0bfec9b GFS2: fix skip unlock condition
The recent commit fb6791d100
included the wrong logic.  The lvbptr check was incorrectly
added after the patch was tested.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-28 09:49:15 +00:00
Trond Myklebust 65436ec0c8 NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
We do need to start the lease recovery thread prior to waiting for the
client initialisation to complete in NFSv4.1.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:41 -05:00
Trond Myklebust 202c312dba NFSv4: Fix NFSv4 trunking discovery
If walking the list in nfs4[01]_walk_client_list fails, then the most
likely explanation is that the server dropped the clientid before we
actually managed to confirm it. As long as our nfs_client is the very
last one in the list to be tested, the caller can be assured that this
is the case when the final return value is NFS4ERR_STALE_CLIENTID.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org [>=3.7]
Tested-by: Ben Greear <greearb@candelatech.com>
2013-01-27 15:51:28 -05:00
Trond Myklebust 4ae19c2dd7 NFSv4: Fix NFSv4 reference counting for trunked sessions
The reference counting in nfs4_init_client assumes wongly that it
is safe for nfs4_discover_server_trunking() to return a pointer to a
nfs_client prior to bumping the reference count.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:15 -05:00
Trond Myklebust dee972b967 NFS: Fix error reporting in nfs_xdev_mount
Currently, nfs_xdev_mount converts all errors from clone_server() to
ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
Also ensure that if nfs_fs_mount_common() returns an error, we
don't dprintk(0)...

The regression originated in commit 3d176e3fe4
(NFS: Use nfs_fs_mount_common() for xdev mounts)

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-27 15:51:15 -05:00
Frederic Weisbecker 6fac4829ce cputime: Use accessors to read task cputime stats
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:31 +01:00
Eric W. Biederman b3c6761d9b userns: Allow the userns root to mount ramfs.
There is no backing store to ramfs and file creation
rules are the same as for any other filesystem so
it is semantically safe to allow unprivileged users
to mount it.

The memory control group successfully limits how much
memory ramfs can consume on any system that cares about
a user namespace root using ramfs to exhaust memory
the memory control group can be deployed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-26 22:22:38 -08:00
Eric W. Biederman ec2aa8e8dd userns: Allow the userns root to mount of devpts
- The context in which devpts is mounted has no effect on the creation
  of ptys as the /dev/ptmx interface has been used by unprivileged
  users for many years.

- Only support unprivileged mounts in combination with the newinstance
  option to ensure that mounting of /dev/pts in a user namespace will
  not allow the options of an existing mount of devpts to be modified.

- Create /dev/pts/ptmx as the root user in the user namespace that
  mounts devpts so that it's permissions to be changed.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-26 22:22:21 -08:00
Jan Kara ced55f38d6 xfs: Fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-26 09:43:58 -06:00
Dave Chinner 3b19034d4f xfs: fix shutdown hang on invalid inode during create
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:

[   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck

The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:

xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock

And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.

The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.

The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-26 09:34:38 -06:00
Greg Kroah-Hartman 422d26b6ec Merge 3.8-rc5 into driver-core-next
This resolves a gpio driver merge issue pointed out in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:06:30 -08:00
Greg Kroah-Hartman 9f9cba810f Merge 3.8-rc5 into tty-next
This resolves a number of tty driver merge issues found in linux-next

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 13:27:36 -08:00
Rafael J. Wysocki 0bb8f3d6ae sysfs: Functions for adding/removing symlinks to/from attribute groups
The most convenient way to expose ACPI power resources lists of a
device is to put symbolic links to sysfs directories representing
those resources into special attribute groups in the device's sysfs
directory.  For this purpose, it is necessary to be able to add
symbolic links to attribute groups.

For this reason, add sysfs helper functions for adding/removing
symbolic links to/from attribute groups, sysfs_add_link_to_group()
and sysfs_remove_link_from_group(), respectively.

This change set includes a build fix from David Rientjes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:51:13 +01:00
Linus Torvalds d7df025eb4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "It turns out that we had two crc bugs when running fsx-linux in a
  loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
  all down.  Miao also has a new OOM fix in this v2 pull as well.

  Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
  and resuming a running balance across drives.

  Josef's orphan truncate patch fixes an obscure corruption we'd see
  during xfstests.

  Arne's patches address problems with subvolume quotas.  If the user
  destroys quota groups incorrectly the FS will refuse to mount.

  The rest are smaller fixes and plugs for memory leaks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
  Btrfs: fix repeated delalloc work allocation
  Btrfs: fix wrong max device number for single profile
  Btrfs: fix missed transaction->aborted check
  Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
  Btrfs: put csums on the right ordered extent
  Btrfs: use right range to find checksum for compressed extents
  Btrfs: fix panic when recovering tree log
  Btrfs: do not allow logged extents to be merged or removed
  Btrfs: fix a regression in balance usage filter
  Btrfs: prevent qgroup destroy when there are still relations
  Btrfs: ignore orphan qgroup relations
  Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
  Btrfs: fix unlock order in btrfs_ioctl_rm_dev
  Btrfs: fix unlock order in btrfs_ioctl_resize
  Btrfs: fix "mutually exclusive op is running" error code
  Btrfs: bring back balance pause/resume logic
  btrfs: update timestamps on truncate()
  btrfs: fix btrfs_cont_expand() freeing IS_ERR em
  Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
  Btrfs: fix off-by-one in lseek
  ...
2013-01-25 10:55:21 -08:00
Chen Gang 03dafb5f59 ext4: fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24 23:24:58 -05:00
Theodore Ts'o 72ba74508b ext4: release sysfs kobject when failing to enable quotas on mount
In addition, print the error returned from ext4_enable_quotas()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-01-24 23:24:54 -05:00
Linus Torvalds 66e2d3e8c2 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Two small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
  cifs: fix srcip_matches() for ipv6
2013-01-24 19:15:43 -08:00
Miao Xie 1eafa6c737 Btrfs: fix repeated delalloc work allocation
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.

Fix this problem by splice this delalloc list.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:27 -05:00
Miao Xie c9f01bfe0c Btrfs: fix wrong max device number for single profile
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:26 -05:00
Miao Xie 2cba30f172 Btrfs: fix missed transaction->aborted check
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.

Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.

So we need more check for ->aborted when committing the transaction. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:25 -05:00
Miao Xie 8d25a086eb Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:23 -05:00
Josef Bacik e58dd74bcc Btrfs: put csums on the right ordered extent
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent.  This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent.  Without this we could end up with csums for
bytenrs that don't have extents to cover them yet.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:22 -05:00
Liu Bo 192000dda2 Btrfs: use right range to find checksum for compressed extents
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:17 -05:00
Josef Bacik b0175117b9 Btrfs: fix panic when recovering tree log
A user reported a BUG_ON(ret) that occured during tree log replay.  Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry.  We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON().  The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility.  Thanks,

Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:49 -05:00
Josef Bacik 201a903894 Btrfs: do not allow logged extents to be merged or removed
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:48 -05:00
Dave Chinner 4d559a3bcb xfs: limit speculative prealloc near ENOSPC thresholds
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.

Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-24 11:08:55 -06:00
Dave Chinner 10616b806d xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.

In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported.  Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-24 11:06:41 -06:00
Miklos Szeredi fb05f41f5f fuse: cleanup fuse_direct_io()
Fix the following sparse warnings:

fs/fuse/file.c:1216:43: warning: cast removes address space of expression
fs/fuse/file.c:1216:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1216:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1216:43:    got void *<noident>
fs/fuse/file.c:1241:43: warning: cast removes address space of expression
fs/fuse/file.c:1241:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1241:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1241:43:    got void *<noident>
fs/fuse/file.c:1267:43: warning: cast removes address space of expression
fs/fuse/file.c:1267:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1267:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1267:43:    got void *<noident>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:28 +01:00
Maxim Patlasov 5565a9d884 fuse: optimize __fuse_direct_io()
__fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The
patch calculates 'n' based on iov[] array. This is useful because allocating
FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request
would be waste of memory in case of iov-s of smaller size.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:28 +01:00
Maxim Patlasov 7c190c8b9c fuse: optimize fuse_get_user_pages()
Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as
possible. This is very beneficial in case of iov[] consisting of many
iov-s of relatively small sizes (e.g. PAGE_SIZE).

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov b98d023a24 fuse: pass iov[] to fuse_get_user_pages()
The patch makes preliminary work for the next patch optimizing scatter-gather
direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s
to each fuse request as possible. So, here we only rework all related
call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages().

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov 85f40aec88 fuse: use req->page_descs[] for argpages cases
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.

Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov b2430d7567 fuse: add per-page descriptor <offset, length> to fuse_req
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.

Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov 54b966702d fuse: rework fuse_do_ioctl()
fuse_do_ioctl() already calculates the number of pages it's going to use. It is
stored in 'num_pages' variable. So the patch simply uses it for allocating
fuse_req.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov d07f09f509 fuse: rework fuse_perform_write()
The patch allocates as many page pointers in fuse_req as needed to cover
interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced
to hide this cumbersome arithmetic.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov f8dbdf8182 fuse: rework fuse_readpages()
The patch uses 'nr_pages' argument of fuse_readpages() as heuristics for the
number of page pointers to allocate.

This can be improved further by taking in consideration fc->max_read and gaps
between page indices, but it's not clear whether it's worthy or not.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov 4d53dc99ba fuse: rework fuse_retrieve()
The patch reworks fuse_retrieve() to allocate only so many page pointers
as needed. The core part of the patch is the following calculation:

	num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;

(thanks Miklos for formula). All other changes are mostly shuffling lines.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov b111c8c0e3 fuse: categorize fuse_get_req()
The patch categorizes all fuse_get_req() invocations into two categories:
 - fuse_get_req_nopages(fc) - when caller doesn't care about req->pages
 - fuse_get_req(fc, n) - when caller need n page pointers (n > 0)

Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0)
scattered over code. Now it's clear from the first glance when a caller need
fuse_req with page pointers.

The patch doesn't make any logic changes. In multi-page case, it silly
allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended
by future patches.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Maxim Patlasov 4250c0668e fuse: general infrastructure for pages[] of variable size
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from
fuse_req. Instead of that, req->pages may now point either to small inline
array or to an array allocated dynamically.

This essentially means that all callers of fuse_request_alloc[_nofs] should
pass the number of pages needed explicitly.

The patch doesn't make any logic changes.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Anand V. Avati 0b05b18381 fuse: implement NFS-like readdirplus support
This patch implements readdirplus support in FUSE, similar to NFS.
The payload returned in the readdirplus call contains
'fuse_entry_out' structure thereby providing all the necessary inputs
for 'faking' a lookup() operation on the spot.

If the dentry and inode already existed (for e.g. in a re-run of ls -l)
then just the inode attributes timeout and dentry timeout are refreshed.

With a simple client->network->server implementation of a FUSE based
filesystem, the following performance observations were made:

Test: Performing a filesystem crawl over 20,000 files with

sh# time ls -lR /mnt

Without readdirplus:
Run 1: 18.1s
Run 2: 16.0s
Run 3: 16.2s

With readdirplus:
Run 1: 4.1s
Run 2: 3.8s
Run 3: 3.8s

The performance improvement is significant as it avoided 20,000 upcalls
calls (lookup). Cache consistency is no worse than what already is.

Signed-off-by: Anand V. Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
J. Bruce Fields ff89be87c7 nfsd4: require version 4 when enabling or disabling minorversion
The current code will allow silly things like:

	echo "+2 +3 +4 +7.1">/proc/fs/nfsd/versions

Reported-by: Fan Chaoting <fanchaoting@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-23 18:25:01 -05:00
Stanislav Kinsbursky bca0ec6511 nfsd: fix unused "nn" variable warning in free_client()
If CONFIG_LOCKDEP is disabled, then there would be a warning like this:

  CC [M]  fs/nfsd/nfs4state.o
fs/nfsd/nfs4state.c: In function ‘free_client’:
fs/nfsd/nfs4state.c:1051:19: warning: unused variable ‘nn’ [-Wunused-variable]

So, let's add "maybe_unused" tag to this variable.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-23 18:17:40 -05:00
Yanchuan Nian 266533c6df nfsd: Don't unlock the state while it's not locked
In the procedure of CREATE_SESSION, the state is locked after
alloc_conn_from_crses(). If the allocation fails, the function
goes to "out_free_session", and then "out" where there is an
unlock function.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-23 18:17:37 -05:00
Yanchuan Nian 74b70dded3 nfsd: Pass correct slot number to nfsd4_put_drc_mem()
In alloc_session(), numslots is the correct slot number used by the session.
But the slot number passed to nfsd4_put_drc_mem() is the one from nfs client.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-23 18:17:36 -05:00
J. Bruce Fields 84822d0b3b nfsd4: simplify nfsd4_encode_fattr interface slightly
It seems slightly simpler to make nfsd4_encode_fattr rather than its
callers responsible for advancing the write pointer on success.

(Also: the count == 0 check in the verify case looks superfluous.
Running out of buffer space is really the only reason fattr encoding
should fail with eresource.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-01-23 18:17:35 -05:00
Cong Ding 10b8c7dff5 fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
When it goes to error through line 144, the memory allocated to *devname is
not freed, and the caller doesn't free it either in line 250. So we free the
memroy of *devname in function cifs_compose_mount_options() when it goes to
error.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
CC: stable <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-22 23:58:16 -06:00
Linus Torvalds 262060ea46 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This contain a bugfix for CUSE and miscellaneous small fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove unused variable in fuse_try_move_page()
  fuse: make fuse_file_fallocate() static
  fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig
  cuse: fix uninitialized variable warnings
  cuse: do not register multiple devices with identical names
  cuse: use mutex as registration lock instead of spinlocks
2013-01-22 11:53:19 -08:00
Linus Torvalds 05c2cf35c3 f2fs fixes for v3.8-rc5
o Support swap file and link generic_file_remap_pages
 o Enhance the bio streaming flow and free section control
 o Major bug fix on recovery routine
 o Minor bug/warning fixes and code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ/fpIAAoJEEAUqH6CSFDS6BAP/jz/JIPoSu4NunWgVH68tzA4
 xx4lsmJQJQG+1241uA9v4MMkcTzxE0QVNTOX3BpUXBhCMc00aPOPWlWjULridLI9
 0vRE6LpG+NntkyKF+M5T1ydGuQoDlEvqoGs6c5p3yaI6PxzbzmBmipsmqXUA8fyu
 280+OWJoAcALEMJiQ8JsHcDmvM9wdQ+BV/j3BNCm4dqBUA4dYPfDzRKUJYfwqiig
 qmVRseJJaekxrQ2lHG/K/WPAXa8aRcV6khP9tv/BPGRMt+I/fli/J4sWEFT6c73B
 +qYmhrf/RLNJ1O13dePlo3URwMu083PL8QN355GAKUJbMaX/UPjEnq0DLbBOkCS2
 KBQI5O1eiFauEE6YU7p7GuvnLeVkukcXSQNVnRrnWzTUA9CMThZZ2mAgb2lz+iEP
 oZWNirRwnxdcTQRXjPyTCtpPTCgJi4GO1WS5s6HeLP8G8Muo4PzDzcEwMX0Mw5ih
 s4n4wpQ+Zp3h53cc/DxdFAK15uM3XYtUfb92kJwqaEG5VmBy6KvliXDRnNXg7WMI
 imCb08c0Fr0M8ZHOd+UCveICcndFj25jkjx8w7PoE1KBbGJkKf+cjpZ3OhOAliux
 sGtH3EZLV6jx1MjV79OpDXQGoVsvktesCbRcaxc7BbqljXYVNiap8QbzjPnmAt6z
 KKN0GU32eolGgK4Zd/iF
 =vJQc
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 o Support swap file and link generic_file_remap_pages
 o Enhance the bio streaming flow and free section control
 o Major bug fix on recovery routine
 o Minor bug/warning fixes and code cleanups

* tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (22 commits)
  f2fs: use _safe() version of list_for_each
  f2fs: add comments of start_bidx_of_node
  f2fs: avoid issuing small bios due to several dirty node pages
  f2fs: support swapfile
  f2fs: add remap_pages as generic_file_remap_pages
  f2fs: add __init to functions in init_f2fs_fs
  f2fs: fix the debugfs entry creation path
  f2fs: add global mutex_lock to protect f2fs_stat_list
  f2fs: remove the blk_plug usage in f2fs_write_data_pages
  f2fs: avoid redundant time update for parent directory in f2fs_delete_entry
  f2fs: remove redundant call to set_blocksize in f2fs_fill_super
  f2fs: move f2fs_balance_fs to punch_hole
  f2fs: add f2fs_balance_fs in several interfaces
  f2fs: revisit the f2fs_gc flow
  f2fs: check return value during recovery
  f2fs: avoid null dereference in f2fs_acl_from_disk
  f2fs: initialize newly allocated dnode structure
  f2fs: update f2fs partition info about SIT/NAT layout
  f2fs: update f2fs document to reflect SIT/NAT layout correctly
  f2fs: remove unneeded INIT_LIST_HEAD at few places
  ...
2013-01-22 10:33:17 -08:00
Namjae Jeon 99600051b0 udf: add extent cache support in case of file reading
This patch implements extent caching in case of file reading.
While reading a file, currently, UDF reads metadata serially
which takes a lot of time depending on the number of extents present
in the file. Caching last accessd extent improves metadata read time.
Instead of reading file metadata from start, now we read from
the cached extent.

This patch considerably improves the time spent by CPU in kernel mode.
For example, while reading a 10.9 GB file using dd:
Time before applying patch:
11677022208 bytes (10.9GB) copied, 1529.748921 seconds, 7.3MB/s
real    25m 29.85s
user    0m 12.41s
sys     15m 34.75s

Time after applying patch:
11677022208 bytes (10.9GB) copied, 1469.338231 seconds, 7.6MB/s
real    24m 29.44s
user    0m 15.73s
sys     3m 27.61s

[JK: Fix bh refcounting issues, simplify initialization]

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Bonggil Bak <bgbak@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-22 10:48:31 +01:00
Dan Carpenter d8b79b2f94 f2fs: use _safe() version of list_for_each
This is calling list_del() inside a loop which is a problem when we try
move to the next item on the list.  I've converted it to use the _safe
version.  And also, as a cleanup, I've converted it to use
list_for_each_entry instead of list_for_each.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:49:00 +09:00
Jaegeuk Kim 9af45ef5ab f2fs: add comments of start_bidx_of_node
The caller of start_bidx_of_node() should give proper node offsets which
point only direct node blocks. Otherwise, it is a caller's bug.
This patch adds comments to make it clear.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:59 +09:00
Jaegeuk Kim a7fdffbd3e f2fs: avoid issuing small bios due to several dirty node pages
If some small bios of dirty node pages are supposed to be issued during the
sequential data writes, there-in well-produced consecutive data bios are able
to be split by the small node bios, resulting in performance degradation.
So, let's collect a number of dirty node pages until reaching a threshold.
And, by default, I set the threshold as 2MB, a segment size.

This improves sequential write performance on i5, 512GB SSD (830 w/ SATA2) as
follows.
Before: 231 MB/s -> After: 255 MB/s

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-01-22 10:48:59 +09:00
Jaegeuk Kim c01e54b770 f2fs: support swapfile
This patch adds f2fs_bmap operation to the data address space.
This enables f2fs to support swapfile.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:58 +09:00
Jaegeuk Kim 692bb55d1a f2fs: add remap_pages as generic_file_remap_pages
This was added for all the file systems before.

See the following commit.

commit id: 0b173bc4da

[PATCH] mm: kill vma flag VM_CAN_NONLINEAR

This patch moves actual ptes filling for non-linear file mappings
into special vma operation: ->remap_pages().

File system must implement this method to get non-linear mappings support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support."

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:58 +09:00
Namjae Jeon 6e6093a8f1 f2fs: add __init to functions in init_f2fs_fs
Add __init to functions in init_f2fs_fs for code consistency.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:38 +09:00
Ilya Dryomov a105bb88f4 Btrfs: fix a regression in balance usage filter
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input.  Bring it back.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:40:27 -05:00
Chris Mason 83bfccb5c0 Merge branch 'mutex-ops@next-for-chris' of git://github.com/idryomov/btrfs-unstable into linus 2013-01-21 20:39:06 -05:00
Chris Mason daf2c08911 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus 2013-01-21 20:26:55 -05:00
Arne Jansen 2cf6870396 Btrfs: prevent qgroup destroy when there are still relations
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Arne Jansen ff24858c65 Btrfs: ignore orphan qgroup relations
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Kees Cook b7e17a10ed fs/ufs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:06 -08:00
Kees Cook f987c90257 fs/nfsd: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook 03edef0fea fs/logfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Joern Engel <joern@logfs.org>
CC: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook cf98c5e568 fs/jffs2: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: David Woodhouse <dwmw2@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook adae07485e fs/hfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook 462f16a557 fs/efs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook 00f3616b25 fs/cifs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Steve French <sfrench@samba.org>
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook 38db331b57 fs/btrfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook d0e09c80f0 fs/bfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook 5c8e0226e7 fs/befs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook 8dbd5d6df3 fs/afs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook 6d7a19fa74 fs/affs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook acd4fd07d4 fs/adfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Acked-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook c904197533 fs/9p: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Jan Kara 9734c971aa udf: Write LVID to disk after opening / closing
So far we just marked the buffer as dirty and left writing on flusher thread
but especially on opening that opens possible race window where we could write
other modified fs structures to disk before we mark filesystem as open. So sync
LVID buffer to disk after opening and closing fs.

Reported-by: Steve Nickel <snickel58@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:58 +01:00
Wang Shilong c04e88e271 Ext3: return ENOMEM rather than EIO if sb_getblk fails
It will be better to use ENOMEM rather than EIO, because the only
reason that sb_getblk fails is that allocation fails.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong ab6a773dbc Ext2: return ENOMEM rather than EIO if sb_getblk fails
As the only reason that sb_getblks fails is that allocation fails.
It will be better to use ENOMEM rather than EIO.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong 1b7d76e9b1 Ext3: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return
NULL value,it will be better to use unlikely to check it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong 2b0542a4a0 Ext2: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return
NULL value. It will be better to use unlikely to optimize it.

Signed-off-by: Wang shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Wang Shilong 61f43e6880 Ext3: add necessary check in case IO error happens
As we know io error may happen when the function 'sb_getblk'
is called.Add necessary check for it

The patch also fix a coding style problem.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Wang Shilong 8d8759eb48 Ext2: free memory allocated and forget buffer head when io error happens
Add a necessary check when an io error happens.
If io error happens,free the memory allocated and forget
buffer head.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Jan Kara f56426ae4d ext3: Fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Guo Chao 306a74920b ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
This macro, initially introduced by ext2 in v0.99.15, does not
have any users from the beginning. It has been removed in later
ext2 version but still remains in the code of ext3, ext4, ocfs2.
Remove this macro there.

Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Nickolai Zeldovich e3e2775ced cifs: fix srcip_matches() for ipv6
srcip_matches() previously had code like this:

  srcip_matches(..., struct sockaddr *rhs) {
    /* ... */
    struct sockaddr_in6 *vaddr6 = (struct sockaddr_in6 *) &rhs;
    return ipv6_addr_equal(..., &vaddr6->sin6_addr);
  }

which interpreted the values on the stack after the 'rhs' pointer as an
ipv6 address.  The correct thing to do is to use 'rhs', not '&rhs'.

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-01-21 01:37:26 -06:00
Ilya Dryomov 25122d15e2 Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:22 +02:00
Ilya Dryomov 4ac20c70b0 Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Fix unlock order in btrfs_ioctl_rm_dev().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:21 +02:00
Ilya Dryomov 18f39c416d Btrfs: fix unlock order in btrfs_ioctl_resize
Fix unlock order in btrfs_ioctl_resize().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:20 +02:00
Ilya Dryomov 2c0c9da02a Btrfs: fix "mutually exclusive op is running" error code
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add.  Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading.  Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:18 +02:00
Ilya Dryomov ed0fb78fb6 Btrfs: bring back balance pause/resume logic
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge).  Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away.  Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:17 +02:00
Joe Millenbach 4f73bc4dd3 tty: Added a CONFIG_TTY option to allow removal of TTY
The option allows you to remove TTY and compile without errors. This
saves space on systems that won't support TTY interfaces anyway.
bloat-o-meter output is below.

The bulk of this patch consists of Kconfig changes adding "depends on
TTY" to various serial devices and similar drivers that require the TTY
layer.  Ideally, these dependencies would occur on a common intermediate
symbol such as SERIO, but most drivers "select SERIO" rather than
"depends on SERIO", and "select" does not respect dependencies.

bloat-o-meter output comparing our previous minimal to new minimal by
removing TTY.  The list is filtered to not show removed entries with awk
'$3 != "-"' as the list was very long.

add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350)
function                                     old     new   delta
chr_dev_init                                 166     170      +4
allow_signal                                  80      82      +2
static.__warned                              143     142      -1
disallow_signal                               63      62      -1
__set_special_pids                            95      94      -1
unregister_console                           126     121      -5
start_kernel                                 546     541      -5
register_console                             593     588      -5
copy_from_user                                45      40      -5
sys_setsid                                   128     120      -8
sys_vhangup                                   32      19     -13
do_exit                                     1543    1526     -17
bitmap_zero                                   60      40     -20
arch_local_irq_save                          137     117     -20
release_task                                 674     652     -22
static.spin_unlock_irqrestore                308     260     -48

Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Reviewed-by: Jamey Sharp <jamey@minilop.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-18 16:15:27 -08:00
Ben Myers 003fd6c8be xfs: fix fs/xfs/xfs_log.c:1740:39: error: 'B_TRUE' undeclared
Commit 667a9291c5 "xfs: Remove boolean_t typedef completely." didn't.

Remove a stray B_TRUE that breaks CONFIG_XFS_DEBUG=y.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
2013-01-18 15:11:57 -06:00
Greg Kroah-Hartman ed408f7c0f Merge 3.9-rc4 into driver-core-next
This is to fix up a build problem with a wireless driver due to the
dynamic-debug patches in this branch.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 19:48:18 -08:00
Brian Foster 9e96fe6df4 xfs: pull up stack_switch check into xfs_bmapi_write
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-17 17:53:37 -06:00
Thiago Farina 667a9291c5 xfs: Remove boolean_t typedef completely.
Since we are using C99 we have one builtin defined in include/linux/types.h,
use that instead.

v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm

Signed-off-by: Thiago Farina <tfarina@chromium.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-17 17:32:57 -06:00
Alex Elder e8afad656c libceph: pass length to ceph_calc_file_object_mapping()
ceph_calc_file_object_mapping() takes (among other things) a "file"
offset and length, and based on the layout, determines the object
number ("bno") backing the affected portion of the file's data and
the offset into that object where the desired range begins.  It also
computes the size that should be used for the request--either the
amount requested or something less if that would exceed the end of
the object.

This patch changes the input length parameter in this function so it
is used only for input.  That is, the argument will be passed by
value rather than by address, so the value provided won't get
updated by the function.

The value would only get updated if the length would surpass the
current object, and in that case the value it got updated to would
be exactly that returned in *oxlen.

Only one of the two callers is affected by this change.  Update
ceph_calc_raw_layout() so it records any updated value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:04 -06:00
Greg Kroah-Hartman 595e0eb067 Revert "sysfs: Convert print_symbol to %pSR"
This reverts commit 6ad58fa82d as %pSR
isn't in the tree yet.

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 13:09:57 -08:00
Sasha Levin 1884bd4b14 debugfs: remove redundant initialization of dentry
We already initialize it to NULL when declaring it, no need to do
that twice.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 13:02:08 -08:00
Joe Perches 6ad58fa82d sysfs: Convert print_symbol to %pSR
Use the new vsprintf extension to avoid any possible
message interleaving.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 12:57:07 -08:00
Bin Wang 6b8fbde418 sysfs: Fixed a trailing white space error
This patch removes the trailing white space in fs/sysfs/mount.c.

Signed-off-by: Bin Wang <wbin00@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 12:54:47 -08:00
Yan, Zheng 390306c38d ceph: check mds_wanted for imported cap
The MDS may have incorrect wanted caps after importing caps. So the
client should check the value mds has and send cap update if necessary.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng 66f58691c5 ceph: allocate cap_release message when receiving cap import
When client wants to release an imported cap, it's possible there
is no reserved cap_release message in corresponding mds session.
so __queue_cap_release causes kernel panic.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng 395c312b9c ceph: allow revoking duplicated caps issued by non-auth MDS
Allow revoking duplicated caps issued by non-auth MDS if these caps
are also issued by auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:38 -06:00
Yan, Zheng 8a92a119b2 ceph: move dirty inode to migrating list when clearing auth caps
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:37 -06:00
Sam Lang 6e8575faa8 ceph: Check for created flag in response from mds
The mds now sends back a created inode if the create request
performed the create.  If the file already existed, no inode is
returned in the reply.  This allows ceph to set the created flag
in atomic_open so that permissions are properly checked in the case
that the file wasn't created by the create call to the mds.

To ensure compability with previous kernels, a feature for sending
back the inode in the create reply was added, so that the mds will
only send back the inode if the client indicates it supports the
feature.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:36 -06:00
Sam Lang 79aec9844d ceph: Check for err on mds request in atomic_open
The error returned by ceph_mdsc_do_request includes errors sending the
request, errors on timeout, or any errors coming from the mds.  If
ceph_mdsc_do_request returns an error, the reply struct will most likely
be bogus.  We need to bail out and propogate the error instead of
overwriting it.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-01-17 12:42:36 -06:00
Wei Yongjun 8f706111a8 fuse: remove unused variable in fuse_try_move_page()
The variables mapping,index are initialized but never used
otherwise, so remove the unused variables.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:09:59 +01:00
Miklos Szeredi cdadb11cef fuse: make fuse_file_fallocate() static
Fix the following sparse warning:

fs/fuse/file.c:2249:6: warning: symbol 'fuse_file_fallocate' was not declared. Should it be static?

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:09:47 +01:00
Robert P. J. Day 807185eb3e fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig
Given that CUSE depends on FUSE, it only makes sense to move its
Kconfig entry into the FUSE Kconfig file.  Also, add a few grammatical
and semantic touchups.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:08:45 +01:00
Miklos Szeredi e2560362cc cuse: fix uninitialized variable warnings
Fix the following compiler warnings:

fs/fuse/cuse.c: In function 'cuse_process_init_reply':
fs/fuse/cuse.c:288:24: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:14: note: 'val' was declared here
fs/fuse/cuse.c:284:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:8: note: 'key' was declared here

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:05:52 +01:00
David Herrmann 30783587b0 cuse: do not register multiple devices with identical names
Sysfs doesn't allow two devices with the same name, but we register a
sysfs entry for each cuse device without checking for name collisions.
This extends the registration to first check whether the name was already
registered.

To avoid race-conditions between the name-check and linking the device, we
need to protect the whole registration with a mutex.

Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:04:57 +01:00
David Herrmann 8ce03fd76d cuse: use mutex as registration lock instead of spinlocks
We need to check for name-collisions during cuse-device registration. To
avoid race-conditions, this needs to be protected during the whole device
registration. Therefore, replace the spinlocks by mutexes first so we can
safely extend the locked regions to include more expensive or sleeping
code paths.

Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:04:51 +01:00
Philippe De Muyter 4de80273a2 fs/jfs: Fix typo in comment : 'how may' -> 'how many'
Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-17 10:18:02 +01:00
Zheng Liu aaddea812c ext4: add tracepoint in punching hole
This patch adds a tracepoint in ext4_punch_hole.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-16 20:21:26 -05:00
Linus Torvalds dfdebc2483 xfs: bugfixes for 3.8-rc4
- fix(es) for compound buffers
 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit
 - fix for regression in dir v2 code introduced in commit 20f7e9f3
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ9zKnAAoJENaLyazVq6ZORGcP/RemqCHJEw0a89Y0tLLLAcz/
 Es97kJMESdvi3gX3JTdz3vC8LP21dSCR3k3MvVgucb8RsvGoiLixrmluIRxKb79M
 DEmz9YJ/qxFIpnM9y46VxCYV+/ezxUDEv68wA6T2wJbof26nTLlTj2gAgqjvyWiF
 R1c1OmdCsTfA257UvxfxSVixVnWv7E2io2ZXUGsrBkP4J9OMaMtn00UYOuP1YL8S
 NJ44z9QAzTqVEbAfGeaeV/QVUJzMj/IqWCwF7YKEhfmccO/tPyN0+nMG2DI0Fp5e
 cYGsi4JnaFbqE6Aa/7mu3kv8lYnPe0n3t9d3EwzxOEx+PAvuY8N0EW8Qa4c+805n
 zXFvAroLgP0jYEEuIfEGYIwDPxG0xjor6ztu8e2twcIj6cDHzSpeYaDPnYvWJlwu
 FiupnVu+3FX6mVY1jCealI47nOwzM12R7nXysqF3F6Sf95xGJtG3BoTIKioNqk1g
 dzJGMQvwg/WLvquYb9W/ZNb1T314R23wdYtmI7gWJ74z9IQqWCZBWFYyBhQ8y1Pr
 Vf3LFjzqNqqnYNzoe8Wnn9wKQ57Es7onAo34Y9HZCOkslZsn5nKriNTXNN6Q9Upc
 5RKvj1CbTpKAJYrrhWryI1HtlDKqqtMFdmRQulSu+O9ZJuWZh4XNTu4t3oewt0Ac
 5otZwOdk53V3tGxt3prs
 =gA4q
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:

 - fix(es) for compound buffers

 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit

 - fix for regression in dir v2 code introduced in commit 20f7e9f372
   ("xfs: factor dir2 block read operations")

* tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: recalculate leaf entry pointer after compacting a dir2 block
  xfs: remove int casts from debug dquot soft limit timer asserts
  xfs: fix the multi-segment log buffer format
  xfs: fix segment in xfs_buf_item_format_segment
  xfs: rename bli_format to avoid confusion with bli_formats
  xfs: use b_maps[] for discontiguous buffers
2013-01-16 16:19:54 -08:00
Eric Sandeen aeb4f20a02 xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 17:33:53 -06:00
Eric Sandeen 37f13561de xfs: recalculate leaf entry pointer after compacting a dir2 block
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:

XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828

Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.

                        memmove(&blp[mid + 1], &blp[mid],
                                (highstale - mid) * sizeof(*blp));

overwrote it.

What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array.  So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().

Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.

Thanks also to dchinner for looking at this one with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:55 -06:00
Brian Foster ab7eac2200 xfs: remove int casts from debug dquot soft limit timer asserts
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:40 -06:00
Mark Tinguely 91e4bac0b7 xfs: fix the multi-segment log buffer format
Per Dave Chinner suggestion, this patch:
 1) Corrects the detection of whether a multi-segment buffer is
    still tracking data.
 2) Clears all the buffer log formats for a multi-segment buffer.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:08 -06:00
Mark Tinguely 2d0e9df579 xfs: fix segment in xfs_buf_item_format_segment
Not every segment in a multi-segment buffer is dirty in a
transaction and they will not be outputted. The assert in
xfs_buf_item_format_segment() that checks for the at least
one chunk of data in the segment to be used is not necessary
true for multi-segmented buffers.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:56 -06:00
Mark Tinguely 0f22f9d0cd xfs: rename bli_format to avoid confusion with bli_formats
Rename the bli_format structure to __bli_format to avoid
accidently confusing them with the bli_formats pointer.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:37 -06:00
Mark Tinguely d44d9bc68e xfs: use b_maps[] for discontiguous buffers
Commits starting at 77c1a08 introduced a multiple segment support
to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment
buffer in the transaction because it was looking at the single segment
block number rather than the multi-segment b_maps[0].bm.bn. This
results on a recursive buffer lock that can never be satisfied.

This patch:
 1) Changed the remaining b_map accesses to be b_maps[0] accesses.
 2) Renames the single segment b_map structure to __b_map to avoid
    future confusion.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:11 -06:00
Linus Torvalds 31db720643 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 and udf fixes from Jan Kara:
 "One ext3 performance regression fix and one udf regression fix (oops
  on interrupted mount)."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  UDF: Fix a null pointer dereference in udf_sb_free_partitions
  jbd: don't wake kjournald unnecessarily
2013-01-16 10:55:10 -08:00
Kees Cook 1e817fb62c time: create __getnstimeofday for WARNless calls
The pstore RAM backend can get called during resume, and must be defensive
against a suspended time source. Expose getnstimeofday logic that returns
an error instead of a WARN. This can be detected and the timestamp can
be zeroed out.

Reported-by: Doug Anderson <dianders@chromium.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-01-15 18:16:02 -08:00
Akinobu Mita 3d251a5b9e UBIFS: rename random32() to prandom_u32()
Use more preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2013-01-15 15:45:27 +02:00
Namjae Jeon 4589d25d01 f2fs: fix the debugfs entry creation path
As the "status" debugfs entry will be maintained for entire F2FS filesystem
irrespective of the number of partitions.
So, we can move the initialization to the init part of the f2fs and destroy will
be done from exit part. After making changes, for individual partition mount -
entry creation code will not be executed.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:19:15 +09:00
majianpeng 66af62ce75 f2fs: add global mutex_lock to protect f2fs_stat_list
There is an race condition between umounting f2fs and reading f2fs/status, which
results in oops.

Fox example:
Thread A			Thread B
umount f2fs 			cat f2fs/status

f2fs_destroy_stats() {		stat_show() {
				 list_for_each_entry_safe(&f2fs_stat_list)
 list_del(&si->stat_list);
 mutex_lock(&si->stat_lock);
 si->sbi = NULL;
 mutex_unlock(&si->stat_lock);
 kfree(sbi->stat_info);
} 				 mutex_lock(&si->stat_lock) <- si is gone.
				 ...
				}

Solution with a global lock: f2fs_stat_mutex:
Thread A			Thread B
umount f2fs 			cat f2fs/status

f2fs_destroy_stats() {		stat_show() {
 mutex_lock(&f2fs_stat_mutex);
 list_del(&si->stat_list);
 mutex_unlock(&f2fs_stat_mutex);
 kfree(sbi->stat_info);		 mutex_lock(&f2fs_stat_mutex);
}				 list_for_each_entry_safe(&f2fs_stat_list)
				 ...
				 mutex_unlock(&f2fs_stat_mutex);
				}

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
[jaegeuk.kim@samsung.com: fix typos, description, and remove the existing lock]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:18:29 +09:00
Namjae Jeon fa9150a84c f2fs: remove the blk_plug usage in f2fs_write_data_pages
Let's consider the usage of blk_plug in f2fs_write_data_pages().
We can come up with the two issues: lock contention and task awareness.

1. Merging bios prior to grabing "queue lock"
 The f2fs merges consecutive IOs in the file system level before
 submitting any bios, which is similar with the back merge by the
 plugging mechanism in attempt_plug_merge(). Both of them need to acquire
 no queue lock.

2. Merging policy with respect to tasks
 The f2fs merges IOs as much as possible regardless of tasks, while
 blk-plugging is conducted on a basis of tasks. As we can understand
 there are trade-offs, f2fs tries to maximize the write performance with
 well-merged bios.

As a result, if f2fs produces many consecutive but separated bios in
writepages(), it would be good to use blk-plugging since f2fs would be
able to avoid queue lock contention in the block layer by merging them.
But, f2fs merges IOs and submit one bio, which means that there are not
much chances to merge bios by attempt_plug_merge().

However, f2fs has already been used blk_plug by triggering generic_writepages()
in f2fs_write_data_pages().
So to make the overall code consistency, I'd like to remove blk_plug there.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:18:16 +09:00
Namjae Jeon 1b1baff6e5 UDF: Fix a null pointer dereference in udf_sb_free_partitions
This patch fixes a regression caused by commit bff943af6f "udf: Fix memory
leak when mounting" due to which it was triggering a kernel null point
dereference in case of interrupted mount OR when allocating memory to
sbi->s_partmaps failed in function udf_sb_alloc_partition_maps.

Reported-and-tested-by: James Hogan <james@albanarts.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-14 22:53:47 +01:00
Eric Sandeen 7e2fb2d7e6 jbd: don't wake kjournald unnecessarily
Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
commit has already been requested for commit.

commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-14 22:50:45 +01:00
Linus Torvalds 6d283dba37 vfs: add missing virtual cache flush after editing partial pages
Andrew Morton pointed this out a month ago, and then I completely forgot
about it.

If we read a partial last page of a block device, we will zero out the
end of the page, but since that page can then be mapped into user space,
we should also make sure to flush the cache on architectures that have
virtual caches.  We have the flush_dcache_page() function for this, so
use it.

Now, in practice this really never matters, because nobody sane uses
virtual caches to begin with, and they largely exist on old broken RISC
arhitectures.

And even if you did run on one of those obsolete CPU's, the whole "mmap
and access the last partial page of a block device" behavior probably
doesn't actually exist.  The normal IO functions (read/write) will never
see the zeroed-out part of the page that migth not be coherent in the
cache, because they honor the size of the device.

So I'm marking this for stable (3.7 only), but I'm not sure anybody will
ever care.

Pointed-out-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org  # 3.7
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-14 13:17:50 -08:00
Eric Sandeen 3972f2603d btrfs: update timestamps on truncate()
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:37 -05:00
Zach Brown f276795627 btrfs: fix btrfs_cont_expand() freeing IS_ERR em
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.

An instance of -EEXIST was reported in the wild:

  https://bugzilla.redhat.com/show_bug.cgi?id=874407

I have no idea if that -EEXIST is surprising, or not.  Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).

This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:23 -05:00
Liu Bo f9e4fb5393 Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
xfstests case 285 complains.

It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo 1214b53f90 Btrfs: fix off-by-one in lseek
Lock end is inclusive.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo 3268a2468e Btrfs: reset path lock state to zero
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.

Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:53 -05:00
Liu Bo ac5c93005b Btrfs: let allocation start from the right raid type
This'd avoid us empty looping.

Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik f3fe820c20 Btrfs: add orphan before truncating pagecache
Running xfstests 83 in a loop would sometimes fail the fsck.  This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up.  The problem with this is there is plenty of time
for the truncate to fail after we've done this work.  So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe.  This
fixes the btrfsck failures I was seeing while running 83 in a loop.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik 72bcd99d45 Btrfs: set flushing if we're limited flushing
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie 9754767657 Btrfs: fix missing write access release in btrfs_ioctl_resize()
We forget to give up the write access after we find some device operation
is going on. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie dba60f3f5d Btrfs: fix resize a readonly device
We should not resize a readonly device, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:49 -05:00
Miao Xie 5c39da5b6c Btrfs: do not delete a subvolume which is in a R/O subvolume
Step to reproduce:
 # mkfs.btrfs <disk>
 # mount <disk> <mnt>
 # btrfs sub create <mnt>/subv0
 # btrfs sub snap <mnt> <mnt>/subv0/snap0
 # change <mnt>/subv0 from R/W to R/O
 # btrfs sub del <mnt>/subv0/snap0

We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:32 -05:00
Miao Xie d86e56cf7d Btrfs: disable qgroup id 0
Qgroup id 0 is a special number, we should set the id of a qgroup to 0.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:31 -05:00
Lukas Czerner cc975eb460 btrfs: get the device in write mode when deleting it
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-01-14 13:52:31 -05:00
Tsutomu Itoh cfa7a9ccda Btrfs: fix memory leak in name_cache_insert()
We should free name_cache_entry before returning from the
error handling code.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2013-01-14 13:52:30 -05:00
Linus Torvalds 3441f0d26d Driver core fixes for 3.8-rc3
Here are two patches for 3.8-rc3.
 
 One removes the __dev* defines from init.h now that all usages of it are gone
 from your tree.  The other fix is for debugfs's paramater that was using the
 wrong base for the option.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDzjcAACgkQMUfUDdst+ykJVwCcDqiKrO9p0dcH9WXN5aukBWX/
 N8EAoK786v7PjtiVyNOJ/cPUDU8OHUpg
 =U4nL
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg Kroah-Hartman:
 "Here are two patches for 3.8-rc3.

  One removes the __dev* defines from init.h now that all usages of it
  are gone from your tree.  The other fix is for debugfs's paramater
  that was using the wrong base for the option.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  debugfs: convert gid= argument from decimal, not octal
  Remove __dev* markings from init.h
2013-01-14 09:07:11 -08:00
Tejun Heo 9fb0a7da0c writeback: add more tracepoints
Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback.  For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete).  These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
    before dereferencing.

v3: buffer dirtying moved to a block TP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Tejun Heo 5305cb8308 block: add block_{touch|dirty}_buffer tracepoint
The former is triggered from touch_buffer() and the latter
mark_buffer_dirty().

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
    it share TP definition with block_touch_buffer.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Tejun Heo f0059afd3e buffer: make touch_buffer() an exported function
We want to add a trace point to touch_buffer() but macros and inline
functions defined in header files can't have tracing points.  Move
touch_buffer() to fs/buffer.c and make it a proper function.

The new exported function is also declared inline.  As most uses of
touch_buffer() are inside buffer.c with nilfs2 as the only other user,
the effect of this change should be negligible.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Tejun Heo 3a366e614d block: add missing block_bio_complete() tracepoint
bio completion didn't kick block_bio_complete TP.  Only dm was
explicitly triggering the TP on IO completion.  This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
  the trace point is unexported.

* @rq dropped from trace_block_bio_complete().  bios may fly around
  w/o queue associated.  Verifying and accessing the assocaited queue
  belongs to TP probes.

* blktrace now gets both request and bio completions.  Make it ignore
  bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev.  Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-14 15:00:36 +01:00
Namjae Jeon 163799872b f2fs: avoid redundant time update for parent directory in f2fs_delete_entry
In call to f2fs_delete_entry, 'dir' time modification code is put
at two places.
So, remove the redundant code for timing update.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-14 09:43:27 +09:00
Namjae Jeon ff9234ad4e f2fs: remove redundant call to set_blocksize in f2fs_fill_super
Since, f2fs supports only 4KB blocksize, which is set at the beginning in
f2fs_fill_super. So, we do not need to again check this blocksize setting
in such case.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-14 09:41:30 +09:00
Abhijit Pawar a17164e54b fs/xfs remove obsolete simple_strto<foo>
This patch replaces usages of obsolete simple_strtoul with kstrtoint in
xfs_args and suffix_strtoul.

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-13 14:42:07 -06:00
Eric Sandeen d4608632ec xfs: recalculate leaf entry pointer after compacting a dir2 block
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:

XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828

Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.

                        memmove(&blp[mid + 1], &blp[mid],
                                (highstale - mid) * sizeof(*blp));

overwrote it.

What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array.  So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().

Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.

Thanks also to dchinner for looking at this one with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-13 14:36:17 -06:00
Theodore Ts'o 7f5118629f ext4: trigger the lazy inode table initialization after resize
After we have finished extending the file system, we need to trigger a
the lazy inode table thread to zero out the inode tables.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-13 08:41:45 -05:00
Eryu Guan 15b49132fc ext4: check bh in ext4_read_block_bitmap()
Validate the bh pointer before using it, since
ext4_read_block_bitmap_nowait() might return NULL.

I've seen this in fsfuzz testing.

 EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0
 ...
 Call Trace:
  [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60
  [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80
  [<ffffffff811d0d36>] ? __getblk+0x36/0x70
  [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210
  [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140
  [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0
  [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100
  [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0
  [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230
  [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490
  [<ffffffff811b8cd7>] evict+0xa7/0x1c0
  [<ffffffff811b8ed3>] iput_final+0xe3/0x170
  [<ffffffff811b8f9e>] iput+0x3e/0x50
  [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90
  [<ffffffff81231d0b>] ext4_create+0xeb/0x170
  [<ffffffff811aae9c>] vfs_create+0xac/0xd0
  [<ffffffff811ac845>] lookup_open+0x185/0x1c0
  [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170
  [<ffffffff811acb54>] do_last+0x2d4/0x7a0
  [<ffffffff811af743>] path_openat+0xb3/0x480
  [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0
  [<ffffffff811afc49>] do_filp_open+0x49/0xa0
  [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150
  [<ffffffff8119da28>] do_sys_open+0x108/0x1f0
  [<ffffffff8119db51>] sys_open+0x21/0x30
  [<ffffffff81618959>] system_call_fastpath+0x16/0x1b

Also fix comment for ext4_read_block_bitmap_nowait()

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:33:25 -05:00
Wang Shilong aebf02430d ext4: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12 16:28:47 -05:00
Theodore Ts'o 860d21e2c5 ext4: return ENOMEM if sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:19:36 -05:00
Miao Xie 10ee27a06c vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because

- If ->s_umount is write locked, then the sb is not idle. That is
  writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.

- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
  writeback, it may bring us deadlock problem when doing umount. In order to
  fix the problem, ext4 and btrfs implemented their own writeback functions
  instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
  code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().

The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).

This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-12 10:47:43 +08:00
Xi Wang 6d92d4f6a7 fs/exec.c: work around icc miscompilation
The tricky problem is this check:

	if (i++ >= max)

icc (mis)optimizes this check as:

	if (++i > max)

The check now becomes a no-op since max is MAX_ARG_STRINGS (0x7FFFFFFF).

This is "allowed" by the C standard, assuming i++ never overflows,
because signed integer overflow is undefined behavior.  This
optimization effectively reverts the previous commit 362e6663ef
("exec.c, compat.c: fix count(), compat_count() bounds checking") that
tries to fix the check.

This patch simply moves ++ after the check.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Kees Cook d9777b8de4 fs/xfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ben Myers <bpm@sgi.com>
2013-01-11 11:39:04 -08:00
Kees Cook f11cb2271f fs/nilfs2: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2013-01-11 11:39:04 -08:00
Kees Cook 336d6d0323 fs/ecryptfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Tyler Hicks <tyhicks@canonical.com>
CC: Dustin Kirkland <dustin.kirkland@gazzang.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
2013-01-11 11:39:04 -08:00
Kees Cook 1b6a78a522 fs/ceph: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Sage Weil <sage@inktank.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Sage Weil <sage@inktank.com>
2013-01-11 11:39:04 -08:00
Seiji Aguchi 9f244e9cfd pstore: Avoid deadlock in panic and emergency-restart path
[Issue]

When pstore is in panic and emergency-restart paths, it may be blocked
in those paths because it simply takes spin_lock.

This is an example scenario which pstore may hang up in a panic path:

 - cpuA grabs psinfo->buf_lock
 - cpuB panics and calls smp_send_stop
 - smp_send_stop sends IRQ to cpuA
 - after 1 second, cpuB gives up on cpuA and sends an NMI instead
 - cpuA is now in an NMI handler while still holding buf_lock
 - cpuB is deadlocked

This case may happen if a firmware has a bug and
cpuA is stuck talking with it more than one second.

Also, this is a similar scenario in an emergency-restart path:

 - cpuA grabs psinfo->buf_lock and stucks in a firmware
 - cpuB kicks emergency-restart via either sysrq-b or hangcheck timer.
   And then, cpuB is deadlocked by taking psinfo->buf_lock again.

[Solution]

This patch avoids the deadlocking issues in both panic and emergency_restart
paths by introducing a function, is_non_blocking_path(), to check if a cpu
can be blocked in current path.

With this patch, pstore is not blocked even if another cpu has
taken a spin_lock, in those paths by changing from spin_lock_irqsave
to spin_trylock_irqsave.

In addition, according to a comment of emergency_restart() in kernel/sys.c,
spin_lock shouldn't be taken in an emergency_restart path to avoid
deadlock. This patch fits the comment below.

<snip>
/**
 *      emergency_restart - reboot the system
 *
 *      Without shutting down any hardware or taking any locks
 *      reboot the system.  This is called when we know we are in
 *      trouble so this is our best effort to reboot.  This is
 *      safe to call in interrupt context.
 */
void emergency_restart(void)
<snip>

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-01-11 10:20:50 -08:00
Dave Reisner f1688e0431 debugfs: convert gid= argument from decimal, not octal
This patch technically breaks userspace, but I suspect that anyone who
actually used this flag would have encountered this brokenness, declared
it lunacy, and already sent a patch.

Signed-off-by: Dave Reisner <dreisner@archlinux.org>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11 05:56:01 -08:00
Jaegeuk Kim 9eaeba7013 f2fs: move f2fs_balance_fs to punch_hole
The f2fs_fallocate() has two operations: punch_hole and expand_size.

Only in the case of punch_hole, dirty node pages can be produced, so let's
trigger f2fs_balance_fs() in this case only.
Furthermore, let's trigger it at every data truncation routine.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-11 15:09:23 +09:00
Jaegeuk Kim 7d82db8316 f2fs: add f2fs_balance_fs in several interfaces
The f2fs_balance_fs() is to check the number of free sections and decide whether
it needs to conduct cleaning or not. If there are not enough free sections, the
cleaning job should be started.

In order to control an amount of free sections even under high utilization, f2fs
should call f2fs_balance_fs at all the VFS interfaces that are able to produce
dirty pages.
This patch adds the function calls in the missing interfaces as follows.

1. f2fs_setxattr()
The f2fs_setxattr() produces dirty node pages so that we should call
f2fs_balance_fs() either likewise doing in other VFS interfaces such as
f2fs_lookup(), f2fs_mkdir(), and so on.

2. f2fs_sync_file()
We should guarantee serving free sections for syncing metadata during fsync.
Previously, there is no space check before triggering checkpoint and
sync_node_pages.
Therefore, if a bunch of fsync calls are triggered under 100% of FS utilization,
f2fs is able to be faced with no free sections, resulting in BUG_ON().

3. f2fs_sync_fs()
Before calling write_checkpoint(), we should guarantee that there are minimum
free sections.

4. f2fs_write_inode()
f2fs_write_inode() is also able to produce dirty node pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-11 15:09:17 +09:00
Randy Dunlap 254adaa465 seq_file: fix new kernel-doc warnings
Fix kernel-doc warnings in fs/seq_file.c:

  Warning(fs/seq_file.c:304): No description found for parameter 'whence'
  Warning(fs/seq_file.c:304): Excess function parameter 'origin' description in 'seq_lseek'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-10 14:35:24 -08:00
Jaegeuk Kim 408e937561 f2fs: revisit the f2fs_gc flow
I'd like to revisit the f2fs_gc flow and rewrite as follows.

1. In practical, the nGC parameter of f2fs_gc is meaningless. So, let's
  remove it.
2. Background GC marks victim blocks as dirty one at a time.
3. Foreground GC should do cleaning job until acquiring enough free
  sections. Afterwards, it needs to do checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-10 07:42:59 +09:00
Wang Sheng-Hui 210b907acf btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
In the big loop, cur_trans will be set fs_info->running_transaction
before it's used. And after kmem_cache_free it and goto loop, it will
be setup again. No need to setup it immediately after freed.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:48:33 +01:00
Masanari Iida 8a168ca707 treewide: Fix typo in various drivers
Correct spelling typo in printk within various drivers.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:43:32 +01:00
Liu Bo 2c016dc2cb btrfs: fix comment typos
Convert 'hepler' to 'helper'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:40:52 +01:00
Linus Torvalds 5c33d9b248 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) New sysctl ndisc_notify needs some documentation, from Hanns
    Frederic Sowa.

 2) Netfilter REJECT target doesn't set transport header of SKB
    correctly, from Mukund Jampala.

 3) Forcedeth driver needs to check for DMA mapping failures, from Larry
    Finger.

 4) brcmsmac driver can't use usleep_range while holding locks, use
    udelay instead.  From Niels Ole Salscheider.

 5) Fix unregister of netlink bridge multicast database handlers, from
    Vlad Yasevich and Rami Rosen.

 6) Fix checksum calculations in netfilter's ipv6 network prefix
    translation module.

 7) Fix high order page allocation failures in netfilter xt_recent, from
    Eric Dumazet.

 8) mac802154 needs to use netif_rx_ni() instead of netif_rx() because
    mac802154_process_data() can execute in process rather than
    interrupt context.  From Alexander Aring.

 9) Fix splice handling of MSG_SENDPAGE_NOTLAST, otherwise we elide one
    tcp_push() too many.  From Eric Dumazet and Willy Tarreau.

10) Fix skb->truesize tracking in XEN netfront driver, from Ian
    Campbell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  xen/netfront: improve truesize tracking
  ipv4: fix NULL checking in devinet_ioctl()
  tcp: fix MSG_SENDPAGE_NOTLAST logic
  net/ipv4/ipconfig: really display the BOOTP/DHCP server's address.
  ip-sysctl: fix spelling errors
  mac802154: fix NOHZ local_softirq_pending 08 warning
  ipv6: document ndisc_notify in networking/ip-sysctl.txt
  ath9k: Fix Kconfig for ATH9K_HTC
  netfilter: xt_recent: avoid high order page allocations
  netfilter: fix missing dependencies for the NOTRACK target
  netfilter: ip6t_NPT: fix IPv6 NTP checksum calculation
  bridge: add empty br_mdb_init() and br_mdb_uninit() definitions.
  vxlan: allow live mac address change
  bridge: Correctly unregister MDB rtnetlink handlers
  brcmfmac: fix parsing rsn ie for ap mode.
  brcmsmac: add copyright information for Canonical
  rtlwifi: rtl8723ae: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192se: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192de: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192ce: Fix warning for unchecked pci_map_single() call
  ...
2013-01-08 07:31:49 -08:00
Linus Torvalds 127aa93066 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Misc small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Don't let read only caching for mandatory byte-range locked files
  CIFS: Fix write after setting a read lock for read oplock files
  Revert "CIFS: Fix write after setting a read lock for read oplock files"
  cifs: adjust sequence number downward after signing NT_CANCEL request
  cifs: move check for NULL socket into smb_send_rqst
2013-01-07 13:21:55 -08:00
David Teigland f117228346 dlm: avoid scanning unchanged toss lists
Keep track of whether a toss list contains any
shrinkable rsbs.  If not, dlm_scand can avoid
scanning the list for rsbs to shrink.  Unnecessary
scanning can otherwise waste a lot of time because
the toss lists can contain a large number of rsbs
that are non-shrinkable (directory records).

Signed-off-by: David Teigland <teigland@redhat.com>
2013-01-07 12:02:49 -06:00
Linus Torvalds f77637206d Bug fixes, including two regressions introduced in v3.8. The most
serious of these regressions is a buffer cache leak.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ6rvpAAoJENNvdpvBGATwRrIP/1bokaspEpaHVKFAqgJzRNew
 cPdINNqc85m5MmVGmvPq4EMx7+86Z459sZAKafaXV/qVR/m7vmtfIiWi8bWPZUkv
 MrblMSPAQHERypMWA8eJkNfkyw39HJb4GMAIgh1TBUXO2ocb46cGpl0Fcum0twT4
 pv2C2JqNcLtIsekJsrqmvdqrNW+bMoMJZtzjFwHuIknmbo7eSFtgV17EFlcfWhJ7
 CZoJwvk2s/dnS6l4icwKZBNbKnap6oR9SptoymvH+ATPIEO4qJisRbID6XVhTZZ8
 3lsOWfGlGIVEy6wsQKwfeWdCzAyAjyQza7eeMdVkqm97YynFSUD0pMZ9tQ1FpQ9E
 JGAWshLFyA8+atY/JZ8xGQisY2R57WoJd2m0Bf3ockhB963iu+vnIxZ4BaKW3K9l
 WMRqkouA1ijaeeUNHV4FulJ0cG6ioIYTF6nM/jGTJXhF8ZXjJHb4ZiwjjWS8fZv1
 Ooe6GkHz29txi+hOET0vtwDUqkihGFlfNDbZ48/JjlS4sCy5ntcwt321Tn7olbo5
 O72k+oQLtMLJshrZwTuSQZZtiv/9638gtNGC6EUy7p5LTmo5CgaueEh2qhXweVGR
 f8X25RjNWREvE78JiJw6SzYfNeaLO6I0f+Hs4z8PLVc8K02IO0tK8mw8LNgvFHRF
 xCFhD2w916Dsh9cwyiRT
 =pnxL
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 regression fixes from Ted Ts'o:
 "Bug fixes, including two regressions introduced in v3.8.  The most
  serious of these regressions is a buffer cache leak."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
  ext4: release buffer in failed path in dx_probe()
  ext4: fix configuration dependencies for ext4 ACLs and security labels
2013-01-07 09:22:50 -08:00
Linus Torvalds 4c9014f2ca NFS client bugfixes for Linux 3.8
- Fix a permissions problem when opening NFSv4 files that only have the
   exec bit set.
 - Fix a couple of typos in pNFS (inverted logic), and the mount parsing
   (missing pointer dereference).
 - Work around a series of deadlock issues due to workqueues using
   struct work_struct pointer address comparisons in the re-entrancy
   tests. Ensure that we don't free struct work_struct prematurely if
   our work function involves waiting for completion of other work
   items (e.g. by calling rpc_shutdown_client).
 - Revert the part of commit 168e4b3 that is causing unnecessary warnings
   to be issued in the nfsd callback code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ6uzfAAoJEGcL54qWCgDyRvIP/06F4xZChZxH3prNz2cU2EDJ
 2RO3lWoZf8aBk2+bhU5Fm9cuUcCm1raoRFsdWoJC/dnlXL+A7ZeuYDvXycNfTJW3
 vCIRvZTU8TAxwR9szNkPhRIB09FAacioP/K0q6pBfwx8my5NC1yIpOMDVmND+40Z
 oWm7ICip4vblxNVQMYp6/JrDcc7LCDcOG5j0EKO5aSxRE0Ki2TYqN0nk20v9caDe
 43jOA2HWGPQ+Zg3sty2o728u57RB70OBqRJ//2pzW2fEwSut+pf5pC1MoVJvtgST
 Utwbt/tMEFAuvo0WGHqjaYHoAezcLkjSvYdrh7Vz6WJ0gsonCB8V8UvgKsFt0XSM
 YWtXa/PzkDfXDVNwkzpjmDCUDJwzAiTllQP5cJzAhsz7GG0xzGypyqmIMcJ6UGxy
 sADGXAhpJR8sWn7oxh4znMh+JlxW3MLA+jSgkrlzGM7wMinncnp5RSQIlY2/8qhZ
 GiO6b6wiUYYHbZtV456S4M+PwA0kGqkSDP/PlrquXAW8jPFCBlKLf4+SFK2kqs25
 yyJQXD0QycbtfbCGxaFJkt1qUEgIGkTpU8UnyxHKccIXhR4ZDi/IZ3OQSLexE+8V
 L8HZXeT32+fGgPz479bplXJ2VnLQo6VXOrA5ofylqesWW5klDPMTz7KIEJSyyHwG
 kiSLyjhDx/2vAG+Nwusr
 =yxcq
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix a permissions problem when opening NFSv4 files that only have the
   exec bit set.

 - Fix a couple of typos in pNFS (inverted logic), and the mount parsing
   (missing pointer dereference).

 - Work around a series of deadlock issues due to workqueues using
   struct work_struct pointer address comparisons in the re-entrancy
   tests.  Ensure that we don't free struct work_struct prematurely if
   our work function involves waiting for completion of other work items
   (e.g. by calling rpc_shutdown_client).

 - Revert the part of commit 168e4b3 that is causing unnecessary
   warnings to be issued in the nfsd callback code.

* tag 'nfs-for-3.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: avoid dereferencing null pointer in initiate_bulk_draining
  SUNRPC: Partial revert of commit 168e4b39d1
  NFS: Ensure that we free the rpc_task after read and write cleanups are done
  SUNRPC: Ensure that we free the rpc_task after cleanups are done
  nfs: fix null checking in nfs_get_option_str()
  pnfs: Increase the refcount when LAYOUTGET fails the first time
  NFS: Fix access to suid/sgid executables
2013-01-07 08:36:45 -08:00
Eric Dumazet ae62ca7b03 tcp: fix MSG_SENDPAGE_NOTLAST logic
commit 35f9c09fe9 (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all
frags but the last one for a splice() call.

The condition used to set the flag in pipe_to_sendpage() relied on
splice() user passing the exact number of bytes present in the pipe,
or a smaller one.

But some programs pass an arbitrary high value, and the test fails.

The effect of this bug is a lack of tcp_push() at the end of a
splice(pipe -> socket) call, and possibly very slow or erratic TCP
sessions.

We should both test sd->total_len and fact that another fragment
is in the pipe (pipe->nrbufs > 1)

Many thanks to Willy for providing very clear bug report, bisection
and test programs.

Reported-by: Willy Tarreau <w@1wt.eu>
Bisected-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-06 20:58:13 -08:00
Guo Chao fef0ebdb22 ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
This fixes a buffer cache leak when creating a directory, introduced
in commit a774f9c20.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:40:25 -05:00
Guo Chao 0ecaef0644 ext4: release buffer in failed path in dx_probe()
If checksum fails, we should also release the buffer
read from previous iteration.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>-
Cc: stable@vger.kernel.org
--
 fs/ext4/namei.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2013-01-06 23:38:47 -05:00
Valerie Aurora 96465efee1 ext4: fix configuration dependencies for ext4 ACLs and security labels
Commit "ext4: Remove CONFIG_EXT4_FS_XATTR" removed the configuration
dependencies for ext4 xattrs from the ext4 ACLs and security labels
configuration options, but did not replace them with a dependency on
ext4 itself.  Add back the dependency on ext4 so the options only show
up if ext4 is enabled.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:38:44 -05:00
Nickolai Zeldovich ecf0eb9edb nfs: avoid dereferencing null pointer in initiate_bulk_draining
Fix an inverted null pointer check in initiate_bulk_draining().

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-05 14:26:51 -05:00
Trond Myklebust 6db6dd7d3f NFS: Ensure that we free the rpc_task after read and write cleanups are done
This patch ensures that we free the rpc_task after the cleanup callbacks
are done in order to avoid a deadlock problem that can be triggered if
the callback needs to wait for another workqueue item to complete.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Weston Andros Adamson <dros@netapp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Bruce Fields <bfields@fieldses.org>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-04 12:59:10 -05:00
Xi Wang e25fbe380c nfs: fix null checking in nfs_get_option_str()
The following null pointer check is broken.

	*option = match_strdup(args);
	return !option;

The pointer `option' must be non-null, and thus `!option' is always false.
Use `!*option' instead.

The bug was introduced in commit c5cb09b6f8 ("Cleanup: Factor out some
cut-and-paste code.").

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-01-04 10:54:43 -05:00
Yanchuan Nian 39e88fcfb1 pnfs: Increase the refcount when LAYOUTGET fails the first time
The layout will be set unusable if LAYOUTGET fails. Is it reasonable to
increase the refcount iff LAYOUTGET fails the first time?

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-04 10:50:42 -05:00
Jaegeuk Kim c335a86930 f2fs: check return value during recovery
This patch resolves Coverity #753102:

>>> No check of the return value of "f2fs_add_link(&dent, inode)".

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Jaegeuk Kim c1b75eabec f2fs: avoid null dereference in f2fs_acl_from_disk
This patch resolves Coverity #751303:

>>> CID 753103: Explicit null dereferenced (FORWARD_NULL) Passing null
>>> pointer "value" to function "f2fs_acl_from_disk(char const *, size_t)",
	which dereferences it.

[Error path]
- value = NULL;
- retval = 0 by f2fs_getxattr();
- f2fs_acl_from_disk(value:NULL, ...);

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Jaegeuk Kim d66d1f7687 f2fs: initialize newly allocated dnode structure
This patch resolves Coverity #753112.

In practical, the existing code flow does not fall into the reported errorneous
path. But, anyway, let's avoid this for future.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Huajun Li 7880ceedec f2fs: update f2fs partition info about SIT/NAT layout
Update partition info output under debug FS to reflect segment layout correctly.

Signed-off-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon 24c366a9ea f2fs: remove unneeded INIT_LIST_HEAD at few places
While creating a new entry for addition to the list(orphan inode list
and fsync inode entry list), there is no need to call HEAD initialization
for these entries. So, remove that init part.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon 3af60a49fd f2fs: fix time update in case of f2fs fallocate
After doing a punch hole or expanding inode doing fallocation.
The change and modification time are not update for the file.
So, update time after no issue is observed in fallocate.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon a07ef78435 f2fs: introduce f2fs_msg to ease adding information prints
Introduced f2fs_msg function to differentiate f2fs specific messages in
the log. And, added few informative prints in the mount path, to convey
proper error in case of mount failure.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Linus Torvalds 49569646b2 Driver core __dev* removal patches
Here are the remaining __dev* removal patches against the 3.8-rc2 tree.
 All of these patches were previously sent to the subsystem maintainers,
 most of them were picked up and pushed to you, but there were a number
 that fell through the cracks, and new drivers were added during the
 merge window, so this series cleans up the rest of the instances of
 these markings.
 
 Third time's the charm...
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDmHOIACgkQMUfUDdst+ykTZgCePgK84Im3FFooEXJwaPbaf4ls
 lO4AoMEDoWK+BHWOsjQwFPOwFFPEN2Xh
 =6oAQ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core __dev* removal patches - take 3 - from Greg Kroah-Hartman:
 "Here are the remaining __dev* removal patches against the 3.8-rc2
  tree.  All of these patches were previously sent to the subsystem
  maintainers, most of them were picked up and pushed to you, but there
  were a number that fell through the cracks, and new drivers were added
  during the merge window, so this series cleans up the rest of the
  instances of these markings.

  Third time's the charm...

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up trivial conflict with the pinctrl pull in pinctrl-sirf.c.

* tag 'driver-core-3.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
  misc: remove __dev* attributes.
  include: remove __dev* attributes.
  Documentation: remove __dev* attributes.
  Drivers: misc: remove __dev* attributes.
  Drivers: block: remove __dev* attributes.
  Drivers: bcma: remove __dev* attributes.
  Drivers: char: remove __dev* attributes.
  Drivers: clocksource: remove __dev* attributes.
  Drivers: ssb: remove __dev* attributes.
  Drivers: dma: remove __dev* attributes.
  Drivers: gpu: remove __dev* attributes.
  Drivers: infinband: remove __dev* attributes.
  Drivers: memory: remove __dev* attributes.
  Drivers: mmc: remove __dev* attributes.
  Drivers: iommu: remove __dev* attributes.
  Drivers: power: remove __dev* attributes.
  Drivers: message: remove __dev* attributes.
  Drivers: macintosh: remove __dev* attributes.
  Drivers: mfd: remove __dev* attributes.
  pstore: remove __dev* attributes.
  ...
2013-01-03 16:17:50 -08:00
Greg Kroah-Hartman 6ae141718e misc: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the last of the __dev* markings from the kernel from
a variety of different, tiny, places.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:16 -08:00
Greg Kroah-Hartman f568f6ca81 pstore: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit from the pstore filesystem.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Anton Vorontsov <cbouatmailru@gmail.com>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:14 -08:00
Weston Andros Adamson f8d9a897d4 NFS: Fix access to suid/sgid executables
nfs_open_permission_mask() should only check MAY_EXEC for files that
are opened with __FMODE_EXEC.

Also fix NFSv4 access-in-open path in a similar way -- openflags must be
used because fmode will not always have FMODE_EXEC set.

This patch fixes https://bugzilla.kernel.org/show_bug.cgi?id=49101

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-03 17:06:27 -05:00
Eric Sandeen 83a9ba0057 xfs: don't zero structure members after a memset(0)
Commit 408cc4e97a
added memset(0, ...) to allocation args structures,
so there is no need to explicitly set any of the fields
to 0 after that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-03 16:00:07 -06:00
Brian Foster f755503206 xfs: remove int casts from debug dquot soft limit timer asserts
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-03 15:58:39 -06:00
Linus Torvalds 2318aa2720 f2fs: bug fixes for 3.8-rc2
This patch-set includes two major bug fixes:
 - incorrect IUsed provided by *df -i*, and
 - lookup failure of parent inodes in corner cases.
 
 [Other Bug Fixes]
 - Fix error handling routines
 - Trigger recovery process correctly
 - Resolve build failures due to missing header files
 
 [Etc]
 - Add a MAINTAINERS entry for f2fs
 - Fix and clean up variables, functions, and equations
 - Avoid warnings during compilation
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ3QYaAAoJEEAUqH6CSFDSZwYP/jyx4ilQC5pT85NCBs744JGZ
 zGKcNLlWJcVpSwoepmjEJiMfvYE63imwmsG0gHcnUosvndTivrOkLUPReBE4bLLO
 6KR0J/HWxfRY+FAM6nGfoVDWAG/2mU/cCwDKCwGgAZp//5YmxQZTcp3Xcak6dHUQ
 z64O2XC/roK4w827lwpp7lFBnhY3snpaA+EFGe2Wwm+9r9BrJYP3FyFG9VVRR1Cw
 SQphbZrC2yo+03IN1sJV83QLfKt3+tONhctizAtMzVOVgM2ToVLNbz32SVj8pbnD
 WSMwVWYxVQwC8R9ZhUc2Z3hV+m9m9MswBMyK+U9QF5r/avFK4vKHJYLOcXmpzRcX
 voH5tl7OfVERqfMsle+7X6/Edz5xd4abF4b5yM+3h9pz4LRJYWkp3daeWPY++ZUR
 esM81f0+W55I/mk1STlI7N+KdVn3/Zpqi0UVkcZQ8y15NpOR+5zeLF4+8Skk4NGL
 emYJiho4GB2cpKw7gIQQtnGKSTIPnwRK6Lart7qJ1b2FfNwtJqAgALm7HDxbqSTN
 r3XXJv2E1u9HBu0gEKs5g223Poj7nBLwOQPkmIyx1ozD7bx19vQGnfsHza8a9L1y
 pXQ3dKU3o93+dHjMGky2cpe6B52nuvN+MBicqBSCJAIbSafmp8/JrBiN7hgQFdpj
 x/nXXZd+WPakvjHz03e5
 =dH3T
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs bug fixes from Jaegeuk Kim:
 "This patch-set includes two major bug fixes:
   - incorrect IUsed provided by *df -i*, and
   - lookup failure of parent inodes in corner cases.

  [Other Bug Fixes]
   - Fix error handling routines
   - Trigger recovery process correctly
   - Resolve build failures due to missing header files

  [Etc]
   - Add a MAINTAINERS entry for f2fs
   - Fix and clean up variables, functions, and equations
   - Avoid warnings during compilation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: unify string length declarations and usage
  f2fs: clean up unused variables and return values
  f2fs: clean up the start_bidx_of_node function
  f2fs: remove unneeded variable from f2fs_sync_fs
  f2fs: fix fsync_inode list addition logic and avoid invalid access to memory
  f2fs: remove unneeded initialization of nr_dirty in dirty_seglist_info
  f2fs: handle error from f2fs_iget_nowait
  f2fs: fix equation of has_not_enough_free_secs()
  f2fs: add MAINTAINERS entry
  f2fs: return a default value for non-void function
  f2fs: invalidate the node page if allocation is failed
  f2fs: add missing #include <linux/prefetch.h>
  f2fs: do f2fs_balance_fs in front of dir operations
  f2fs: should recover orphan and fsync data
  f2fs: fix handling errors got by f2fs_write_inode
  f2fs: fix up f2fs_get_parent issue to retrieve correct parent inode number
  f2fs: fix wrong calculation on f_files in statfs
  f2fs: remove set_page_dirty for atomic f2fs_end_io_write
2013-01-03 11:41:43 -08:00
Linus Torvalds ed4e6a94d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull GFS2 fixes from Steven Whitehouse:
 "Here are four small bug fixes for GFS2.  There is no common theme here
  really, just a few items that were fixed recently.

  The first fixes lock name generation when the glock number is 0.  The
  second fixes a race allocating reservation structures and the final
  two fix a performance issue by making small changes in the allocation
  code."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Reset rd_last_alloc when it reaches the end of the rgrp
  GFS2: Stop looking for free blocks at end of rgrp
  GFS2: Fix race in gfs2_rs_alloc
  GFS2: Initialize hex string to '0'
2013-01-03 11:38:14 -08:00
Linus Torvalds 007f6c3a63 Two self-explanatory fixes and a third patch which improves performance. When
overwriting a full page in the eCryptfs page cache, skip reading in and
 decrypting the corresponding lower page.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABCgAGBQJQ5HR+AAoJENaSAD2qAscKg/gQAJSGpz9Frh3QqV30smvbKASI
 vBcHpbEBMhpExzkcLF3Gqdj7KqcwpN3Nh+oAD1vNyvermeczazEebr5wFfNTv4eE
 TetUfa2e92RS0c0yxgS+9k1Fhxi8BCovNxmFfiq5iPFHSNwjixPBHLLZVFPCdp9N
 il/dV8Y7wg1exDikZQc8lqiVULZxvkBc+R/dgXFhAnwFxDMT2jiInXbBU4Onct0P
 +YX4FwrKnDCOg7bk8Mk/lW6mwAuhoelnuF3dy9v/soBeclOeTfmUmO44dv0D3IPY
 iGpGofhs+cDSKxOZ0XXocAdFdmY7fbcijppoF00XyZiuqcd59zc0l+LDRuCBcXD7
 SFSTzR0uFf8C0rM4Mjfz6WGbwW7Ae0KqLbFIVg03MJDCquOtDBr0Xdpviy1GYNo3
 H0Z3400olyGqp/3ZoEjefOoz9DbzqHtzhcMtGBN/ihyaolPJzS81pLTYCsja2SJM
 pHUjId3abWOVRgtrAk+XUO9Sn6W8Or5bug4+idYwD6LfUILz9OpHin/mplnHoF9F
 8lEjhzNHyvU3HQPyR4v/TidExyx7IBeP0tOLk4X2N+fmH45ukl/pPDNfpF/2lxpd
 mN7HK2H2cYtGrYSwSmwuG0q9W365vmk8mvu2Xz5aIMe9r5SeucgPjzZ3zg+kHgRE
 OqJljwln6TaSB/7o0MQ5
 =JNeQ
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull ecryptfs fixes from Tyler Hicks:
 "Two self-explanatory fixes and a third patch which improves
  performance: when overwriting a full page in the eCryptfs page cache,
  skip reading in and decrypting the corresponding lower page."

* tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  fs/ecryptfs/crypto.c: make ecryptfs_encode_for_filename() static
  eCryptfs: fix to use list_for_each_entry_safe() when delete items
  eCryptfs: Avoid unnecessary disk read and data decryption during writing
2013-01-02 17:33:50 -08:00
Linus Torvalds 5439ca6b8f Various bug fixes for ext4. Perhaps the most serious bug fixed is one
which could cause file system corruptions when performing file punch
 operations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ374OAAoJENNvdpvBGATwEGAP/jKUwjQhBZiF0k9dg1kQ5eTz
 bdli4fy1vxrEMIOym8IZa4nBQJVCkArwRgjc28gCBD6k9u6X3GPa26vUydsoPfP6
 odPdc9c9HtsbYQGuaq1SohID5HfjxHewTcUmCs4X4SpGcSurUcT7eQYWqSuIxFHR
 0nKk8NO4EcWh2uqIoGPrc8QpSdor0DXXYYjZmHCeVLH1n6PyoMsnrFMfO9KqMLUL
 vNR54CX9n1GRTfAfJNkNzcwfs8IfNkDUyv5hFpDh15tLltogU0TqnlAl3vSeZGSx
 vVfhwHmQTK/bJyC3YaoRZqq9CQJVk2f/OTBpJDFY/USaapuitJd6vqbmh7NiRNAN
 LaKmFt99MPfwyjEhIA7+J0LCTraAxc536q43oWWK5dAJhWI7DW0lbHARVeQTixNy
 KJ1Lp0pmmz1mX8/lugOnK1SPBF525kTaoiz2bWqg4oQgn7mBzUlgj+EV22/6Rq83
 TpKOKstl4BiZi8t5AhmFiwqtknCDiT5vUKQNy2kuM/oXtPJID/lM/TJbR5viYD3l
 AH3Ef7xj61CynFZ0oBeraGwtXc2BHJpJdWz+8uj0/VhFfC+uNUYapSLFwyiAVZKO
 xxaItT3ylfKpa0AWK6HBc2SLuL72SCHAPks06YKFtSyHtr5C8SCcafxU2DSOSi7K
 VrhkcH6STa77Br7a1ORt
 =9R/D
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  Perhaps the most serious bug fixed is one
  which could cause file system corruptions when performing file punch
  operations."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid hang when mounting non-journal filesystems with orphan list
  ext4: lock i_mutex when truncating orphan inodes
  ext4: do not try to write superblock on ro remount w/o journal
  ext4: include journal blocks in df overhead calcs
  ext4: remove unaligned AIO warning printk
  ext4: fix an incorrect comment about i_mutex
  ext4: fix deadlock in journal_unmap_buffer()
  ext4: split off ext4_journalled_invalidatepage()
  jbd2: fix assertion failure in jbd2_journal_flush()
  ext4: check dioread_nolock on remount
  ext4: fix extent tree corruption caused by hole punch
2013-01-02 09:57:34 -08:00
Hugh Dickins a7a88b2373 mempolicy: remove arg from mpol_parse_str, mpol_to_str
Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:27:10 -08:00
Eric Wong 128dd1759d epoll: prevent missed events on EPOLL_CTL_MOD
EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
ensure events are not missed.  Since the modifications to the interest
mask are not protected by the same lock as ep_poll_callback, we need to
ensure the change is visible to other CPUs calling ep_poll_callback.

We also need to ensure f_op->poll() has an up-to-date view of past
events which occured before we modified the interest mask.  So this
barrier also pairs with the barrier in wq_has_sleeper().

This should guarantee either ep_poll_callback or f_op->poll() (or both)
will notice the readiness of a recently-ready/modified item.

This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
http://thread.gmane.org/gmane.linux.kernel/1408782/

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
Tested-by: "Junchang(Jason) Wang" <junchang.wang@yale.edu>
Cc: netdev@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:16:43 -08:00
Bob Peterson 13d2eb0129 GFS2: Reset rd_last_alloc when it reaches the end of the rgrp
In function rg_mblk_search, it's searching for multiple blocks in
a given state (e.g. "free"). If there's an active block reservation
its goal is the next free block of that. If the resource group
contains the dinode's goal block, that's used for the search. But
if neither is the case, it uses the rgrp's last allocated block.
That way, consecutive allocations appear after one another on media.
The problem comes in when you hit the end of the rgrp; it would never
start over and search from the beginning. This became a problem,
since if you deleted all the files and data from the rgrp, it would
never start over and find free blocks. So it had to keep searching
further out on the media to allocate blocks. This patch resets the
rd_last_alloc after it does an unsuccessful search at the end of
the rgrp.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:05:27 +00:00
Bob Peterson 15bd50ad82 GFS2: Stop looking for free blocks at end of rgrp
This patch adds a return code check after calling function
gfs2_rbm_from_block while determining the free extent size.
That way, when the end of an rgrp is reached, it won't try
to process unaligned blocks after the end.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:05:10 +00:00
Abhijith Das f1213cacc7 GFS2: Fix race in gfs2_rs_alloc
QE aio tests uncovered a race condition in gfs2_rs_alloc where it's possible
to come out of the function with a valid ip->i_res allocation but it gets
freed before use resulting in a NULL ptr dereference.

This patch envelopes the initial short-circuit check for non-NULL ip->i_res
into the mutex lock. With this patch, I was able to successfully run the
reproducer test multiple times.

Resolves: rhbz#878476
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:04:53 +00:00
Nathan Straz ec1487528b GFS2: Initialize hex string to '0'
When generating the DLM lock name, a value of 0 would skip
the loop and leave the string unchanged.  This left locks with
a value of 0 unlabeled.  Initializing the string to '0' fixes this.

Signed-off-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:04:00 +00:00
Pavel Shilovsky 63b7d3a41c CIFS: Don't let read only caching for mandatory byte-range locked files
If we have mandatory byte-range locks on a file we can't cache reads
because pagereading may have conflicts with these locks on the server.
That's why we should allow level2 oplocks for files without mandatory
locks only.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 23:04:30 -06:00
Pavel Shilovsky 88cf75aaaf CIFS: Fix write after setting a read lock for read oplock files
If we have a read oplock and set a read lock in it, we can't write to the
locked area - so, filemap_fdatawrite may fail with a no information for a
userspace application even if we request a write to non-locked area. Fix
this by writing directly to the server and then breaking oplock level from
level2 to None.

Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS
and SMB2 protocols.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 23:04:14 -06:00
Pavel Shilovsky ca8aa29c60 Revert "CIFS: Fix write after setting a read lock for read oplock files"
that solution has data races and can end up two identical writes to the
server: when clientCanCacheAll value can be changed during the execution
of __generic_file_aio_write.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 22:59:55 -06:00
Ben Myers 56431cd194 Merge branch 'xfs-for-3.9' 2012-12-31 10:41:39 -06:00
Jeff Layton 31efee60f4 cifs: adjust sequence number downward after signing NT_CANCEL request
When a call goes out, the signing code adjusts the sequence number
upward by two to account for the request and the response. An NT_CANCEL
however doesn't get a response of its own, it just hurries the server
along to get it to respond to the original request more quickly.
Therefore, we must adjust the sequence number back down by one after
signing a NT_CANCEL request.

Cc: <stable@vger.kernel.org>
Reported-by: Tim Perry <tdparmor-sambabugs@yahoo.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-30 11:43:51 -06:00
Jeff Layton ea702b80e0 cifs: move check for NULL socket into smb_send_rqst
Cai reported this oops:

[90701.616664] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[90701.625438] IP: [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.632167] PGD fea319067 PUD 103fda4067 PMD 0
[90701.637255] Oops: 0000 [#1] SMP
[90701.640878] Modules linked in: des_generic md4 nls_utf8 cifs dns_resolver binfmt_misc tun sg igb iTCO_wdt iTCO_vendor_support lpc_ich pcspkr i2c_i801 i2c_core i7core_edac edac_core ioatdma dca mfd_core coretemp kvm_intel kvm crc32c_intel microcode sr_mod cdrom ata_generic sd_mod pata_acpi crc_t10dif ata_piix libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[90701.677655] CPU 10
[90701.679808] Pid: 9627, comm: ls Tainted: G        W    3.7.1+ #10 QCI QSSC-S4R/QSSC-S4R
[90701.688950] RIP: 0010:[<ffffffff814a343e>]  [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.698383] RSP: 0018:ffff88177b431bb8  EFLAGS: 00010206
[90701.704309] RAX: ffff88177b431fd8 RBX: 00007ffffffff000 RCX: ffff88177b431bec
[90701.712271] RDX: 0000000000000003 RSI: 0000000000000006 RDI: 0000000000000000
[90701.720223] RBP: ffff88177b431bc8 R08: 0000000000000004 R09: 0000000000000000
[90701.728185] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[90701.736147] R13: ffff88184ef92000 R14: 0000000000000023 R15: ffff88177b431c88
[90701.744109] FS:  00007fd56a1a47c0(0000) GS:ffff88105fc40000(0000) knlGS:0000000000000000
[90701.753137] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[90701.759550] CR2: 0000000000000028 CR3: 000000104f15f000 CR4: 00000000000007e0
[90701.767512] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[90701.775465] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[90701.783428] Process ls (pid: 9627, threadinfo ffff88177b430000, task ffff88185ca4cb60)
[90701.792261] Stack:
[90701.794505]  0000000000000023 ffff88177b431c50 ffff88177b431c38 ffffffffa014fcb1
[90701.802809]  ffff88184ef921bc 0000000000000000 00000001ffffffff ffff88184ef921c0
[90701.811123]  ffff88177b431c08 ffffffff815ca3d9 ffff88177b431c18 ffff880857758000
[90701.819433] Call Trace:
[90701.822183]  [<ffffffffa014fcb1>] smb_send_rqst+0x71/0x1f0 [cifs]
[90701.828991]  [<ffffffff815ca3d9>] ? schedule+0x29/0x70
[90701.834736]  [<ffffffffa014fe6d>] smb_sendv+0x3d/0x40 [cifs]
[90701.841062]  [<ffffffffa014fe96>] smb_send+0x26/0x30 [cifs]
[90701.847291]  [<ffffffffa015801f>] send_nt_cancel+0x6f/0xd0 [cifs]
[90701.854102]  [<ffffffffa015075e>] SendReceive+0x18e/0x360 [cifs]
[90701.860814]  [<ffffffffa0134a78>] CIFSFindFirst+0x1a8/0x3f0 [cifs]
[90701.867724]  [<ffffffffa013f731>] ? build_path_from_dentry+0xf1/0x260 [cifs]
[90701.875601]  [<ffffffffa013f731>] ? build_path_from_dentry+0xf1/0x260 [cifs]
[90701.883477]  [<ffffffffa01578e6>] cifs_query_dir_first+0x26/0x30 [cifs]
[90701.890869]  [<ffffffffa015480d>] initiate_cifs_search+0xed/0x250 [cifs]
[90701.898354]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.904486]  [<ffffffffa01554cb>] cifs_readdir+0x45b/0x8f0 [cifs]
[90701.911288]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.917410]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.923533]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.929657]  [<ffffffff81195848>] vfs_readdir+0xb8/0xe0
[90701.935490]  [<ffffffff81195b9f>] sys_getdents+0x8f/0x110
[90701.941521]  [<ffffffff815d3b99>] system_call_fastpath+0x16/0x1b
[90701.948222] Code: 66 90 55 65 48 8b 04 25 f0 c6 00 00 48 89 e5 53 48 83 ec 08 83 fe 01 48 8b 98 48 e0 ff ff 48 c7 80 48 e0 ff ff ff ff ff ff 74 22 <48> 8b 47 28 ff 50 68 65 48 8b 14 25 f0 c6 00 00 48 89 9a 48 e0
[90701.970313] RIP  [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.977125]  RSP <ffff88177b431bb8>
[90701.981018] CR2: 0000000000000028
[90701.984809] ---[ end trace 24bd602971110a43 ]---

This is likely due to a race vs. a reconnection event.

The current code checks for a NULL socket in smb_send_kvec, but that's
too late. By the time that check is done, the socket will already have
been passed to kernel_setsockopt. Move the check into smb_send_rqst, so
that it's checked earlier.

In truth, this is a bit of a half-assed fix. The -ENOTSOCK error
return here looks like it could bubble back up to userspace. The locking
rules around the ssocket pointer are really unclear as well. There are
cases where the ssocket pointer is changed without holding the srv_mutex,
but I'm not clear whether there's a potential race here yet or not.

This code seems like it could benefit from some fundamental re-think of
how the socket handling should behave. Until then though, this patch
should at least fix the above oops in most cases.

Cc: <stable@vger.kernel.org> # 3.7+
Reported-and-Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-30 11:38:58 -06:00
Leon Romanovsky 9836b8b949 f2fs: unify string length declarations and usage
This patch is intended to unify string length declarations and usage.
There are number of calls to strlen which return size_t object.
The size of this object depends on compiler if it will be bigger,
equal or even smaller than an unsigned int

Signed-off-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:53 +09:00
Jaegeuk Kim 2b50638dec f2fs: clean up unused variables and return values
This patch cleans up a couple of unnecessary codes related to unused variables
and return values.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:52 +09:00
Jaegeuk Kim ce19a5d432 f2fs: clean up the start_bidx_of_node function
This patch also resolves the following warning reported by kbuild test robot.

fs/f2fs/gc.c: In function 'start_bidx_of_node':
fs/f2fs/gc.c:453:21: warning: 'bidx' may be used uninitialized in this function

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:52 +09:00
Namjae Jeon 64c576fe51 f2fs: remove unneeded variable from f2fs_sync_fs
We can directly return '0' from the function, instead of introducing a
'ret' variable.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:45 +09:00
Namjae Jeon fd8bb65f79 f2fs: fix fsync_inode list addition logic and avoid invalid access to memory
In function find_fsync_dnodes() - the fsync inodes gets added to the list, but
in one path suppose f2fs_iget results in error, in such case - error gets added
to the fsync inode list.
In next call to recover_data()->get_fsync_inode()
entry = list_entry(this, struct fsync_inode_entry, list);
                if (entry->inode->i_ino == ino)
This can result in "invalid access to memory" when it encounters 'error' as
entry in the fsync inode list.
So, add the fsync inode entry to the list only in case of no errors.
And, free the object at that point itself in case of issue.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:36 +09:00
Namjae Jeon 344324f10f f2fs: remove unneeded initialization of nr_dirty in dirty_seglist_info
Since, the memory for the object of dirty_seglist_info is allocated
using kzalloc - which returns zeroed out memory. So, there is no need
to initialize the nr_dirty values with zeroes.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:05 +09:00
Namjae Jeon 06025f4df8 f2fs: handle error from f2fs_iget_nowait
In case f2fs_iget_nowait returns error, it results in truncate_hole being
called with 'error' value as inode pointer. There is no check in truncate_hole
for valid inode, so it could result in crash due "invalid access to memory".
Avoid this by handling error condition properly.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:26:13 +09:00
Jaegeuk Kim 029cd28c1f f2fs: fix equation of has_not_enough_free_secs()
Practically, has_not_enough_free_secs() should calculate with the numbers of
current node and directory data blocks together.
Actually the equation was implemented in need_to_flush().

So, this patch removes need_flush() and moves the equation into
has_not_enough_free_secs().

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:10 +09:00
Jaegeuk Kim 12a67146e3 f2fs: return a default value for non-void function
This patch resolves a build warning reported by kbuild test robot.

"
fs/f2fs/segment.c: In function '__get_segment_type':
fs/f2fs/segment.c:806:1: warning: control reaches end of non-void
function [-Wreturn-type]
"

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:09 +09:00
Jaegeuk Kim 71e9fec548 f2fs: invalidate the node page if allocation is failed
The new_node_page() is processed as the following procedure.

1. A new node page is allocated.
2. Set PageUptodate with proper footer information.
3. Check if there is a free space for allocation
 4.a. If there is no space, f2fs returns with -ENOSPC.
 4.b. Otherwise, go next.

In the case of step #4.a, f2fs remains a wrong node page in the page cache
with the uptodate flag.

Also, even though a new node page is allocated successfully, an error can be
occurred afterwards due to allocation failure of the other data structures.
In such a case, remove_inode_page() would be triggered, so that we have to
clear uptodate flag in truncate_node() too.

So, we should remove the uptodate flag, if allocation is failed.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:09 +09:00
Geert Uytterhoeven 690e4a3ead f2fs: add missing #include <linux/prefetch.h>
m68k allmodconfig:

fs/f2fs/data.c: In function ‘read_end_io’:
fs/f2fs/data.c:311: error: implicit declaration of function ‘prefetchw’

fs/f2fs/segment.c: In function ‘f2fs_end_io_write’:
fs/f2fs/segment.c:628: error: implicit declaration of function ‘prefetchw’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:22:43 +09:00
Theodore Ts'o 0e9a9a1ad6 ext4: avoid hang when mounting non-journal filesystems with orphan list
When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree.  If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack.  (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:50 -05:00
Theodore Ts'o 721e3eba21 ext4: lock i_mutex when truncating orphan inodes
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is
called without i_mutex being taken.  It had previously not been taken
during orphan cleanup since races weren't possible at that point in
the mount process, but as a result of this c278531d39, we will now see
a kernel WARN_ON in this case.  Take the i_mutex in
ext4_orphan_cleanup() to suppress this warning.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:48 -05:00
Eric W. Biederman 48c6d1217e f2fs: Don't assign e_id in f2fs_acl_from_disk
With user namespaces enabled building f2fs fails with:

 CC      fs/f2fs/acl.o
fs/f2fs/acl.c: In function ‘f2fs_acl_from_disk’:
fs/f2fs/acl.c:85:21: error: ‘struct posix_acl_entry’ has no member named ‘e_id’
make[2]: *** [fs/f2fs/acl.o] Error 1
make[2]: Target `__build' not remade because of errors.

e_id is a backwards compatibility field only used for file systems
that haven't been converted to use kuids and kgids.  When the posix
acl tag field is neither ACL_USER nor ACL_GROUP assigning e_id is
unnecessary.  Remove the assignment so f2fs will build with user
namespaces enabled.

Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 20:05:15 -08:00
Jaegeuk Kim 1efef83202 f2fs: do f2fs_balance_fs in front of dir operations
In order to conserve free sections to deal with the worst-case scenarios, f2fs
should be able to freeze all the directory operations especially when there are
not enough free sections. The f2fs_balance_fs() is for this use.

When FS utilization becomes almost 100%, directory operations can be failed due
to -ENOSPC frequently, which produces some dirty node pages occasionally.

Previously, in such a case, f2fs_balance_fs() is not able to be triggered since
it is triggered only if the directory operation ends up with success.

So, this patch triggers f2fs_balance_fs() at first before handling directory
operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim 30f0c75858 f2fs: should recover orphan and fsync data
The recovery routine should do all the time regardless of normal umount action.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim 398b1ac5a5 f2fs: fix handling errors got by f2fs_write_inode
Ruslan reported that f2fs hangs with an infinite loop in f2fs_sync_file():

	while (sync_node_pages(sbi, inode->i_ino, &wbc) == 0)
		f2fs_write_inode(inode, NULL);

The reason was revealed that the cold flag is not set even thought this inode is
a normal file. Therefore, sync_node_pages() skips to write node blocks since it
only writes cold node blocks.

The cold flag is stored to the node_footer in node block, and whenever a new
node page is allocated, it is set according to its file type, file or directory.

But, after sudden-power-off, when recovering the inode page, f2fs doesn't recover
its cold flag.

So, let's assign the cold flag in more right places.

One more thing:
If f2fs_write_inode() returns an error due to whatever situations, there would
be no dirty node pages so that sync_node_pages() returns zero.
(i.e., zero means nothing was written.)

Reported-by: Ruslan N. Marchenko <me@ruff.mobi>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Namjae Jeon 38e0abdcfb f2fs: fix up f2fs_get_parent issue to retrieve correct parent inode number
Test Case:
[NFS Client]
ls -lR .

[NFS Server]
while [ 1 ]
do
echo 3 > /proc/sys/vm/drop_caches
done

Error on NFS Client: "No such file or directory"

When cache is dropped at the server, it results in lookup failure at the
NFS client due to non-connection with the parent. The default path is it
initiates a lookup by calculating the hash value for the name, even though
the hash values stored on the disk for "." and ".." is maintained as zero,
which results in failure from find_in_block due to not matching HASH values.
Fix up, by using the correct hashing values for these entries.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim 1362b5e347 f2fs: fix wrong calculation on f_files in statfs
In f2fs_statfs(), f_files should be the total number of available inodes
instead of the currently allocated inodes.
So, this patch should resolve the reported bug below.

Note that, showing 10% usage is not a bug, since f2fs reveals whole volume size
as much as possible and shows the space overhead as *used*.
This policy is fair enough with respect to other file systems.

<Reported Bug>
(loop0 is backed by 1GiB file)

$ mkfs.f2fs /dev/loop0

F2FS-tools: Ver: 1.1.0 (2012-12-11)
Info: sector size = 512
Info: total sectors = 2097152 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
Info: format successful

$ mount /dev/loop0 mnt/

$ df mnt/
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0       1046528 98312    929784  10%
/home/zeta/linux-devel/mtd-bench/mnt

$ df mnt/ -i
Filesystem     Inodes   IUsed  IFree IUse% Mounted on
/dev/loop0       1 -465918 465919     - /home/zeta/linux-devel/mtd-bench/mnt

Notice IUsed is negative. Also, 10% usage on a fresh f2fs seems too
much to be correct.

Reported-and-Tested-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:51 +09:00
Jaegeuk Kim dfb7c0ceab f2fs: remove set_page_dirty for atomic f2fs_end_io_write
We should guarantee not to do *scheduling while atomic*.
I found, in atomic f2fs_end_io_write(), there is a set_page_dirty() call
to deal with IO errors.

But, set_page_dirty() calls:
 -> f2fs_set_data_page_dirty()
   -> set_dirty_dir_page()
      -> cond_resched() which results in scheduling.

In order to avoid this, I'd like to remove simply set_page_dirty(),
since the page is already marked as ERROR and f2fs will be operated
as the read-only mode as well.
So, there is no recovery issue with this.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:51 +09:00
Eric W. Biederman dfb2ea45be proc: Allow proc_free_inum to be called from any context
While testing the pid namespace code I hit this nasty warning.

[  176.262617] ------------[ cut here ]------------
[  176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
[  176.265145] Hardware name: Bochs
[  176.265677] Modules linked in:
[  176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
[  176.266564] Call Trace:
[  176.266564]  [<ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0
[  176.266564]  [<ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20
[  176.266564]  [<ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0
[  176.266564]  [<ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20
[  176.266564]  [<ffffffff8123dbda>] proc_free_inum+0x3a/0x50
[  176.266564]  [<ffffffff8111d0dc>] free_pid_ns+0x1c/0x80
[  176.266564]  [<ffffffff8111d195>] put_pid_ns+0x35/0x50
[  176.266564]  [<ffffffff810c608a>] put_pid+0x4a/0x60
[  176.266564]  [<ffffffff8146b177>] tty_ioctl+0x717/0xc10
[  176.266564]  [<ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90
[  176.266564]  [<ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10
[  176.266564]  [<ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70
[  176.266564]  [<ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550
[  176.266564]  [<ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60
[  176.266564]  [<ffffffff810b9127>] ? __set_task_blocked+0x37/0x80
[  176.266564]  [<ffffffff810ab95b>] ? sys_wait4+0xab/0xf0
[  176.266564]  [<ffffffff811e3d31>] sys_ioctl+0x91/0xb0
[  176.266564]  [<ffffffff810a95f0>] ? task_stopped_code+0x50/0x50
[  176.266564]  [<ffffffff81939199>] system_call_fastpath+0x16/0x1b
[  176.266564] ---[ end trace 387af88219ad6143 ]---

It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
put_pid is called with another spinlock held and irqs disabled.

For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 16:23:12 -08:00
Michael Tokarev d096ad0f79 ext4: do not try to write superblock on ro remount w/o journal
When a journal-less ext4 filesystem is mounted on a read-only block
device (blockdev --setro will do), each remount (for other, unrelated,
flags, like suid=>nosuid etc) results in a series of scary messages
from kernel telling about I/O errors on the device.

This is becauese of the following code ext4_remount():

       if (sbi->s_journal == NULL)
                ext4_commit_super(sb, 1);

at the end of remount procedure, which forces writing (flushing) of
a superblock regardless whenever it is dirty or not, if the filesystem
is readonly or not, and whenever the device itself is readonly or not.

We only need call ext4_commit_super when the file system had been
previously mounted read/write.

Thanks to Eric Sandeen for help in diagnosing this issue.

Signed-off-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-25 14:08:16 -05:00
Eric Sandeen 0875a2b448 ext4: include journal blocks in df overhead calcs
To more accurately calculate overhead for "bsd" style
df reporting, we should count the journal blocks as
overhead as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Eric Whitney <enwlinux@gmail.com>
2012-12-25 13:56:01 -05:00
Eric Sandeen a28a9178e8 ext4: remove unaligned AIO warning printk
Although I put this in, I now think it was a bad decision.  For most
users, there is very little to be done in this case.  They get the
message, once per day, with no real context or proposed action.  TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.

Just nuke it.  Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:33:13 -05:00
Andy Lutomirski ad96f71155 ext4: fix an incorrect comment about i_mutex
i_mutex is not held when ->sync_file is called.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:31:52 -05:00
Jan Kara 53e872681f ext4: fix deadlock in journal_unmap_buffer()
We cannot wait for transaction commit in journal_unmap_buffer()
because we hold page lock which ranks below transaction start.  We
solve the issue by bailing out of journal_unmap_buffer() and
jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
for waiting for transaction commit to finish and try invalidation
again. Since the issue can happen only for page stradding i_size, it
is simple enough to manually call jbd2_journal_invalidatepage() for
such page from ext4_setattr(), check the return value and wait if
necessary.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:29:52 -05:00
Jan Kara 4520fb3c36 ext4: split off ext4_journalled_invalidatepage()
In data=journal mode we don't need delalloc or DIO handling in invalidatepage
and similarly in other modes we don't need the journal handling. So split
invalidatepage implementations.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:28:54 -05:00
Linus Torvalds 769cb858c2 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Misc small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: eliminate cifsERROR variable
  cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use
  cifs: fix double-free of "string" in cifs_parse_mount_options
2012-12-21 17:09:07 -08:00
J. Bruce Fields 10532b560b Revert "nfsd: warn on odd reply state in nfsd_vfs_read"
This reverts commit 79f77bf9a4.

This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs.  The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 17:07:45 -08:00
Trond Myklebust c4271c6e37 NFS: Kill fscache warnings when mounting without -ofsc
The fscache code will currently bleat a "non-unique superblock keys"
warning even if the user is mounting without the 'fsc' option.

There should be no reason to even initialise the superblock cache cookie
unless we're planning on using fscache for something, so ensure that we
check for the NFS_OPTION_FSCACHE flag before calling into the fscache
code.

Reported-by: Paweł Sikora <pawel.sikora@agmk.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:32:09 -08:00
David Howells c129c29347 NFS: Provide stub nfs_fscache_wait_on_invalidate() for when CONFIG_NFS_FSCACHE=n
Provide a stub nfs_fscache_wait_on_invalidate() function for when
CONFIG_NFS_FSCACHE=n lest the following error appear:

  fs/nfs/inode.c: In function 'nfs_invalidate_mapping':
  fs/nfs/inode.c:887:2: error: implicit declaration of function 'nfs_fscache_wait_on_invalidate' [-Werror=implicit-function-declaration]
  cc1: some warnings being treated as errors

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:06:48 -08:00
Jan Kara d7961c7fa4 jbd2: fix assertion failure in jbd2_journal_flush()
The following race is possible between start_this_handle() and someone
calling jbd2_journal_flush().

Process A                              Process B
start_this_handle().
  if (journal->j_barrier_count) # false
  if (!journal->j_running_transaction) { #true
    read_unlock(&journal->j_state_lock);
                                       jbd2_journal_lock_updates()
                                       jbd2_journal_flush()
                                         write_lock(&journal->j_state_lock);
                                         if (journal->j_running_transaction) {
                                           # false
                                         ... wait for committing trans ...
                                         write_unlock(&journal->j_state_lock);
    ...
    write_lock(&journal->j_state_lock);
    if (!journal->j_running_transaction) { # true
      jbd2_get_transaction(journal, new_transaction);
    write_unlock(&journal->j_state_lock);
    goto repeat; # eventually blocks on j_barrier_count > 0
                                         ...
                                         J_ASSERT(!journal->j_running_transaction);
                                           # fails

We fix the race by rechecking j_barrier_count after reacquiring j_state_lock
in exclusive mode.

Reported-by: yjwsignal@empal.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-21 00:15:51 -05:00
Linus Torvalds 96680d2b91 Merge branch 'for-next' of git://git.infradead.org/users/eparis/notify
Pull filesystem notification updates from Eric Paris:
 "This pull mostly is about locking changes in the fsnotify system.  By
  switching the group lock from a spin_lock() to a mutex() we can now
  hold the lock across things like iput().  This fixes a problem
  involving unmounting a fs and having inodes be busy, first pointed out
  by FAT, but reproducible with tmpfs.

  This also restores signal driven I/O for inotify, which has been
  broken since about 2.6.32."

Ugh.  I *hate* the timing of this.  It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes.  That's just crap.  But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.

Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.

* 'for-next' of git://git.infradead.org/users/eparis/notify:
  inotify: automatically restart syscalls
  inotify: dont skip removal of watch descriptor if creation of ignored event failed
  fanotify: dont merge permission events
  fsnotify: make fasync generic for both inotify and fanotify
  fsnotify: change locking order
  fsnotify: dont put marks on temporary list when clearing marks by group
  fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
  fsnotify: pass group to fsnotify_destroy_mark()
  fsnotify: use a mutex instead of a spinlock to protect a groups mark list
  fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
  fsnotify: take groups mark_lock before mark lock
  fsnotify: use reference counting for groups
  fsnotify: introduce fsnotify_get_group()
  inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
2012-12-20 20:11:52 -08:00
Linus Torvalds 4c9a44aebe Merge branch 'akpm' (Andrew's patch-bomb)
Merge the rest of Andrew's patches for -rc1:
 "A bunch of fixes and misc missed-out-on things.

  That'll do for -rc1.  I still have a batch of IPC patches which still
  have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  keys: use keyring_alloc() to create module signing keyring
  keys: fix unreachable code
  sendfile: allows bypassing of notifier events
  SGI-XP: handle non-fatal traps
  fat: fix incorrect function comment
  Documentation: ABI: remove testing/sysfs-devices-node
  proc: fix inconsistent lock state
  linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
  memcg: don't register hotcpu notifier from ->css_alloc()
  checkpatch: warn on uapi #includes that #include <uapi/...
  revert "rtc: recycle id when unloading a rtc driver"
  mm: clean up transparent hugepage sysfs error messages
  hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
  hfsplus: rework processing of hfs_btree_write() returned error
  hfsplus: rework processing errors in hfsplus_free_extents()
  hfsplus: avoid crash on failed block map free
  kcmp: include linux/ptrace.h
  drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
  mm: cma: WARN if freed memory is still in use
  exec: do not leave bprm->interp on stack
  ...
2012-12-20 20:00:43 -08:00
Linus Torvalds 1f0377ff08 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS update from Al Viro:
 "fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
  misc stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
  vfs: make lremovexattr retry once on ESTALE error
  vfs: make removexattr retry once on ESTALE
  vfs: make llistxattr retry once on ESTALE error
  vfs: make listxattr retry once on ESTALE error
  vfs: make lgetxattr retry once on ESTALE
  vfs: make getxattr retry once on an ESTALE error
  vfs: allow lsetxattr() to retry once on ESTALE errors
  vfs: allow setxattr to retry once on ESTALE errors
  vfs: allow utimensat() calls to retry once on an ESTALE error
  vfs: fix user_statfs to retry once on ESTALE errors
  vfs: make fchownat retry once on ESTALE errors
  vfs: make fchmodat retry once on ESTALE errors
  vfs: have chroot retry once on ESTALE error
  vfs: have chdir retry lookup and call once on ESTALE error
  vfs: have faccessat retry once on an ESTALE error
  vfs: have do_sys_truncate retry once on an ESTALE error
  vfs: fix renameat to retry on ESTALE errors
  vfs: make do_unlinkat retry once on ESTALE errors
  vfs: make do_rmdir retry once on ESTALE errors
  vfs: add a flags argument to user_path_parent
  ...
2012-12-20 18:14:31 -08:00
Linus Torvalds 54d46ea993 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "sigaltstack infrastructure + conversion for x86, alpha and um,
  COMPAT_SYSCALL_DEFINE infrastructure.

  Note that there are several conflicts between "unify
  SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
  resolution is trivial - just remove definitions of SS_ONSTACK and
  SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
  include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to generic sigaltstack
  new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
  generic compat_sys_sigaltstack()
  introduce generic sys_sigaltstack(), switch x86 and um to it
  new helper: compat_user_stack_pointer()
  new helper: restore_altstack()
  unify SS_ONSTACK/SS_DISABLE definitions
  new helper: current_user_stack_pointer()
  missing user_stack_pointer() instances
  Bury the conditionals from kernel_thread/kernel_execve series
  COMPAT_SYSCALL_DEFINE: infrastructure
2012-12-20 18:05:28 -08:00
Scott Wolchok a68c2f12b4 sendfile: allows bypassing of notifier events
do_sendfile() in fs/read_write.c does not call the fsnotify functions,
unlike its neighbors.  This manifests as a lack of inotify ACCESS events
when a file is sent using sendfile(2).

Addresses
  https://bugzilla.kernel.org/show_bug.cgi?id=12812

[akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Scott Wolchok <swolchok@umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:21 -08:00
Ravishankar N c39540c6d1 fat: fix incorrect function comment
fat_search_long() returns 0 on success, -ENOENT/ENOMEM on failure.
Change the function comment accordingly.

While at it, fix some trivial typos.

Signed-off-by: Ravishankar N <cyberax82@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Xiaotian Feng ee297209bf proc: fix inconsistent lock state
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq.  Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.

Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.

  =================================
  [ INFO: inconsistent lock state ]
  3.7.0 #36 Not tainted
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0x8ae/0xca0
     lock_acquire+0x199/0x200
     _raw_spin_lock+0x41/0x50
     proc_alloc_inum+0x4c/0xd0
     alloc_mnt_ns+0x49/0xc0
     create_mnt_ns+0x25/0x70
     mnt_init+0x161/0x1c7
     vfs_caches_init+0x107/0x11a
     start_kernel+0x348/0x38c
     x86_64_start_reservations+0x131/0x136
     x86_64_start_kernel+0x103/0x112
  irq event stamp: 2993422
  hardirqs last  enabled at (2993422):  _raw_spin_unlock_irqrestore+0x55/0x80
  hardirqs last disabled at (2993421):  _raw_spin_lock_irqsave+0x29/0x70
  softirqs last  enabled at (2993394):  _local_bh_enable+0x13/0x20
  softirqs last disabled at (2993395):  call_softirq+0x1c/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(proc_inum_lock);
    <Interrupt>
      lock(proc_inum_lock);

   *** DEADLOCK ***

  no locks held by swapper/1/0.

  stack backtrace:
  Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
  Call Trace:
   <IRQ>  [<ffffffff810a40f1>] ? vprintk_emit+0x471/0x510
    print_usage_bug+0x2a5/0x2c0
    mark_lock+0x33b/0x5e0
    __lock_acquire+0x813/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_free_inum+0x1c/0x50
    free_pid_ns+0x1c/0x50
    put_pid_ns+0x2e/0x50
    put_pid+0x4a/0x60
    delayed_put_pid+0x12/0x20
    rcu_process_callbacks+0x462/0x790
    __do_softirq+0x1b4/0x3b0
    call_softirq+0x1c/0x30
    do_softirq+0x59/0xd0
    irq_exit+0x54/0xd0
    smp_apic_timer_interrupt+0x95/0xa3
    apic_timer_interrupt+0x72/0x80
    cpuidle_enter_tk+0x10/0x20
    cpuidle_enter_state+0x17/0x50
    cpuidle_idle_call+0x287/0x520
    cpu_idle+0xba/0x130
    start_secondary+0x2b3/0x2bc

Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Vyacheslav Dubeyko bffdd661bd hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
Add an error message for the case of failure of sync fs in
delayed_sync_fs() method.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Vyacheslav Dubeyko 81cc7fad55 hfsplus: rework processing of hfs_btree_write() returned error
Add to hfs_btree_write() a return of -EIO on failure of b-tree node
searching.  Also add logic ofor processing errors from hfs_btree_write()
in hfsplus_system_write_inode() with a message about b-tree writing
failure.

[akpm@linux-foundation.org: reduce scope of `err', print errno on error]
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Vyacheslav Dubeyko 1b243fd39b hfsplus: rework processing errors in hfsplus_free_extents()
Currently, it doesn't process error codes from the hfsplus_block_free()
call in hfsplus_free_extents() method.  Add some error code processing.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Alan Cox 5daa669c80 hfsplus: avoid crash on failed block map free
If the read fails we kmap an error code.  This doesn't end well.  Instead
print a critical error and pray.  This mirrors the rest of the fs
behaviour with critical error cases.

Acked-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Kees Cook b66c598401 exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.

Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules.  Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted.  They leave bprm->interp
pointing to their local stack.  This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.

After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules.  As
such, we need to protect the changes to interp.

This changes the logic to require allocation for any changes to the
bprm->interp.  To avoid adding a new kmalloc to every exec, the default
value is left as-is.  Only when passing through binfmt_script or
binfmt_misc does an allocation take place.

For a proof of concept, see DoTest.sh from:

   http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Jeff Layton b729d75d19 vfs: make lremovexattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:11 -05:00
Jeff Layton 12f0621299 vfs: make removexattr retry once on ESTALE
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton bd9bbc9842 vfs: make llistxattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton 10a90cf36e vfs: make listxattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton 3a3e159dbf vfs: make lgetxattr retry once on ESTALE
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton 60e66b48ca vfs: make getxattr retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton 49e09e1cc5 vfs: allow lsetxattr() to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton 68f1bb8bb8 vfs: allow setxattr to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:08 -05:00
Jeff Layton a69201d6f0 vfs: allow utimensat() calls to retry once on an ESTALE error
Clearly, we can't handle the NULL filename case, but we can deal with
the case where there's a real pathname.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:08 -05:00
Jeff Layton 96948fc606 vfs: fix user_statfs to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:07 -05:00
Jeff Layton 99a5df37a0 vfs: make fchownat retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:07 -05:00