Commit Graph

38175 Commits

Author SHA1 Message Date
NeilBrown ef16cc5909 autofs4: d_manage() should return -EISDIR when appropriate in rcu-walk mode.
If rcu-walk mode we don't *have* to return -EISDIR for non-mount-traps
as we will simply drop into REF-walk and handling DCACHE_NEED_AUTOMOUNT
dentrys the slow way.  But it is better if we do when possible.

In 'oz_mode', use the same condition as ref-walk: if not a mountpoint,
then it must be -EISDIR.

In regular mode there are most tests needed.  Most of them can be
performed without taking any spinlocks.  If we find a directory that
isn't obviously empty, and isn't mounted on, we need to call
'simple_empty()' which does take a spinlock.  If this turned out to hurt
performance, some other approach could be found to signal when a
directory is known to be empty.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 4d885f90e3 autofs4: avoid taking fs_lock during rcu-walk
->fs_lock protects AUTOFS_INF_EXPIRING.  We need to be sure that once
the flag is set, no new references beneath the dentry are taken.  So
rcu-walk currently needs to take fs_lock before checking the flag.  This
hurts performance.

Change the expiry to a two-stage process.  First set AUTOFS_INF_NO_RCU
which forces any path walk into ref-walk mode, then drop the lock and
call synchronize_rcu().  Once that returns we can be sure no rcu-walk is
active beneath the dentry and we can check reference counts again.

Now during an RCU-walk we can test AUTOFS_INF_EXPIRING without taking
the lock as along as we test AUTOFS_INF_NO_RCU too.  If either are set,
we must abort the RCU-walk If neither are set, we know that refcounts
will be tested again after we finish the RCU-walk so we are safe to
continue.

->fs_lock is still taken in d_manage() to check for a non-trap
directory.  That will be resolved in the next patch.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 6ece08e618 autofs4: make "autofs4_can_expire" idempotent.
Have a "test" function change the value it is testing can be confusing,
particularly as a future patch will be calling this function twice.

So move the update for 'last_used' to avoid repeat expiry to the place
where the final determination on what to expire is known.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown a5d1dba143 autofs4: factor should_expire() out of autofs4_expire_indirect.
Future patch will potentially call this twice, so make it separate.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 23bfc2a24e autofs4: allow RCU-walk to walk through autofs4
This series teaches autofs about RCU-walk so that we don't drop straight
into REF-walk when we hit an autofs directory, and so that we avoid
spinlocks as much as possible when performing an RCU-walk.

This is needed so that the benefits of the recent NFS support for
RCU-walk are fully available when NFS filesystems are automounted.

Patches have been carefully reviewed and tested both with test suites
and in production - thanks a lot to Ian Kent for his support there.

This patch (of 6):

Any attempt to look up a pathname that passes though an autofs4 mount is
currently forced out of RCU-walk into REF-walk.

This can significantly hurt performance of many-thread work loads on
many-core systems, especially if the automounted filesystem supports
RCU-walk but doesn't get to benefit from it.

So if autofs4_d_manage is called with rcu_walk set, only fail with -ECHILD
if it is necessary to wait longer than a spinlock.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Fabian Frederick 8a273345dc fs/ncpfs/dir.c: remove redundant sys_tz declaration
sys_tz is already declared in include/linux/time.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Arnd Bergmann de8288b1f8 binfmt_misc: work around gcc-4.9 warning
gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
parse_command function:

  fs/binfmt_misc.c: In function 'parse_command.part.3':
  fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]

I've managed to trace this back to the ARM implementation of memset,
which is called from copy_from_user in case of a fault and which does

 #define memset(p,v,n)                                                  \
        ({                                                              \
                void *__p = (p); size_t __n = n;                        \
                if ((__n) != 0) {                                       \
                        if (__builtin_constant_p((v)) && (v) == 0)      \
                                __memzero((__p),(__n));                 \
                        else                                            \
                                memset((__p),(v),(__n));                \
                }                                                       \
                (__p);                                                  \
        })

Apparently gcc gets confused by the check for "size != 0" and believes
that the size might be zero when it gets to the line that does "if
(s[count-1] == '\n')", so it would access data outside of the array.

gcc is clearly wrong here, since this condition was already checked
earlier in the function and the 'size' value can not change in the
meantime.

Fortunately, we can work around it and get rid of the warning by
rearranging the function to check for zero size after doing the
copy_from_user.  It is still safe to pass a zero size into
copy_from_user, so it does not cause any side effects.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Mike Frysinger bbaecc0882 binfmt_misc: expand the register format limit to 1920 bytes
The current code places a 256 byte limit on the registration format.
This ends up being fairly limited when you try to do matching against a
binary format like ELF:

 - the magic & mask formats cannot have any embedded NUL chars
   (string_unescape_inplace halts at the first NUL)
 - each escape sequence quadruples the size: \x00 is needed for NUL
 - trying to match bytes at the start of the file as well as further
   on leads to a lot of \x00 sequences in the mask
 - magic & mask have to be the same length (when decoded)
 - still need bytes for the other fields
 - impossible!

Let's look at a concrete (and common) example: using QEMU to run MIPS
ELFs.  The name field uses 11 bytes "qemu-mipsel".  The interp uses 20
bytes "/usr/bin/qemu-mipsel".  The type & flags takes up 4 bytes.  We
need 7 bytes for the delimiter (usually ":").  We can skip offset.  So
already we're down to 107 bytes to use with the magic/mask instead of
the real limit of 128 (BINPRM_BUF_SIZE).  If people use shell code to
register (which they do the majority of the time), they're down to ~26
possible bytes since the escape sequence must be \x##.

The ELF format looks like (both 32 & 64 bit):

	e_ident: 16 bytes
	e_type: 2 bytes
	e_machine: 2 bytes

Those 20 bytes are enough for most architectures because they have so few
formats in the first place, thus they can be uniquely identified.  That
also means for shell users, since 20 is smaller than 26, they can sanely
register a handler.

But for some targets (like MIPS), we need to poke further.  The ELF fields
continue on:

	e_entry: 4 or 8 bytes
	e_phoff: 4 or 8 bytes
	e_shoff: 4 or 8 bytes
	e_flags: 4 bytes

We only care about e_flags here as that includes the bits to identify
whether the ELF is O32/N32/N64.  But now we have to consume another 16
bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
flags.  If every byte is escaped, we send 288 more bytes to the kernel
((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
{e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
budget.

Even if we try to be clever and do the decoding ourselves (rather than
relying on the kernel to process \x##), we still can't hit the mark --
string_unescape_inplace treats mask & magic as C strings so NUL cannot
be embedded.  That leaves us with having to pass \x00 for the 12/24
entry/phoff/shoff bytes (as those will be completely random addresses),
and that is a minimum requirement of 48/96 bytes for the mask alone.
Add up the rest and we blow through it (this is for 64 bit ELFs):
magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
       4 {e_flags} = 48              # ^^ See note below.
mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
       4 {e_flags} = 120
Remember above we had 107 left over, and now we're at 168.  This is of
course the *best* case scenario -- you'll also want to have NUL bytes
in the magic & mask too to match literal zeros.

Note: the reason we can use 24 in the magic is that we can work off of the
fact that for bytes the mask would clobber, we can stuff any value into
magic that we want.  So when mask is \x00, we don't need the magic to also
be \x00, it can be an unescaped raw byte like '!'.  This lets us handle
more formats (barely) under the current 256 limit, but that's a pretty
tall hoop to force people to jump through.

With all that said, let's bump the limit from 256 bytes to 1920.  This way
we support escaping every byte of the mask & magic field (which is 1024
bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
fields.  Like long paths to the interpreter (when you have source in your
/really/long/homedir/qemu/foo).  Since the current code stuffs more than
one structure into the same buffer, we leave a bit of space to easily
round up to 2k.  1920 is just as arbitrary as 256 ;).

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:15 +02:00
Linus Torvalds faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Linus Torvalds d6dd50e07c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...
2014-10-13 15:44:12 +02:00
Linus Torvalds 5ff0b9e1a1 xfs: update for 3.18-rc1
This update contains:
 o various cleanups
 o log recovery debug hooks
 o seek hole/data implementation merge
 o extent shift rework to fix collapse range bugs
 o various sparse warning fixes
 o log recovery transaction processing rework to fix use after free bugs
 o metadata buffer IO infrastructuer rework to ensure all buffers under IO have
   valid reference counts
 o various fixes for ondisk flags, writeback and zero range corner cases
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUOxyCAAoJEK3oKUf0dfodzt8QAKcFdE8hyCAnD8IK85v46gWG
 IHnxOlTLrhs/22wfD1fSUcjCBQsQAIloQihvVGStugFnkEUHOUjlZ/oMcGNFPECC
 L7B4Ns6WmA9TA8ibgYvLZepautNjzhS5/lGfqSWpw4hQPsJJp2fGyCVF/ZhwnP6D
 qPeflVic8E8rgaJp98X8uFyZ+9EEoSF7/9EhmvVNwsO6UaThhIO/oPydx8oNrhKS
 k6aADmxNYtFWJb6kUjFbXaJwrFIFLvJc60FZz2eUViVGBx6K8D5FBiVbzZKe2WZ6
 VOz4fj63BYI7Nxk4rZGJPoyql+ChO/pIVwH15ZmYRkkgUXs8FGy85mNKMg7DHnFm
 K/ZUhW5IBc6GtkwPCjNIM642IQYnTR5SdQfFxMS2EYPBUumcQ3EbD44aGZY69YYu
 pP+2g4b1diadNkGACccj6teQ9V0fbyF0lfZqoZMeN/W0As6l9oYa0yFBGsK9sblq
 yrPfce+wEy5HBy9M7Fqpvm3bwMunNViqilGZXKlOyodSgXxSF3JwXuc+8/TNwcnL
 O0RSD7R7k6TvrmAntTgwT4beZi4ziG+/tVa0rD3mJM/sXyzcP2bwbA1APM74NcHh
 p8mrJRci6vtKPwIylQ1xeCeK/WD21OhbJWBYR+0JOEJSnAjtv8flk7mqGLhy+M+Y
 yCdHJIfuJ4NKj4X3f0Kc
 =TdAB
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs update from Dave Chinner:
 "This update contains:
   - various cleanups
   - log recovery debug hooks
   - seek hole/data implementation merge
   - extent shift rework to fix collapse range bugs
   - various sparse warning fixes
   - log recovery transaction processing rework to fix use after free
     bugs
   - metadata buffer IO infrastructuer rework to ensure all buffers
     under IO have valid reference counts
   - various fixes for ondisk flags, writeback and zero range corner
     cases"

* tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (56 commits)
  xfs: fix agno increment in xfs_inumbers() loop
  xfs: xfs_iflush_done checks the wrong log item callback
  xfs: flush the range before zero range conversion
  xfs: restore buffer_head unwritten bit on ioend cancel
  xfs: check for null dquot in xfs_quota_calc_throttle()
  xfs: fix crc field handling in xfs_sb_to/from_disk
  xfs: don't send null bp to xfs_trans_brelse()
  xfs: check for inode size overflow in xfs_new_eof()
  xfs: only set extent size hint when asked
  xfs: project id inheritance is a directory only flag
  xfs: kill time.h
  xfs: compat_xfs_bstat does not have forkoff
  xfs: simplify xfs_zero_remaining_bytes
  xfs: check xfs_buf_read_uncached returns correctly
  xfs: introduce xfs_buf_submit[_wait]
  xfs: kill xfs_bioerror_relse
  xfs: xfs_bioerror can die.
  xfs: kill xfs_bdstrat_cb
  xfs: rework xfs_buf_bio_endio error handling
  xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality
  ...
2014-10-13 12:06:54 +02:00
Linus Torvalds 77c688ac87 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The big thing in this pile is Eric's unmount-on-rmdir series; we
  finally have everything we need for that.  The final piece of prereqs
  is delayed mntput() - now filesystem shutdown always happens on
  shallow stack.

  Other than that, we have several new primitives for iov_iter (Matt
  Wilcox, culled from his XIP-related series) pushing the conversion to
  ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
  cleanups and fixes (including the external name refcounting, which
  gives consistent behaviour of d_move() wrt procfs symlinks for long
  and short names alike) and assorted cleanups and fixes all over the
  place.

  This is just the first pile; there's a lot of stuff from various
  people that ought to go in this window.  Starting with
  unionmount/overlayfs mess...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
  fs/file_table.c: Update alloc_file() comment
  vfs: Deduplicate code shared by xattr system calls operating on paths
  reiserfs: remove pointless forward declaration of struct nameidata
  don't need that forward declaration of struct nameidata in dcache.h anymore
  take dname_external() into fs/dcache.c
  let path_init() failures treated the same way as subsequent link_path_walk()
  fix misuses of f_count() in ppp and netlink
  ncpfs: use list_for_each_entry() for d_subdirs walk
  vfs: move getname() from callers to do_mount()
  gfs2_atomic_open(): skip lookups on hashed dentry
  [infiniband] remove pointless assignments
  gadgetfs: saner API for gadgetfs_create_file()
  f_fs: saner API for ffs_sb_create_file()
  jfs: don't hash direct inode
  [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
  ecryptfs: ->f_op is never NULL
  android: ->f_op is never NULL
  nouveau: __iomem misannotations
  missing annotation in fs/file.c
  fs: namespace: suppress 'may be used uninitialized' warnings
  ...
2014-10-13 11:28:42 +02:00
Dave Chinner 6889e783cd Merge branch 'xfs-misc-fixes-for-3.18-3' into for-next 2014-10-13 10:22:45 +11:00
Eric Sandeen a8b1ee8baf xfs: fix agno increment in xfs_inumbers() loop
caused a regression in xfs_inumbers, which in turn broke
xfsdump, causing incomplete dumps.

The loop in xfs_inumbers() needs to fill the user-supplied
buffers, and iterates via xfs_btree_increment, reading new
ags as needed.

But the first time through the loop, if xfs_btree_increment()
succeeds, we continue, which triggers the ++agno at the bottom
of the loop, and we skip to soon to the next ag - without
the proper setup under next_ag to read the next ag.

Fix this by removing the agno increment from the loop conditional,
and only increment agno if we have actually hit the code under
the next_ag: target.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-13 10:21:53 +11:00
Eric Biggers a457606a6f fs/file_table.c: Update alloc_file() comment
This comment is 5 years outdated; init_file() no longer exists.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:10 -04:00
Eric Biggers 8cc431165d vfs: Deduplicate code shared by xattr system calls operating on paths
The following pairs of system calls dealing with extended attributes only
differ in their behavior on whether the symbolic link is followed (when
the named file is a symbolic link):

- setxattr() and lsetxattr()
- getxattr() and lgetxattr()
- listxattr() and llistxattr()
- removexattr() and lremovexattr()

Despite this, the implementations all had duplicated code, so this commit
redirects each of the above pairs of system calls to a corresponding
function to which different lookup flags (LOOKUP_FOLLOW or 0) are passed.

For me this reduced the stripped size of xattr.o from 8824 to 8248 bytes.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:10 -04:00
Al Viro 50b220bbe7 reiserfs: remove pointless forward declaration of struct nameidata
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:07 -04:00
Al Viro 810bb17267 take dname_external() into fs/dcache.c
never used outside and it's too low-level for legitimate uses outside
of fs/dcache.c anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:05 -04:00
Al Viro 115cbfdc60 let path_init() failures treated the same way as subsequent link_path_walk()
As it is, path_lookupat() and path_mounpoint() might end up leaking struct file
reference in some cases.

Spotted-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:04 -04:00
Linus Torvalds 5e40d331bd Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris.

Mostly ima, selinux, smack and key handling updates.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  integrity: do zero padding of the key id
  KEYS: output last portion of fingerprint in /proc/keys
  KEYS: strip 'id:' from ca_keyid
  KEYS: use swapped SKID for performing partial matching
  KEYS: Restore partial ID matching functionality for asymmetric keys
  X.509: If available, use the raw subjKeyId to form the key description
  KEYS: handle error code encoded in pointer
  selinux: normalize audit log formatting
  selinux: cleanup error reporting in selinux_nlmsg_perm()
  KEYS: Check hex2bin()'s return when generating an asymmetric key ID
  ima: detect violations for mmaped files
  ima: fix race condition on ima_rdwr_violation_check and process_measurement
  ima: added ima_policy_flag variable
  ima: return an error code from ima_add_boot_aggregate()
  ima: provide 'ima_appraise=log' kernel option
  ima: move keyring initialization to ima_init()
  PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
  PKCS#7: Better handling of unsupported crypto
  KEYS: Overhaul key identification when searching for asymmetric keys
  KEYS: Implement binary asymmetric key ID handling
  ...
2014-10-12 10:13:55 -04:00
Linus Torvalds ef4a48c513 File locking related changes for v3.18 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUNZK4AAoJEAAOaEEZVoIVI08P/iM7eaIVRnqaqtWw/JBzxiba
 EMDlJYUBSlv6lYk9s8RJT4bMmcmGAKSYzVAHSoPahzNcqTDdFLeDTLGxJ8uKBbjf
 d1qRRdH1yZHGUzCvJq3mEendjfXn435Y3YburUxjLfmzrzW7EbMvndiQsS5dhAm9
 PEZ+wrKF/zFL7LuXa1YznYrbqOD/GRsJAXGEWc3kNwfS9avephVG/RI3GtpI2PJj
 RY1mf8P7+WOlrShYoEuUo5aqs01MnU70LbqGHzY8/QKH+Cb0SOkCHZPZyClpiA+G
 MMJ+o2XWcif3BZYz+dobwz/FpNZ0Bar102xvm2E8fqByr/T20JFjzooTKsQ+PtCk
 DetQptrU2gtyZDKtInJUQSDPrs4cvA13TW+OEB1tT8rKBnmyEbY3/TxBpBTB9E6j
 eb/V3iuWnywR3iE+yyvx24Qe7Pov6deM31s46+Vj+GQDuWmAUJXemhfzPtZiYpMT
 exMXTyDS3j+W+kKqHblfU5f+Bh1eYGpG2m43wJVMLXKV7NwDf8nVV+Wea962ga+w
 BAM3ia4JRVgRWJBPsnre3lvGT5kKPyfTZsoG+kOfRxiorus2OABoK+SIZBZ+c65V
 Xh8VH5p3qyCUBOynXlHJWFqYWe2wH0LfbPrwe9dQwTwON51WF082EMG5zxTG0Ymf
 J2z9Shz68zu0ok8cuSlo
 =Hhee
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux

Pull file locking related changes from Jeff Layton:
 "This release is a little more busy for file locking changes than the
  last:

   - a set of patches from Kinglong Mee to fix the lockowner handling in
     knfsd
   - a pile of cleanups to the internal file lease API.  This should get
     us a bit closer to allowing for setlease methods that can block.

  There are some dependencies between mine and Bruce's trees this cycle,
  and I based my tree on top of the requisite patches in Bruce's tree"

* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
  locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
  locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
  locks: set fl_owner for leases to filp instead of current->files
  locks: give lm_break a return value
  locks: __break_lease cleanup in preparation of allowing direct removal of leases
  locks: remove i_have_this_lease check from __break_lease
  locks: move freeing of leases outside of i_lock
  locks: move i_lock acquisition into generic_*_lease handlers
  locks: define a lm_setup handler for leases
  locks: plumb a "priv" pointer into the setlease routines
  nfsd: don't keep a pointer to the lease in nfs4_file
  locks: clean up vfs_setlease kerneldoc comments
  locks: generic_delete_lease doesn't need a file_lock at all
  nfsd: fix potential lease memory leak in nfs4_setlease
  locks: close potential race in lease_get_mtime
  security: make security_file_set_fowner, f_setown and __f_setown void return
  locks: consolidate "nolease" routines
  locks: remove lock_may_read and lock_may_write
  lockd: rip out deferred lock handling from testlock codepath
  NFSD: Get reference of lockowner when coping file_lock
  ...
2014-10-11 13:21:34 -04:00
Linus Torvalds 90d0c376f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The largest set of changes here come from Miao Xie.  He's cleaning up
  and improving read recovery/repair for raid, and has a number of
  related fixes.

  I've merged another set of fsync fixes from Filipe, and he's also
  improved the way we handle metadata write errors to make sure we force
  the FS readonly if things go wrong.

  Otherwise we have a collection of fixes and cleanups.  Dave Sterba
  gets a cookie for removing the most lines (thanks Dave)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
  btrfs: Fix compile error when CONFIG_SECURITY is not set.
  Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
  btrfs: Make btrfs handle security mount options internally to avoid losing security label.
  Btrfs: send, don't delay dir move if there's a new parent inode
  btrfs: add more superblock checks
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: be aware of btree inode write errors to avoid fs corruption
  Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
  btrfs: fix shadow warning on cmp
  Btrfs: fix compilation errors under DEBUG
  Btrfs: fix crash of btrfs_release_extent_buffer_page
  Btrfs: add missing end_page_writeback on submit_extent_page failure
  btrfs: Fix the wrong condition judgment about subset extent map
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: cleanup error handling in build_backref_tree
  btrfs: move checks for DUMMY_ROOT into a helper
  btrfs: new define for the inline extent data start
  btrfs: kill extent_buffer_page helper
  btrfs: drop constant param from btrfs_release_extent_buffer_page
  btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
  ...
2014-10-11 08:03:52 -04:00
Linus Torvalds ac0c49396d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
 "A few UDF fixes and also a few patches which are preparing filesystems
  for support of project quotas in VFS"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix loading of special inodes
  ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
  udf: remove redundant sys_tz declaration
  ocfs2: Don't use MAXQUOTAS value
  reiserfs: Don't use MAXQUOTAS value
  ext3: Don't use MAXQUOTAS value
  udf: Fix race between write(2) and close(2)
2014-10-11 08:02:31 -04:00
Linus Torvalds eca9fdf32d Minor code cleanups and a fix for when eCryptfs metadata is stored in xattrs
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJUNtXXAAoJENaSAD2qAscKx1UP/jqt4gm1AYFBpkwBhVRUssIQ
 wHckk8QPasIdEGeKvyCKXl88sUDLSsJwf/mUpl8pBfKm64LokP2fmUBU9Pkf9hVU
 lcuaFNIEmHh8p1IqcfaFbnZOjuuHc9M2ULQLmo5ShoTHNu6JYAP2zRBMfFrEdcMR
 vKh+RCARa5jr1CdwTHX+dH5vJQIQXW/qgRK5G5Z6KeBI766jK2BvZxYHivUgLWuC
 dV6K4RzvHHJYEVoXjCUhrgGepGwHlDoEgx/Y0GK9vbFPG38IrfSlN6fgvzV6mMYE
 ien8FsuHPv5oiBmFM2byRmWpJtWUOViCbMaMmKqY5Ix21E0lUafA7ixH3nSpOHNZ
 b29dxmHnEDomCJXLAF0NQUE84yTw6ITLp2FldUR2o+sidnJsDx/hph/KvmsK6d6P
 sDfEN/DtzPluZoXKY0jrRtoAhi0citNTgKfrujmx6baBstxRp7AfwGcP4skXJq7w
 wkg0Seo449CUaBJK9A4s7nIjMBQ6/3hjvF/NVcry1aY+/RyhJDz6uFrtnOEqW3So
 6khwJOOcRkmLLXrk+DzvthDRCVNJhWM80cB5/UBBfVG2Kk3PHa86Jfp5Y4teWugx
 zyoacEWiitKcK4Po8skZpLd2vUwny/cD0qS43LT+SMvgRsYZ4XdUnceoY8FkOnLr
 ooIFKHCR/+2XVnAaC101
 =mjS7
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs updates from Tyler Hicks:
 "Minor code cleanups and a fix for when eCryptfs metadata is stored in
  xattrs"

* tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: remove unneeded buggy code in ecryptfs_do_create()
  ecryptfs: avoid to access NULL pointer when write metadata in xattr
  ecryptfs: remove unnecessary break after goto
  ecryptfs: Remove unnecessary include of syscall.h in keystore.c
  fs/ecryptfs/messaging.c: remove null test before kfree
  ecryptfs: Drop cast
  Use %pd in eCryptFS
2014-10-11 08:01:27 -04:00
Linus Torvalds 41e46ac0fa This time we have a couple of bug fixes, one relating to bad i_goal values
which are now ignored (i_goal is basically a hint so it is safe to so this)
 and another relating to the saving of the dirent location during rename.
 
 There is one performance improvement, which is an optimisation in rgblk_free
 so that multiple block deallocations will now be more efficient,
 and one clean up patch to use _RET_IP_ rather than writing it out longhand.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJUNQIcAAoJEMrg3m4a/8jSKaQP+wT4SCjKI0TMkVqfyuyDjTAW
 5TxV9h3B5sOxlroJY063WSZUFFWCuooLn4rIK+IT72Jju/AtW1NOb+kx6T8PZ+bT
 5Qj7sReGNJADwWdFNlrE3l+7SecVHO2fxfoI5zKX06YgL8WDptR+nYo/Hn1kwQFL
 4/cukJSkKMLzvrParR/S7RildlG98jQUhS1WtgxUhyyLVC3+b/9HvwAznq6JUX2U
 DLePL2rliCTPZEEvq7tKZGi4uv3ZSbhfSgJHeQeYW565OAiYiJ4fmPFXK9LSyxvv
 +WWoRLGkKDjwHLejkBwjwVbpDOrzVSXuwGrRh0DixL/0xwSAsraBMhiIr6XzXxq8
 3S2uTbl+i+aeuIEwcE/OjDW1Mxp9GovuW0cKdpscVW/Hz6Id5asEtF+YUPBWFi/R
 fFydsn7Gjrjtd2K++n+094N6kqhYo6Vud9es0kigp9D7pq4NSaoKSu74qz2sCWel
 51SgysflmL92lXM2f619LgVL4IBdciECNKaC/JPiUJSc8K0MgUNWE4wreGoeTsuc
 ceQZ8dW1i0w/3MVy5RbjPiEkNpN2ISNs0vDga07dr1Imy/6CJQR+hq9IdLEoku9V
 iBJH46EorVWN4Flt2jeM3uvIc7j+w3kMc5Mxia8D4+aF5eIx/D44ucPb1zPDrILV
 O+LD3XOy+kxecCn1wesC
 =+QNn
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull gfs2 updates from Steven Whitehouse:
 "This time we have a couple of bug fixes, one relating to bad i_goal
  values which are now ignored (i_goal is basically a hint so it is safe
  to so this) and another relating to the saving of the dirent location
  during rename.

  There is one performance improvement, which is an optimisation in
  rgblk_free so that multiple block deallocations will now be more
  efficient, and one clean up patch to use _RET_IP_ rather than writing
  it out longhand"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)
  GFS2: Use gfs2_rbm_incr in rgblk_free
  GFS2: Make rename not save dirent location
  GFS2: fix bad inode i_goal values during block allocation
2014-10-11 08:00:16 -04:00
Linus Torvalds c798360cd1 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "A lot of activities on percpu front.  Notable changes are...

   - percpu allocator now can take @gfp.  If @gfp doesn't contain
     GFP_KERNEL, it tries to allocate from what's already available to
     the allocator and a work item tries to keep the reserve around
     certain level so that these atomic allocations usually succeed.

     This will replace the ad-hoc percpu memory pool used by
     blk-throttle and also be used by the planned blkcg support for
     writeback IOs.

     Please note that I noticed a bug in how @gfp is interpreted while
     preparing this pull request and applied the fix 6ae833c7fe
     ("percpu: fix how @gfp is interpreted by the percpu allocator")
     just now.

   - percpu_ref now uses longs for percpu and global counters instead of
     ints.  It leads to more sparse packing of the percpu counters on
     64bit machines but the overhead should be negligible and this
     allows using percpu_ref for refcnting pages and in-memory objects
     directly.

   - The switching between percpu and single counter modes of a
     percpu_ref is made independent of putting the base ref and a
     percpu_ref can now optionally be initialized in single or killed
     mode.  This allows avoiding percpu shutdown latency for cases where
     the refcounted objects may be synchronously created and destroyed
     in rapid succession with only a fraction of them reaching fully
     operational status (SCSI probing does this when combined with
     blk-mq support).  It's also planned to be used to implement forced
     single mode to detect underflow more timely for debugging.

  There's a separate branch percpu/for-3.18-consistent-ops which cleans
  up the duplicate percpu accessors.  That branch causes a number of
  conflicts with s390 and other trees.  I'll send a separate pull
  request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
  percpu: fix how @gfp is interpreted by the percpu allocator
  blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
  percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
  percpu_ref: add PERCPU_REF_INIT_* flags
  percpu_ref: decouple switching to percpu mode and reinit
  percpu_ref: decouple switching to atomic mode and killing
  percpu_ref: add PCPU_REF_DEAD
  percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
  percpu_ref: replace pcpu_ prefix with percpu_
  percpu_ref: minor code and comment updates
  percpu_ref: relocate percpu_ref_reinit()
  Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
  Revert "percpu: free percpu allocation info for uniprocessor system"
  percpu-refcount: make percpu_ref based on longs instead of ints
  percpu-refcount: improve WARN messages
  percpu: fix locking regression in the failure path of pcpu_alloc()
  percpu-refcount: add @gfp to percpu_ref_init()
  proportions: add @gfp to init functions
  percpu_counter: add @gfp to percpu_counter_init()
  percpu_counter: make percpu_counters_lock irq-safe
  ...
2014-10-10 07:26:02 -04:00
Linus Torvalds b211e9d7c8 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Just a handful of cleanup patches"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: remove redundant variable in cgroup_mount()"
  cgroup: remove redundant variable in cgroup_mount()
  cgroup: fix missing unlock in cgroup_release_agent()
  cgroup: remove CGRP_RELEASABLE flag
  perf/cgroup: Remove perf_put_cgroup()
  cgroup: remove redundant check in cgroup_ino()
  cpuset: simplify proc_cpuset_show()
  cgroup: simplify proc_cgroup_show()
  cgroup: use a per-cgroup work for release agent
  cgroup: remove bogus comments
  cgroup: remove redundant code in cgroup_rmdir()
  cgroup: remove some useless forward declarations
  cgroup: fix a typo in comment.
2014-10-10 07:24:40 -04:00
Sebastien Buisson 86cf78d73d fs/buffer.c: increase the buffer-head per-CPU LRU size
Increase the buffer-head per-CPU LRU size to allow efficient filesystem
operations that access many blocks for each transaction.  For example,
creating a file in a large ext4 directory with quota enabled will access
multiple buffer heads and will overflow the LRU at the default 8-block LRU
size:

* parent directory inode table block (ctime, nlinks for subdirs)
* new inode bitmap
* inode table block
* 2 quota blocks
* directory leaf block (not reused, but pollutes one cache entry)
* 2 levels htree blocks (only one is reused, other pollutes cache)
* 2 levels indirect/index blocks (only one is reused)

The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
destroy.

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Sebastien Buisson <sebastien.buisson@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:26:02 -04:00
Konstantin Khlebnikov 09316c09dd mm/balloon_compaction: add vmstat counters and kpageflags bit
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.

Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate".  They accumulate balloon
activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
pages.

All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.

Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:26:01 -04:00
Peter Feiner 81d0fa623c mm: softdirty: unmapped addresses between VMAs are clean
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
	uint64_t pme[3];
	int fd = open("/proc/self/pagemap", O_RDONLY);;
	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	munmap(m + getpagesize(), getpagesize());
	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
	return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
	Used the program given above. I'm going to include this code in
	a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Xue jiufei b246d3d11e ocfs2: fix a deadlock while o2net_wq doing direct memory reclaim
Fix a deadlock problem caused by direct memory reclaim in o2net_wq.  The
situation is as follows:

1) Receive a connect message from another node, node queues a
   work_struct o2net_listen_work.

2) o2net_wq processes this work and call the following functions:

o2net_wq
-> o2net_accept_one
  -> sock_create_lite
    -> sock_alloc()
      -> kmem_cache_alloc with GFP_KERNEL
        -> ____cache_alloc_node
          ->__alloc_pages_nodemask
            -> do_try_to_free_pages
              -> shrink_slab
                -> evict
                  -> ocfs2_evict_inode
                    -> ocfs2_drop_lock
                      -> dlmunlock
                        -> o2net_send_message_vec

   then o2net_wq wait for the unlock reply from master.

3) tcp layer received the reply, call o2net_data_ready() and queue
   sc_rx_work, waiting o2net_wq to process this work.

4) o2net_wq is a single thread workqueue, it process the work one by
   one.  Right now it is still doing o2net_listen_work and cannot handle
   sc_rx_work.  so we deadlock.

Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
(http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO.  We use
memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
allocations done by this process are done as if GFP_NOIO was specified.
We are not reentering filesystem while doing memory reclaim.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Oleg Nesterov 498f237178 mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race
9e7814404b "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.

The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.

However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.

Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.

Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:56 -04:00
Akinobu Mita 447f05bb48 block_dev: implement readpages() to optimize sequential read
Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem.  But it is not correct due to the lack of
effective readpages() in the address space operations for block device.

This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

Install 1GB of RAM disk storage:

	# modprobe scsi_debug dev_size_mb=1024 delay=0

Sequential read from file on a filesystem:

	# mkfs.ext4 /dev/$DEV
	# mount /dev/$DEV /mnt
	# fio --name=t --size=512m --rw=read --filename=/mnt/file
	...
	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec

Sequential read from a block device:
	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
	...
(Without this commit)
	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec

(With this commit)
	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Akinobu Mita 4db96b71e3 vfs: guard end of device for mpage interface
Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.

Using mpage_readpages() for block device requires this guard check.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Akinobu Mita 59d43914ed vfs: make guard_bh_eod() more generic
This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

This patch (of 3):

guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size.  This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.

The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Baoquan He bf3e269246 fs/proc/kcore.c: don't add modules range to kcore if it's equal to vmcore range
On some ARCHs modules range is eauql to vmalloc range. E.g on i686

	"#define MODULES_VADDR   VMALLOC_START"
	"#define MODULES_END     VMALLOC_END"

This will cause 2 duplicate program segments in /proc/kcore, and no flag
to indicate they are different.  This is confusing.  And usually people
who need check the elf header or read the content of kcore will check
memory ranges.  Two program segments which are the same are unnecessary.

So check if the modules range is equal to vmalloc range.  If so, just skip
adding the modules range.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 58cb65487e proc/maps: make vm_is_stack() logic namespace-friendly
- Rename vm_is_stack() to task_of_stack() and change it to return
  "struct task_struct *" rather than the global (and thus wrong in
  general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
  uses the right namespace to report the correct pid_t.

  Unfortunately we need to define this helper twice, in task_mmu.c
  and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
  and move at least pid_of_stack/task_of_stack there to avoid the
  code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 2c03376d2d proc/maps: replace proc_maps_private->pid with "struct inode *inode"
m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 47fecca15c fs/proc/task_nommu.c: don't use priv->task->mm
I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.

Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 27692cd56e fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open()
Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.

Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ce34fddb5b fs/proc/task_nommu.c: change maps_open() to use __seq_open_private()
Cleanup and preparation. maps_open() can use __seq_open_private()
like proc_maps_open() does.

[akpm@linux-foundation.org: deuglify]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 557c2d8a73 fs/proc/task_mmu.c: update m->version in the main loop in m_start()
Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov b8c20a9b85 fs/proc/task_mmu.c: reintroduce m->version logic
Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.

However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().

m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ad2a00e4b7 fs/proc/task_mmu.c: introduce m_next_vma() helper
Extract the tail_vma/vm_next calculation from m_next() into the new
trivial helper, m_next_vma().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 0c255321f8 fs/proc/task_mmu.c: simplify m_start() to make it readable
Now that m->version is gone we can cleanup m_start(). In particular,

  - Remove the "unsigned long" typecast, m->index can't be negative
    or exceed ->map_count. But lets use "unsigned int pos" to make
    it clear that "pos < map_count" is safe.

  - Remove the unnecessary "vma != NULL" check in the main loop. It
    can't be NULL unless we have a vm bug.

  - This also means that "pos < map_count" case can simply return the
    valid vma and avoid "goto" and subsequent checks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ebb6cdde1a fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic
m_start() carefully documents, checks, and sets "m->version = -1" if
we are going to return NULL. The only problem is that we will be never
called again if m_start() returns NULL, so this is simply pointless
and misleading.

Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
just wrong, we want -1 in this case. And in fact we also want -1 if
->vm_next == NULL and ->tail_vma == NULL.

And it is not used consistently, the "scan vmas" loop in m_start()
should update last_addr too.

Finally, imo the whole "last_addr" logic in m_start() looks horrible.
find_vma(last_addr) is called unconditionally even if we are not going
to use the result. But the main problem is that this code participates
in tail_vma-or-NULL mess, and this looks simply unfixable.

Remove this optimization. We will add it back after some cleanups.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 0d5f5f45f9 fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop()
1. There is no reason to reset ->tail_vma in m_start(), if we return
   IS_ERR_OR_NULL() it won't be used.

2. m_start() also clears priv->task to ensure that m_stop() won't use
   the stale pointer if we fail before get_task_struct(). But this is
   ugly and confusing, move this initialization in m_stop().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 23d54837e4 fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next()
1. Kill the first "vma != NULL" check. Firstly this is not possible,
   m_next() won't be called if ->start() or the previous ->next()
   returns NULL.

   And if it was possible the 2nd "vma != tail_vma" check is buggy,
   we should not wrongly return ->tail_vma.

2. Make this function readable. The logic is very simple, we should
   return check "vma != tail" once and return "vm_next || tail_vma".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 59b4bf12d4 fs/proc/task_mmu.c: simplify the vma_stop() logic
m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
vma. This is because in this case m_stop()->vma_stop() obviously
can't use gate_vma->vm_mm.

Now that we have proc_maps_private->mm we can simplify this logic:

  - Change m_start() to return with ->mmap_sem held unless it returns
    IS_ERR_OR_NULL().

  - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
    this makes "vm_area_struct *vma" unnecessary.

  - This also allows m_start() to use vm_stop().

  - Cleanup m_next() to follow the new locking rule.

    Note: m_stop() looks very ugly, and this temporary uglifies it
    even more. Fixed by the next change.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 29a40ace84 fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
A simple test-case from Kirill Shutemov

	cat /proc/self/maps >/dev/null
	chmod +x /proc/self/net/packet
	exec /proc/self/net/packet

makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.

It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().

Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.

The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.

Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00