Commit Graph

33499 Commits

Author SHA1 Message Date
Stefan Behrens 79556c3d88 btrfs: show compiled-in config features at module load time
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.

(This commit message is a copy of David Sterba's commit message when
he introduced btrfs_print_info()).

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:56 -04:00
Filipe David Borba Manana cef2193729 Btrfs: more efficient inode tree replace operation
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:55 -04:00
Ilya Dryomov 55e50e458e Btrfs: do not add replace target to the alloc_list
If replace was suspended by the umount, replace target device is added
to the fs_devices->alloc_list during a later mount.  This is obviously
wrong.  ->is_tgtdev_for_dev_replace is supposed to guard against that,
but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
*after* everything is opened and fs_devices lists are populated.  Fix
this by checking the devid instead: for replace targets it's always
equal to BTRFS_DEV_REPLACE_DEVID.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:55 -04:00
Josef Bacik 83d4cfd4da Btrfs: fixup error handling in btrfs_reloc_cow
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent.  So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:54 -04:00
Chris Mason 07f0e62e7f Linux 3.11
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSJPkeAAoJEHm+PkMAQRiGVWMH/jo5f01Ra7G4/CYS59K+AlBQ
 /oWL3W81r5MORlsMxwUwGtJ3sZ7UulKwiDrluWeOkz2+/9SmoHoUfkpbByq1bSIV
 y0eqhmjtkHQZz5radJIHeyz1gJIICBIgAM0l45j8SpK4n9EXRcjLSZjdjAkPzxZp
 qZpfxKhVSTu79m96bud7F+HrboHDQEyhD9zqdSi4xPQNnOmTc7K3tvui9AB3rMbV
 ablM3C+LqBYjZx+pKS/rOdfATxZvtU392HU53XTALt6VD1e8alMmhmpe0I9Zxvjv
 scsB6hfRkevfe7VaK3aVoDnQnLKd61yxs+/XdzTtkWPbVGp+kiuFUdDv/5y2r1g=
 =7Xf6
 -----END PGP SIGNATURE-----

Merge tag 'v3.11' into for-linus

Linux 3.11
2013-09-21 10:44:55 -04:00
David Howells 509bf24d18 CacheFiles: Don't try to dump the index key if the cookie has been cleared
Don't try to dump the index key that distinguishes an object if netfs
data in the cookie the object refers to has been cleared (ie.  the
cookie has passed most of the way through
__fscache_relinquish_cookie()).

Since the netfs holds the index key, we can't get at it once the ->def
and ->netfs_data pointers have been cleared - and a NULL pointer
exception will ensue, usually just after a:

	CacheFiles: Error: Unexpected object collision

error is reported.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-20 15:15:43 -07:00
Josh Boyer 607566aecc CacheFiles: Fix memory leak in cachefiles_check_auxdata error paths
In cachefiles_check_auxdata(), we allocate auxbuf but fail to free it if
we determine there's an error or that the data is stale.

Further, assigning the output of vfs_getxattr() to auxbuf->len gives
problems with checking for errors as auxbuf->len is a u16.  We don't
actually need to set auxbuf->len, so keep the length in a variable for
now.  We shouldn't need to check the upper limit of the buffer as an
overflow there should be indicated by -ERANGE.

While we're at it, fscache_check_aux() returns an enum value, not an
int, so assign it to an appropriately typed variable rather than to ret.

Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hongyi Jia <jiayisuse@gmail.com>
cc: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-20 15:15:42 -07:00
Linus Torvalds e9ff04dd94 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "These fix several bugs with RBD from 3.11 that didn't get tested in
  time for the merge window: some error handling, a use-after-free, and
  a sequencing issue when unmapping and image races with a notify
  operation.

  There is also a patch fixing a problem with the new ceph + fscache
  code that just went in"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  fscache: check consistency does not decrement refcount
  rbd: fix error handling from rbd_snap_name()
  rbd: ignore unmapped snapshots that no longer exist
  rbd: fix use-after free of rbd_dev->disk
  rbd: make rbd_obj_notify_ack() synchronous
  rbd: complete notifies before cleaning up osd_client and rbd_dev
  libceph: add function to ensure notifies are complete
2013-09-19 12:50:37 -05:00
Linus Torvalds 3fe03debfc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "atomic_open-related fixes (Miklos' series, with EEXIST-related parts
  replaced with fix in fs/namei.c:atomic_open() instead of messing with
  the instances) + race fix in autofs + leak on failure exit in 9p"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: don't forget to destroy inode cache if fscache registration fails
  atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
  vfs: don't set FILE_CREATED before calling ->atomic_open()
  nfs: set FILE_CREATED
  gfs2: set FILE_CREATED
  cifs: fix filp leak in cifs_atomic_open()
  vfs: improve i_op->atomic_open() documentation
  autofs4: close the races around autofs4_notify_daemon()
2013-09-18 19:22:22 -05:00
Linus Torvalds 9baa505948 Three pstore fixes related to compression:
1) Better adjustment of size of compression buffer (was too big
    for EFIVARS backend resulting in compression failure
 2) Use zlib_inflateInit2 instead of zlib_inflateInit
 3) Don't print messages about compression failure.  They will
    waste space that may better be used to log console output
    leading to the crash.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSOeAIAAoJEKurIx+X31iBq8wP/1MthA3CDTVFl2beFNXEo8G/
 Sq3YAfTHj61f+UKT2489WSyYwc6Q3y4iEia+shCu28DkuQZMifH8KoDfsoJAKF1X
 SVsm5MkelhXEDlmt94AnEXmNIgQMnJ1c5uToTanNz/UbpUZdsdVzP+c4ifUC1mX3
 m+uARA2oy7obVm0RihXEzRhMZAOdkq0TXxL4TVaZShjDPuxN5BSQGlNB13+6LAEM
 Q54HI/j9RHVFiIxT7INttyOMvDps2zDNJtsVgiphp0bBQBWzY1puJJykM/T64ZJV
 /UMsycoKLJdLi3pnwWtZ1USTk4EwkjjVWCtUHtan6wEt1rDbrkWaMU1RvTASBz9Z
 418EUAob0FZuL0ZdaN4WgYc04xwgc748S/PcUtkFfvk8KqhQbmkgbdVu6cs/mJmQ
 Jbi+ATJda1zmCEQXZBLENfe7o4yiGgKjOWWy5/tbtMi8a6cpMIPUn9phNXNoRvBb
 II0iMKwZetuOkDDqJAtZwPUiYNdRHWLosn+66AjpYARXqrCnRfi87x4WMWYJ4CVR
 RMxrn6YQT3DIDxnBd00zVepdK9ee8It10t7k07f6Ve/EdvOJZK9lSg/FUp9MhL5a
 N6S9X2gQ0R2wDHjFNRyL8p0xIoe45zFXPICLYaqcDxEcC0G7bd1AxGZ5y9v+/qvK
 76dJvg0f1E/TsoqhQw79
 =E5IH
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore/compression fixes from Tony Luck:
 "Three pstore fixes related to compression:
   1) Better adjustment of size of compression buffer (was too big for
      EFIVARS backend resulting in compression failure
   2) Use zlib_inflateInit2 instead of zlib_inflateInit
   3) Don't print messages about compression failure.  They will waste
      space that may better be used to log console output leading to the
      crash"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore: Remove the messages related to compression failure
  pstore: Use zlib_inflateInit2 instead of zlib_inflateInit
  pstore: Adjust buffer size for compression for smaller registered buffers
2013-09-18 12:39:40 -05:00
Al Viro 8061a6fa56 9p: don't forget to destroy inode cache if fscache registration fails
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-17 22:31:01 -04:00
Al Viro 03da633aa7 atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-17 17:08:50 -04:00
Bjorn Helgaas adbe6991ef bio-integrity: Fix use of bs->bio_integrity_pool after free
This fixes a copy and paste error introduced by 9f060e2231
("block: Convert integrity to bvec_alloc_bs()").

Found by Coverity (CID 1020654).

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-09-17 12:46:24 -06:00
Miklos Szeredi 116cc02253 vfs: don't set FILE_CREATED before calling ->atomic_open()
If O_CREAT|O_EXCL are passed to open, then we know that either

 - the file is successfully created, or
 - the operation fails in some way.

So previously we set FILE_CREATED before calling ->atomic_open() so the
filesystem doesn't have to.  This, however, led to bugs in the
implementation that went unnoticed when the filesystem didn't check for
existence, yet returned success.  To prevent this kind of bug, require
filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
verify this in the VFS.

Also added a couple more verifications for the result of atomic_open():

 - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
 - Warn if filesystem set FILE_CREATED but gave a negative dentry.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi 01c919abaf nfs: set FILE_CREATED
Set FILE_CREATED on O_CREAT|O_EXCL.  If the NFS server honored our request
for exclusivity then this must be correct.

Currently this is a no-op, since the VFS sets FILE_CREATED anyway.  The
next patch will, however, require this flag to be always set by
filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi c5bf8fef52 gfs2: set FILE_CREATED
In gfs2_create_inode() set FILE_CREATED in *opened.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi dfb1d61b0e cifs: fix filp leak in cifs_atomic_open()
If an error occurs after having called finish_open() then fput() needs to
be called on the already opened file.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steve French <sfrench@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Miklos Szeredi 0854d450e2 vfs: improve i_op->atomic_open() documentation
Fix documentation of ->atomic_open() and related functions: finish_open()
and finish_no_open().  Also add details that seem to be unclear and a
source of bugs (some of which are fixed in the following series).

Cc-ing maintainers of all filesystems implementing ->atomic_open().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Steve French <sfrench@samba.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Al Viro 606035e76e autofs4: close the races around autofs4_notify_daemon()
Don't drop ->wq_mutex before calling autofs4_notify_daemon() only to regain it
there.  Besides being pointless, that opens a race window where autofs4_wait_release()
could've come and freed wq->name.name.  And do the debugging printk in the "reused an
existing wq" case before dropping ->wq_mutex - the same reason...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Ian Kent <raven@themaw.net>
2013-09-16 19:16:38 -04:00
Linus Torvalds 3369d11693 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Two minor cifs fixes and a minor documentation cleanup for cifs.txt"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update cifs.txt and remove some outdated infos
  cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache
  cifs: Do not take a reference to the page in cifs_readpage_worker()
2013-09-16 15:39:21 -04:00
Linus Torvalds 098e7f1665 Just one patch which fixes the power-cut recovery testing mode.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSNrA6AAoJECmIfjd9wqK0r60P/ijFSSZxYEr5/ChOVt1Jjs/q
 cx0FcOO3r4RnXJXEQ9yNNlHWDZ+ZWYrSalaaKAAeh0WGvmCkHEyUbrAuL3Y76GEw
 O37eM9Qlbpb23iQ+gTtapIhdBjABGwo556UebzUsSkJZef+B7aCdgxNjOYAYitF6
 mcG3dndj91XUuhNd+93R8ovVHFjXwndruCYp+UsAajSHYGs3ThocWXXVRF/Rv0mG
 GDeJD4MGuNOGG5t6WjeOYlVE5WuDHJBUYRoUqhnzHfEx7hQ60m26H6Oir8ncXj7/
 3IIrfkF9pbIFiQ1jBRmcGFzzaY2UTqXaDoZN5MUc1w/1DH9PGkfeF7OfpREvDIJY
 rvbT/lX/iHUbQ7lQ+CBZqc3orJT0t1nJy/mhtRy3rb2xFf2gRaFwMwuLPFgeBarm
 hbUpZu3VQpi0Anx7pTavbYn5ZCoobBHvnzuOGg/2EjOFhW0baTnXzmXgHGoJAW+v
 ZxcLEMsTFERr3T6pqxu6v9CNL3DVkO2jvKNR/0I30cE4XDjcd81tXvOAfw0pVp3x
 bEhWLJSG2UFybQ2/PLgvuTriZ4wuJ2Mw5KCGmfp3i0IM9J7/1e9tMNvUOickcnz2
 qkSFuL8Ee47QmTV95tdRwM2T679MXmDoPY6QulIl2bSMnshfMEKbL83wNCpVzXee
 wwV0z4EbGlNtbR254LVF
 =0WcB
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs

Pull ubifs fix from Artem Bityutskiy:
 "Just one patch which fixes the power-cut recovery testing mode.

  I'll start using a single UBI/UBIFS tree instead of 2 trees from now
  on.  So in the future you'll get 1 small pull request instead of 2
  tiny ones"

* tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs:
  UBIFS: remove invalid warn msg with tst_recovery enabled
2013-09-16 15:36:55 -04:00
Aruna Balakrishnaiah 802e4c6f58 pstore: Remove the messages related to compression failure
Remove the messages indicating compression failure as it will
add to the space during panic path.

Reported-by: Seiji Aguchi <seiji.aguchi@hds.com>
Tested-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-09-16 09:28:29 -07:00
Aruna Balakrishnaiah b61edf8e7c pstore: Use zlib_inflateInit2 instead of zlib_inflateInit
Since zlib_deflateInit2() is used for specifying window bit during compression,
zlib_inflateInit2() is appropriate for decompression.

Reported-by: Seiji Aguchi <seiji.aguchi@hds.com>
Tested-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-09-16 09:28:29 -07:00
Aruna Balakrishnaiah 7de8fe2fa8 pstore: Adjust buffer size for compression for smaller registered buffers
When backends (ex: efivars) have smaller registered buffers, the
big_oops_buf is too big for them as number of repeated occurences
in the text captured will be less. What happens is that pstore takes
too big a bite from the dmesg log and then finds it cannot compress it
enough to meet the backend block size. Patch takes care of adjusting
the buffer size based on the registered buffer size. cmpr values have
been arrived after doing experiments with plain text for buffers of
size 1k - 4k (Smaller the buffer size repeated occurence will be less)
and with sample crash log for buffers ranging from 4k - 10k.

Reported-by: Seiji Aguchi <seiji.aguchi@hds.com>
Tested-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-09-16 09:28:28 -07:00
Linus Torvalds 05a8252bde vfs: fix typo in comment in recent dentry work
Sedat points out that I transposed some letters in "LRU" and wrote "RLU"
instead in one of the new comments explaining the flow.  Let's just fix
it.

Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-15 07:11:01 -04:00
Linus Torvalds 3711d86a2d a trivial writeback fix
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMwgMAAoJECvKgwp+S8JaiJUP/RGA98MkWnl5eio9mG5eEbF/
 DC6bP5UOzPo+6oZbwH4LTc4EB04q728SSOU1nG6q1yfuSF0I1Kzt/Um6aS3P5wdk
 okyYW1SjieE0xpmfQpvMEX6TZ7L/FpYjAg47GI0TaJMUdKRmJK0fkZ22hfv6uJzr
 PMVmdJKKgxs85usrn4JyNY93xpKZgncJVuwpfFF1k9oSNIXHAk7OxT7JWj51UdqP
 k/L/HXNhT3MRVvsjyqURHMIXfqRvqcgn47LAkM/IYVdgaFkpLPvwp8RZr/CcKr7U
 KqJsQqqegRyoQ73yqgWXGAGLLXujKllsfKLu/d0vtqY2J4z6lHKTcRGpAGCDyH+3
 bLe4hk+/d+Tz0xBSPaHryy/4yiQ4O+h9rLZCwGdxMX1duoqvThL9S8fLoUkrNBai
 OU7cd4iWPlCmiquATjk0bgthCcKw3wlg+rsiSzUcaO3JbdwTp8P45Mie0ZtZ5jpa
 UcczrT6osOAAswoEPMMeySQ+BVLewSPwmYKaETniYXB5Bb/IHkliX1MkXnA1D9bI
 DNijgB2g2561BVhdkDHf2q8D4Cbrq6UhK7plATB90DB7bwNaAxmtRVJ3zDaQGKOM
 VWBbloNf5QcodshEttj9ZLko7JNF/DjNOcNomb5ZtzY+EGzMksUHBUMPld3yOcna
 LTNApshhbx92MemJ02FC
 =FB22
 -----END PGP SIGNATURE-----

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback fix from Wu Fengguang:
 "A trivial writeback fix"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Do not sort b_io list only because of block device inode
2013-09-13 23:06:40 -04:00
Linus Torvalds 89dc77bcda vfs: fix dentry LRU list handling and nr_dentry_unused accounting
The LRU list changes interacted badly with our nr_dentry_unused
accounting, and even worse with the new DCACHE_LRU_LIST bit logic.

This introduces helper functions to make sure everything follows the
proper dcache d_lru list rules: the dentry cache is complicated by the
fact that some of the hotpaths don't even want to look at the LRU list
at all, and the fact that we use the same list entry in the dentry for
both the LRU list and for our temporary shrinking lists when removing
things from the LRU.

The helper functions temporarily have some extra sanity checking for the
flag bits that have to match the current LRU state of the dentry.  We'll
remove that before the final 3.12 release, but considering how easy it
is to get wrong, this first cleanup version has some very particular
sanity checking.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13 22:55:10 -04:00
Sachin Prabhu 466bd31bbd cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache
When reading a single page with cifs_readpage(), we make a call to
fscache_read_or_alloc_page() which once done, asynchronously calls
the completion function cifs_readpage_from_fscache_complete(). This
completion function unlocks the page once it has been populated from
cache. The module then attempts to unlock the page a second time in
cifs_readpage() which leads to warning messages.

In case of a successful call to fscache_read_or_alloc_page() we should skip
the second unlock_page() since this will be called by the
cifs_readpage_from_fscache_complete() once the page has been populated by
fscache.

With the modifications to cifs_readpage_worker(), we will need to re-grab the
page lock in cifs_write_begin().

The problem was first noticed when testing new fscache patches for cifs.
https://bugzilla.redhat.com/show_bug.cgi?id=1005737

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13 16:24:49 -05:00
Sachin Prabhu a9e9b7bc15 cifs: Do not take a reference to the page in cifs_readpage_worker()
We do not need to take a reference to the pagecache in
cifs_readpage_worker() since the calling function will have already
taken one before passing the pointer to the page as an argument to the
function.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13 16:24:43 -05:00
Linus Torvalds 9bf12df31f Merge git://git.kvack.org/~bcrl/aio-next
Pull aio changes from Ben LaHaise:
 "First off, sorry for this pull request being late in the merge window.
  Al had raised a couple of concerns about 2 items in the series below.
  I addressed the first issue (the race introduced by Gu's use of
  mm_populate()), but he has not provided any further details on how he
  wants to rework the anon_inode.c changes (which were sent out months
  ago but have yet to be commented on).

  The bulk of the changes have been sitting in the -next tree for a few
  months, with all the issues raised being addressed"

* git://git.kvack.org/~bcrl/aio-next: (22 commits)
  aio: rcu_read_lock protection for new rcu_dereference calls
  aio: fix race in ring buffer page lookup introduced by page migration support
  aio: fix rcu sparse warnings introduced by ioctx table lookup patch
  aio: remove unnecessary debugging from aio_free_ring()
  aio: table lookup: verify ctx pointer
  staging/lustre: kiocb->ki_left is removed
  aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
  aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
  aio: convert the ioctx list to table lookup v3
  aio: double aio_max_nr in calculations
  aio: Kill ki_dtor
  aio: Kill ki_users
  aio: Kill unneeded kiocb members
  aio: Kill aio_rw_vect_retry()
  aio: Don't use ctx->tail unnecessarily
  aio: io_cancel() no longer returns the io_event
  aio: percpu ioctx refcount
  aio: percpu reqs_available
  aio: reqs_active -> reqs_available
  aio: fix build when migration is disabled
  ...
2013-09-13 10:55:58 -07:00
Linus Torvalds e0ea4045bc xfs: update #2 for v3.12-rc1
Here we have defrag support for v5 superblock, a number of bugfixes and
 a cleanup or two.
 
 - defrag support for CRC filesystems
 - fix endian worning in xlog_recover_get_buf_lsn
 - fixes for sparse warnings
 - fix for assert in xfs_dir3_leaf_hdr_from_disk
 - fix for log recovery of remote symlinks
 - fix for log recovery of btree root splits
 - fixes formemory allocation failures with ACLs
 - fix for assert in xfs_buf_item_relse
 - fix for assert in xfs_inode_buf_verify
 - fix an assignment in an assert that should be a test in
   xfs_bmbt_change_owner
 - remove dead code in xlog_recover_inode_pass2
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMjQUAAoJENaLyazVq6ZOu2IP/1OHZYy+Bkmj0tO9pdsdEa4s
 w4FEBPsQePMJPjwdN693rKpW1exZue5sUmPMErH3ENzc2DPAwpUAlc9XAIohtdFx
 rTqrz2q+qTfZTq8oYBIA/RCOifJ2cHWN8tDYZPJpp5wceV7CRGYQeR1foiudE3ZH
 QDIPXioy8P9IkfGaXCtr/iWf9kycMO2lgNTNfdL6qtwX99HCqHZanTlsWx1BIYGQ
 Fa5TaOsXis6idPMCFMuEC15iEwA+YXc0HmXuHkMFLj+9mwFc4h/Aq65bwUkYZLmy
 +T1Wo/uQ/21rl6im/rWqgCh6fFS8NJQp8NIJeCIyihUEHbarfPyJIJRJjoP457YO
 cv8OkixCkt4zX6CkTxaL5ZFEBW9FYbRb13Gg96J6hb4WfdAFMtQg7FAjThSU/+Qr
 HwjaAso3GXimEaZD1C3c0TtZEQ0x9E6pENVI7/ewB1I0p92p7GJBMq4C7CTAYThV
 5zhdcOnViSrJTJvVQxm+gfOYzubkWWiVmbVku3RCO6//kvPBOvJ9juSYsl0mKeRu
 v2DZZB3AYJE/qnbYfZBlktX9obE6k+keKF6w8Eiufr2IqwJaqfaM4h9eogzAwTJA
 vyXKeLxUEmgHuqivFSZjw3sEK6sY654GCMMTP+2IpD19vlAIioYXdgp0ZbkkdiE3
 6twrzdFZAr1zy80xlM8W
 =2Uq6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull xfs update #2 from Ben Myers:
 "Here we have defrag support for v5 superblock, a number of bugfixes
  and a cleanup or two.

   - defrag support for CRC filesystems
   - fix endian worning in xlog_recover_get_buf_lsn
   - fixes for sparse warnings
   - fix for assert in xfs_dir3_leaf_hdr_from_disk
   - fix for log recovery of remote symlinks
   - fix for log recovery of btree root splits
   - fixes formemory allocation failures with ACLs
   - fix for assert in xfs_buf_item_relse
   - fix for assert in xfs_inode_buf_verify
   - fix an assignment in an assert that should be a test in
     xfs_bmbt_change_owner
   - remove dead code in xlog_recover_inode_pass2"

* tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: remove dead code from xlog_recover_inode_pass2
  xfs: = vs == typo in ASSERT()
  xfs: don't assert fail on bad inode numbers
  xfs: aborted buf items can be in the AIL.
  xfs: factor all the kmalloc-or-vmalloc fallback allocations
  xfs: fix memory allocation failures with ACLs
  xfs: ensure we copy buffer type in da btree root splits
  xfs: set remote symlink buffer type for recovery
  xfs: recovery of swap extents operations for CRC filesystems
  xfs: swap extents operations for CRC filesystems
  xfs: check magic numbers in dir3 leaf verifier first
  xfs: fix some minor sparse warnings
  xfs: fix endian warning in xlog_recover_get_buf_lsn()
2013-09-12 16:13:41 -07:00
Linus Torvalds ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Kirill A. Shutemov 3cd14fcd3f thp: account anon transparent huge pages into NR_ANON_PAGES
We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
not much sense in not accounting transparent huge pages there, but add
them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Ning Qu <quning@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:03 -07:00
Kirill A. Shutemov 7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Linus Torvalds 26935fb06e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 4 from Al Viro:
 "list_lru pile, mostly"

This came out of Andrew's pile, Al ended up doing the merge work so that
Andrew didn't have to.

Additionally, a few fixes.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
  super: fix for destroy lrus
  list_lru: dynamically adjust node arrays
  shrinker: Kill old ->shrink API.
  shrinker: convert remaining shrinkers to count/scan API
  staging/lustre/libcfs: cleanup linux-mem.h
  staging/lustre/ptlrpc: convert to new shrinker API
  staging/lustre/obdclass: convert lu_object shrinker to count/scan API
  staging/lustre/ldlm: convert to shrinkers to count/scan API
  hugepage: convert huge zero page shrinker to new shrinker API
  i915: bail out earlier when shrinker cannot acquire mutex
  drivers: convert shrinkers to new count/scan API
  fs: convert fs shrinkers to new scan/count API
  xfs: fix dquot isolation hang
  xfs-convert-dquot-cache-lru-to-list_lru-fix
  xfs: convert dquot cache lru to list_lru
  xfs: rework buffer dispose list tracking
  xfs-convert-buftarg-lru-to-generic-code-fix
  xfs: convert buftarg LRU to generic code
  fs: convert inode and dentry shrinking to be node aware
  vmscan: per-node deferred work
  ...
2013-09-12 15:01:38 -07:00
Linus Torvalds 1d7b24ff33 NFS client bugfixes:
- Fix a few credential reference leaks resulting from the SP4_MACH_CRED
   NFSv4.1 state protection code.
 - Fix the SUNRPC bloatometer footprint: convert a 256K hashtable into the
   intended 64 byte structure.
 - Fix a long standing XDR issue with FREE_STATEID
 - Fix a potential WARN_ON spamming issue
 - Fix a missing dprintk() kuid conversion
 
 New features:
 - Enable the NFSv4.1 state protection support for the WRITE and COMMIT
   operations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMiO+AAoJEGcL54qWCgDyuwEQALNAMpcRhASpqrRSuX94aKn3
 ATENr87ov2FCXcTP/OBjdlcryyjp+0e5JBW5T0nHn90Uylz4p/87eOILlqIq4ax2
 4QldKAuHdk5gLwiX5ebWpDtlwjTwyth1PRD7iPHT8lvIlO0IT7S/VDaa/04J37PL
 Lw1zaTD0cpdRkdTnA12RDJ5oTW0YwmSBb5qJQROjinwa/ALuIZJpoBNCV01lIP2k
 VaW0Yd8A+hqtawmxnf3G14r50Ds269AZ5K4hcRjQMEWeetlwfXFSTSjx8dzgsQkx
 4VF6wiCSwsKEdrp8csRv+fsHiGRjNfzdSTrQxcJa+ssP6qX0KWHYPdw2jgbozX+2
 kUQw2bFgxug+zdNjp+z1daJzw4QAfkjfNBWzt4w7a+8VOnR+/fydJzmka4mlJUKB
 IDy8l/KrSCjCHi9VYal27+IQs/bcLAIvASUF14cZ/+ZY9MUsWhYXVPHNLhwTPds2
 jFvawh77V6MHg/wA2+D7yHbHmOOmZaH2/Af9v3HKsVhhoLwqr5LO9qfAq63KSxzW
 udzmjlSEhlOiJKDMZo9HigjKhU+Ndujr7RqsP6WFjTPa4yn6499cbTy7izze6MPB
 JZDlmkInnZAtLDOuHAwxSNuNfBD6Yrzk1PV8Gv2xMEdp41bxgAg//K3WXx2vSGWa
 4TQMHjaegAkdHyTK0rJD
 =IdGo
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes (part 2) from Trond Myklebust:
 "Bugfixes:
   - Fix a few credential reference leaks resulting from the
     SP4_MACH_CRED NFSv4.1 state protection code.
   - Fix the SUNRPC bloatometer footprint: convert a 256K hashtable into
     the intended 64 byte structure.
   - Fix a long standing XDR issue with FREE_STATEID
   - Fix a potential WARN_ON spamming issue
   - Fix a missing dprintk() kuid conversion

  New features:
   - Enable the NFSv4.1 state protection support for the WRITE and
     COMMIT operations"

* tag 'nfs-for-3.12-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: No, I did not intend to create a 256KiB hashtable
  sunrpc: Add missing kuids conversion for printing
  NFSv4.1: sp4_mach_cred: WARN_ON -> WARN_ON_ONCE
  NFSv4.1: sp4_mach_cred: no need to ref count creds
  NFSv4.1: fix SECINFO* use of put_rpccred
  NFSv4.1: sp4_mach_cred: ask for WRITE and COMMIT
  NFSv4.1 fix decode_free_stateid
2013-09-12 13:39:34 -07:00
Linus Torvalds 68f0d9d92e vfs: make d_path() get the root path under RCU
This avoids the spinlocks and refcounts in the d_path() sequence too
(used by /proc and various other entities).  See commit 8b19e34188 for
the equivalent getcwd() system call path.

And unlike getcwd(), d_path() doesn't copy the result to user space, so
I don't need to fear _that_ particular bug happening again.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 13:24:55 -07:00
Linus Torvalds 3272c544da vfs: use __getname/__putname for getcwd() system call
It's a pathname.  It should use the pathname allocators and
deallocators, and PATH_MAX instead of PAGE_SIZE.  Never mind that the
two are commonly the same.

With this, the allocations scale up nicely too, and I can do getcwd()
system calls at a rate of about 300M/s, with no lock contention
anywhere.

Of course, nobody sane does that, especially since getcwd() is
traditionally a very slow operation in Unix.  But this was also the
simplest way to benchmark the prepend_path() improvements by Waiman, and
once I saw the profiles I couldn't leave it well enough alone.

But apart from being an performance improvement (from using per-cpu slab
allocators instead of the raw page allocator), it's actually a valid and
real cleanup.

Signed-off-by: Linus "OCD" Torvalds <torvalds@linux-foundation.org>
2013-09-12 12:40:15 -07:00
Linus Torvalds ff812d7242 vfs: don't copy things to user space holding the rcu readlock
Oops.  That wasn't very smart.  We don't actually need the RCU lock any
more by the time we copy the cwd string to user space, but I had
stupidly surrounded the whole thing with it.

Introduced by commit 8b19e34188 ("vfs: make getcwd() get the root and
pwd path under rcu")

Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 11:57:01 -07:00
Linus Torvalds 8b19e34188 vfs: make getcwd() get the root and pwd path under rcu
This allows us to skip all the crazy spinlocks and reference count
updates, and instead use the fs sequence read-lock to get an atomic
snapshot of the root and cwd information.

We might want to make the rule that "prepend_path()" is always called
with the RCU lock held, but the RCU lock nests fine and this is the
minimal fix.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 10:35:47 -07:00
Linus Torvalds 5762482f54 vfs: move get_fs_root_and_pwd() to single caller
Let's not pollute the include files with inline functions that are only
used in a single place.  Especially not if we decide we might want to
change the semantics of said function to make it more efficient..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 10:12:47 -07:00
Linus Torvalds b7c09ad401 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This is against 3.11-rc7, but was pulled and tested against your tree
  as of yesterday.  We do have two small incrementals queued up, but I
  wanted to get this bunch out the door before I hop on an airplane.

  This is a fairly large batch of fixes, performance improvements, and
  cleanups from the usual Btrfs suspects.

  We've included Stefan Behren's work to index subvolume UUIDs, which is
  targeted at speeding up send/receive with many subvolumes or snapshots
  in place.  It closes a long standing performance issue that was built
  in to the disk format.

  Mark Fasheh's offline dedup work is also here.  In this case offline
  means the FS is mounted and active, but the dedup work is not done
  inline during file IO.  This is a building block where utilities are
  able to ask the FS to dedup a series of extents.  The kernel takes
  care of verifying the data involved really is the same.  Today this
  involves reading both extents, but we'll continue to evolve the
  patches"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
  Btrfs: optimize key searches in btrfs_search_slot
  Btrfs: don't use an async starter for most of our workers
  Btrfs: only update disk_i_size as we remove extents
  Btrfs: fix deadlock in uuid scan kthread
  Btrfs: stop refusing the relocation of chunk 0
  Btrfs: fix memory leak of uuid_root in free_fs_info
  btrfs: reuse kbasename helper
  btrfs: return btrfs error code for dev excl ops err
  Btrfs: allow partial ordered extent completion
  Btrfs: convert all bug_ons in free-space-cache.c
  Btrfs: add support for asserts
  Btrfs: adjust the fs_devices->missing count on unmount
  Btrf: cleanup: don't check for root_refs == 0 twice
  Btrfs: fix for patch "cleanup: don't check the same thing twice"
  Btrfs: get rid of one BUG() in write_all_supers()
  Btrfs: allocate prelim_ref with a slab allocater
  Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
  Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
  Btrfs: fix race between removing a dev and writing sbs
  Btrfs: remove ourselves from the cluster list under lock
  ...
2013-09-12 09:58:51 -07:00
Waiman Long 1812997720 dcache: get/release read lock in read_seqbegin_or_lock() & friend
This patch modifies read_seqbegin_or_lock() and need_seqretry() to use
newly introduced read_seqlock_excl() and read_sequnlock_excl()
primitives so that they won't change the sequence number even if they
fall back to take the lock.  This is OK as no change to the protected
data structure is being made.

It will prevent one fallback to lock taking from cascading into a series
of lock taking reducing performance because of the sequence number
change.  It will also allow other sequence readers to go forward while
an exclusive reader lock is taken.

This patch also updates some of the inaccurate comments in the code.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
To: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 09:25:23 -07:00
Mark Tinguely 08474ed639 xfs: remove dead code from xlog_recover_inode_pass2
Additional code in the error handler of xlog_recover_inode_pass2()
results in the following error:

static checker warning: "fs/xfs/xfs_log_recover.c:2999
xlog_recover_inode_pass2()
	 info: ignoring unreachable code."

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-12 09:51:49 -05:00
Dan Carpenter aa9e10409e xfs: = vs == typo in ASSERT()
There is a '=' vs '==' typo so the ASSERT()s are always true.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-12 09:42:08 -05:00
Linus Torvalds 7b7a2f0a31 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "CIFS update including case insensitive file name matching improvements
  for UTF-8 to Unicode, various small cifs fixes, SMB2/SMB3 leasing
  improvements, support for following SMB2 symlinks, SMB3 packet signing
  improvements"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (25 commits)
  CIFS: Respect epoch value from create lease context v2
  CIFS: Add create lease v2 context for SMB3
  CIFS: Move parsing lease buffer to ops struct
  CIFS: Move creating lease buffer to ops struct
  CIFS: Store lease state itself rather than a mapped oplock value
  CIFS: Replace clientCanCache* bools with an integer
  [CIFS] quiet sparse compile warning
  cifs: Start using per session key for smb2/3 for signature generation
  cifs: Add a variable specific to NTLMSSP for key exchange.
  cifs: Process post session setup code in respective dialect functions.
  CIFS: convert to use le32_add_cpu()
  CIFS: Fix missing lease break
  CIFS: Fix a memory leak when a lease break comes
  cifs: add winucase_convert.pl to Documentation/ directory
  cifs: convert case-insensitive dentry ops to use new case conversion routines
  cifs: add new case-insensitive conversion routines that are based on wchar_t's
  [CIFS] Add Scott to list of cifs contributors
  cifs: Move and expand MAX_SERVER_SIZE definition
  cifs: Expand max share name length to 256
  cifs: Move string length definitions to uapi
  ...
2013-09-12 07:41:12 -07:00
Linus Torvalds 1ae276a911 Two small fixes to the code that initializes the per-file crypto
contexts.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCgAGBQJSMPiLAAoJENaSAD2qAscKbTUP/iYjgdQGDodEYTVg9ofUaJ8O
 ltzlIbVweglEW+Z7tr83vM1R29ta95WQK2PpK4SxV6+Jh6nJz9p9WLCvrugUXKOB
 GOd+WA/8i8lHGdydtnC8Cd3vhHG76oLwR+iv8HzI6TIdMWJMV5bNGT1D6GqADXca
 4dl8pD8QTrh4jldjmYyiT8dFR4wfAvfcTvKKemMFY68LXpntVgt580hd7893LOUJ
 7elAG0l1sygOWbgroJf1Rqm2OnRP9brET1+TgKAcJrv9AciidVkMCB72srkX82Bz
 eBGipzFaYT+3lDrK5iM+9l8NnQeYOFIp4JuId1wv28DH06/ExTWqfOZiq5VCq1gb
 /6spqQGj7mRp7oGk1yIvtTr7TxlbGqmUeP3wbClSmG+nsjAyC7ZsqVzgyJtgxd45
 ox06Rf7jufkxbztYOQBa6qWerbvW3zS2Not9Usp5oBWlTLBV1xEVRdEX6QXii1nL
 z4CQTWgapx0AvuIWAsJbQMVLiHMEGA8luapo9GihODBdaHtX4lnQ3L2GURvjyy3I
 0agE37ITpEDAFE4YzR5XquPvqHqlHFHb2PoE+7a96YXXFlR+ZkklAwQd4cbiomCT
 czFNLcWTmmKbW/i8IS/5wgOfQuNVfDjFXKw1ynCKcuB6mCK+ugImqTG8dT793ldB
 QVkmgx/s//v2/WbvxzNW
 =yb4P
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Two small fixes to the code that initializes the per-file crypto
  contexts"

* tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: avoid ctx initialization race
  ecryptfs: remove check for if an array is NULL
2013-09-11 19:17:04 -07:00
Linus Torvalds c2d95729e3 Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 - Some pidns/fork/exec tweaks
 - OCFS2 updates
 - Most of MM - there remain quite a few memcg parts which depend on
   pending core cgroups changes.  Which might have been already merged -
   I'll check tomorrow...
 - Various misc stuff all over the place
 - A few block bits which I never got around to sending to Jens -
   relatively minor things.
 - MAINTAINERS maintenance
 - A small number of lib/ updates
 - checkpatch updates
 - epoll
 - firmware/dmi-scan
 - Some kprobes work for S390
 - drivers/rtc updates
 - hfsplus feature work
 - vmcore feature work
 - rbtree upgrades
 - AOE updates
 - pktcdvd cleanups
 - PPS
 - memstick
 - w1
 - New "inittmpfs" feature, which does the obvious
 - More IPC work from Davidlohr.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (303 commits)
  lz4: fix compression/decompression signedness mismatch
  ipc: drop ipc_lock_check
  ipc, shm: drop shm_lock_check
  ipc: drop ipc_lock_by_ptr
  ipc, shm: guard against non-existant vma in shmdt(2)
  ipc: document general ipc locking scheme
  ipc,msg: drop msg_unlock
  ipc: rename ids->rw_mutex
  ipc,shm: shorten critical region for shmat
  ipc,shm: cleanup do_shmat pasta
  ipc,shm: shorten critical region for shmctl
  ipc,shm: make shmctl_nolock lockless
  ipc,shm: introduce shmctl_nolock
  ipc: drop ipcctl_pre_down
  ipc,shm: shorten critical region in shmctl_down
  ipc,shm: introduce lockless functions to obtain the ipc object
  initmpfs: use initramfs if rootfstype= or root= specified
  initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled
  initmpfs: move rootfs code from fs/ramfs/ to init/
  initmpfs: move bdi setup from init_rootfs to init_ramfs
  ...
2013-09-11 16:08:54 -07:00
Rob Landley 57f150a58c initmpfs: move rootfs code from fs/ramfs/ to init/
When the rootfs code was a wrapper around ramfs, having them in the same
file made sense.  Now that it can wrap another filesystem type, move it in
with the init code instead.

This also allows a subsequent patch to access rootfstype= command line
arg.

Signed-off-by: Rob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:37 -07:00
Rob Landley 4bbee76bc9 initmpfs: move bdi setup from init_rootfs to init_ramfs
Even though ramfs hasn't got a backing device, commit e0bf68ddec ("mm:
bdi init hooks") added one anyway, and put the initialization in
init_rootfs() since that's the first user, leaving it out of init_ramfs()
to avoid duplication.

But initmpfs uses init_tmpfs() instead, so move the init into the
filesystem's init function, add a "once" guard to prevent duplicate
initialization, and call the filesystem init from rootfs init.

This goes part of the way to allowing ramfs to be built as a module.

[akpm@linux-foundation.org; using bit 1 was odd]
Signed-off-by: Rob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:37 -07:00
Rob Landley 137fdcc18a initmpfs: replace MS_NOUSER in initramfs
Mounting MS_NOUSER prevents --bind mounts from rootfs.  Prevent new rootfs
mounts with a different mechanism that doesn't affect bind mounts.

Signed-off-by: Rob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:36 -07:00
Jan Kara 5e4c0d9741 lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice.  That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt.  Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.

Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes.  However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed.  To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:36 -07:00
Dan Carpenter 6325932666 affs: use loff_t in affs_truncate()
It seems pretty unlikely that AFFS supports files over 4GB but we may as
well leave use loff_t just for cleanness sake instead of truncating it to
32 bits.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:29 -07:00
Michael Holzheu 11e376a3f9 vmcore: enable /proc/vmcore mmap for s390
The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
now to use mmap also on s390.

So enable mmap for s390 again.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:14 -07:00
Michael Holzheu 9cb218131d vmcore: introduce remap_oldmem_pfn_range()
For zfcpdump we can't map the HSA storage because it is only available via
a read interface.  Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped.  For zfcpdump this is everything
besides of the HSA memory.  For the areas that are not mapped by
remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
handler mmap_vmcore_fault() is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:10 -07:00
Michael Holzheu be8a8d069e vmcore: introduce ELF header in new memory feature
For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
(zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
a hypervisor owned memory area (HSA) before the kdump kernel is booted.
When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

The advantages of this mechanism are:

 * No crashkernel memory has to be defined in the old kernel.
 * Early boot problems (before kexec_load has been done) can be dumped
 * Non-Linux systems can be dumped.

We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.

Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.

So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.

The following steps are done during zfcpdump execution:

1.  Production system crashes
2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4.  Boot loader loads kernel into low memory area
5.  Kernel boots and uses only HSA_SIZE bytes of memory
6.  Kernel saves registers of non-boot CPUs
7.  Kernel does memory detection for dump memory map
8.  Kernel creates ELF header for /proc/vmcore
9.  /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
    - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
    - copy_oldmem_page() copies from real memory for memory above HSA_SIZE

Currently for s390 we create the ELF core header in the 2nd kernel with a
small trick.  We relocate the addresses in the ELF header in a way that
for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
and the read_from_oldmem() returns the correct data.  This allows the
/proc/vmcore code to use the ELF header in the 2nd kernel.

This patch:

Exchange the old mechanism with the new and much cleaner function call
override feature that now offcially allows to create the ELF core header
in the 2nd kernel.

To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:

 * elfcorehdr_alloc: Allocate ELF header
 * elfcorehdr_free: Free the memory of the ELF header
 * elfcorehdr_read: Read from ELF header
 * elfcorehdr_read_notes: Read from ELF notes

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:10 -07:00
Oleg Nesterov 6b3c538f5b exec: cleanup the error handling in search_binary_handler()
The error hanling and ret-from-loop look confusing and inconsistent.

- "retval >= 0" simply returns

- "!bprm->file" returns too but with read_unlock() because
   binfmt_lock was already re-acquired

- "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
  relies on the same check after the main loop

Consolidate these checks into a single if/return statement.

need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
the main loop are not needed.  This is only for pathological and
impossible list_empty(&formats) case.

It is not clear why do we check "bprm->mm == NULL", probably this
should be removed.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:09 -07:00
Oleg Nesterov 4e0621a07e exec: don't retry if request_module() fails
A separate one-liner for better documentation.

It doesn't make sense to retry if request_module() fails to exec
/sbin/modprobe, add the additional "request_module() < 0" check.

However, this logic still doesn't look exactly right:

1. It would be better to check "request_module() != 0", the user
   space modprobe process should report the correct exit code.
   But I didn't dare to add the user-visible change.

2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
   to exec a "#!path-to-unsupported-binary" script. In this case
   request_module() + "retry" will be done twice: first by the
   "depth == 1" code, and then again by the "depth == 0" caller
   which doesn't make sense.

3. And note that in the case above bprm->buf was already changed
   by load_script()->prepare_binprm(), so this looks even more
   ugly.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:07 -07:00
Oleg Nesterov cb7b6b1cbc exec: cleanup the CONFIG_MODULES logic
search_binary_handler() uses "for (try=0; try<2; try++)" to avoid "goto"
but the code looks too complicated and horrible imho.  We still need to
check "try == 0" before request_module() and add the additional "break"
for !CONFIG_MODULES case.

Kill this loop and use a simple "bool need_retry" + "goto retry".  The
code looks much simpler and we do not even need ifdef's, gcc can optimize
out the "if (need_retry)" block if !IS_ENABLED().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:05 -07:00
Oleg Nesterov 92eaa565ad exec: kill ->load_binary != NULL check in search_binary_handler()
search_binary_handler() checks ->load_binary != NULL for no reason, this
method should be always defined.  Turn this check into WARN_ON() and move
it into __register_binfmt().

Also, kill the function pointer.  The current code looks confusing, as if
->load_binary can go away after read_unlock(&binfmt_lock).  But we rely on
module_get(fmt->module), this fmt can't be changed or unregistered,
otherwise this code is buggy anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:05 -07:00
Oleg Nesterov 52f14282bb exec: move allow_write_access/fput to exec_binprm()
When search_binary_handler() succeeds it does allow_write_access() and
fput(), then it clears bprm->file to ensure the caller will not do the
same.

We can simply move this code to exec_binprm() which is called only once.
In fact we could move this to free_bprm() and remove the same code in
do_execve_common's error path.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:05 -07:00
Oleg Nesterov 9beb266f2d exec: proc_exec_connector() should be called only once
A separate one-liner with the minor fix.

PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
least twice if search_binary_handler() is called by ->load_binary()
recursively, say, load_script().

Move it to exec_binprm(), this is "depth == 0" code too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:05 -07:00
Oleg Nesterov 131b2f9f12 exec: kill "int depth" in search_binary_handler()
Nobody except search_binary_handler() should touch ->recursion_depth, "int
depth" buys nothing but complicates the code, kill it.

Probably we should also kill "fn" and the !NULL check, ->load_binary
should be always defined.  And it can not go away after read_unlock() or
this code is buggy anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:04 -07:00
Oleg Nesterov 5d1baf3b63 exec: introduce exec_binprm() for "depth == 0" code
task_pid_nr_ns() and trace/ptrace code in the middle of the recursive
search_binary_handler() looks confusing and imho annoying.  We only need
this code if "depth == 0", lets add a simple helper which calls
search_binary_handler() and does trace_sched_process_exec() +
ptrace_event().

The patch also moves the setting of task->did_exec, we need to do this
only once.

Note: we can kill either task->did_exec or PF_FORKNOEXEC.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Zach Levis <zml@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:03 -07:00
Oleg Nesterov 96d0df79f2 proc: make proc_fd_permission() thread-friendly
proc_fd_permission() says "process can still access /proc/self/fd after it
has executed a setuid()", but the "task_pid() = proc_pid() check only
helps if the task is group leader, /proc/self points to
/proc/<leader-pid>.

Change this check to use task_tgid() so that the whole thread group can
access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd.

Notes:
	- CLONE_THREAD does not require CLONE_FILES so task->files
	  can differ, but I don't think this can lead to any security
	  problem. And this matches same_thread_group() in
	  __ptrace_may_access().

	- /proc/self should probably point to /proc/<thread-tid>, but
	  it is too late to change the rules. Perhaps it makes sense
	  to add /proc/thread though.

Test-case:

	void *tfunc(void *arg)
	{
		assert(opendir("/proc/self/fd"));
		return NULL;
	}

	int main(void)
	{
		pthread_t t;
		pthread_create(&t, NULL, tfunc, NULL);
		pthread_join(t, NULL);
		return 0;
	}

fails if, say, this executable is not readable and suid_dumpable = 0.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:03 -07:00
Chen Gang a3c039929d fs/proc/task_mmu.c: check the return value of mpol_to_str()
mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
check about it, or buffer may not be zero based, and next seq_printf()
will cause issue.

The failure return need after mpol_cond_put() to match get_vma_policy().

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:03 -07:00
Andrew Morton be49b30a98 fs/file_table.c:fput(): make comment more truthful
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:01 -07:00
Stéphane Graber 65aafb1e74 coredump: add new %P variable in core_pattern
Add a new %P variable to be used in core_pattern.  This variable contains
the global PID (PID in the init namespace) as %p contains the PID in the
current namespace which isn't always what we want.

The main use for this is to make it easier to handle crashes that happened
within a container.  With that new variables it's possible to have the
crashes dumped into the container or forwarded to the host with the right
PID (from the host's point of view).

Signed-off-by: Stéphane Graber <stgraber@ubuntu.com>
Reported-by: Hans Feldt <hans.feldt@ericsson.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andy Whitcroft <apw@canonical.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:01 -07:00
Vyacheslav Dubeyko b4c1107cc9 hfsplus: integrate POSIX ACLs support into driver
Integrate implemented POSIX ACLs support into hfsplus driver.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:01 -07:00
Vyacheslav Dubeyko eef80d4ad1 hfsplus: implement POSIX ACLs support
Implement POSIX ACLs support in hfsplus driver.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:01 -07:00
Vyacheslav Dubeyko 2c92057e45 hfsplus: add necessary declarations for POSIX ACLs support
This patchset implements POSIX ACLs support in hfsplus driver.

Mac OS X beginning with version 10.4 ("Tiger") support NFSv4 ACLs, which
are part of the NFSv4 standard.  HFS+ stores ACLs in the form of
specially named extended attributes (com.apple.system.Security).

But this patchset doesn't use "com.apple.system.Security" extended
attributes.  It implements support of POSIX ACLs in the form of extended
attributes with names "system.posix_acl_access" and
"system.posix_acl_default".  These xattrs are treated only under Linux.
POSIX ACLs doesn't mean something under Mac OS X.  Thereby, this patch
set provides opportunity to use POSIX ACLs under Linux on HFS+
filesystem.

This patch:

Add CONFIG_HFSPLUS_FS_POSIX_ACL kernel configuration option, DBG_ACL_MOD
debugging flag and acl.h file with declaration of essential functions
for support POSIX ACLs in hfsplus driver.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:59:00 -07:00
Eric Dumazet 91cf5ab60f epoll: add a reschedule point in ep_free()
ep_free() might iterate on a huge set of epitems and hold cpu too long.
Add two cond_resched() in order to yield cpu to other tasks.  This is safe
as we only hold mutexes in this function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:50 -07:00
Gu Zheng bc5c8f0783 fs/bio-integrity: fix a potential mem leak
Free the bio_integrity_pool in the fail path of biovec_create_pool in
function bioset_integrity_create().

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:21 -07:00
Junxiao Bi 146d7009b4 writeback: fix race that cause writeback hung
There is a race between mark inode dirty and writeback thread, see the
following scenario.  In this case, writeback thread will not run though
there is dirty_io.

__mark_inode_dirty()                                          bdi_writeback_workfn()
	...                                                       	...
	spin_lock(&inode->i_lock);
	...
	if (bdi_cap_writeback_dirty(bdi)) {
	    <<< assume wb has dirty_io, so wakeup_bdi is false.
	    <<< the following inode_dirty also have wakeup_bdi false.
	    if (!wb_has_dirty_io(&bdi->wb))
		    wakeup_bdi = true;
	}
	spin_unlock(&inode->i_lock);
	                                                            <<< assume last dirty_io is removed here.
	                                                            pages_written = wb_do_writeback(wb);
	                                                            ...
	                                                            <<< work_list empty and wb has no dirty_io,
	                                                            <<< delayed_work will not be queued.
	                                                            if (!list_empty(&bdi->work_list) ||
	                                                                (wb_has_dirty_io(wb) && dirty_writeback_interval))
	                                                                queue_delayed_work(bdi_wq, &wb->dwork,
	                                                                    msecs_to_jiffies(dirty_writeback_interval * 10));
	spin_lock(&bdi->wb.list_lock);
	inode->dirtied_when = jiffies;
	<<< new dirty_io is added.
	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
	spin_unlock(&bdi->wb.list_lock);

	<<< though there is dirty_io, but wakeup_bdi is false,
	<<< so writeback thread will not be waked up and
	<<< the new dirty_io will not be flushed.
	if (wakeup_bdi)
	    bdi_wakeup_thread_delayed(bdi);

Writeback will run until there is a new flush work queued.  This may cause
a lot of dirty pages stay in memory for a long time.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:13 -07:00
Maxim Patlasov 5a53748568 mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:04 -07:00
Wanpeng Li 7d9f073b8d mm/writeback: make writeback_inodes_wb static
It's not used globally and could be static.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:02 -07:00
Cyrill Gorcunov d9104d1ca9 mm: track vma changes with VM_SOFTDIRTY bit
Pavel reported that in case if vma area get unmapped and then mapped (or
expanded) in-place, the soft dirty tracker won't be able to recognize this
situation since it works on pte level and ptes are get zapped on unmap,
loosing soft dirty bit of course.

So to resolve this situation we need to track actions on vma level, there
VM_SOFTDIRTY flag comes in.  When new vma area created (or old expanded)
we set this bit, and keep it here until application calls for clearing
soft dirty bit.

Thus when user space application track memory changes now it can detect if
vma area is renewed.

Reported-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:56 -07:00
Jan Kara 47df3ddedd writeback: fix occasional slow sync(1)
In case when system contains no dirty pages, wakeup_flusher_threads() will
submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
immediately without doing anything, even though there are dirty inodes in
the system.  Thus sync(1) will write all the dirty inodes from a
WB_SYNC_ALL writeback pass which is slow.

Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
instead of calculating number of dirty pages manually.  That function also
takes number of dirty inodes into account.

Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Paul Taysom <taysom@chromium.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:55 -07:00
Jie Liu 28e8be3180 ocfs2: fix the end cluster offset of FIEMAP
Call fiemap ioctl(2) with given start offset as well as an desired mapping
range should show extents if possible.  However, we somehow figure out the
end offset of mapping via 'mapping_end -= cpos' before iterating the
extent records which would cause problems if the given fiemap length is
too small to a cluster size, e.g,

Cluster size 4096:
debugfs.ocfs2 1.6.3
        Block Size Bits: 12   Cluster Size Bits: 12

The extended fiemap test utility From David:
https://gist.github.com/anonymous/6172331

# dd if=/dev/urandom of=/ocfs2/test_file bs=1M count=1000
# ./fiemap /ocfs2/test_file 4096 10
start: 4096, length: 10
File /ocfs2/test_file has 0 extents:
#	Logical          Physical         Length           Flags
	^^^^^ <-- No extent is shown

In this case, at ocfs2_fiemap(): cpos == mapping_end == 1. Hence the
loop of searching extent records was not executed at all.

This patch remove the in question 'mapping_end -= cpos', and loops
until the cpos is larger than the mapping_end as usual.

# ./fiemap /ocfs2/test_file 4096 10
start: 4096, length: 10
File /ocfs2/test_file has 1 extents:
#	Logical          Physical         Length           Flags
0:	0000000000000000 0000000056a01000 0000000006a00000 0000

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: David Weber <wb@munzinger.de>
Tested-by: David Weber <wb@munzinger.de>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fashen <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:53 -07:00
Joseph Qi a72e27d372 ocfs2: remove unused variable ip in dlmfs_get_root_inode()
Variable ip in dlmfs_get_root_inode() is defined but not used.  So clean
it up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:52 -07:00
Joyce 6f8648e894 ocfs2: fix a tiny race case when firing callbacks
In o2hb_shutdown_slot() and o2hb_check_slot(), since event is defined as
local, it is only valid during the call stack.  So the following tiny race
case may happen in a multi-volumes mounted environment:

o2hb-vol1                         o2hb-vol2
1) o2hb_shutdown_slot
allocate local event1
2) queue_node_event
add event1 to global o2hb_node_events
                                  3) o2hb_shutdown_slot
                                  allocate local event2
                                  4) queue_node_event
                                  add event2 to global o2hb_node_events
                                  5) o2hb_run_event_list
                                  delete event1 from o2hb_node_events
6) o2hb_run_event_list
event1 empty, return
7) o2hb_shutdown_slot
event1 lifecycle ends
                                  8) o2hb_fire_callbacks
                                  event1 is already *invalid*

This patch lets it wait on o2hb_callback_sem when another thread is firing
callbacks.  And for performance consideration, we only call
o2hb_run_event_list when there is an event queued.

Signed-off-by: Joyce <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:51 -07:00
Joseph Qi 03dbe88aa9 ocfs2: avoid possible NULL pointer dereference in o2net_accept_one()
Since o2nm_get_node_by_num() may return NULL, we add this check in
o2net_accept_one() to avoid possible NULL pointer dereference.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:50 -07:00
Joseph Qi 9a239e4c68 ocfs2: adjust code style for o2net_handler_tree_lookup()
Code in o2net_handler_tree_lookup() may be corrupted by mistake.  So
adjust it to promote readability.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:50 -07:00
Younger Liu 7aebff18b9 ocfs2: free path in ocfs2_remove_inode_range()
In ocfs2_remove_inode_range(), there is a memory leak.  The variable path
has allocated memory with ocfs2_new_path_from_et(), but it is not free.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:50 -07:00
Joseph Qi 6cae6d3189 ocfs2: fix possible double free in ocfs2_reflink_xattr_rec
In ocfs2_reflink_xattr_rec(), meta_ac and data_ac are allocated by calling
ocfs2_lock_reflink_xattr_rec_allocators().

Once an error occurs when allocating *data_ac, it frees *meta_ac which is
allocated before.  Here it mistakenly sets meta_ac to NULL but *meta_ac.
Then ocfs2_reflink_xattr_rec() will try to free meta_ac again which is
already invalid.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:49 -07:00
Xue jiufei 69b2bd16d9 ocfs2/dlm: force clean refmap when doing local cleanup
dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
lockres is UNKNOWN.  Otherwise node may hang when umounting filesystems.
Here's the situation:

	Node1                                    Node2
dlmlock()
  -> dlm_get_lock_resource()
send DLM_MASTER_REQUEST_MSG to
other nodes.

                                       trying to master this lockres,
                                       return MAYBE.

selected as the master of lockresA,
set mle->master to Node1,
and do assert_master,
send DLM_ASSERT_MASTER_MSG to Node2.
                                       Node 2 has interest on lockresA
                                       and return
                                       DLM_ASSERT_RESPONSE_MASTERY_REF
                                       then something happened and
                                       Node2 crashed.

Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
sending DLM_ASSERT_MASTER_MSG to other nodes

o2hb found node2 down, calling dlm_hb_node_down() -->
dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
no need to call dlm_free_dead_locks().

Set the master of lockresA to Node1, but Node2 stills remains in refmap.

When Node1 umount, it found that the refmap of lockresA is not empty and
attempted to migrate it to Node2, But Node2 is already down, so umount
hang, trying to migrate lockresA again and again.

Signed-off-by: joyce <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:49 -07:00
Younger Liu 6ea437a363 ocfs2: free meta_ac and data_ac when ocfs2_start_trans fails in ocfs2_xattr_set()
In ocfs2_xattr_set(), if ocfs2_start_trans failed, meta_ac and data_ac
should be free.  Otherwise, It would lead to a memory leak.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:47 -07:00
Joseph Qi 17caf9555e ocfs2: add the missing return value check of ocfs2_xattr_get_clusters
In ocfs2_xattr_value_attach_refcount(), if error occurs when calling
ocfs2_xattr_get_clusters(), it will go with unexpected behavior since
local variables p_cluster, num_clusters and ext_flags are declared without
initialization.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:45 -07:00
Jie Liu 4704aa30fc ocfs2: fix a memory leak in __ocfs2_move_extents()
The ocfs2 path is not properly freed which leads to a memory leak at
__ocfs2_move_extents().

This patch stops the leaks of the ocfs2_path structure.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Younger Liu <younger.liu@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:44 -07:00
Joseph Qi 2b0f6eae2d ocfs2: add missing return value check of ocfs2_get_clusters()
In ocfs2_attach_refcount_tree() and ocfs2_duplicate_extent_list(), if
error occurs when calling ocfs2_get_clusters(), it will go with
unexpected behavior as local variables p_cluster, num_clusters and
ext_flags are declared without initialization.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:44 -07:00
Joseph Qi 3d94ea51c1 ocfs2: clean up dead code in ocfs2_acl_from_xattr()
In ocfs2_acl_from_xattr(), if size is less than sizeof(struct
posix_acl_entry), it returns ERR_PTR(-EINVAL) directly.  Then assign (size
/ sizeof(struct posix_acl_entry)) to count which will be at least 1, that
means the following branch (count < 0) and (count == 0) will never be
true.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:39 -07:00
Dong Fang df53cd3b70 ocfs2: use list_for_each_entry() instead of list_for_each()
[dan.carpenter@oracle.com: fix up some NULL dereference bugs]
Signed-off-by: Dong Fang <yp.fangdong@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:36 -07:00
Sunil Mushran 8dd7903e48 fs/ocfs2/cluster/tcp.c: fix possible null pointer dereferences
Fix some possible null pointer dereferences that were detected by the
static code analyser, smatch.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:34 -07:00
Younger Liu 7e9b793707 ocfs2: ac_bits_wanted should be local_alloc_bits when returns -ENOSPC
There is an issue in reserving and claiming space for localalloc, When
localalloc space is not enough, it would claim space from global_bitmap.
And if there is not enough free space in global_bitmap, the size of
claiming space would set to half of orignal size and retry.

The issue is as follows: osb->local_alloc_bits is set to half of orignal
size in ocfs2_recalc_la_window(), but ac->ac_bits_wanted is set to
osb->local_alloc_default_bits which is not changed.  localalloc always
reserves and claims local_alloc_default_bits space and returns ENOSPC.

So, ac->ac_bits_wanted should be osb->local_alloc_bits which would be
changed.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:31 -07:00
Xue jiufei 98ac9125c5 ocfs2: dlm_request_all_locks() should deal with the status sent from target node
dlm_request_all_locks() should deal with the status sent from target node
if DLM_LOCK_REQUEST_MSG is sent successfully, or recovery master will fall
into endless loop, waiting for other nodes to send locks and
DLM_RECO_DATA_DONE_MSG to me.

        NodeA                                  NodeB
                                     selected as recovery master
                                     dlm_remaster_locks()
                                     ->dlm_request_all_locks()
                                     send DLM_LOCK_REQUEST_MSG to nodeA

It happened that NodeA cannot alloc memory when it processes this
message.  dlm_request_all_locks_handler() do not queue
dlm_request_all_locks_worker and returns -ENOMEM.  It will never send
locks and DLM_RECO_DATA_DONE_MSG to NodeB.

                                    NodeB do not deal with the status
                                    sent from nodeA, and will fall in
                                    endless loop waiting for the
                                    recovery state of NodeA to be
                                    changed.

Signed-off-by: joyce <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:31 -07:00
Junxiao Bi f17c20dd2e ocfs2: use i_size_read() to access i_size
Though ocfs2 uses inode->i_mutex to protect i_size, there are both
i_size_read/write() and direct accesses.  Clean up all direct access to
eliminate confusion.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:30 -07:00
Younger Liu 2b1e55c389 ocfs2: lighten up allocate transaction
The issue scenario is as following:

When fallocating a very large disk space for a small file,
__ocfs2_extend_allocation attempts to get a very large transaction.  For
some journal sizes, there may be not enough room for this transaction,
and the fallocate will fail.

The patch below extends & restarts the transaction as necessary while
allocating space, and should work with even the smallest journal.  This
patch refers ext4 resize.

Test:
# mkfs.ocfs2 -b 4K -C 32K -T datafiles /dev/sdc
...(jounral size is 32M)
# mount.ocfs2 /dev/sdc /mnt/ocfs2/
# touch /mnt/ocfs2/1.log
# fallocate -o 0 -l 400G /mnt/ocfs2/1.log
fallocate: /mnt/ocfs2/1.log: fallocate failed: Cannot allocate memory
# tail -f /var/log/messages
[ 7372.278591] JBD: fallocate wants too many credits (2051 > 2048)
[ 7372.278597] (fallocate,6438,0):__ocfs2_extend_allocation:709 ERROR: status = -12
[ 7372.278603] (fallocate,6438,0):ocfs2_allocate_unwritten_extents:1504 ERROR: status = -12
[ 7372.278607] (fallocate,6438,0):__ocfs2_change_file_space:1955 ERROR: status = -12
^C
With this patch, the test works well.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:56:28 -07:00
Linus Torvalds 2b76db6a0f for-linus-3.12-merge minor 9p fixes and tweaks for 3.12 merge window
The first fixes namespace issues which causes a kernel
 NULL pointer dereference, the second fixes uevent
 handling to work better with udev, and the third
 switches some code to use srlcpy instead of strncpy
 in order to be safer.
 
 All changes have been baking in for-next for at least
 2 weeks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJSMJjZAAoJEDZk62b0Tg6x81sQAKa60QStBKhnL65bvG+ooIsS
 mhwfmFyaWOKw1ezwY2Vk0+JnmKDBpKmqjjwyL3nLP18TcRZStPiFdcJBKWl+czge
 FTv14t54CcjysYPbYN7+gUap4F5mfg0mcHaR0UGow505dNyjwd7mqkZhy1IqhdvP
 Ue/h0RE46GeNtdirxrKBdEfW/7TAL0tcoRgjKu0ev1V2sXCJZywuXgkzWjByRXwT
 JOg04gGnYThuek0/KUPRhf0KxB0CyKrZiics7LGb40HkYYxs7ahADACttLyiDr8l
 GntfHXLgvVlU5QcSbKRfLp0zNbi7AxWmJrwYsEwpas4tUw1Q+pVJ2EE2Ameuq5G+
 LrMGmRVQCVYw8UN+OYUO7glhXEJcCPJj6vxgm+NVXx24yaQyGI1aTsIEjHwZ/hkm
 wlQHC47z6/fIypkXpsU6pYWF/r3GwXHokYReejATQWEPIzIxvHeThe0jjqMLth7F
 zmsHZTpmECqtti1fizy5wBZD25wAIxdf+rf8nKy1VvcSN4s08ESSlC/kV/siNeko
 efFnL8xbjP5SPEVoBtXM6eTDHrQ0S+ACSGWtp0FGXKOW4PKzS60ve2Stp+FYZgQc
 WgXI7+NBU6Z9z+cZ9bsY0hrGwK1YZiR4F3KJ5ofTuxAO6n7zd+N3fGBuQJ2tiW9P
 pKtIXNozWqnAU9Wx4rGa
 =YbFT
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p updates from Eric Van Hensbergen:
 "Minor 9p fixes and tweaks for 3.12 merge window

  The first fixes namespace issues which causes a kernel NULL pointer
  dereference, the second fixes uevent handling to work better with
  udev, and the third switches some code to use srlcpy instead of
  strncpy in order to be safer.

  All changes have been baking in for-next for at least 2 weeks"

* tag 'for-linus-3.12-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: avoid accessing utsname after namespace has been torn down
  9p: send uevent after adding/removing mount_tag attribute
  fs: 9p: use strlcpy instead of strncpy
2013-09-11 12:34:13 -07:00
Linus Torvalds 53bf710832 A couple of minor additional sanity check patches for corrupted information,
and some fixes.  Apart from that there's a minor loop optimisation.
 
 These sanity checks mainly exist to trap maliciously corrupted
 filesystems either through using a deliberately modified mksquashfs,
 or where the user has deliberately chosen to generate uncompressed
 metadata and then corrupted it.
 
 Normally metadata in Squashfs filesystems is compressed, which means
 corruption (either accidental or malicious) is detected when
 trying to decompress the metadata.  So corrupted data does not normally
 get as far as the code paths in question here.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSMJLmAAoJEJAch/D1fbHUXSUP/1kCIu6EgpFHtrgjiaOV2RZh
 WiQ+Hxnn6F6f6Zvt3GKwCrj0nmIiGxhPeOrg8EXt26UYlFQi2NBzqHU/kcrHwz/H
 re2BObRb50FdUb/MfqtIiyXPmjwb0AlY+XuaG8+xnc6snoZ9076T6qSEeoqvehU5
 cVqgTsFoGNpU+4ymtXUe2bgGsSZN+o0zPdqiM49b9zfHCgbe66fLrI7NTTiygdg5
 vE9mAq0WGxSv6x7SvQyl2+PIZyBpZkYth7kJ9tBk6iEXFIhCy4EGHff/U08WaT9E
 sLMKpy65E2rSxr9L5TXXquKinyWFhlx6yhBYzkJ1TKM8FmGSclmixWzQVhI+5S1B
 1IJR4W/Nq8cfiBaGTOGLtGywuvYAFRuAgjZyodiIgCU2QTxhu5/7h5HZx8ZpjO0x
 wiEuL+6iup8tzPi0x+2XoFKZpV5wi8ozKZ9hiE2dQe2ckTSKYAxUDh0dez+ZiI/W
 J1dyvwn8aB4PQMBSRM7g7PKdGrnb1ZaFzdUt9iloqX1BXmE2ZSG1sHKnBL8Lcrmq
 dRylQ1hzr3o+xVJLZKB+fWlOWZe7YiQM0uMeAXtPyT3gumlO3tqVHOxvHo4DyHx/
 1YmNBL0nSjZJMwG5TCLiV//s/qFXV4d8YN9KngIkliEgGHRvRuIWXHI37WaPt9k5
 OV6uNKK4sy2roTjc8pL/
 =r/lz
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs updates from Phillip Lougher:
 "A couple of minor additional sanity check patches for corrupted
  information, and some fixes.  Apart from that there's a minor loop
  optimisation.

  These sanity checks mainly exist to trap maliciously corrupted
  filesystems either through using a deliberately modified mksquashfs,
  or where the user has deliberately chosen to generate uncompressed
  metadata and then corrupted it.

  Normally metadata in Squashfs filesystems is compressed, which means
  corruption (either accidental or malicious) is detected when trying to
  decompress the metadata.  So corrupted data does not normally get as
  far as the code paths in question here"

* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: add corruption check for type in squashfs_readdir()
  Squashfs: add corruption check in get_dir_index_using_offset()
  Squashfs: fix corruption checks in squashfs_readdir()
  Squashfs: fix corruption checks in squashfs_lookup()
  Squashfs: fix corruption check in get_dir_index_using_name()
  Squashfs: Optimized uncompressed buffer loop
  Squashfs: sanity check information from disk
2013-09-11 12:33:12 -07:00
Weston Andros Adamson 312cd958a7 NFSv4.1: sp4_mach_cred: WARN_ON -> WARN_ON_ONCE
No need to spam the logs

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-09-11 09:08:08 -04:00