Commit Graph

47399 Commits

Author SHA1 Message Date
Linus Torvalds 52bce91165 splice: reinstate SIGPIPE/EPIPE handling
Commit 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
caused a regression when there were no more readers left on a pipe that
was being spliced into: rather than the expected SIGPIPE and -EPIPE
return value, the writer would end up waiting forever for space to free
up (which obviously was not going to happen with no readers around).

Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Debugged-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org   # v4.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-21 10:59:34 -08:00
Linus Torvalds bc1ecd626b NFS client updates for Linux 4.10
Highlights include:
 
 - Further attribute cache improvements to make revalidation more fine grained
 - NFSv4 locking improvements
 
 Bugfixes:
 - nfs4_fl_prepare_ds must be careful about reporting success in files layout
 - pNFS/flexfiles: Instead of marking a device inactive, remove it from the cache
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYWoWvAAoJEGcL54qWCgDydigQAMFTUZTjHGx34l3yqie3uV0D
 P0Ws6qfQJwN6xwgbhLMt6pbECadN7d55KszIM71gu/Iqa1UbtHZht2Slf/Tw5wY8
 vjptzLkvOie5g9sGIhhQqkgrXnG3A96aEuaknIj9JXWALJb01M94ZfC0STgUjmil
 sp100cDmmOonqflk7Tgh/jmfiUucBw9HnyNRY59sUaR0lAmHRrpJuqftmarYKRrq
 1rTCeGx6MyM2kVyqGsY7FP1prQcNcIJXwti/EXrOywXbHZxmjl+l0WLu6iUOEj/I
 zvfWCeMFyMXeVsTrRLYeM1vQ/maUbSev6gms2RshbmDITE6Ir9nHfBTCfOmh95wV
 paKnAzuEKsUR+LhEbsORk03efQYkeDQuJW3QqhdVuqGnEwhaboTiP4GRJLjEo287
 dHy6sEGEbAcieUsolXhDIaxggoF39gGihOT3+m+WgIlXRd38W2mvlQC9oaq89ehw
 Ipetm3swnkY8bTavsNYFW1x+V5zrGH9Dlu2/mpnqVAIwGT+ByolD8tYDAHf4zj3b
 cy9XhFzdLDRy4V95l4hzZFDCdjy4dZZu7cJEO5zqUB3236qhiZbGLjL232ZEHFi3
 wNHn13nADV+DzoOuB5HJwfhKyOOBQsQpBIvzAgt1n7muZwxa35nX/2waX1xt2HX5
 J+s+e02xJnMStcr/2O5v
 =8gLp
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull more NFS client updates from Trond Myklebust:
 "Highlights include:

   - further attribute cache improvements to make revalidation more fine
     grained

   - NFSv4 locking improvements

  Bugfixes:

   - nfs4_fl_prepare_ds must be careful about reporting success in files
     layout

   - pNFS/flexfiles: Instead of marking a device inactive, remove it
     from the cache"

* tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
  NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
  NFSv4: Place the GETATTR operation before the CLOSE
  NFSv4: Also ask for attributes when downgrading to a READ-only state
  NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
  pNFS: Return RW layouts on OPEN_DOWNGRADE
  NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
  NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
  NFSv4: ensure __nfs4_find_lock_state returns consistent result.
  NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
  pNFS/flexfiles: delete deviceid, don't mark inactive
  NFS: Clean up nfs_attribute_timeout()
  NFS: Remove unused function nfs_revalidate_inode_rcu()
  NFS: Fix and clean up the access cache validity checking
  NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
  NFS: Clean up cache validity checking
  NFS: Don't revalidate the file on close if we hold a delegation
  NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
  NFSv4: Update the attribute cache info in update_changeattr
2016-12-21 10:40:30 -08:00
Trond Myklebust 8ac2b42238 NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
If our DELEGRETURN RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:30:03 -05:00
Trond Myklebust f07d4a31cc NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
If our CLOSE RPC call is rejected with an EACCES call, then we should
remove the GETATTR call from the compound RPC and retry.
This could potentially happen when there is a conflict between an
ACL denying attribute reads and our use of SP4_MACH_CRED.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:30:01 -05:00
Trond Myklebust d8d849835e NFSv4: Place the GETATTR operation before the CLOSE
In order to benefit from the DENY share lock protection, we should
put the GETATTR operation before the CLOSE. Otherwise, we might race
with a Windows machine that thinks it is now safe to modify the file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:30:00 -05:00
Trond Myklebust 9413a1a1bf NFSv4: Also ask for attributes when downgrading to a READ-only state
If we're downgrading from a READ+WRITE mode to a READ-only mode, then
ask for cache consistency attributes so that we avoid the revalidation
in nfs_close_context()

Fixes: 3947b74d0f ("NFSv4: Don't request a GETATTR on open_downgrade.")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:58 -05:00
Trond Myklebust a5f925bccc NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
The NFS_INO_REVAL_FORCED flag now really only has meaning for the
case when we've just been handed a delegation for a file that was already
cached, and we're unsure about that cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:56 -05:00
Trond Myklebust e71708d4df pNFS: Return RW layouts on OPEN_DOWNGRADE
If the client holds no more writeable open state, and does not hold a
write delegation, then send a layoutreturn as part of the OPEN_DOWNGRADE.

We do this only for writes, since some layout drivers may require you to
also hold a read layout if you are doing a R/W workload.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:55 -05:00
Trond Myklebust b6808145ad NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
While we do not need to return the RW layout when downgrading from a
read/write open state to read-only, we might want to do so in order
to reduce the burden on the metadataserver so that it does not need
to check for changed data when responding to GETATTR requests.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:52 -05:00
NeilBrown 86cfb04185 NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
When an NFS4ERR_BAD_SEQID is received the open-owner is removed from
the ->state_owners rbtree so that it will no longer be used.

If any stateids attached to this open-owner are still in use, and if a
request using one gets an NFS4ERR_BAD_STATEID reply, this can for bad.

The state is marked as needing recovery and the nfs4_state_manager()
is scheduled to clean up.  nfs4_state_manager() finds states to be
recovered by walking the state_owners rbtree.  As the open-owner is
not in the rbtree, the bad state is not found so nfs4_state_manager()
completes having done nothing.  The request is then retried, with a
predicatable result (indefinite retries).

If the stateid is for a delegation, this open_owner will be used
to open files when the delegation is returned.  For that to work,
a new open-owner needs to be presented to the server.

This patch changes NFS4ERR_BAD_SEQID handling to leave the open-owner
in the rbtree but updates the 'create_time' so it looks like a new
open-owner.  With this the indefinite retries no longer happen.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:51 -05:00
NeilBrown 3f8f25489f NFSv4: ensure __nfs4_find_lock_state returns consistent result.
If a file has both flock locks and OFD locks, then it is possible that
two different nfs4 lock states could apply to file accesses from a
single process.

It is not possible to know, efficiently, which one is "correct".
Presumably the state which represents a lock that covers the region
undergoing IO would be the "correct" one to use, but finding that has
a non-trivial cost and would provide miniscule value.

Currently we just return whichever is first in the list, which could
result in inconsistent behaviour if an application ever put it self in
this position.  As consistent behaviour is preferable (when perfectly
correct behaviour is not available), change the search to return a
consistent result in this circumstance.
Specifically: if there is both a flock and OFD lock state, always return
the flock one.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:50 -05:00
NeilBrown cfd278c280 NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.

This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.

So add a test for ds_clp == NULL and return NULL in that case.

Fixes: c23266d532 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Olga Kornievskaia <aglo@umich.edu>
Acked-by: Adamson, Andy <William.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:48 -05:00
Weston Andros Adamson 1c48cee83b pNFS/flexfiles: delete deviceid, don't mark inactive
Instead of marking a device inactive, remove it from the cache entirely.

Flexfiles has a way to report errors back to the server, so we don't want
to stop devices from being tried again for 120 seconds.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:45 -05:00
Trond Myklebust 187e593d27 NFS: Clean up nfs_attribute_timeout()
It can be made static.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:44 -05:00
Trond Myklebust 3f642a1335 NFS: Remove unused function nfs_revalidate_inode_rcu()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:41 -05:00
Trond Myklebust 21c3ba7e5d NFS: Fix and clean up the access cache validity checking
The access cache needs to check whether or not the mode bits, ownership,
or ACL has changed or the cache has timed out.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:39 -05:00
Trond Myklebust 9cdd1d3f1a NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
Just like in nfs_check_verifier(), we want to use
nfs_mapping_need_revalidate_inode() to check our knowledge of the
change attribute is up to date.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:37 -05:00
Trond Myklebust 61540bf6bb NFS: Clean up cache validity checking
Consolidate the open-coded checking of NFS_I(inode)->cache_validity
into a couple of helper functions.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:35 -05:00
Trond Myklebust 58ff41842c NFS: Don't revalidate the file on close if we hold a delegation
If we're holding a delegation, we can skip sending the close-to-open
GETATTR until we're returning that delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:32 -05:00
Trond Myklebust 0bc2c9b4dc NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
DELEGRETURN will always carry a reference to the inode except when
the latter is being freed, so let's ensure that we always use that
inode information to ensure close-to-open cache consistency, even
when the DELEGRETURN call is asynchronous.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:29 -05:00
Trond Myklebust e603a4c1b5 NFSv4: Update the attribute cache info in update_changeattr
If we successfully updated the change attribute, we should timestamp the
cache. While we do know that the other attributes are not completely up
to date, we have the NFS_INO_INVALID_ATTR flag that let us know that,
so it is valid to say that the cache has not timed out.
We can also clear NFS_INO_REVAL_PAGECACHE, since our change attribute
is now known to be valid.

Conversely, if the change attribute did not match, we should make sure to
also revalidate the access and ACL caches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-19 17:29:27 -05:00
Linus Torvalds e93b1cc8a8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, fsnotify and ext2 updates from Jan Kara:
 "Changes to locking of some quota operations from dedicated quota mutex
  to s_umount semaphore, a fsnotify fix and a simple ext2 fix"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix bogus warning in dquot_disable()
  fsnotify: Fix possible use-after-free in inode iteration on umount
  ext2: reject inodes with negative size
  quota: Remove dqonoff_mutex
  ocfs2: Use s_umount for quota recovery protection
  quota: Remove dqonoff_mutex from dquot_scan_active()
  ocfs2: Protect periodic quota syncing with s_umount semaphore
  quota: Use s_umount protection for quota operations
  quota: Hold s_umount in exclusive mode when enabling / disabling quotas
  fs: Provide function to get superblock with exclusive s_umount
2016-12-19 08:23:53 -08:00
Jan Kara 2700e6067c quota: Fix bogus warning in dquot_disable()
dquot_disable() was warning when sb_has_quota_loaded() was true when
invalidating page cache for quota files. The thinking behind this
warning was that we must have raced with somebody else turning quotas on
and this should not happen because all places modifying quota state must
hold s_umount exclusively now. However sb_has_quota_loaded() can be also
true at this point when we are just suspending quotas on remount
read-only. Just restore the behavior to situation before commit
c3b004460d ("quota: Remove dqonoff_mutex") which introduced the
warning.

The code in dquot_disable() can be further simplified with the new
locking of quota state changes however let's leave that to a separate
commit that can get more testing exposure.

Fixes: c3b004460d
Signed-off-by: Jan Kara <jack@suse.cz>
2016-12-19 14:01:39 +01:00
Linus Torvalds 231753ef78 Merge uncontroversial parts of branch 'readlink' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull partial readlink cleanups from Miklos Szeredi.

This is the uncontroversial part of the readlink cleanup patch-set that
simplifies the default readlink handling.

Miklos and Al are still discussing the rest of the series.

* git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  vfs: make generic_readlink() static
  vfs: remove ".readlink = generic_readlink" assignments
  vfs: default to generic_readlink()
  vfs: replace calling i_op->readlink with vfs_readlink()
  proc/self: use generic_readlink
  ecryptfs: use vfs_get_link()
  bad_inode: add missing i_op initializers
2016-12-17 19:16:12 -08:00
Linus Torvalds 0110c350c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "In this pile:

   - autofs-namespace series
   - dedupe stuff
   - more struct path constification"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  ocfs2: charge quota for reflinked blocks
  ocfs2: fix bad pointer cast
  ocfs2: always unlock when completing dio writes
  ocfs2: don't eat io errors during _dio_end_io_write
  ocfs2: budget for extent tree splits when adding refcount flag
  ocfs2: prohibit refcounted swapfiles
  ocfs2: add newlines to some error messages
  ocfs2: convert inode refcount test to a helper
  simple_write_end(): don't zero in short copy into uptodate
  exofs: don't mess with simple_write_{begin,end}
  9p: saner ->write_end() on failing copy into non-uptodate page
  fix gfs2_stuffed_write_end() on short copies
  fix ceph_write_end()
  nfs_write_end(): fix handling of short copies
  vfs: refactor clone/dedupe_file_range common functions
  fs: try to clone files first in vfs_copy_file_range
  vfs: misc struct path constification
  namespace.c: constify struct path passed to a bunch of primitives
  quota: constify struct path in quota_on
  ...
2016-12-17 18:44:00 -08:00
Al Viro 9763f7a4a5 Merge branch 'work.autofs' into for-linus
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-16 16:34:52 -05:00
Al Viro 3c55d6bcfe Merge remote-tracking branch 'djwong/ocfs2-vfs-reflink-6' into for-linus 2016-12-16 16:21:05 -05:00
Al Viro 4da00fd1b9 Merge branch 'work.write_end' into for-linus 2016-12-16 16:19:49 -05:00
Linus Torvalds 59331c215d A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
   (myself).  Also fixed a deadlock caused by a bogus allocation on the
   writeback path and authorize reply verification.
 
 - a fix for long stalls during fsync (Jeff Layton).  The client now
   has a way to force the MDS log flush, leading to ~100x speedups in
   some synthetic tests.
 
 - a new [no]require_active_mds mount option (Zheng Yan).  On mount, we
   will now check whether any of the MDSes are available and bail rather
   than block if none are.  This check can be avoided by specifying the
   "no" option.
 
 - a couple of MDS cap handling fixes and a few assorted patches
   throughout.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
 4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
 7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
 8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
 lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
 KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
 =ypWx
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A varied set of changes:

   - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
     (myself). Also fixed a deadlock caused by a bogus allocation on the
     writeback path and authorize reply verification.

   - a fix for long stalls during fsync (Jeff Layton). The client now
     has a way to force the MDS log flush, leading to ~100x speedups in
     some synthetic tests.

   - a new [no]require_active_mds mount option (Zheng Yan).

     On mount, we will now check whether any of the MDSes are available
     and bail rather than block if none are. This check can be avoided
     by specifying the "no" option.

   - a couple of MDS cap handling fixes and a few assorted patches
     throughout"

* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
  libceph: remove now unused finish_request() wrapper
  libceph: always signal completion when done
  ceph: avoid creating orphan object when checking pool permission
  ceph: properly set issue_seq for cap release
  ceph: add flags parameter to send_cap_msg
  ceph: update cap message struct version to 10
  ceph: define new argument structure for send_cap_msg
  ceph: move xattr initialzation before the encoding past the ceph_mds_caps
  ceph: fix minor typo in unsafe_request_wait
  ceph: record truncate size/seq for snap data writeback
  ceph: check availability of mds cluster on mount
  ceph: fix splice read for no Fc capability case
  ceph: try getting buffer capability for readahead/fadvise
  ceph: fix scheduler warning due to nested blocking
  ceph: fix printing wrong return variable in ceph_direct_read_write()
  crush: include mapper.h in mapper.c
  rbd: silence bogus -Wmaybe-uninitialized warning
  libceph: no need to drop con->mutex for ->get_authorizer()
  libceph: drop len argument of *verify_authorizer_reply()
  libceph: verify authorize reply on connect
  ...
2016-12-16 11:23:34 -08:00
Linus Torvalds ff0f962ca3 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This update contains:

   - try to clone on copy-up

   - allow renaming a directory

   - split source into managable chunks

   - misc cleanups and fixes

  It does not contain the read-only fd data inconsistency fix, which Al
  didn't like. I'll leave that to the next year..."

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits)
  ovl: fix reStructuredText syntax errors in documentation
  ovl: fix return value of ovl_fill_super
  ovl: clean up kstat usage
  ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
  ovl: create directories inside merged parent opaque
  ovl: opaque cleanup
  ovl: show redirect_dir mount option
  ovl: allow setting max size of redirect
  ovl: allow redirect_dir to default to "on"
  ovl: check for emptiness of redirect dir
  ovl: redirect on rename-dir
  ovl: lookup redirects
  ovl: consolidate lookup for underlying layers
  ovl: fix nested overlayfs mount
  ovl: check namelen
  ovl: split super.c
  ovl: use d_is_dir()
  ovl: simplify lookup
  ovl: check lower existence of rename target
  ovl: rename: simplify handling of lower/merged directory
  ...
2016-12-16 10:58:12 -08:00
Linus Torvalds 087a76d390 Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
  here, and Christoph pitched in corrections/improvements to make btrfs
  use proper helpers for bio walking instead of doing it by hand.

  There are some key fixes as well, including some long standing bugs
  that took forever to track down in btrfs_drop_extents and during
  balance"

* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
  btrfs: limit async_work allocation and worker func duration
  Revert "Btrfs: adjust len of writes if following a preallocated extent"
  Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
  btrfs: opencode chunk locking, remove helpers
  btrfs: remove root parameter from transaction commit/end routines
  btrfs: split btrfs_wait_marked_extents into normal and tree log functions
  btrfs: take an fs_info directly when the root is not used otherwise
  btrfs: simplify btrfs_wait_cache_io prototype
  btrfs: convert extent-tree tracepoints to use fs_info
  btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
  btrfs: root->fs_info cleanup, add fs_info convenience variables
  btrfs: root->fs_info cleanup, update_block_group{,flags}
  btrfs: root->fs_info cleanup, lock/unlock_chunks
  btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
  btrfs: pull node/sector/stripe sizes out of root and into fs_info
  btrfs: root->fs_info cleanup, io_ctl_init
  btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
  btrfs: struct reada_control.root -> reada_control.fs_info
  btrfs: struct btrfsic_state->root should be an fs_info
  btrfs: alloc_reserved_file_extent trace point should use extent_root
  ...
2016-12-16 10:53:01 -08:00
Linus Torvalds 759b2656b2 The one new feature is support for a new NFSv4.2 mode_umask attribute
that makes ACL inheritance a little more useful in environments that
 default to restrictive umasks.  Requires client-side support, also on
 its way for 4.10.
 
 Other than that, miscellaneous smaller fixes and cleanup, especially to
 the server rdma code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYVAEqAAoJECebzXlCjuG+VM0QAKaR+ibSM31Ahpnrgit5/wrb
 n630KDFztO7iqEeuHfPQ4/n05T2QR0JWsLpjLMFvx88Gy4gyXYk9cuDPIrNKX1IS
 3/nnhBo0+EVnjODjufommCrtbPZlqOSsS3N03vWkB7rTi8QYsWBOThh+XLRJYOXo
 LZzJE1WmXNeCXV1kXPBsauryywql1fmwTXBzmIf1HbzoGAVROMEA2qqh4Z3nb7BP
 sJuGchWx0STBOuAa278ighXQPUW2lUft9uzw2bssOtMwfNyOs/Pd6nx4F1Lg6WwD
 1UQXoiR8K3PqelZfoeFJ05v0css/sbNKep+huWRdOXZj3Kjpa20lKBX8xHfat7sN
 1OQ4FHx8ToigX3c+wwtlCqRMCcIxqUYkRjqzPHyeBiSSSp0rLrId44rI5x/K0yay
 3bkGw7hFDSzc0Nq2uZgmtlbyTC71hLNhkWe7ThofcVG/pS0JtAqBiKIVwXJPh/e0
 PLmVHYGU6Xowjag5edJlXY1tlIlxtWfqsWUarCXS5bfKUa3UjMVSjyuljsDqqJsn
 96fEWu7DiUo4HeGYmf8MJoeZYV2y0DKSQGeguVkUKWp2DoTzinQHTfdKvrZVwNuu
 hVE9/QeWzUvPY13HOUaKD2skozhbUChqv0NHESKUv8gxE3svTEpYZkXrE74WNqMk
 l/WXAhw+RdKZof4+qdjU
 =JANY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "The one new feature is support for a new NFSv4.2 mode_umask attribute
  that makes ACL inheritance a little more useful in environments that
  default to restrictive umasks. Requires client-side support, also on
  its way for 4.10.

  Other than that, miscellaneous smaller fixes and cleanup, especially
  to the server rdma code"

[ The client side of the umask attribute was merged yesterday ]

* tag 'nfsd-4.10' of git://linux-nfs.org/~bfields/linux:
  nfsd: add support for the umask attribute
  sunrpc: use DEFINE_SPINLOCK()
  svcrdma: Further clean-up of svc_rdma_get_inv_rkey()
  svcrdma: Break up dprintk format in svc_rdma_accept()
  svcrdma: Remove unused variable in rdma_copy_tail()
  svcrdma: Remove unused variables in xprt_rdma_bc_allocate()
  svcrdma: Remove svc_rdma_op_ctxt::wc_status
  svcrdma: Remove DMA map accounting
  svcrdma: Remove BH-disabled spin locking in svc_rdma_send()
  svcrdma: Renovate sendto chunk list parsing
  svcauth_gss: Close connection when dropping an incoming message
  svcrdma: Clear xpt_bc_xps in xprt_setup_rdma_bc() error exit arm
  nfsd: constify reply_cache_stats_operations structure
  nfsd: update workqueue creation
  sunrpc: GFP_KERNEL should be GFP_NOFS in crypto code
  nfsd: catch errors in decode_fattr earlier
  nfsd: clean up supported attribute handling
  nfsd: fix error handling for clients that fail to return the layout
  nfsd: more robust allocation failure handling in nfsd_reply_cache_init
2016-12-16 10:48:28 -08:00
Linus Torvalds 9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Geliang Tang 313684c48c ovl: fix return value of ovl_fill_super
If kcalloc() failed, the return value of ovl_fill_super() is -EINVAL,
not -ENOMEM. So this patch sets this value to -ENOMEM before calling
kcalloc(), and sets it back to -EINVAL after calling kcalloc().

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Al Viro 32a3d848eb ovl: clean up kstat usage
FWIW, there's a bit of abuse of struct kstat in overlayfs object
creation paths - for one thing, it ends up with a very small subset
of struct kstat (mode + rdev), for another it also needs link in
case of symlinks and ends up passing it separately.

IMO it would be better to introduce a separate object for that.

In principle, we might even lift that thing into general API and switch
 ->mkdir()/->mknod()/->symlink() to identical calling conventions.  Hell
knows, perhaps ->create() as well...

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein 9aba652190 ovl: fold ovl_copy_up_truncate() into ovl_copy_up()
This removes code duplication.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein 97c684cc91 ovl: create directories inside merged parent opaque
The benefit of making directories opaque on creation is that lookups can
stop short when they reach the original created directory, instead of
continue lookup the entire depth of parent directory stack.

The best case is overlay with N layers, performing lookup for first level
directory, which exists only in upper.  In that case, there will be only
one lookup instead of N.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi 5cf5b477f0 ovl: opaque cleanup
oe->opaque is set for

 a) whiteouts
 b) directories having the "trusted.overlay.opaque" xattr

Case b can be simplified, since setting the xattr always implies setting
oe->opaque.  Also once set, the opaque flag is never cleared.

Don't need to set opaque flag for non-directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein c5bef3a72b ovl: show redirect_dir mount option
Show the value of redirect_dir in /proc/mounts.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi 3ea22a71b6 ovl: allow setting max size of redirect
Add a module option to allow tuning the max size of absolute redirects.
Default is 256.

Size of relative redirects is naturally limited by the the underlying
filesystem's max filename length (usually 255).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi 688ea0e5a0 ovl: allow redirect_dir to default to "on"
This patch introduces a kernel config option and a module param.  Both can
be used independently to turn the default value of redirect_dir on or off.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Amir Goldstein d15951198e ovl: check for emptiness of redirect dir
Before introducing redirect_dir feature, the condition
!ovl_lower_positive(dentry) for a directory, implied that it is a pure
upper directory, which may be removed if empty.

Now that directory can be redirect, it is possible that upper does not
cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a
merge (with redirected path) and maybe non empty.

Check for this case in ovl_remove_upper().

This change fixes the following test case from rename-pop-dir.py
of unionmount-testsuite:

    """Remove dir and rename old name"""
    d = ctx.non_empty_dir()
    d2 = ctx.no_dir()

    ctx.rmdir(d, err=ENOTEMPTY)
    ctx.rename(d, d2)
    ctx.rmdir(d, err=ENOENT)
    ctx.rmdir(d2, err=ENOTEMPTY)

./run --ov rename-pop-dir
/mnt/a/no_dir103: Expected error (Directory not empty) was not produced

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:57 +01:00
Miklos Szeredi a6c6065511 ovl: redirect on rename-dir
Current code returns EXDEV when a directory would need to be copied up to
move.  We could copy up the directory tree in this case, but there's
another, simpler solution: point to old lower directory from moved upper
directory.

This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay.  After such attribute has been set,
the directory can be moved without further actions required.

This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi 02b69b284c ovl: lookup redirects
If a directory has the "trusted.overlay.redirect" xattr, it means that the
value of the xattr should be used to find the underlying directory on the
next lower layer.

The redirect may be relative or absolute.  Absolute redirects begin with a
slash.

A relative redirect means: instead of the current dentry's name use the
value of the redirect to find the directory in the next lower
layer. Relative redirects must not contain a slash.

An absolute redirect means: look up the directory relative to the root of
the overlay using the value of the redirect in the next lower layer.

Redirects work on lower layers as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi e28edc46b8 ovl: consolidate lookup for underlying layers
Use a common helper for lookup of upper and lower layers.  This paves the
way for looking up directory redirects.

No functional change.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Amir Goldstein 48fab5d7c7 ovl: fix nested overlayfs mount
When the upper overlayfs checks "trusted.overlay.*" xattr on the underlying
overlayfs mount, it gets -EPERM, which confuses the upper overlayfs.

Fix this by returning -EOPNOTSUPP instead of -EPERM from
ovl_own_xattr_get() and ovl_own_xattr_set().  This behavior is consistent
with the behavior of ovl_listxattr(), which filters out the private
overlayfs xattrs.

Note: nested overlays are deprecated.  But this change makes sense
regardless: these xattrs are private to the overlay and should always be
hidden.  Hence getting and setting them should indicate this.

[SzMi: Use EOPNOTSUPP instead of ENODATA and use it for both getting and
setting "trusted.overlay." xattrs.  This is a perfectly valid error code
for "we don't support this prefix", which is the case here.]

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi 6b2d5fe46f ovl: check namelen
We already calculate f_namelen in statfs as the maximum of the name lengths
provided by the filesystems taking part in the overlay.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi bbb1e54dd5 ovl: split super.c
fs/overlayfs/super.c is the biggest of the overlayfs source files and it
contains various utility functions as well as the rather complicated lookup
code.  Split these parts out to separate files.

Before:

 1446 fs/overlayfs/super.c

After:

  919 fs/overlayfs/super.c
  267 fs/overlayfs/namei.c
  235 fs/overlayfs/util.c
   51 fs/overlayfs/ovl_entry.h

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi 2b8c30e9ef ovl: use d_is_dir()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00
Miklos Szeredi 8ee6059c58 ovl: simplify lookup
If encountering a non-directory, then stop looking at lower layers.

In this case the oe->opaque flag is not set anymore, which doesn't matter
since existence of lower file is now checked at remove/rename time.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-16 11:02:56 +01:00