Commit Graph

29148 Commits

Author SHA1 Message Date
Zhao Hongjiang ae11e0f184 userns: fix return value on mntns_install() failure
Change return value from -EINVAL to -EPERM when the permission check fails.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:22 -08:00
Eric W. Biederman 0c55cfc416 vfs: Allow unprivileged manipulation of the mount namespace.
- Add a filesystem flag to mark filesystems that are safe to mount as
  an unprivileged user.

- Add a filesystem flag to mark filesystems that don't need MNT_NODEV
  when mounted by an unprivileged user.

- Relax the permission checks to allow unprivileged users that have
  CAP_SYS_ADMIN permissions in the user namespace referred to by the
  current mount namespace to be allowed to mount, unmount, and move
  filesystems.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:21 -08:00
Eric W. Biederman 7a472ef4be vfs: Only support slave subtrees across different user namespaces
Sharing mount subtress with mount namespaces created by unprivileged
users allows unprivileged mounts created by unprivileged users to
propagate to mount namespaces controlled by privileged users.

Prevent nasty consequences by changing shared subtrees to slave
subtress when an unprivileged users creates a new mount namespace.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:20 -08:00
Eric W. Biederman 771b137168 vfs: Add a user namespace reference from struct mnt_namespace
This will allow for support for unprivileged mounts in a new user namespace.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:19 -08:00
Eric W. Biederman 8823c079ba vfs: Add setns support for the mount namespace
setns support for the mount namespace is a little tricky as an
arbitrary decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship between
the two mount namespaces.  Therefore I arbitrarily find the root
mount point, and follow every mount on top of it to find the top
of the mount stack.  Then I set fs->root and fs->pwd to that
location.  The topmost root of the mount stack seems like a
reasonable place to be.

Bind mount support for the mount namespace inodes has the
possibility of creating circular dependencies between mount
namespaces.  Circular dependencies can result in loops that
prevent mount namespaces from every being freed.  I avoid
creating those circular dependencies by adding a sequence number
to the mount namespace and require all bind mounts be of a
younger mount namespace into an older mount namespace.

Add a helper function proc_ns_inode so it is possible to
detect when we are attempting to bind mound a namespace inode.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:18 -08:00
Eric W. Biederman a85fb273c9 vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
Once you are confined to a user namespace applications can not gain
privilege and escape the user namespace so there is no longer a reason
to restrict chroot.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:17 -08:00
Eric W. Biederman 57e8391d32 pidns: Add setns support
- Pid namespaces are designed to be inescapable so verify that the
  passed in pid namespace is a child of the currently active
  pid namespace or the currently active pid namespace itself.

  Allowing the currently active pid namespace is important so
  the effects of an earlier setns can be cancelled.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:14 -08:00
Eric W. Biederman 0a01f2cc39 pidns: Make the pidns proc mount/umount logic obvious.
Track the number of pids in the proc hash table.  When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.

Move the mount of proc into alloc_pid when we allocate the pid for
init.

Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task.  Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.

Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue.  This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.

In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.

Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-11-19 05:59:10 -08:00
Eric W. Biederman 17cf22c33e pidns: Use task_active_pid_ns where appropriate
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 05:59:09 -08:00
Eric W. Biederman ae06c7c83f procfs: Don't cache a pid in the root inode.
Now that we have s_fs_info pointing to our pid namespace
the original reason for the proc root inode having a struct
pid is gone.

Caching a pid in the root inode has led to some complicated
code.  Now that we don't need the struct pid, just remove it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 03:09:35 -08:00
Eric W. Biederman e656d8a6f7 procfs: Use the proc generic infrastructure for proc/self.
I had visions at one point of splitting proc into two filesystems.  If
that had happened proc/self being the the part of proc that actually deals
with pids would have been a nice cleanup.  As it is proc/self requires
a lot of unnecessary infrastructure for a single file.

The only user visible change is that a mounted /proc for a pid namespace
that is dead now shows a broken proc symlink, instead of being completely
invisible.  I don't think anyone will notice or care.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-19 03:09:34 -08:00
Eric W. Biederman 499dcf2024 userns: Support fuse interacting with multiple user namespaces
Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

The connection between between a fuse filesystem and a fuse daemon is
established when a fuse filesystem is mounted and provided with a file
descriptor the fuse daemon created by opening /dev/fuse.

For now restrict the communication of uids and gids between the fuse
filesystem and the fuse daemon to the initial user namespace.  Enforce
this by verifying the file descriptor passed to the mount of fuse was
opened in the initial user namespace.  Ensuring the mount happens in
the initial user namespace is not necessary as mounts from non-initial
user namespaces are not yet allowed.

In fuse_req_init_context convert the currrent fsuid and fsgid into the
initial user namespace for the request that will be sent to the fuse
daemon.

In fuse_fill_attr convert the uid and gid passed from the fuse daemon
from the initial user namespace into kuids and kgids.

In iattr_to_fattr called from fuse_setattr convert kuids and kgids
into the uids and gids in the initial user namespace before passing
them to the fuse filesystem.

In fuse_change_attributes_common called from fuse_dentry_revalidate,
fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
the uid and gid from the fuse daemon into a kuid and a kgid to store
on the fuse inode.

By default fuse mounts are restricted to task whose uid, suid, and
euid matches the fuse user_id and whose gid, sgid, and egid matches
the fuse group id.  Convert the user_id and group_id mount options
into kuids and kgids at mount time, and use uid_eq and gid_eq to
compare the in fuse_allow_task.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14 22:05:33 -08:00
Eric W. Biederman 45634cd8cb userns: Support autofs4 interacing with multiple user namespaces
Use kuid_t and kgid_t in struct autofs_info and struct autofs_wait_queue.

When creating directories and symlinks default the uid and gid of
the mount requester to the global root uid and gid.  autofs4_wait
will update these fields when a mount is requested.

When generating autofsv5 packets report the uid and gid of the mount
requestor in user namespace of the process that opened the pipe,
reporting unmapped uids and gids as overflowuid and overflowgid.

In autofs_dev_ioctl_requester return the uid and gid of the last mount
requester converted into the calling processes user namespace.  When the
uid or gid don't map return overflowuid and overflowgid as appropriate,
allowing failure to find a mount requester to be distinguished from
failure to map a mount requester.

The uid and gid mount options specifying the user and group of the
root autofs inode are converted into kuid and kgid as they are parsed
defaulting to the current uid and current gid of the process that
mounts autofs.

Mounting of autofs for the present remains confined to processes in
the initial user namespace.

Cc: Ian Kent <raven@themaw.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14 22:05:32 -08:00
Mikulas Patocka 1a25b1c4ce Lock splice_read and splice_write functions
Functions generic_file_splice_read and generic_file_splice_write access
the pagecache directly. For block devices these functions must be locked
so that block size is not changed while they are in progress.

This patch is an additional fix for commit b87570f5d3 ("Fix a crash
when block device is read and block size is changed at the same time")
that locked aio_read, aio_write and mmap against block size change.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-28 10:59:37 -07:00
Linus Torvalds 64b1cbaa10 Power management and ACPI fixes for 3.7-rc3
* Fix for a memory leak in acpi_bind_one() from Jesper Juhl.
 
 * Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
   from Jonghwan Choi.
 
 * Fix for smp_processor_id() usage in preemptible code in powernow-k8 from
   Andreas Herrmann.
 
 * Fix for a suspend-related memory leak in cpufreq stats from Xiaobing Tu.
 
 * Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD
   in flush_old_exec() from Oleg Nesterov.
 
 * acpi_processor_notify() fix from Alan Cox.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJQivlUAAoJEKhOf7ml8uNsPHMP/jGv0umbDl0CrJBjd9eF+Tdt
 52DpJ/c2HjghsmSG26MCsFVO026DcPoO/t7faUVpWiUy/I38TwyjTFMKrxouY5AC
 8p929Gt5yjf/pB7w/P5C3exhv9zSWdVzCZ4rmlt7knBn7vN7jfI5Lv4kaEwAcu4o
 mnGbUVzaFLaiHsKFa8iBOpkdr01Fn9FRINddMQ/+PdiFR+wkqOKMZBExjRoQgS31
 aH90LL5Nfv5pSH126TH5S6GDdAXw0g4eHEfxGjNodEXdmAS+GXrD6QoGabab99ZD
 SaZA41kTv3+ls4Z9uhhpNBgqEDQEWiNVBVfTs0PWTUpemYKlx85Ihdl8PbH1H0TM
 QeHsM3dfHJsfhK/EjUFwx31oWrvfM0Qqw8CxOc/ASm2rpPOZVBgqRzKsqSMQE805
 y3lteaoT9nObnKdy871QmIhAk3fN25u1txCtmNFc0S5VZyiAnD2RVS/a8y93gMse
 5lxSMjOfUqvq3APlz4HIn3YovswjiAOOw0PlD3nK2qLj7tyEVOl5CMyL/05v58wJ
 SeOic10v1oOEDYT3uM+aVERmK9APAsMbcecj7Xd5yqPu8NPx7zrj+z30wr0znS5x
 KWnyZR83F4ZCb00geSZW0LliXuazjj+lWX/TFri9XY6ZdMvmTJOUVKBD8JwhHN0j
 WH9iMOOBTMnJdbnf88sM
 =8Tcc
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael J Wysocki:

 - Fix for a memory leak in acpi_bind_one() from Jesper Juhl.

 - Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
   from Jonghwan Choi.

 - Fix for smp_processor_id() usage in preemptible code in powernow-k8
   from Andreas Herrmann.

 - Fix for a suspend-related memory leak in cpufreq stats from Xiaobing
   Tu.

 - Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD in
   flush_old_exec() from Oleg Nesterov.

 - acpi_processor_notify() fix from Alan Cox.

* tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: missing break
  freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
  Fix memory leak in cpufreq stats.
  cpufreq / powernow-k8: Remove usage of smp_processor_id() in preemptible code
  PM / Domains: Fix memory leak on error path in pm_genpd_attach_cpuidle
  ACPI: Fix memory leak in acpi_bind_one()
2012-10-26 14:23:35 -07:00
Linus Torvalds 299650cad6 Driver core fixes for 3.7-rc3
Here are a number of firmware core fixes for 3.7, and some other minor fixes.
 And some documentation updates thrown in for good measure.
 
 All have been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlCKwkIACgkQMUfUDdst+ynzUgCfQDwxUw1PVqQyWy7SakpsjFJJ
 8kwAoITyjppn39v1WuZbg0+FZ6JpocyY
 =2mDG
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg Kroah-Hartman:
 "Here are a number of firmware core fixes for 3.7, and some other minor
  fixes.  And some documentation updates thrown in for good measure.

  All have been in the linux-next tree for a while.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  Documentation:Chinese translation of Documentation/arm64/memory.txt
  Documentation:Chinese translation of Documentation/arm64/booting.txt
  Documentation:Chinese translation of Documentation/IRQ.txt
  firmware loader: document kernel direct loading
  sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
  dynamic_debug: Remove unnecessary __used
  firmware loader: sync firmware cache by async_synchronize_full_domain
  firmware loader: let direct loading back on 'firmware_buf'
  firmware loader: fix one reqeust_firmware race
  firmware loader: cancel uncache work before caching firmware
2012-10-26 10:24:51 -07:00
Linus Torvalds 561ec64ae6 VFS: don't do protected {sym,hard}links by default
In commit 800179c9b8 ("This adds symlink and hardlink restrictions to
the Linux VFS"), the new link protections were enabled by default, in
the hope that no actual application would care, despite it being
technically against legacy UNIX (and documented POSIX) behavior.

However, it does turn out to break some applications.  It's rare, and
it's unfortunate, but it's unacceptable to break existing systems, so
we'll have to default to legacy behavior.

In particular, it has broken the way AFD distributes files, see

  http://www.dwd.de/AFD/

along with some legacy scripts.

Distributions can end up setting this at initrd time or in system
scripts: if you have security problems due to link attacks during your
early boot sequence, you have bigger problems than some kernel sysctl
setting. Do:

	echo 1 > /proc/sys/fs/protected_symlinks
	echo 1 > /proc/sys/fs/protected_hardlinks

to re-enable the link protections.

Alternatively, we may at some point introduce a kernel config option
that sets these kinds of "more secure but not traditional" behavioural
options automatically.

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # v3.6
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-26 10:05:07 -07:00
Linus Torvalds f48d42773b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has our series of fixes for the next rc.  The biggest batch is
  from Jan Schmidt, fixing up some problems in our subvolume quota code
  and fixing btrfs send/receive to work with the new extended inode
  refs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: do not bug when we fail to commit the transaction
  Btrfs: fix memory leak when cloning root's node
  Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
  Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
  Btrfs: fix deadlock caused by the nested chunk allocation
  btrfs: Return EINVAL when length to trim is less than FSB
  Btrfs: fix memory leak in btrfs_quota_enable()
  Btrfs: send correct rdev and mode in btrfs-send
  Btrfs: extended inode refs support for send mechanism
  Btrfs: Fix wrong error handling code
  Fix a sign bug causing invalid memory access in the ino_paths ioctl.
  Btrfs: comment for loop in tree_mod_log_insert_move
  Btrfs: fix extent buffer reference for tree mod log roots
  Btrfs: determine level of old roots
  Btrfs: tree mod log's old roots could still be part of the tree
  Btrfs: fix a tree mod logging issue for root replacement operations
  Btrfs: don't put removals from push_node_left into tree mod log twice
2012-10-26 09:34:04 -07:00
Linus Torvalds fec4fba6e4 NFS bugfixes for Linux 3.7
- Fix the NFSv2/v3 kernel statd protocol, which broke due to net namespace
   related changes.
 - Fix a number of races in the SUNRPC TCP disconnect/reconnect code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQieKUAAoJEGcL54qWCgDylBsP/R8UkhkNuL92x+KMSlNG28rr
 EYLIl5XNpBy/xIPhQmxuzV3KjytNzwe/3caNZ/Cl6thS4l/CRiqPKsq62xIOEMKn
 NBsY0ZULzhi2xwcR3BZxgn055PSTMxbQ9a3oeNbTvUbsrk5AnX7br6ol/oW1s7Fi
 ytCLrW/A4zB4DCJ/HMUnnQTCKCBGXBs2VD0ib6lKzr/6OqMO26cbQfHddTZW9BC3
 9qDKDVUQbLr9lNouYZbqWhZ6g/ecIkFyDyxW6ElVELaTNZFo0MTkQMYBgTQPDFd8
 PAxnq/Dcik2/Ls+qQgtqK2gs67RH7oAWvPR34ozA2pVJJj6wojXHbtIpH6IDuwIM
 XvNCOkQNebzB2glf0WFLqUjmoQVvONiDp0zA9CYz5xBgjgA6AlmGsnW5dHKFiAkS
 RxqZKPPinMOBzULOQYcC5Bu/hkXDMtvSi0UwMnuM5IZh1GAadtw+Do+0k6ociRuK
 ALD4tzfV+o9VM+/714eyYyJ7DEL4HgeEEpoQb6tRyNj8YIlibxM3XKGBBzwssI2d
 L2YuWmBE6N+zYUrwOqY1v9hYwh7qayqQiAgIac8e4/oC+UnQM8tfNNABcouA8a3d
 fSxOK+S0DFYTRN1EMDvhasnxUqmilj0Wuw+g6ytL95k8T2Q5b8PdYt6QCBc9h6SJ
 Qb+E/wkMEGAavm6w2T42
 =8Dt4
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS bugfixes from Trond Myklebust:

 - Fix the NFSv2/v3 kernel statd protocol, which broke due to net
   namespace related changes.

 - Fix a number of races in the SUNRPC TCP disconnect/reconnect code.

* tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
  LOCKD: fix races in nsm_client_get
  SUNRPC: Get rid of the xs_error_report socket callback
  SUNRPC: Prevent races in xs_abort_connection()
  Revert "SUNRPC: Ensure we close the socket on EPIPE errors too..."
  SUNRPC: Clear the connect flag when socket state is TCP_CLOSE_WAIT
2012-10-25 19:26:16 -07:00
Kees Cook 1217650336 fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
The compat ioctl for VIDEO_SET_SPU_PALETTE was missing an error check
while converting ioctl arguments.  This could lead to leaking kernel
stack contents into userspace.

Patch extracted from existing fix in grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: David Miller <davem@davemloft.net>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-25 14:37:53 -07:00
Oleg Nesterov b40a79591c freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
flush_old_exec() clears PF_KTHREAD but forgets about PF_NOFREEZE.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-10-25 22:28:12 +02:00
Josef Bacik c37b2b6269 Btrfs: do not bug when we fail to commit the transaction
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious.  Remove the BUG_ON().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:59:57 -04:00
Liu Bo 7bfdcf7fba Btrfs: fix memory leak when cloning root's node
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-25 15:55:21 -04:00
Chris Mason c657c3ef1a Merge branch 'for-chris-fixed' of git://git.jan-o-sch.net/btrfs-unstable 2012-10-25 15:53:10 -04:00
Josef Bacik be6aef6049 Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot.  Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-25 15:50:18 -04:00
Alex Lyakas e2d044fe77 Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.

Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>

 	if (S_ISREG(sctx->cur_inode_mode)) {
2012-10-25 15:47:31 -04:00
Miao Xie 671415b7db Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
 # mkfs.btrfs -m raid1 <disk1> <disk2>
 # btrfstune -S 1 <disk1>
 # mount <disk1> <mnt>
 # btrfs device add <disk3> <disk4> <mnt>
 # mount -o remount,rw <mnt>
 # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
 Deadlock happened.

It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.

By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.

So, I fix this problem by commiting the transaction at the end of the first
device insertion.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-25 15:47:00 -04:00
Lukas Czerner e515c18bfe btrfs: Return EINVAL when length to trim is less than FSB
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2012-10-25 15:46:22 -04:00
Tsutomu Itoh 5b7ff5b3c4 Btrfs: fix memory leak in btrfs_quota_enable()
We should free quota_root before returning from the error
handling code.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-10-25 15:45:43 -04:00
Arne Jansen d79e50433b Btrfs: send correct rdev and mode in btrfs-send
When sending a device file, the stream was missing the mode. Also the
rdev was encoded wrongly.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2012-10-25 15:45:25 -04:00
Jan Schmidt 96b5bd7771 Btrfs: extended inode refs support for send mechanism
This adds support for the new extended inode refs to btrfs send.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-25 15:45:16 -04:00
Stefan Behrens 84167d1905 Btrfs: Fix wrong error handling code
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-25 15:40:03 -04:00
Gabriel de Perthuis 661bec6ba8 Fix a sign bug causing invalid memory access in the ino_paths ioctl.
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:

  ls -i
  btrfs inspect inode-resolve -v $ino /mnt/btrfs

I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.
2012-10-25 15:39:47 -04:00
Geert Uytterhoeven 66081a7251 sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
The warning check for duplicate sysfs entries can cause a buffer overflow
when printing the warning, as strcat() doesn't check buffer sizes.
Use strlcat() instead.

Since strlcat() doesn't return a pointer to the passed buffer, unlike
strcat(), I had to convert the nested concatenation in sysfs_add_one() to
an admittedly more obscure comma operator construct, to avoid emitting code
for the concatenation if CONFIG_BUG is disabled.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24 15:57:14 -07:00
Trond Myklebust e498daa812 LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
The current code is clearing it in all cases _except_ when zero.

Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-24 10:46:22 -04:00
Trond Myklebust a4ee8d978e LOCKD: fix races in nsm_client_get
Commit e9406db20f (lockd: per-net
NSM client creation and destruction helpers introduced) contains
a nasty race on initialisation of the per-net NSM client because
it doesn't check whether or not the client is set after grabbing
the nsm_create_mutex.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-24 10:46:22 -04:00
Jan Schmidt 01763a2e37 Btrfs: comment for loop in tree_mod_log_insert_move
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:40 +02:00
Jan Schmidt d638108484 Btrfs: fix extent buffer reference for tree mod log roots
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:39 +02:00
Jan Schmidt 5b6602e762 Btrfs: determine level of old roots
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:38 +02:00
Jan Schmidt 834328a849 Btrfs: tree mod log's old roots could still be part of the tree
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-24 12:36:37 +02:00
Linus Torvalds 684baeb1d7 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core kernel fixes from Ingo Molnar:
 "Two small fixes"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: Reflect the new location of the NMI watchdog info
  nohz: Fix idle ticks in cpu summary line of /proc/stat
2012-10-24 04:07:02 +03:00
Jan Schmidt ba1bfbd592 Btrfs: fix a tree mod logging issue for root replacement operations
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in
two places. Where needed, we call tree_mod_log_free_eb explicitly now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23 15:09:14 +02:00
Jan Schmidt 57911b8ba8 Btrfs: don't put removals from push_node_left into tree mod log twice
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did
log the removal of the old data entries from the source buffer. Therefore,
we must not call tree_mod_log_eb_move if the check evaluates to true, as
that would log the removal twice, finally resulting in (rewinded) buffers
with wrong values for header_nritems.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2012-10-23 15:09:11 +02:00
Linus Torvalds d888af9654 Bug fix: Fix FITRIM argument handling
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJQhZDSAAoJEDaohF61QIxkwgkP/0j/ygUXo50k7EkdwFsC0mHw
 lQ5iirP7oUJ1p1v2PeaBqBulql5yseqTj6keLTTrYSBG/Ocp8qir9z0pyqWjN112
 DU8o/CPSVXnAj64d7Nu/YZBGdUHWQu+1Wex+nDyh2QXdJx94G3SoE3ZZ86BiXJ01
 1rVvHB+FuyzbKFCEk8hslKyJ8WHv1flu+9ulxQXDATrxJxqqTvBkRdT21h3Ds+d8
 HQhxj2RHT7yVHWvyQumTNLLVWOtuT+Svd3Mglzl4bbfAQ49THJFHaxGp1ufGZlR/
 iZ2+yoTU9ze+2JZ8FuCCkSfhv55iugLS8FRHUT6VdErOKnHum4aZgkHCz3oA0c5B
 FI/8pVUqEkf3wENxaM96O+XEIxF4ChMeF+fm3niMb0nL3Fiv7Sx97XExTqMhVhlv
 Xn+t71oNaU2xhus2A9Nd95oGEg9qQ2L0fvDWDV7/p+PvRRkCd0oKBVILZhVCMJ/V
 hvhbhNObUNWAb8x1pmrgWyk2H+dn4tX5SbyrbxnEAmX3oifw7CCfmZGDF+nanmMt
 cnqIjQ3qZxvQ/io6Edye0NX21K0HQcS2Bdh09uuWT+C3UNjwqGceyZdeay2gdWpP
 R7DgiaSm7+590InRweuREttVFigzOVDhN3zZTwlPGLUZ3q9EwdOcnhrPMD8ThT48
 mVWzYaPwGax3Ro3IjWsJ
 =M5iC
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.7-2' of git://github.com/kleikamp/linux-shaggy

Pull jfs fix from Dave Kleikamp:
 "Bug fix: Fix FITRIM argument handling"

* tag 'jfs-3.7-2' of git://github.com/kleikamp/linux-shaggy:
  jfs: Fix FITRIM argument handling
2012-10-23 08:49:34 +03:00
Linus Torvalds e589db7a6a Various bug fixes for ext4. The most serious of them fixes a security
bug (CVE-2012-4508) which leads to stale data exposure when we have
 fallocate racing against writes to files undergoing delayed
 allocation.  We also have two fixes for the metadata checksum feature,
 the most serious of which can cause the superblock to have a invalid
 checksum after a power failure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQhcJVAAoJENNvdpvBGATwlc0QAJ5eRVSXoQ9DL/rpycZtWsiR
 1HofZCBbeVJq7JkazypYZPV+ncm2Nxljx61EBMpReDgx+hgJS8VD7BcjxblXT1gK
 cvIk7tYXS1E5++TWZzQd0v3GMDoMJsfzb0Ao6vefaVgqh07MKE9Zvx0L8JR4tsH1
 YlRs2/ZALFqqMficemXpDuWRRoBTEcYkvaW9PtUIpeuk9i71iSCDiHvi0mRy4dYe
 nLftjBOjcsIuK0I7DfUYrbZNQuYacFcFTM5foE6lhdT+tlL1/od2M00IpopSSjF8
 7RoqV351FqL74Stu71wDp+q9n8t8bR9gnvEuDisHXXH6PKIYo83vawvuDKtP05lt
 lF0l2nKy/QorQtUNRnrWiRshPNEplmKM1yfRXwzfq5CX4Mjox1PM9g1AfMT/Pzbq
 wNPMqtiaNnVzfcSP94MTExKMR5axFgeFsIwuCtPVNDAUEbEFuwKARIeFjCGxYYsr
 81rIKD4lgvJjaHChtE/NzslQysMmr6qiZa17s+NteCwNRJX7U4xN99SO2BXSW7lW
 xGb1ZjdESiBZGzsmuOXqAQw7KWIRS7bQ+s4dewEbqQJomPD3NQUKsRVo1wdWeUqI
 6SI1YsBYEDPiViiohsFyn1Zl3BHgIvypWMW/ChhKtYsmEJapTnJ3SR3a+eafFZ99
 HgCMCF1i0KN1tvMDrLRn
 =qL29
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most serious of them fixes a security
  bug (CVE-2012-4508) which leads to stale data exposure when we have
  fallocate racing against writes to files undergoing delayed
  allocation.  We also have two fixes for the metadata checksum feature,
  the most serious of which can cause the superblock to have a invalid
  checksum after a power failure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Avoid underflow in ext4_trim_fs()
  ext4: Checksum the block bitmap properly with bigalloc enabled
  ext4: fix undefined bit shift result in ext4_fill_flex_info
  ext4: fix metadata checksum calculation for the superblock
  ext4: race-condition protection for ext4_convert_unwritten_extents_endio
  ext4: serialize fallocate with ext4_convert_unwritten_extents
2012-10-23 08:48:26 +03:00
Linus Torvalds 344ba37bdc NFS client bugfixes for Linux 3.7
- Do not call pnfs_return_layout() from an rpciod context
 - nfs4_ds_disconnect can cause Oopses. Kill it...
 - Fix the return value for nfs_callback_start_svc
 - Fix a number of compile warnings
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQhYQSAAoJEGcL54qWCgDyX7UP/0u2lexD08QtE+rMO4+9PHeI
 6vpTqd6LURhbjZ5JGkThBHU3H8LwGb588SBMzeU6ULk5WituoBki6WCBTEw9TxCN
 4nBr9/7FEyzM/6JKSICeTs/x4ZGS+lHE2M6Q77auUj8nclykNs+oVmYCAyc1k/2J
 bL7Hz+MDLvpPLqcVthMpaSEfuPHOaawkHOD3Iqh7ZqtqcePm6K+GadD9s/pSY8RA
 8sslffyw4oEfSThaHyCuX3kcq3IL1rCez4QL986f/c1Owi8zud1c/63sSBFHpX3h
 cmQSR3R1tWNllIrwU8XD2ypVHM8RK3DFgO+KWyv9c55CCIAMeaviou9YNeWZ40d3
 iptigAjQNNzx9pMCEKtC0G/B4XLkByF9iL0/xEQzCFeX4z34mLYGpw9CpEv6bpKl
 RABPRzvJGyuodZEMxfzFKEr13M4LBV7XG6HjRt+r0ytlHOlkMil6q4sY5OhX62lG
 EwpiWx/52Lf2P7YPgkjEZeaPRq92uIX9oEGUYL1cEmXb7pVNRuPfXYtxyKy1FHXr
 xyBtbCOg/HwhYZdsHeT0C56Id0rFRP5Z+fQKdKNxyLQ9CPhpSZwAgdk/QLVeWedt
 IE5IDQpehq2v8rKyhLu08ny4YHSrUwopw6s9mZLDDFxOW9R1p23u28t50PjRSMao
 tWq7F03fspqbKeNXeB9I
 =bDBm
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Do not call pnfs_return_layout() from an rpciod context
 - nfs4_ds_disconnect can cause Oopses.  Kill it...
 - Fix the return value for nfs_callback_start_svc
 - Fix a number of compile warnings

* tag 'nfs-for-3.7-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix the return value for nfs_callback_start_svc
  NFSv4.1: Declare osd_pri_2_pnfs_err(), objio_init_read/write to be static
  NFSv4: fs/nfs/nfs4getroot.c needs to include "internal.h"
  NFSv4.1: Use kcalloc() to allocate zeroed arrays instead of kzalloc()
  NFSv4.1: Do not call pnfs_return_layout() from an rpciod context
  NFSv4.1: Kill nfs4_ds_disconnect()
2012-10-23 08:47:38 +03:00
Lukas Czerner 5de35e8d5c ext4: Avoid underflow in ext4_trim_fs()
Currently if len argument in ext4_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.

Also remove useless unlikely().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-10-22 18:01:19 -04:00
Dmitry Torokhov 2f0157f13f char_dev: pin parent kobject
In certain cases (for example when a cdev structure is embedded into
another object whose lifetime is controlled by a separate kobject) it is
beneficial to tie lifetime of another object to the lifetime of
character device so that related object is not freed until after
char_dev object is freed.

To achieve this let's pin kobject's parent when doing cdev_add() and
unpin when last reference to cdev structure is being released.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-22 08:50:37 +03:00
Tao Ma 79f1ba4956 ext4: Checksum the block bitmap properly with bigalloc enabled
In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.

Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-10-22 00:34:32 -04:00
KAMEZAWA Hiroyuki 9e7814404b hold task->mempolicy while numa_maps scans.
/proc/<pid>/numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

V2->v3
  -  updated comments to be more verbose.
  -  removed task_lock() in numa_maps code.
V1->V2
  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
     can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 14:32:10 -07:00