2019-05-23 17:14:55 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
2006-06-02 04:10:59 +08:00
|
|
|
/*
|
|
|
|
* fs/inotify_user.c - inotify support for userspace
|
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* John McCutchan <ttb@tentacle.dhs.org>
|
|
|
|
* Robert Love <rml@novell.com>
|
|
|
|
*
|
|
|
|
* Copyright (C) 2005 John McCutchan
|
|
|
|
* Copyright 2006 Hewlett-Packard Development Company, L.P.
|
|
|
|
*
|
2009-05-22 05:02:01 +08:00
|
|
|
* Copyright (C) 2009 Eric Paris <Red Hat Inc>
|
|
|
|
* inotify was largely rewriten to make use of the fsnotify infrastructure
|
2006-06-02 04:10:59 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/file.h>
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <linux/fs.h> /* struct inode */
|
|
|
|
#include <linux/fsnotify_backend.h>
|
|
|
|
#include <linux/idr.h>
|
2015-05-02 08:08:20 +08:00
|
|
|
#include <linux/init.h> /* fs_initcall */
|
2006-06-02 04:10:59 +08:00
|
|
|
#include <linux/inotify.h>
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <linux/kernel.h> /* roundup() */
|
|
|
|
#include <linux/namei.h> /* LOOKUP_FOLLOW */
|
2017-02-03 02:15:33 +08:00
|
|
|
#include <linux/sched/signal.h>
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <linux/slab.h> /* struct kmem_cache */
|
2006-06-02 04:10:59 +08:00
|
|
|
#include <linux/syscalls.h>
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <linux/types.h>
|
2010-02-11 15:24:46 +08:00
|
|
|
#include <linux/anon_inodes.h>
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <linux/uaccess.h>
|
|
|
|
#include <linux/poll.h>
|
|
|
|
#include <linux/wait.h>
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
#include <linux/memcontrol.h>
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
#include <linux/security.h>
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
#include "inotify.h"
|
2012-12-18 08:05:12 +08:00
|
|
|
#include "../fdinfo.h"
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
#include <asm/ioctls.h>
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2016-12-14 21:56:33 +08:00
|
|
|
/* configurable via /proc/sys/fs/inotify/ */
|
2008-02-15 11:31:21 +08:00
|
|
|
static int inotify_max_queued_events __read_mostly;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2016-12-22 01:06:12 +08:00
|
|
|
struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
|
2014-06-07 05:38:04 +08:00
|
|
|
struct ctl_table inotify_table[] = {
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
|
|
|
.procname = "max_user_instances",
|
2016-12-14 21:56:33 +08:00
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
|
2006-06-02 04:10:59 +08:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 06:58:50 +08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2006-06-02 04:10:59 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_user_watches",
|
2016-12-14 21:56:33 +08:00
|
|
|
.data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
|
2006-06-02 04:10:59 +08:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 06:58:50 +08:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
2006-06-02 04:10:59 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "max_queued_events",
|
|
|
|
.data = &inotify_max_queued_events,
|
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-19 06:58:50 +08:00
|
|
|
.extra1 = SYSCTL_ZERO
|
2006-06-02 04:10:59 +08:00
|
|
|
},
|
2009-11-06 06:25:10 +08:00
|
|
|
{ }
|
2006-06-02 04:10:59 +08:00
|
|
|
};
|
|
|
|
#endif /* CONFIG_SYSCTL */
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
static inline __u32 inotify_arg_to_mask(u32 arg)
|
2008-02-06 17:36:09 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
__u32 mask;
|
2008-02-06 17:36:09 +08:00
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
/*
|
|
|
|
* everything should accept their own ignored, cares about children,
|
|
|
|
* and should receive events when the inode is unmounted
|
|
|
|
*/
|
|
|
|
mask = (FS_IN_IGNORED | FS_EVENT_ON_CHILD | FS_UNMOUNT);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* mask off the flags used to open the fd */
|
2010-07-28 22:18:37 +08:00
|
|
|
mask |= (arg & (IN_ALL_EVENTS | IN_ONESHOT | IN_EXCL_UNLINK));
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
return mask;
|
2006-06-02 04:10:59 +08:00
|
|
|
}
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
static inline u32 inotify_mask_to_arg(__u32 mask)
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
return mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED |
|
|
|
|
IN_Q_OVERFLOW);
|
2006-06-02 04:10:59 +08:00
|
|
|
}
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* intofiy userspace file descriptor functions */
|
2017-07-03 13:02:18 +08:00
|
|
|
static __poll_t inotify_poll(struct file *file, poll_table *wait)
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group = file->private_data;
|
2017-07-03 13:02:18 +08:00
|
|
|
__poll_t ret = 0;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
poll_wait(file, &group->notification_waitq, wait);
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2009-05-22 05:02:01 +08:00
|
|
|
if (!fsnotify_notify_queue_is_empty(group))
|
2018-02-12 06:34:03 +08:00
|
|
|
ret = EPOLLIN | EPOLLRDNORM;
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
static int round_event_name_len(struct fsnotify_event *fsn_event)
|
2014-01-22 07:48:13 +08:00
|
|
|
{
|
2014-01-22 07:48:14 +08:00
|
|
|
struct inotify_event_info *event;
|
|
|
|
|
|
|
|
event = INOTIFY_E(fsn_event);
|
2014-01-22 07:48:13 +08:00
|
|
|
if (!event->name_len)
|
|
|
|
return 0;
|
|
|
|
return roundup(event->name_len + 1, sizeof(struct inotify_event));
|
|
|
|
}
|
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
/*
|
|
|
|
* Get an inotify_kernel_event if one exists and is small
|
|
|
|
* enough to fit in "count". Return an error pointer if
|
|
|
|
* not large enough.
|
|
|
|
*
|
2016-10-08 07:56:52 +08:00
|
|
|
* Called with the group->notification_lock held.
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
*/
|
2009-05-22 05:02:01 +08:00
|
|
|
static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
|
|
|
|
size_t count)
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
{
|
|
|
|
size_t event_size = sizeof(struct inotify_event);
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_event *event;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
if (fsnotify_notify_queue_is_empty(group))
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
return NULL;
|
|
|
|
|
2014-08-07 07:03:26 +08:00
|
|
|
event = fsnotify_peek_first_event(group);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
|
|
|
|
|
2014-01-22 07:48:13 +08:00
|
|
|
event_size += round_event_name_len(event);
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (event_size > count)
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2016-10-08 07:56:52 +08:00
|
|
|
/* held the notification_lock the whole time, so this is the
|
2009-05-22 05:02:01 +08:00
|
|
|
* same event we peeked above */
|
2014-08-07 07:03:26 +08:00
|
|
|
fsnotify_remove_first_event(group);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
return event;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy an event to user space, returning how much we copied.
|
|
|
|
*
|
|
|
|
* We already checked that the event size is smaller than the
|
|
|
|
* buffer we had in "get_one_event()" above.
|
|
|
|
*/
|
2009-05-22 05:02:01 +08:00
|
|
|
static ssize_t copy_event_to_user(struct fsnotify_group *group,
|
2014-01-22 07:48:14 +08:00
|
|
|
struct fsnotify_event *fsn_event,
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
char __user *buf)
|
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct inotify_event inotify_event;
|
2014-01-22 07:48:14 +08:00
|
|
|
struct inotify_event_info *event;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
size_t event_size = sizeof(struct inotify_event);
|
2014-01-22 07:48:13 +08:00
|
|
|
size_t name_len;
|
|
|
|
size_t pad_name_len;
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
pr_debug("%s: group=%p event=%p\n", __func__, group, fsn_event);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
event = INOTIFY_E(fsn_event);
|
2014-01-22 07:48:13 +08:00
|
|
|
name_len = event->name_len;
|
2009-08-28 22:00:05 +08:00
|
|
|
/*
|
2014-01-22 07:48:13 +08:00
|
|
|
* round up name length so it is a multiple of event_size
|
2009-08-27 18:20:04 +08:00
|
|
|
* plus an extra byte for the terminating '\0'.
|
|
|
|
*/
|
2014-01-22 07:48:14 +08:00
|
|
|
pad_name_len = round_event_name_len(fsn_event);
|
2014-01-22 07:48:13 +08:00
|
|
|
inotify_event.len = pad_name_len;
|
2019-01-11 01:04:31 +08:00
|
|
|
inotify_event.mask = inotify_mask_to_arg(event->mask);
|
2014-01-22 07:48:14 +08:00
|
|
|
inotify_event.wd = event->wd;
|
2009-05-22 05:02:01 +08:00
|
|
|
inotify_event.cookie = event->sync_cookie;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* send the main event */
|
|
|
|
if (copy_to_user(buf, &inotify_event, event_size))
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
return -EFAULT;
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
buf += event_size;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/*
|
|
|
|
* fsnotify only stores the pathname, so here we have to send the pathname
|
|
|
|
* and then pad that pathname out to a multiple of sizeof(inotify_event)
|
2014-01-22 07:48:13 +08:00
|
|
|
* with zeros.
|
2009-05-22 05:02:01 +08:00
|
|
|
*/
|
2014-01-22 07:48:13 +08:00
|
|
|
if (pad_name_len) {
|
2009-05-22 05:02:01 +08:00
|
|
|
/* copy the path name */
|
2014-01-22 07:48:14 +08:00
|
|
|
if (copy_to_user(buf, event->name, name_len))
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
return -EFAULT;
|
2014-01-22 07:48:13 +08:00
|
|
|
buf += name_len;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
|
2009-08-27 18:20:04 +08:00
|
|
|
/* fill userspace with 0's */
|
2014-01-22 07:48:13 +08:00
|
|
|
if (clear_user(buf, pad_name_len - name_len))
|
2009-05-22 05:02:01 +08:00
|
|
|
return -EFAULT;
|
2014-01-22 07:48:13 +08:00
|
|
|
event_size += pad_name_len;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
}
|
2009-05-22 05:02:01 +08:00
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
return event_size;
|
|
|
|
}
|
|
|
|
|
2006-06-02 04:10:59 +08:00
|
|
|
static ssize_t inotify_read(struct file *file, char __user *buf,
|
|
|
|
size_t count, loff_t *pos)
|
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group;
|
|
|
|
struct fsnotify_event *kevent;
|
2006-06-02 04:10:59 +08:00
|
|
|
char __user *start;
|
|
|
|
int ret;
|
2014-09-24 16:18:50 +08:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
start = buf;
|
2009-05-22 05:02:01 +08:00
|
|
|
group = file->private_data;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2014-09-24 16:18:50 +08:00
|
|
|
add_wait_queue(&group->notification_waitq, &wait);
|
2006-06-02 04:10:59 +08:00
|
|
|
while (1) {
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2009-05-22 05:02:01 +08:00
|
|
|
kevent = get_one_event(group, count);
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
pr_debug("%s: group=%p kevent=%p\n", __func__, group, kevent);
|
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (kevent) {
|
|
|
|
ret = PTR_ERR(kevent);
|
|
|
|
if (IS_ERR(kevent))
|
|
|
|
break;
|
2009-05-22 05:02:01 +08:00
|
|
|
ret = copy_event_to_user(group, kevent, buf);
|
2014-01-22 07:48:14 +08:00
|
|
|
fsnotify_destroy_event(group, kevent);
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (ret < 0)
|
|
|
|
break;
|
|
|
|
buf += ret;
|
|
|
|
count -= ret;
|
|
|
|
continue;
|
2006-06-02 04:10:59 +08:00
|
|
|
}
|
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
ret = -EAGAIN;
|
|
|
|
if (file->f_flags & O_NONBLOCK)
|
2006-06-02 04:10:59 +08:00
|
|
|
break;
|
2012-03-27 01:07:59 +08:00
|
|
|
ret = -ERESTARTSYS;
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (signal_pending(current))
|
2006-06-02 04:10:59 +08:00
|
|
|
break;
|
2008-10-03 05:50:12 +08:00
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (start != buf)
|
2006-06-02 04:10:59 +08:00
|
|
|
break;
|
2008-10-03 05:50:12 +08:00
|
|
|
|
2014-09-24 16:18:50 +08:00
|
|
|
wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
2006-06-02 04:10:59 +08:00
|
|
|
}
|
2014-09-24 16:18:50 +08:00
|
|
|
remove_wait_queue(&group->notification_waitq, &wait);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
inotify: clean up inotify_read and fix locking problems
If userspace supplies an invalid pointer to a read() of an inotify
instance, the inotify device's event list mutex is unlocked twice.
This causes an unbalance which effectively leaves the data structure
unprotected, and we can trigger oopses by accessing the inotify
instance from different tasks concurrently.
The best fix (contributed largely by Linus) is a total rewrite
of the function in question:
On Thu, Jan 22, 2009 at 7:05 AM, Linus Torvalds wrote:
> The thing to notice is that:
>
> - locking is done in just one place, and there is no question about it
> not having an unlock.
>
> - that whole double-while(1)-loop thing is gone.
>
> - use multiple functions to make nesting and error handling sane
>
> - do error testing after doing the things you always need to do, ie do
> this:
>
> mutex_lock(..)
> ret = function_call();
> mutex_unlock(..)
>
> .. test ret here ..
>
> instead of doing conditional exits with unlocking or freeing.
>
> So if the code is written in this way, it may still be buggy, but at least
> it's not buggy because of subtle "forgot to unlock" or "forgot to free"
> issues.
>
> This _always_ unlocks if it locked, and it always frees if it got a
> non-error kevent.
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Robert Love <rlove@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-22 22:29:45 +08:00
|
|
|
if (start != buf && ret != -EFAULT)
|
|
|
|
ret = buf - start;
|
2006-06-02 04:10:59 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int inotify_release(struct inode *ignored, struct file *file)
|
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group = file->private_data;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
pr_debug("%s: group=%p\n", __func__, group);
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* free this group, matching get was inotify_init->fsnotify_obtain_group */
|
2011-06-14 23:29:45 +08:00
|
|
|
fsnotify_destroy_group(group);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long inotify_ioctl(struct file *file, unsigned int cmd,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group;
|
2014-01-22 07:48:14 +08:00
|
|
|
struct fsnotify_event *fsn_event;
|
2006-06-02 04:10:59 +08:00
|
|
|
void __user *p;
|
|
|
|
int ret = -ENOTTY;
|
2009-05-22 05:02:01 +08:00
|
|
|
size_t send_len = 0;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
group = file->private_data;
|
2006-06-02 04:10:59 +08:00
|
|
|
p = (void __user *) arg;
|
|
|
|
|
2010-07-28 22:18:37 +08:00
|
|
|
pr_debug("%s: group=%p cmd=%u\n", __func__, group, cmd);
|
|
|
|
|
2006-06-02 04:10:59 +08:00
|
|
|
switch (cmd) {
|
|
|
|
case FIONREAD:
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_lock(&group->notification_lock);
|
2014-01-22 07:48:14 +08:00
|
|
|
list_for_each_entry(fsn_event, &group->notification_list,
|
|
|
|
list) {
|
2009-05-22 05:02:01 +08:00
|
|
|
send_len += sizeof(struct inotify_event);
|
2014-01-22 07:48:14 +08:00
|
|
|
send_len += round_event_name_len(fsn_event);
|
2009-05-22 05:02:01 +08:00
|
|
|
}
|
2016-10-08 07:56:52 +08:00
|
|
|
spin_unlock(&group->notification_lock);
|
2009-05-22 05:02:01 +08:00
|
|
|
ret = put_user(send_len, (int __user *) p);
|
2006-06-02 04:10:59 +08:00
|
|
|
break;
|
inotify: Extend ioctl to allow to request id of new watch descriptor
Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().
CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:
while (1) {
int wd;
wd = inotify_add_watch(inotify_fd, path, mask);
if (wd < 0) {
break;
} else if (wd == desired_wd_id) {
ret = 0;
break;
}
inotify_rm_watch(inotify_fd, wd);
}
(You may find the actual code at the below link:
https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.
But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.
This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.
The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.
v2: Use INT_MAX instead of custom definition of max id,
as IDR subsystem guarantees id is between 0 and INT_MAX.
CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <willy@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-09 23:04:54 +08:00
|
|
|
#ifdef CONFIG_CHECKPOINT_RESTORE
|
|
|
|
case INOTIFY_IOC_SETNEXTWD:
|
|
|
|
ret = -EINVAL;
|
|
|
|
if (arg >= 1 && arg <= INT_MAX) {
|
|
|
|
struct inotify_group_private_data *data;
|
|
|
|
|
|
|
|
data = &group->inotify_data;
|
|
|
|
spin_lock(&data->idr_lock);
|
|
|
|
idr_set_cursor(&data->idr, (unsigned int)arg);
|
|
|
|
spin_unlock(&data->idr_lock);
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif /* CONFIG_CHECKPOINT_RESTORE */
|
2006-06-02 04:10:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations inotify_fops = {
|
2012-12-18 08:05:12 +08:00
|
|
|
.show_fdinfo = inotify_show_fdinfo,
|
2009-05-22 05:02:01 +08:00
|
|
|
.poll = inotify_poll,
|
|
|
|
.read = inotify_read,
|
2011-10-15 05:43:39 +08:00
|
|
|
.fasync = fsnotify_fasync,
|
2009-05-22 05:02:01 +08:00
|
|
|
.release = inotify_release,
|
|
|
|
.unlocked_ioctl = inotify_ioctl,
|
2006-06-02 04:10:59 +08:00
|
|
|
.compat_ioctl = inotify_ioctl,
|
llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-08-16 00:52:59 +08:00
|
|
|
.llseek = noop_llseek,
|
2006-06-02 04:10:59 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/*
|
|
|
|
* find_inode - resolve a user-given path to a specific inode
|
|
|
|
*/
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
static int inotify_find_inode(const char __user *dirname, struct path *path,
|
|
|
|
unsigned int flags, __u64 mask)
|
2009-05-22 05:02:01 +08:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
error = user_path_at(AT_FDCWD, dirname, flags, path);
|
|
|
|
if (error)
|
|
|
|
return error;
|
|
|
|
/* you can only watch an inode if you have read permissions on it */
|
|
|
|
error = inode_permission(path->dentry->d_inode, MAY_READ);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
if (error) {
|
|
|
|
path_put(path);
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
error = security_path_notify(path, mask,
|
|
|
|
FSNOTIFY_OBJ_TYPE_INODE);
|
2009-05-22 05:02:01 +08:00
|
|
|
if (error)
|
|
|
|
path_put(path);
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
static int inotify_add_to_idr(struct idr *idr, spinlock_t *idr_lock,
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark)
|
2009-12-18 09:12:04 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2013-02-28 09:04:50 +08:00
|
|
|
idr_preload(GFP_KERNEL);
|
|
|
|
spin_lock(idr_lock);
|
2009-12-18 09:12:04 +08:00
|
|
|
|
2013-04-30 07:21:21 +08:00
|
|
|
ret = idr_alloc_cyclic(idr, i_mark, 1, 0, GFP_NOWAIT);
|
2013-02-28 09:04:50 +08:00
|
|
|
if (ret >= 0) {
|
2009-12-18 09:12:04 +08:00
|
|
|
/* we added the mark to the idr, take a reference */
|
2013-02-28 09:04:50 +08:00
|
|
|
i_mark->wd = ret;
|
|
|
|
fsnotify_get_mark(&i_mark->fsn_mark);
|
|
|
|
}
|
2009-12-18 09:12:04 +08:00
|
|
|
|
2013-02-28 09:04:50 +08:00
|
|
|
spin_unlock(idr_lock);
|
|
|
|
idr_preload_end();
|
|
|
|
return ret < 0 ? ret : 0;
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
static struct inotify_inode_mark *inotify_idr_find_locked(struct fsnotify_group *group,
|
2009-12-18 09:12:04 +08:00
|
|
|
int wd)
|
|
|
|
{
|
|
|
|
struct idr *idr = &group->inotify_data.idr;
|
|
|
|
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark;
|
2009-12-18 09:12:04 +08:00
|
|
|
|
|
|
|
assert_spin_locked(idr_lock);
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
i_mark = idr_find(idr, wd);
|
|
|
|
if (i_mark) {
|
|
|
|
struct fsnotify_mark *fsn_mark = &i_mark->fsn_mark;
|
2009-12-18 09:12:04 +08:00
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
fsnotify_get_mark(fsn_mark);
|
2009-12-18 09:12:04 +08:00
|
|
|
/* One ref for being in the idr, one ref we just took */
|
2017-10-20 18:26:02 +08:00
|
|
|
BUG_ON(refcount_read(&fsn_mark->refcnt) < 2);
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
return i_mark;
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
static struct inotify_inode_mark *inotify_idr_find(struct fsnotify_group *group,
|
2009-12-18 09:12:04 +08:00
|
|
|
int wd)
|
|
|
|
{
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark;
|
2009-12-18 09:12:04 +08:00
|
|
|
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
|
|
|
|
|
|
|
|
spin_lock(idr_lock);
|
2009-12-18 10:24:24 +08:00
|
|
|
i_mark = inotify_idr_find_locked(group, wd);
|
2009-12-18 09:12:04 +08:00
|
|
|
spin_unlock(idr_lock);
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
return i_mark;
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
|
|
|
|
2009-08-25 04:03:35 +08:00
|
|
|
/*
|
|
|
|
* Remove the mark from the idr (if present) and drop the reference
|
|
|
|
* on the mark because it was in the idr.
|
|
|
|
*/
|
2009-07-07 22:28:24 +08:00
|
|
|
static void inotify_remove_from_idr(struct fsnotify_group *group,
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark)
|
2009-07-07 22:28:24 +08:00
|
|
|
{
|
2016-12-21 18:50:39 +08:00
|
|
|
struct idr *idr = &group->inotify_data.idr;
|
2009-12-18 09:12:04 +08:00
|
|
|
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *found_i_mark = NULL;
|
2009-08-25 04:03:35 +08:00
|
|
|
int wd;
|
2009-07-07 22:28:24 +08:00
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
spin_lock(idr_lock);
|
2009-12-18 10:24:24 +08:00
|
|
|
wd = i_mark->wd;
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
/*
|
2009-12-18 10:24:24 +08:00
|
|
|
* does this i_mark think it is in the idr? we shouldn't get called
|
2009-12-18 09:12:04 +08:00
|
|
|
* if it wasn't....
|
|
|
|
*/
|
|
|
|
if (wd == -1) {
|
2016-12-09 16:38:55 +08:00
|
|
|
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
|
|
|
|
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
|
2009-08-25 04:03:35 +08:00
|
|
|
goto out;
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
/* Lets look in the idr to see if we find it */
|
2009-12-18 10:24:24 +08:00
|
|
|
found_i_mark = inotify_idr_find_locked(group, wd);
|
|
|
|
if (unlikely(!found_i_mark)) {
|
2016-12-09 16:38:55 +08:00
|
|
|
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
|
|
|
|
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
|
2009-08-25 04:03:35 +08:00
|
|
|
goto out;
|
2009-12-18 09:12:04 +08:00
|
|
|
}
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
/*
|
2009-12-18 10:24:24 +08:00
|
|
|
* We found an mark in the idr at the right wd, but it's
|
|
|
|
* not the mark we were told to remove. eparis seriously
|
2009-12-18 09:12:04 +08:00
|
|
|
* fucked up somewhere.
|
|
|
|
*/
|
2009-12-18 10:24:24 +08:00
|
|
|
if (unlikely(found_i_mark != i_mark)) {
|
|
|
|
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p "
|
2016-12-09 16:38:55 +08:00
|
|
|
"found_i_mark=%p found_i_mark->wd=%d "
|
|
|
|
"found_i_mark->group=%p\n", __func__, i_mark,
|
|
|
|
i_mark->wd, i_mark->fsn_mark.group, found_i_mark,
|
|
|
|
found_i_mark->wd, found_i_mark->fsn_mark.group);
|
2009-08-25 04:03:35 +08:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
/*
|
|
|
|
* One ref for being in the idr
|
|
|
|
* one ref grabbed by inotify_idr_find
|
|
|
|
*/
|
2017-10-20 18:26:02 +08:00
|
|
|
if (unlikely(refcount_read(&i_mark->fsn_mark.refcnt) < 2)) {
|
2016-12-09 16:38:55 +08:00
|
|
|
printk(KERN_ERR "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
|
|
|
|
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
|
2009-12-18 09:12:04 +08:00
|
|
|
/* we can't really recover with bad ref cnting.. */
|
|
|
|
BUG();
|
|
|
|
}
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2016-12-21 18:50:39 +08:00
|
|
|
idr_remove(idr, wd);
|
|
|
|
/* Removed from the idr, drop that ref. */
|
|
|
|
fsnotify_put_mark(&i_mark->fsn_mark);
|
2009-08-25 04:03:35 +08:00
|
|
|
out:
|
2016-12-21 18:50:39 +08:00
|
|
|
i_mark->wd = -1;
|
|
|
|
spin_unlock(idr_lock);
|
2009-12-18 09:12:04 +08:00
|
|
|
/* match the ref taken by inotify_idr_find_locked() */
|
2009-12-18 10:24:24 +08:00
|
|
|
if (found_i_mark)
|
|
|
|
fsnotify_put_mark(&found_i_mark->fsn_mark);
|
2009-07-07 22:28:24 +08:00
|
|
|
}
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/*
|
2009-08-25 04:03:35 +08:00
|
|
|
* Send IN_IGNORED for this wd, remove this wd from the idr.
|
2009-05-22 05:02:01 +08:00
|
|
|
*/
|
2009-12-18 10:24:24 +08:00
|
|
|
void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
|
2009-06-13 04:04:26 +08:00
|
|
|
struct fsnotify_group *group)
|
2009-05-22 05:02:01 +08:00
|
|
|
{
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark;
|
2018-04-21 07:10:52 +08:00
|
|
|
struct fsnotify_iter_info iter_info = { };
|
|
|
|
|
|
|
|
fsnotify_iter_set_report_type_mark(&iter_info, FSNOTIFY_OBJ_TYPE_INODE,
|
|
|
|
fsn_mark);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
/* Queue ignore event for the watch */
|
2018-04-21 07:10:50 +08:00
|
|
|
inotify_handle_event(group, NULL, FS_IN_IGNORED, NULL,
|
|
|
|
FSNOTIFY_EVENT_NONE, NULL, 0, &iter_info);
|
2009-07-16 03:49:52 +08:00
|
|
|
|
2014-01-22 07:48:14 +08:00
|
|
|
i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark);
|
2009-12-18 10:24:24 +08:00
|
|
|
/* remove this mark from the idr */
|
|
|
|
inotify_remove_from_idr(group, i_mark);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2016-12-14 21:56:33 +08:00
|
|
|
dec_inotify_watches(group->inotify_data.ucounts);
|
2009-05-22 05:02:01 +08:00
|
|
|
}
|
|
|
|
|
2009-08-25 04:03:35 +08:00
|
|
|
static int inotify_update_existing_watch(struct fsnotify_group *group,
|
|
|
|
struct inode *inode,
|
|
|
|
u32 arg)
|
2009-05-22 05:02:01 +08:00
|
|
|
{
|
2009-12-18 10:24:24 +08:00
|
|
|
struct fsnotify_mark *fsn_mark;
|
|
|
|
struct inotify_inode_mark *i_mark;
|
2009-05-22 05:02:01 +08:00
|
|
|
__u32 old_mask, new_mask;
|
2009-08-25 04:03:35 +08:00
|
|
|
__u32 mask;
|
|
|
|
int add = (arg & IN_MASK_ADD);
|
2018-05-31 17:43:03 +08:00
|
|
|
int create = (arg & IN_MASK_CREATE);
|
2009-08-25 04:03:35 +08:00
|
|
|
int ret;
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
mask = inotify_arg_to_mask(arg);
|
|
|
|
|
2016-12-21 23:28:45 +08:00
|
|
|
fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, group);
|
2009-12-18 10:24:24 +08:00
|
|
|
if (!fsn_mark)
|
2009-08-25 04:03:35 +08:00
|
|
|
return -ENOENT;
|
2019-03-02 09:17:32 +08:00
|
|
|
else if (create) {
|
|
|
|
ret = -EEXIST;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-07-07 22:28:24 +08:00
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark);
|
2009-07-07 22:28:23 +08:00
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
spin_lock(&fsn_mark->lock);
|
|
|
|
old_mask = fsn_mark->mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
if (add)
|
2016-12-21 23:03:59 +08:00
|
|
|
fsn_mark->mask |= mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
else
|
2016-12-21 23:03:59 +08:00
|
|
|
fsn_mark->mask = mask;
|
2009-12-18 10:24:33 +08:00
|
|
|
new_mask = fsn_mark->mask;
|
2009-12-18 10:24:24 +08:00
|
|
|
spin_unlock(&fsn_mark->lock);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
if (old_mask != new_mask) {
|
|
|
|
/* more bits in old than in new? */
|
|
|
|
int dropped = (old_mask & ~new_mask);
|
2009-12-18 10:24:24 +08:00
|
|
|
/* more bits in this fsn_mark than the inode's mask? */
|
2009-05-22 05:02:01 +08:00
|
|
|
int do_inode = (new_mask & ~inode->i_fsnotify_mask);
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
/* update the inode with this new fsn_mark */
|
2009-05-22 05:02:01 +08:00
|
|
|
if (dropped || do_inode)
|
2016-12-21 23:13:54 +08:00
|
|
|
fsnotify_recalc_mask(inode->i_fsnotify_marks);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
}
|
|
|
|
|
2009-08-25 04:03:35 +08:00
|
|
|
/* return the wd */
|
2009-12-18 10:24:24 +08:00
|
|
|
ret = i_mark->wd;
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2019-03-02 09:17:32 +08:00
|
|
|
out:
|
2009-12-18 10:24:24 +08:00
|
|
|
/* match the get from fsnotify_find_mark() */
|
2009-12-18 10:24:24 +08:00
|
|
|
fsnotify_put_mark(fsn_mark);
|
2009-07-07 22:28:23 +08:00
|
|
|
|
2009-08-25 04:03:35 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int inotify_new_watch(struct fsnotify_group *group,
|
|
|
|
struct inode *inode,
|
|
|
|
u32 arg)
|
|
|
|
{
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *tmp_i_mark;
|
2009-08-25 04:03:35 +08:00
|
|
|
__u32 mask;
|
|
|
|
int ret;
|
2009-12-18 09:12:04 +08:00
|
|
|
struct idr *idr = &group->inotify_data.idr;
|
|
|
|
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
|
2009-08-25 04:03:35 +08:00
|
|
|
|
|
|
|
mask = inotify_arg_to_mask(arg);
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
tmp_i_mark = kmem_cache_alloc(inotify_inode_mark_cachep, GFP_KERNEL);
|
|
|
|
if (unlikely(!tmp_i_mark))
|
2009-08-25 04:03:35 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-12-22 01:06:12 +08:00
|
|
|
fsnotify_init_mark(&tmp_i_mark->fsn_mark, group);
|
2009-12-18 10:24:24 +08:00
|
|
|
tmp_i_mark->fsn_mark.mask = mask;
|
|
|
|
tmp_i_mark->wd = -1;
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2013-04-30 07:21:21 +08:00
|
|
|
ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
|
2009-12-18 09:12:04 +08:00
|
|
|
if (ret)
|
2009-08-25 04:03:35 +08:00
|
|
|
goto out_err;
|
|
|
|
|
2016-12-14 21:56:33 +08:00
|
|
|
/* increment the number of watches the user has */
|
|
|
|
if (!inc_inotify_watches(group->inotify_data.ucounts)) {
|
|
|
|
inotify_remove_from_idr(group, tmp_i_mark);
|
|
|
|
ret = -ENOSPC;
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
2009-08-25 04:03:35 +08:00
|
|
|
/* we are on the idr, now get on the inode */
|
2018-04-21 07:10:55 +08:00
|
|
|
ret = fsnotify_add_inode_mark_locked(&tmp_i_mark->fsn_mark, inode, 0);
|
2009-08-25 04:03:35 +08:00
|
|
|
if (ret) {
|
|
|
|
/* we failed to get on the inode, get off the idr */
|
2009-12-18 10:24:24 +08:00
|
|
|
inotify_remove_from_idr(group, tmp_i_mark);
|
2009-08-25 04:03:35 +08:00
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2009-12-18 10:24:24 +08:00
|
|
|
/* return the watch descriptor for this new mark */
|
|
|
|
ret = tmp_i_mark->wd;
|
2009-08-25 04:03:35 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
out_err:
|
2009-12-18 10:24:24 +08:00
|
|
|
/* match the ref from fsnotify_init_mark() */
|
|
|
|
fsnotify_put_mark(&tmp_i_mark->fsn_mark);
|
2009-08-25 04:03:35 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2013-07-09 06:59:45 +08:00
|
|
|
mutex_lock(&group->mark_mutex);
|
2009-08-25 04:03:35 +08:00
|
|
|
/* try to update and existing watch with the new arg */
|
|
|
|
ret = inotify_update_existing_watch(group, inode, arg);
|
|
|
|
/* no mark present, try to add a new one */
|
|
|
|
if (ret == -ENOENT)
|
|
|
|
ret = inotify_new_watch(group, inode, arg);
|
2013-07-09 06:59:45 +08:00
|
|
|
mutex_unlock(&group->mark_mutex);
|
2009-07-07 22:28:24 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-04-06 05:20:50 +08:00
|
|
|
static struct fsnotify_group *inotify_new_group(unsigned int max_events)
|
2009-05-22 05:02:01 +08:00
|
|
|
{
|
|
|
|
struct fsnotify_group *group;
|
2014-02-22 02:14:11 +08:00
|
|
|
struct inotify_event_info *oevent;
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2009-12-18 10:24:22 +08:00
|
|
|
group = fsnotify_alloc_group(&inotify_fsnotify_ops);
|
2009-05-22 05:02:01 +08:00
|
|
|
if (IS_ERR(group))
|
|
|
|
return group;
|
|
|
|
|
2014-02-22 02:14:11 +08:00
|
|
|
oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
|
|
|
|
if (unlikely(!oevent)) {
|
|
|
|
fsnotify_destroy_group(group);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
group->overflow_event = &oevent->fse;
|
2019-01-11 01:04:31 +08:00
|
|
|
fsnotify_init_event(group->overflow_event, NULL);
|
|
|
|
oevent->mask = FS_Q_OVERFLOW;
|
2014-02-22 02:14:11 +08:00
|
|
|
oevent->wd = -1;
|
|
|
|
oevent->sync_cookie = 0;
|
|
|
|
oevent->name_len = 0;
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
group->max_events = max_events;
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
group->memcg = get_mem_cgroup_from_mm(current->mm);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
spin_lock_init(&group->inotify_data.idr_lock);
|
|
|
|
idr_init(&group->inotify_data.idr);
|
2016-12-14 21:56:33 +08:00
|
|
|
group->inotify_data.ucounts = inc_ucount(current_user_ns(),
|
|
|
|
current_euid(),
|
|
|
|
UCOUNT_INOTIFY_INSTANCES);
|
2011-04-06 05:20:50 +08:00
|
|
|
|
2016-12-14 21:56:33 +08:00
|
|
|
if (!group->inotify_data.ucounts) {
|
2011-06-14 23:29:45 +08:00
|
|
|
fsnotify_destroy_group(group);
|
2011-04-06 05:20:50 +08:00
|
|
|
return ERR_PTR(-EMFILE);
|
|
|
|
}
|
2009-05-22 05:02:01 +08:00
|
|
|
|
|
|
|
return group;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* inotify syscalls */
|
2018-03-14 04:27:21 +08:00
|
|
|
static int do_inotify_init(int flags)
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group;
|
2010-02-11 15:24:46 +08:00
|
|
|
int ret;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2008-07-24 12:29:42 +08:00
|
|
|
/* Check the IN_* constants for consistency. */
|
|
|
|
BUILD_BUG_ON(IN_CLOEXEC != O_CLOEXEC);
|
|
|
|
BUILD_BUG_ON(IN_NONBLOCK != O_NONBLOCK);
|
|
|
|
|
2008-07-24 12:29:41 +08:00
|
|
|
if (flags & ~(IN_CLOEXEC | IN_NONBLOCK))
|
2008-07-24 12:29:32 +08:00
|
|
|
return -EINVAL;
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
|
2011-04-06 05:20:50 +08:00
|
|
|
group = inotify_new_group(inotify_max_queued_events);
|
|
|
|
if (IS_ERR(group))
|
|
|
|
return PTR_ERR(group);
|
2009-08-05 22:35:21 +08:00
|
|
|
|
2010-02-11 15:24:46 +08:00
|
|
|
ret = anon_inode_getfd("inotify", &inotify_fops, group,
|
|
|
|
O_RDONLY | flags);
|
2011-04-06 05:20:50 +08:00
|
|
|
if (ret < 0)
|
2011-06-14 23:29:45 +08:00
|
|
|
fsnotify_destroy_group(group);
|
2009-08-05 22:35:21 +08:00
|
|
|
|
2006-06-02 04:10:59 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-03-14 04:27:21 +08:00
|
|
|
SYSCALL_DEFINE1(inotify_init1, int, flags)
|
|
|
|
{
|
|
|
|
return do_inotify_init(flags);
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:30 +08:00
|
|
|
SYSCALL_DEFINE0(inotify_init)
|
2008-07-24 12:29:32 +08:00
|
|
|
{
|
2018-03-14 04:27:21 +08:00
|
|
|
return do_inotify_init(0);
|
2008-07-24 12:29:32 +08:00
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:31 +08:00
|
|
|
SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
|
|
|
|
u32, mask)
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group;
|
2006-06-02 04:10:59 +08:00
|
|
|
struct inode *inode;
|
2008-07-22 21:59:21 +08:00
|
|
|
struct path path;
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
|
|
|
int ret;
|
2006-06-02 04:10:59 +08:00
|
|
|
unsigned flags = 0;
|
|
|
|
|
2015-11-06 10:43:46 +08:00
|
|
|
/*
|
|
|
|
* We share a lot of code with fs/dnotify. We also share
|
|
|
|
* the bit layout between inotify's IN_* and the fsnotify
|
|
|
|
* FS_*. This check ensures that only the inotify IN_*
|
|
|
|
* bits get passed in and set in watches/events.
|
|
|
|
*/
|
|
|
|
if (unlikely(mask & ~ALL_INOTIFY_BITS))
|
|
|
|
return -EINVAL;
|
|
|
|
/*
|
|
|
|
* Require at least one valid bit set in the mask.
|
|
|
|
* Without _something_ set, we would have no events to
|
|
|
|
* watch for.
|
|
|
|
*/
|
2013-05-01 06:26:46 +08:00
|
|
|
if (unlikely(!(mask & ALL_INOTIFY_BITS)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(fd);
|
|
|
|
if (unlikely(!f.file))
|
2006-06-02 04:10:59 +08:00
|
|
|
return -EBADF;
|
|
|
|
|
2018-05-31 17:43:03 +08:00
|
|
|
/* IN_MASK_ADD and IN_MASK_CREATE don't make sense together */
|
2019-01-01 17:54:26 +08:00
|
|
|
if (unlikely((mask & IN_MASK_ADD) && (mask & IN_MASK_CREATE))) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto fput_and_out;
|
|
|
|
}
|
2018-05-31 17:43:03 +08:00
|
|
|
|
2006-06-02 04:10:59 +08:00
|
|
|
/* verify that this is indeed an inotify instance */
|
2012-08-29 00:52:22 +08:00
|
|
|
if (unlikely(f.file->f_op != &inotify_fops)) {
|
2006-06-02 04:10:59 +08:00
|
|
|
ret = -EINVAL;
|
|
|
|
goto fput_and_out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!(mask & IN_DONT_FOLLOW))
|
|
|
|
flags |= LOOKUP_FOLLOW;
|
|
|
|
if (mask & IN_ONLYDIR)
|
|
|
|
flags |= LOOKUP_DIRECTORY;
|
|
|
|
|
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 23:20:00 +08:00
|
|
|
ret = inotify_find_inode(pathname, &path, flags,
|
|
|
|
(mask & IN_ALL_EVENTS));
|
2009-05-22 05:02:01 +08:00
|
|
|
if (ret)
|
2006-06-02 04:10:59 +08:00
|
|
|
goto fput_and_out;
|
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* inode held in place by reference to path; group by fget on fd */
|
2008-07-22 21:59:21 +08:00
|
|
|
inode = path.dentry->d_inode;
|
2012-08-29 00:52:22 +08:00
|
|
|
group = f.file->private_data;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-05-22 05:02:01 +08:00
|
|
|
/* create/update an inode mark */
|
|
|
|
ret = inotify_update_watch(group, inode, mask);
|
2008-07-22 21:59:21 +08:00
|
|
|
path_put(&path);
|
2006-06-02 04:10:59 +08:00
|
|
|
fput_and_out:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2006-06-02 04:10:59 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-01-14 21:14:31 +08:00
|
|
|
SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
|
2006-06-02 04:10:59 +08:00
|
|
|
{
|
2009-05-22 05:02:01 +08:00
|
|
|
struct fsnotify_group *group;
|
2009-12-18 10:24:24 +08:00
|
|
|
struct inotify_inode_mark *i_mark;
|
2012-08-29 00:52:22 +08:00
|
|
|
struct fd f;
|
|
|
|
int ret = 0;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
f = fdget(fd);
|
|
|
|
if (unlikely(!f.file))
|
2006-06-02 04:10:59 +08:00
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
/* verify that this is indeed an inotify instance */
|
2009-12-18 09:12:04 +08:00
|
|
|
ret = -EINVAL;
|
2012-08-29 00:52:22 +08:00
|
|
|
if (unlikely(f.file->f_op != &inotify_fops))
|
2006-06-02 04:10:59 +08:00
|
|
|
goto out;
|
|
|
|
|
2012-08-29 00:52:22 +08:00
|
|
|
group = f.file->private_data;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
ret = -EINVAL;
|
2009-12-18 10:24:24 +08:00
|
|
|
i_mark = inotify_idr_find(group, wd);
|
|
|
|
if (unlikely(!i_mark))
|
2009-05-22 05:02:01 +08:00
|
|
|
goto out;
|
|
|
|
|
2009-12-18 09:12:04 +08:00
|
|
|
ret = 0;
|
|
|
|
|
2011-06-14 23:29:51 +08:00
|
|
|
fsnotify_destroy_mark(&i_mark->fsn_mark, group);
|
2009-12-18 09:12:04 +08:00
|
|
|
|
|
|
|
/* match ref taken by inotify_idr_find */
|
2009-12-18 10:24:24 +08:00
|
|
|
fsnotify_put_mark(&i_mark->fsn_mark);
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
out:
|
2012-08-29 00:52:22 +08:00
|
|
|
fdput(f);
|
2006-06-02 04:10:59 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2011-03-01 22:06:02 +08:00
|
|
|
* inotify_user_setup - Our initialization function. Note that we cannot return
|
2006-06-02 04:10:59 +08:00
|
|
|
* error because we have compiled-in VFS hooks. So an (unlikely) failure here
|
|
|
|
* must result in panic().
|
|
|
|
*/
|
|
|
|
static int __init inotify_user_setup(void)
|
|
|
|
{
|
2010-07-28 22:18:37 +08:00
|
|
|
BUILD_BUG_ON(IN_ACCESS != FS_ACCESS);
|
|
|
|
BUILD_BUG_ON(IN_MODIFY != FS_MODIFY);
|
|
|
|
BUILD_BUG_ON(IN_ATTRIB != FS_ATTRIB);
|
|
|
|
BUILD_BUG_ON(IN_CLOSE_WRITE != FS_CLOSE_WRITE);
|
|
|
|
BUILD_BUG_ON(IN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
|
|
|
|
BUILD_BUG_ON(IN_OPEN != FS_OPEN);
|
|
|
|
BUILD_BUG_ON(IN_MOVED_FROM != FS_MOVED_FROM);
|
|
|
|
BUILD_BUG_ON(IN_MOVED_TO != FS_MOVED_TO);
|
|
|
|
BUILD_BUG_ON(IN_CREATE != FS_CREATE);
|
|
|
|
BUILD_BUG_ON(IN_DELETE != FS_DELETE);
|
|
|
|
BUILD_BUG_ON(IN_DELETE_SELF != FS_DELETE_SELF);
|
|
|
|
BUILD_BUG_ON(IN_MOVE_SELF != FS_MOVE_SELF);
|
|
|
|
BUILD_BUG_ON(IN_UNMOUNT != FS_UNMOUNT);
|
|
|
|
BUILD_BUG_ON(IN_Q_OVERFLOW != FS_Q_OVERFLOW);
|
|
|
|
BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
|
|
|
|
BUILD_BUG_ON(IN_EXCL_UNLINK != FS_EXCL_UNLINK);
|
2010-10-29 05:21:58 +08:00
|
|
|
BUILD_BUG_ON(IN_ISDIR != FS_ISDIR);
|
2010-07-28 22:18:37 +08:00
|
|
|
BUILD_BUG_ON(IN_ONESHOT != FS_IN_ONESHOT);
|
|
|
|
|
2018-10-04 05:25:36 +08:00
|
|
|
BUILD_BUG_ON(HWEIGHT32(ALL_INOTIFY_BITS) != 22);
|
2010-07-28 22:18:37 +08:00
|
|
|
|
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
|
|
|
inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark,
|
|
|
|
SLAB_PANIC|SLAB_ACCOUNT);
|
2009-05-22 05:02:01 +08:00
|
|
|
|
2006-06-02 04:10:59 +08:00
|
|
|
inotify_max_queued_events = 16384;
|
2016-12-14 21:56:33 +08:00
|
|
|
init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
|
|
|
|
init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
|
2006-06-02 04:10:59 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2015-05-02 08:08:20 +08:00
|
|
|
fs_initcall(inotify_user_setup);
|