linux/fs/notify/inotify/inotify_user.c

832 lines
21 KiB
C
Raw Normal View History

/*
* fs/inotify_user.c - inotify support for userspace
*
* Authors:
* John McCutchan <ttb@tentacle.dhs.org>
* Robert Love <rml@novell.com>
*
* Copyright (C) 2005 John McCutchan
* Copyright 2006 Hewlett-Packard Development Company, L.P.
*
* Copyright (C) 2009 Eric Paris <Red Hat Inc>
* inotify was largely rewriten to make use of the fsnotify infrastructure
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2, or (at your option) any
* later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/file.h>
#include <linux/fs.h> /* struct inode */
#include <linux/fsnotify_backend.h>
#include <linux/idr.h>
#include <linux/init.h> /* fs_initcall */
#include <linux/inotify.h>
#include <linux/kernel.h> /* roundup() */
#include <linux/namei.h> /* LOOKUP_FOLLOW */
#include <linux/sched/signal.h>
#include <linux/slab.h> /* struct kmem_cache */
#include <linux/syscalls.h>
#include <linux/types.h>
#include <linux/anon_inodes.h>
#include <linux/uaccess.h>
#include <linux/poll.h>
#include <linux/wait.h>
fs: fsnotify: account fsnotify metadata to kmemcg Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
#include <linux/memcontrol.h>
#include "inotify.h"
fs, notify: add procfs fdinfo helper This allow us to print out fsnotify details such as watchee inode, device, mask and optionally a file handle. For inotify objects if kernel compiled with exportfs support the output will be | pos: 0 | flags: 02000000 | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:11a1000020542153 | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:49b1060023552153 If kernel compiled without exportfs support, the file handle won't be provided but inode and device only. | pos: 0 | flags: 02000000 | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 For fanotify the output is like | pos: 0 | flags: 04002 | fanotify flags:10 event-flags:0 | fanotify mnt_id:12 mask:3b ignored_mask:0 | fanotify ino:50205 sdev:800013 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:05020500fb1d47e7 To minimize impact on general fsnotify code the new functionality is gathered in fs/notify/fdinfo.c file. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: James Bottomley <jbottomley@parallels.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Helsley <matt.helsley@gmail.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 08:05:12 +08:00
#include "../fdinfo.h"
#include <asm/ioctls.h>
/* configurable via /proc/sys/fs/inotify/ */
static int inotify_max_queued_events __read_mostly;
struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
#ifdef CONFIG_SYSCTL
#include <linux/sysctl.h>
static int zero;
struct ctl_table inotify_table[] = {
{
.procname = "max_user_instances",
.data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero,
},
{
.procname = "max_user_watches",
.data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero,
},
{
.procname = "max_queued_events",
.data = &inotify_max_queued_events,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
.extra1 = &zero
},
{ }
};
#endif /* CONFIG_SYSCTL */
static inline __u32 inotify_arg_to_mask(u32 arg)
{
__u32 mask;
/*
* everything should accept their own ignored, cares about children,
* and should receive events when the inode is unmounted
*/
mask = (FS_IN_IGNORED | FS_EVENT_ON_CHILD | FS_UNMOUNT);
/* mask off the flags used to open the fd */
mask |= (arg & (IN_ALL_EVENTS | IN_ONESHOT | IN_EXCL_UNLINK));
return mask;
}
static inline u32 inotify_mask_to_arg(__u32 mask)
{
return mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED |
IN_Q_OVERFLOW);
}
/* intofiy userspace file descriptor functions */
static __poll_t inotify_poll(struct file *file, poll_table *wait)
{
struct fsnotify_group *group = file->private_data;
__poll_t ret = 0;
poll_wait(file, &group->notification_waitq, wait);
spin_lock(&group->notification_lock);
if (!fsnotify_notify_queue_is_empty(group))
ret = EPOLLIN | EPOLLRDNORM;
spin_unlock(&group->notification_lock);
return ret;
}
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
static int round_event_name_len(struct fsnotify_event *fsn_event)
{
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
struct inotify_event_info *event;
event = INOTIFY_E(fsn_event);
if (!event->name_len)
return 0;
return roundup(event->name_len + 1, sizeof(struct inotify_event));
}
2009-01-22 22:29:45 +08:00
/*
* Get an inotify_kernel_event if one exists and is small
* enough to fit in "count". Return an error pointer if
* not large enough.
*
* Called with the group->notification_lock held.
2009-01-22 22:29:45 +08:00
*/
static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
size_t count)
2009-01-22 22:29:45 +08:00
{
size_t event_size = sizeof(struct inotify_event);
struct fsnotify_event *event;
2009-01-22 22:29:45 +08:00
if (fsnotify_notify_queue_is_empty(group))
2009-01-22 22:29:45 +08:00
return NULL;
event = fsnotify_peek_first_event(group);
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
event_size += round_event_name_len(event);
2009-01-22 22:29:45 +08:00
if (event_size > count)
return ERR_PTR(-EINVAL);
/* held the notification_lock the whole time, so this is the
* same event we peeked above */
fsnotify_remove_first_event(group);
return event;
2009-01-22 22:29:45 +08:00
}
/*
* Copy an event to user space, returning how much we copied.
*
* We already checked that the event size is smaller than the
* buffer we had in "get_one_event()" above.
*/
static ssize_t copy_event_to_user(struct fsnotify_group *group,
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
struct fsnotify_event *fsn_event,
2009-01-22 22:29:45 +08:00
char __user *buf)
{
struct inotify_event inotify_event;
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
struct inotify_event_info *event;
2009-01-22 22:29:45 +08:00
size_t event_size = sizeof(struct inotify_event);
size_t name_len;
size_t pad_name_len;
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
pr_debug("%s: group=%p event=%p\n", __func__, group, fsn_event);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
event = INOTIFY_E(fsn_event);
name_len = event->name_len;
/*
* round up name length so it is a multiple of event_size
* plus an extra byte for the terminating '\0'.
*/
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
pad_name_len = round_event_name_len(fsn_event);
inotify_event.len = pad_name_len;
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
inotify_event.mask = inotify_mask_to_arg(fsn_event->mask);
inotify_event.wd = event->wd;
inotify_event.cookie = event->sync_cookie;
2009-01-22 22:29:45 +08:00
/* send the main event */
if (copy_to_user(buf, &inotify_event, event_size))
2009-01-22 22:29:45 +08:00
return -EFAULT;
buf += event_size;
2009-01-22 22:29:45 +08:00
/*
* fsnotify only stores the pathname, so here we have to send the pathname
* and then pad that pathname out to a multiple of sizeof(inotify_event)
* with zeros.
*/
if (pad_name_len) {
/* copy the path name */
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
if (copy_to_user(buf, event->name, name_len))
2009-01-22 22:29:45 +08:00
return -EFAULT;
buf += name_len;
2009-01-22 22:29:45 +08:00
/* fill userspace with 0's */
if (clear_user(buf, pad_name_len - name_len))
return -EFAULT;
event_size += pad_name_len;
2009-01-22 22:29:45 +08:00
}
2009-01-22 22:29:45 +08:00
return event_size;
}
static ssize_t inotify_read(struct file *file, char __user *buf,
size_t count, loff_t *pos)
{
struct fsnotify_group *group;
struct fsnotify_event *kevent;
char __user *start;
int ret;
DEFINE_WAIT_FUNC(wait, woken_wake_function);
start = buf;
group = file->private_data;
add_wait_queue(&group->notification_waitq, &wait);
while (1) {
spin_lock(&group->notification_lock);
kevent = get_one_event(group, count);
spin_unlock(&group->notification_lock);
pr_debug("%s: group=%p kevent=%p\n", __func__, group, kevent);
2009-01-22 22:29:45 +08:00
if (kevent) {
ret = PTR_ERR(kevent);
if (IS_ERR(kevent))
break;
ret = copy_event_to_user(group, kevent, buf);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
fsnotify_destroy_event(group, kevent);
2009-01-22 22:29:45 +08:00
if (ret < 0)
break;
buf += ret;
count -= ret;
continue;
}
2009-01-22 22:29:45 +08:00
ret = -EAGAIN;
if (file->f_flags & O_NONBLOCK)
break;
ret = -ERESTARTSYS;
2009-01-22 22:29:45 +08:00
if (signal_pending(current))
break;
2009-01-22 22:29:45 +08:00
if (start != buf)
break;
wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}
remove_wait_queue(&group->notification_waitq, &wait);
2009-01-22 22:29:45 +08:00
if (start != buf && ret != -EFAULT)
ret = buf - start;
return ret;
}
static int inotify_release(struct inode *ignored, struct file *file)
{
struct fsnotify_group *group = file->private_data;
pr_debug("%s: group=%p\n", __func__, group);
/* free this group, matching get was inotify_init->fsnotify_obtain_group */
fsnotify_destroy_group(group);
return 0;
}
static long inotify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
struct fsnotify_group *group;
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
struct fsnotify_event *fsn_event;
void __user *p;
int ret = -ENOTTY;
size_t send_len = 0;
group = file->private_data;
p = (void __user *) arg;
pr_debug("%s: group=%p cmd=%u\n", __func__, group, cmd);
switch (cmd) {
case FIONREAD:
spin_lock(&group->notification_lock);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
list_for_each_entry(fsn_event, &group->notification_list,
list) {
send_len += sizeof(struct inotify_event);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
send_len += round_event_name_len(fsn_event);
}
spin_unlock(&group->notification_lock);
ret = put_user(send_len, (int __user *) p);
break;
inotify: Extend ioctl to allow to request id of new watch descriptor Watch descriptor is id of the watch created by inotify_add_watch(). It is allocated in inotify_add_to_idr(), and takes the numbers starting from 1. Every new inotify watch obtains next available number (usually, old + 1), as served by idr_alloc_cyclic(). CRIU (Checkpoint/Restore In Userspace) project supports inotify files, and restores watched descriptors with the same numbers, they had before dump. Since there was no kernel support, we had to use cycle to add a watch with specific descriptor id: while (1) { int wd; wd = inotify_add_watch(inotify_fd, path, mask); if (wd < 0) { break; } else if (wd == desired_wd_id) { ret = 0; break; } inotify_rm_watch(inotify_fd, wd); } (You may find the actual code at the below link: https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577) The cycle is suboptiomal and very expensive, but since there is no better kernel support, it was the only way to restore that. Happily, we had met mostly descriptors with small id, and this approach had worked somehow. But recent time containers with inotify with big watch descriptors begun to come, and this way stopped to work at all. When descriptor id is something about 0x34d71d6, the restoring process spins in busy loop for a long time, and the restore hungs and delay of migration from node to node could easily be watched. This patch aims to solve this problem. It introduces new ioctl INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created watch descriptor from userspace. It simply calls idr_set_cursor() primitive to populate idr::idr_next, so that next idr_alloc_cyclic() allocation will return this id, if it is not occupied. This is the way which is used to restore some other resources from userspace. For example, /proc/sys/kernel/ns_last_pid works the same for task pids. The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system may exclude it. v2: Use INT_MAX instead of custom definition of max id, as IDR subsystem guarantees id is between 0 and INT_MAX. CC: Jan Kara <jack@suse.cz> CC: Matthew Wilcox <willy@infradead.org> CC: Andrew Morton <akpm@linux-foundation.org> CC: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
2018-02-09 23:04:54 +08:00
#ifdef CONFIG_CHECKPOINT_RESTORE
case INOTIFY_IOC_SETNEXTWD:
ret = -EINVAL;
if (arg >= 1 && arg <= INT_MAX) {
struct inotify_group_private_data *data;
data = &group->inotify_data;
spin_lock(&data->idr_lock);
idr_set_cursor(&data->idr, (unsigned int)arg);
spin_unlock(&data->idr_lock);
ret = 0;
}
break;
#endif /* CONFIG_CHECKPOINT_RESTORE */
}
return ret;
}
static const struct file_operations inotify_fops = {
fs, notify: add procfs fdinfo helper This allow us to print out fsnotify details such as watchee inode, device, mask and optionally a file handle. For inotify objects if kernel compiled with exportfs support the output will be | pos: 0 | flags: 02000000 | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:11a1000020542153 | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:49b1060023552153 If kernel compiled without exportfs support, the file handle won't be provided but inode and device only. | pos: 0 | flags: 02000000 | inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 | inotify wd:2 ino:a111 sdev:800013 mask:800afce ignored_mask:0 | inotify wd:1 ino:6b149 sdev:800013 mask:800afce ignored_mask:0 For fanotify the output is like | pos: 0 | flags: 04002 | fanotify flags:10 event-flags:0 | fanotify mnt_id:12 mask:3b ignored_mask:0 | fanotify ino:50205 sdev:800013 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:05020500fb1d47e7 To minimize impact on general fsnotify code the new functionality is gathered in fs/notify/fdinfo.c file. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: James Bottomley <jbottomley@parallels.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Helsley <matt.helsley@gmail.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18 08:05:12 +08:00
.show_fdinfo = inotify_show_fdinfo,
.poll = inotify_poll,
.read = inotify_read,
.fasync = fsnotify_fasync,
.release = inotify_release,
.unlocked_ioctl = inotify_ioctl,
.compat_ioctl = inotify_ioctl,
llseek: automatically add .llseek fop All file_operations should get a .llseek operation so we can make nonseekable_open the default for future file operations without a .llseek pointer. The three cases that we can automatically detect are no_llseek, seq_lseek and default_llseek. For cases where we can we can automatically prove that the file offset is always ignored, we use noop_llseek, which maintains the current behavior of not returning an error from a seek. New drivers should normally not use noop_llseek but instead use no_llseek and call nonseekable_open at open time. Existing drivers can be converted to do the same when the maintainer knows for certain that no user code relies on calling seek on the device file. The generated code is often incorrectly indented and right now contains comments that clarify for each added line why a specific variant was chosen. In the version that gets submitted upstream, the comments will be gone and I will manually fix the indentation, because there does not seem to be a way to do that using coccinelle. Some amount of new code is currently sitting in linux-next that should get the same modifications, which I will do at the end of the merge window. Many thanks to Julia Lawall for helping me learn to write a semantic patch that does all this. ===== begin semantic patch ===== // This adds an llseek= method to all file operations, // as a preparation for making no_llseek the default. // // The rules are // - use no_llseek explicitly if we do nonseekable_open // - use seq_lseek for sequential files // - use default_llseek if we know we access f_pos // - use noop_llseek if we know we don't access f_pos, // but we still want to allow users to call lseek // @ open1 exists @ identifier nested_open; @@ nested_open(...) { <+... nonseekable_open(...) ...+> } @ open exists@ identifier open_f; identifier i, f; identifier open1.nested_open; @@ int open_f(struct inode *i, struct file *f) { <+... ( nonseekable_open(...) | nested_open(...) ) ...+> } @ read disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ read_no_fpos disable optional_qualifier exists @ identifier read_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off) { ... when != off } @ write @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; expression E; identifier func; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { <+... ( *off = E | *off += E | func(..., off, ...) | E = *off ) ...+> } @ write_no_fpos @ identifier write_f; identifier f, p, s, off; type ssize_t, size_t, loff_t; @@ ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off) { ... when != off } @ fops0 @ identifier fops; @@ struct file_operations fops = { ... }; @ has_llseek depends on fops0 @ identifier fops0.fops; identifier llseek_f; @@ struct file_operations fops = { ... .llseek = llseek_f, ... }; @ has_read depends on fops0 @ identifier fops0.fops; identifier read_f; @@ struct file_operations fops = { ... .read = read_f, ... }; @ has_write depends on fops0 @ identifier fops0.fops; identifier write_f; @@ struct file_operations fops = { ... .write = write_f, ... }; @ has_open depends on fops0 @ identifier fops0.fops; identifier open_f; @@ struct file_operations fops = { ... .open = open_f, ... }; // use no_llseek if we call nonseekable_open //////////////////////////////////////////// @ nonseekable1 depends on !has_llseek && has_open @ identifier fops0.fops; identifier nso ~= "nonseekable_open"; @@ struct file_operations fops = { ... .open = nso, ... +.llseek = no_llseek, /* nonseekable */ }; @ nonseekable2 depends on !has_llseek @ identifier fops0.fops; identifier open.open_f; @@ struct file_operations fops = { ... .open = open_f, ... +.llseek = no_llseek, /* open uses nonseekable */ }; // use seq_lseek for sequential files ///////////////////////////////////// @ seq depends on !has_llseek @ identifier fops0.fops; identifier sr ~= "seq_read"; @@ struct file_operations fops = { ... .read = sr, ... +.llseek = seq_lseek, /* we have seq_read */ }; // use default_llseek if there is a readdir /////////////////////////////////////////// @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier readdir_e; @@ // any other fop is used that changes pos struct file_operations fops = { ... .readdir = readdir_e, ... +.llseek = default_llseek, /* readdir is present */ }; // use default_llseek if at least one of read/write touches f_pos ///////////////////////////////////////////////////////////////// @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read.read_f; @@ // read fops use offset struct file_operations fops = { ... .read = read_f, ... +.llseek = default_llseek, /* read accesses f_pos */ }; @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, ... + .llseek = default_llseek, /* write accesses f_pos */ }; // Use noop_llseek if neither read nor write accesses f_pos /////////////////////////////////////////////////////////// @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; identifier write_no_fpos.write_f; @@ // write fops use offset struct file_operations fops = { ... .write = write_f, .read = read_f, ... +.llseek = noop_llseek, /* read and write both use no f_pos */ }; @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier write_no_fpos.write_f; @@ struct file_operations fops = { ... .write = write_f, ... +.llseek = noop_llseek, /* write uses no f_pos */ }; @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; identifier read_no_fpos.read_f; @@ struct file_operations fops = { ... .read = read_f, ... +.llseek = noop_llseek, /* read uses no f_pos */ }; @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @ identifier fops0.fops; @@ struct file_operations fops = { ... +.llseek = noop_llseek, /* no read or write fn */ }; ===== End semantic patch ===== Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Julia Lawall <julia@diku.dk> Cc: Christoph Hellwig <hch@infradead.org>
2010-08-16 00:52:59 +08:00
.llseek = noop_llseek,
};
/*
* find_inode - resolve a user-given path to a specific inode
*/
static int inotify_find_inode(const char __user *dirname, struct path *path, unsigned flags)
{
int error;
error = user_path_at(AT_FDCWD, dirname, flags, path);
if (error)
return error;
/* you can only watch an inode if you have read permissions on it */
error = inode_permission(path->dentry->d_inode, MAY_READ);
if (error)
path_put(path);
return error;
}
static int inotify_add_to_idr(struct idr *idr, spinlock_t *idr_lock,
struct inotify_inode_mark *i_mark)
{
int ret;
idr_preload(GFP_KERNEL);
spin_lock(idr_lock);
ret = idr_alloc_cyclic(idr, i_mark, 1, 0, GFP_NOWAIT);
if (ret >= 0) {
/* we added the mark to the idr, take a reference */
i_mark->wd = ret;
fsnotify_get_mark(&i_mark->fsn_mark);
}
spin_unlock(idr_lock);
idr_preload_end();
return ret < 0 ? ret : 0;
}
static struct inotify_inode_mark *inotify_idr_find_locked(struct fsnotify_group *group,
int wd)
{
struct idr *idr = &group->inotify_data.idr;
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
struct inotify_inode_mark *i_mark;
assert_spin_locked(idr_lock);
i_mark = idr_find(idr, wd);
if (i_mark) {
struct fsnotify_mark *fsn_mark = &i_mark->fsn_mark;
fsnotify_get_mark(fsn_mark);
/* One ref for being in the idr, one ref we just took */
BUG_ON(refcount_read(&fsn_mark->refcnt) < 2);
}
return i_mark;
}
static struct inotify_inode_mark *inotify_idr_find(struct fsnotify_group *group,
int wd)
{
struct inotify_inode_mark *i_mark;
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
spin_lock(idr_lock);
i_mark = inotify_idr_find_locked(group, wd);
spin_unlock(idr_lock);
return i_mark;
}
/*
* Remove the mark from the idr (if present) and drop the reference
* on the mark because it was in the idr.
*/
static void inotify_remove_from_idr(struct fsnotify_group *group,
struct inotify_inode_mark *i_mark)
{
struct idr *idr = &group->inotify_data.idr;
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
struct inotify_inode_mark *found_i_mark = NULL;
int wd;
spin_lock(idr_lock);
wd = i_mark->wd;
/*
* does this i_mark think it is in the idr? we shouldn't get called
* if it wasn't....
*/
if (wd == -1) {
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
goto out;
}
/* Lets look in the idr to see if we find it */
found_i_mark = inotify_idr_find_locked(group, wd);
if (unlikely(!found_i_mark)) {
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
goto out;
}
/*
* We found an mark in the idr at the right wd, but it's
* not the mark we were told to remove. eparis seriously
* fucked up somewhere.
*/
if (unlikely(found_i_mark != i_mark)) {
WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p "
"found_i_mark=%p found_i_mark->wd=%d "
"found_i_mark->group=%p\n", __func__, i_mark,
i_mark->wd, i_mark->fsn_mark.group, found_i_mark,
found_i_mark->wd, found_i_mark->fsn_mark.group);
goto out;
}
/*
* One ref for being in the idr
* one ref grabbed by inotify_idr_find
*/
if (unlikely(refcount_read(&i_mark->fsn_mark.refcnt) < 2)) {
printk(KERN_ERR "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p\n",
__func__, i_mark, i_mark->wd, i_mark->fsn_mark.group);
/* we can't really recover with bad ref cnting.. */
BUG();
}
idr_remove(idr, wd);
/* Removed from the idr, drop that ref. */
fsnotify_put_mark(&i_mark->fsn_mark);
out:
i_mark->wd = -1;
spin_unlock(idr_lock);
/* match the ref taken by inotify_idr_find_locked() */
if (found_i_mark)
fsnotify_put_mark(&found_i_mark->fsn_mark);
}
/*
* Send IN_IGNORED for this wd, remove this wd from the idr.
*/
void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
inotify: inotify_destroy_mark_entry could get called twice inotify_destroy_mark_entry could get called twice for the same mark since it is called directly in inotify_rm_watch and when the mark is being destroyed for another reason. As an example assume that the file being watched was just deleted so inotify_destroy_mark_entry would get called from the path fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() -> fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry(). If this happened at the same time as userspace tried to remove a watch via inotify_rm_watch we could attempt to remove the mark from the idr twice and could thus double dec the ref cnt and potentially could be in a use after free/double free situation. The fix is to have inotify_rm_watch use the generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the inotify_destroy_mark_entry() function can only be called one. This patch also renames the function to inotify_ingored_remove_idr() so it is clear what is actually going on in the function. Hopefully this fixes: [ 20.342058] idr_remove called for id=20 which is not allocated. [ 20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077 [ 20.353933] Call Trace: [ 20.356410] [<ffffffff811a82b7>] idr_remove+0x115/0x18f [ 20.361737] [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75 [ 20.367061] [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf [ 20.373771] [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf [ 20.380306] [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10 [ 20.386238] [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170 [ 20.393293] [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf [ 20.399829] [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6 [ 20.405850] [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Eric Paris <eparis@redhat.com> Tested-by: Peter Ziljlstra <peterz@infradead.org>
2009-06-13 04:04:26 +08:00
struct fsnotify_group *group)
{
struct inotify_inode_mark *i_mark;
struct fsnotify_iter_info iter_info = { };
fsnotify_iter_set_report_type_mark(&iter_info, FSNOTIFY_OBJ_TYPE_INODE,
fsn_mark);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
/* Queue ignore event for the watch */
inotify_handle_event(group, NULL, FS_IN_IGNORED, NULL,
FSNOTIFY_EVENT_NONE, NULL, 0, &iter_info);
fsnotify: do not share events between notification groups Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-22 07:48:14 +08:00
i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark);
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);
dec_inotify_watches(group->inotify_data.ucounts);
}
static int inotify_update_existing_watch(struct fsnotify_group *group,
struct inode *inode,
u32 arg)
{
struct fsnotify_mark *fsn_mark;
struct inotify_inode_mark *i_mark;
__u32 old_mask, new_mask;
__u32 mask;
int add = (arg & IN_MASK_ADD);
inotify: Add flag IN_MASK_CREATE for inotify_add_watch() The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch() which prevents inotify from modifying any existing watches when invoked. If the pathname specified in the call has a watched inode associated with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST. Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and will return EINVAL. RATIONALE In the current implementation, there is no way to prevent inotify_add_watch() from modifying existing watch descriptors. Even if the caller keeps a record of all watch descriptors collected, this is only sufficient to detect that an existing watch descriptor may have been modified. The assumption that a particular path will map to the same inode over multiple calls to inotify_add_watch() cannot be made as files can be renamed or deleted. It is also not possible to assume that two distinct paths do no map to the same inode, due to hard-links or a dereferenced symbolic link. Further uses of inotify_add_watch() to revert the change may cause other watch descriptors to be modified or created, merely compunding the problem. There is currently no system call such as inotify_modify_watch() to explicity modify a watch descriptor, which would be able to revert unwanted changes. Thus the caller cannot guarantee to be able to revert any changes to existing watch decriptors. Additionally the caller cannot assume that the events that are associated with a watch descriptor are within the set requested, as any future calls to inotify_add_watch() may unintentionally modify a watch descriptor's mask. Thus it cannot currently be guaranteed that a watch descriptor will only generate events which have been requested. The program must filter events which come through its watch descriptor to within its expected range. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Henry Wilson <henry.wilson@acentic.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-31 17:43:03 +08:00
int create = (arg & IN_MASK_CREATE);
int ret;
mask = inotify_arg_to_mask(arg);
fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, group);
if (!fsn_mark)
return -ENOENT;
inotify: Add flag IN_MASK_CREATE for inotify_add_watch() The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch() which prevents inotify from modifying any existing watches when invoked. If the pathname specified in the call has a watched inode associated with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST. Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and will return EINVAL. RATIONALE In the current implementation, there is no way to prevent inotify_add_watch() from modifying existing watch descriptors. Even if the caller keeps a record of all watch descriptors collected, this is only sufficient to detect that an existing watch descriptor may have been modified. The assumption that a particular path will map to the same inode over multiple calls to inotify_add_watch() cannot be made as files can be renamed or deleted. It is also not possible to assume that two distinct paths do no map to the same inode, due to hard-links or a dereferenced symbolic link. Further uses of inotify_add_watch() to revert the change may cause other watch descriptors to be modified or created, merely compunding the problem. There is currently no system call such as inotify_modify_watch() to explicity modify a watch descriptor, which would be able to revert unwanted changes. Thus the caller cannot guarantee to be able to revert any changes to existing watch decriptors. Additionally the caller cannot assume that the events that are associated with a watch descriptor are within the set requested, as any future calls to inotify_add_watch() may unintentionally modify a watch descriptor's mask. Thus it cannot currently be guaranteed that a watch descriptor will only generate events which have been requested. The program must filter events which come through its watch descriptor to within its expected range. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Henry Wilson <henry.wilson@acentic.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-31 17:43:03 +08:00
else if (create)
return -EEXIST;
i_mark = container_of(fsn_mark, struct inotify_inode_mark, fsn_mark);
spin_lock(&fsn_mark->lock);
old_mask = fsn_mark->mask;
if (add)
fsn_mark->mask |= mask;
else
fsn_mark->mask = mask;
new_mask = fsn_mark->mask;
spin_unlock(&fsn_mark->lock);
if (old_mask != new_mask) {
/* more bits in old than in new? */
int dropped = (old_mask & ~new_mask);
/* more bits in this fsn_mark than the inode's mask? */
int do_inode = (new_mask & ~inode->i_fsnotify_mask);
/* update the inode with this new fsn_mark */
if (dropped || do_inode)
fsnotify_recalc_mask(inode->i_fsnotify_marks);
}
/* return the wd */
ret = i_mark->wd;
/* match the get from fsnotify_find_mark() */
fsnotify_put_mark(fsn_mark);
return ret;
}
static int inotify_new_watch(struct fsnotify_group *group,
struct inode *inode,
u32 arg)
{
struct inotify_inode_mark *tmp_i_mark;
__u32 mask;
int ret;
struct idr *idr = &group->inotify_data.idr;
spinlock_t *idr_lock = &group->inotify_data.idr_lock;
mask = inotify_arg_to_mask(arg);
tmp_i_mark = kmem_cache_alloc(inotify_inode_mark_cachep, GFP_KERNEL);
if (unlikely(!tmp_i_mark))
return -ENOMEM;
fsnotify_init_mark(&tmp_i_mark->fsn_mark, group);
tmp_i_mark->fsn_mark.mask = mask;
tmp_i_mark->wd = -1;
ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
if (ret)
goto out_err;
/* increment the number of watches the user has */
if (!inc_inotify_watches(group->inotify_data.ucounts)) {
inotify_remove_from_idr(group, tmp_i_mark);
ret = -ENOSPC;
goto out_err;
}
/* we are on the idr, now get on the inode */
ret = fsnotify_add_inode_mark_locked(&tmp_i_mark->fsn_mark, inode, 0);
if (ret) {
/* we failed to get on the inode, get off the idr */
inotify_remove_from_idr(group, tmp_i_mark);
goto out_err;
}
/* return the watch descriptor for this new mark */
ret = tmp_i_mark->wd;
out_err:
/* match the ref from fsnotify_init_mark() */
fsnotify_put_mark(&tmp_i_mark->fsn_mark);
return ret;
}
static int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg)
{
int ret = 0;
mutex_lock(&group->mark_mutex);
/* try to update and existing watch with the new arg */
ret = inotify_update_existing_watch(group, inode, arg);
/* no mark present, try to add a new one */
if (ret == -ENOENT)
ret = inotify_new_watch(group, inode, arg);
mutex_unlock(&group->mark_mutex);
return ret;
}
static struct fsnotify_group *inotify_new_group(unsigned int max_events)
{
struct fsnotify_group *group;
struct inotify_event_info *oevent;
group = fsnotify_alloc_group(&inotify_fsnotify_ops);
if (IS_ERR(group))
return group;
oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
if (unlikely(!oevent)) {
fsnotify_destroy_group(group);
return ERR_PTR(-ENOMEM);
}
group->overflow_event = &oevent->fse;
fsnotify_init_event(group->overflow_event, NULL, FS_Q_OVERFLOW);
oevent->wd = -1;
oevent->sync_cookie = 0;
oevent->name_len = 0;
group->max_events = max_events;
fs: fsnotify: account fsnotify metadata to kmemcg Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
group->memcg = get_mem_cgroup_from_mm(current->mm);
spin_lock_init(&group->inotify_data.idr_lock);
idr_init(&group->inotify_data.idr);
group->inotify_data.ucounts = inc_ucount(current_user_ns(),
current_euid(),
UCOUNT_INOTIFY_INSTANCES);
if (!group->inotify_data.ucounts) {
fsnotify_destroy_group(group);
return ERR_PTR(-EMFILE);
}
return group;
}
/* inotify syscalls */
static int do_inotify_init(int flags)
{
struct fsnotify_group *group;
int ret;
/* Check the IN_* constants for consistency. */
BUILD_BUG_ON(IN_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(IN_NONBLOCK != O_NONBLOCK);
flag parameters: NONBLOCK in inotify_init This patch adds non-blocking support for inotify_init1. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("inotify_init1(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_NONBLOCK); if (fd == -1) { puts ("inotify_init1(IN_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("inotify_init1(IN_NONBLOCK) set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 12:29:41 +08:00
if (flags & ~(IN_CLOEXEC | IN_NONBLOCK))
flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 12:29:32 +08:00
return -EINVAL;
/* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
group = inotify_new_group(inotify_max_queued_events);
if (IS_ERR(group))
return PTR_ERR(group);
ret = anon_inode_getfd("inotify", &inotify_fops, group,
O_RDONLY | flags);
if (ret < 0)
fsnotify_destroy_group(group);
return ret;
}
SYSCALL_DEFINE1(inotify_init1, int, flags)
{
return do_inotify_init(flags);
}
SYSCALL_DEFINE0(inotify_init)
flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 12:29:32 +08:00
{
return do_inotify_init(0);
flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 12:29:32 +08:00
}
SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
u32, mask)
{
struct fsnotify_group *group;
struct inode *inode;
struct path path;
struct fd f;
int ret;
unsigned flags = 0;
/*
* We share a lot of code with fs/dnotify. We also share
* the bit layout between inotify's IN_* and the fsnotify
* FS_*. This check ensures that only the inotify IN_*
* bits get passed in and set in watches/events.
*/
if (unlikely(mask & ~ALL_INOTIFY_BITS))
return -EINVAL;
/*
* Require at least one valid bit set in the mask.
* Without _something_ set, we would have no events to
* watch for.
*/
if (unlikely(!(mask & ALL_INOTIFY_BITS)))
return -EINVAL;
f = fdget(fd);
if (unlikely(!f.file))
return -EBADF;
inotify: Add flag IN_MASK_CREATE for inotify_add_watch() The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch() which prevents inotify from modifying any existing watches when invoked. If the pathname specified in the call has a watched inode associated with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST. Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and will return EINVAL. RATIONALE In the current implementation, there is no way to prevent inotify_add_watch() from modifying existing watch descriptors. Even if the caller keeps a record of all watch descriptors collected, this is only sufficient to detect that an existing watch descriptor may have been modified. The assumption that a particular path will map to the same inode over multiple calls to inotify_add_watch() cannot be made as files can be renamed or deleted. It is also not possible to assume that two distinct paths do no map to the same inode, due to hard-links or a dereferenced symbolic link. Further uses of inotify_add_watch() to revert the change may cause other watch descriptors to be modified or created, merely compunding the problem. There is currently no system call such as inotify_modify_watch() to explicity modify a watch descriptor, which would be able to revert unwanted changes. Thus the caller cannot guarantee to be able to revert any changes to existing watch decriptors. Additionally the caller cannot assume that the events that are associated with a watch descriptor are within the set requested, as any future calls to inotify_add_watch() may unintentionally modify a watch descriptor's mask. Thus it cannot currently be guaranteed that a watch descriptor will only generate events which have been requested. The program must filter events which come through its watch descriptor to within its expected range. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Henry Wilson <henry.wilson@acentic.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-31 17:43:03 +08:00
/* IN_MASK_ADD and IN_MASK_CREATE don't make sense together */
if (unlikely((mask & IN_MASK_ADD) && (mask & IN_MASK_CREATE))) {
ret = -EINVAL;
goto fput_and_out;
}
inotify: Add flag IN_MASK_CREATE for inotify_add_watch() The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch() which prevents inotify from modifying any existing watches when invoked. If the pathname specified in the call has a watched inode associated with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST. Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and will return EINVAL. RATIONALE In the current implementation, there is no way to prevent inotify_add_watch() from modifying existing watch descriptors. Even if the caller keeps a record of all watch descriptors collected, this is only sufficient to detect that an existing watch descriptor may have been modified. The assumption that a particular path will map to the same inode over multiple calls to inotify_add_watch() cannot be made as files can be renamed or deleted. It is also not possible to assume that two distinct paths do no map to the same inode, due to hard-links or a dereferenced symbolic link. Further uses of inotify_add_watch() to revert the change may cause other watch descriptors to be modified or created, merely compunding the problem. There is currently no system call such as inotify_modify_watch() to explicity modify a watch descriptor, which would be able to revert unwanted changes. Thus the caller cannot guarantee to be able to revert any changes to existing watch decriptors. Additionally the caller cannot assume that the events that are associated with a watch descriptor are within the set requested, as any future calls to inotify_add_watch() may unintentionally modify a watch descriptor's mask. Thus it cannot currently be guaranteed that a watch descriptor will only generate events which have been requested. The program must filter events which come through its watch descriptor to within its expected range. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Henry Wilson <henry.wilson@acentic.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-31 17:43:03 +08:00
/* verify that this is indeed an inotify instance */
if (unlikely(f.file->f_op != &inotify_fops)) {
ret = -EINVAL;
goto fput_and_out;
}
if (!(mask & IN_DONT_FOLLOW))
flags |= LOOKUP_FOLLOW;
if (mask & IN_ONLYDIR)
flags |= LOOKUP_DIRECTORY;
ret = inotify_find_inode(pathname, &path, flags);
if (ret)
goto fput_and_out;
/* inode held in place by reference to path; group by fget on fd */
inode = path.dentry->d_inode;
group = f.file->private_data;
/* create/update an inode mark */
ret = inotify_update_watch(group, inode, mask);
path_put(&path);
fput_and_out:
fdput(f);
return ret;
}
SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
{
struct fsnotify_group *group;
struct inotify_inode_mark *i_mark;
struct fd f;
int ret = 0;
f = fdget(fd);
if (unlikely(!f.file))
return -EBADF;
/* verify that this is indeed an inotify instance */
ret = -EINVAL;
if (unlikely(f.file->f_op != &inotify_fops))
goto out;
group = f.file->private_data;
ret = -EINVAL;
i_mark = inotify_idr_find(group, wd);
if (unlikely(!i_mark))
goto out;
ret = 0;
fsnotify_destroy_mark(&i_mark->fsn_mark, group);
/* match ref taken by inotify_idr_find */
fsnotify_put_mark(&i_mark->fsn_mark);
out:
fdput(f);
return ret;
}
/*
* inotify_user_setup - Our initialization function. Note that we cannot return
* error because we have compiled-in VFS hooks. So an (unlikely) failure here
* must result in panic().
*/
static int __init inotify_user_setup(void)
{
BUILD_BUG_ON(IN_ACCESS != FS_ACCESS);
BUILD_BUG_ON(IN_MODIFY != FS_MODIFY);
BUILD_BUG_ON(IN_ATTRIB != FS_ATTRIB);
BUILD_BUG_ON(IN_CLOSE_WRITE != FS_CLOSE_WRITE);
BUILD_BUG_ON(IN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
BUILD_BUG_ON(IN_OPEN != FS_OPEN);
BUILD_BUG_ON(IN_MOVED_FROM != FS_MOVED_FROM);
BUILD_BUG_ON(IN_MOVED_TO != FS_MOVED_TO);
BUILD_BUG_ON(IN_CREATE != FS_CREATE);
BUILD_BUG_ON(IN_DELETE != FS_DELETE);
BUILD_BUG_ON(IN_DELETE_SELF != FS_DELETE_SELF);
BUILD_BUG_ON(IN_MOVE_SELF != FS_MOVE_SELF);
BUILD_BUG_ON(IN_UNMOUNT != FS_UNMOUNT);
BUILD_BUG_ON(IN_Q_OVERFLOW != FS_Q_OVERFLOW);
BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
BUILD_BUG_ON(IN_EXCL_UNLINK != FS_EXCL_UNLINK);
BUILD_BUG_ON(IN_ISDIR != FS_ISDIR);
BUILD_BUG_ON(IN_ONESHOT != FS_IN_ONESHOT);
BUILD_BUG_ON(HWEIGHT32(ALL_INOTIFY_BITS) != 22);
fs: fsnotify: account fsnotify metadata to kmemcg Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 06:46:39 +08:00
inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark,
SLAB_PANIC|SLAB_ACCOUNT);
inotify_max_queued_events = 16384;
init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
return 0;
}
fs_initcall(inotify_user_setup);