Commit Graph

597 Commits

Author SHA1 Message Date
Roman Gushchin b87d8cefe4 mm, memcg: rework remote charging API to support nesting
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.

  memalloc_use_memcg(target_memcg);
  <...>
  memalloc_unuse_memcg();

It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting.  On exit from
the inner block the active memcg will be cleared instead of being
restored.

  memalloc_use_memcg(target_memcg);

  memalloc_use_memcg(target_memcg_2);
    <...>
    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

  memalloc_unuse_memcg();

This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one.  So a remote charging
block will look like:

  old_memcg = set_active_memcg(target_memcg);
  <...>
  set_active_memcg(old_memcg);

This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Jan Kara 8aed8cebdd fanotify: compare fsid when merging name event
When merging name events, fsids of the two involved events have to
match. Otherwise we could merge events from two different filesystems
and thus effectively loose the second event.

Backporting note: Although the commit cacfb956d4 introducing this bug
was merged for 5.7, the relevant code didn't get used in the end until
7e8283af6e ("fanotify: report parent fid + name + child fid") which
will be merged with this patch. So there's no need for backporting this.

Fixes: cacfb956d4 ("fanotify: record name info for FAN_DIR_MODIFY event")
Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-28 10:58:07 +02:00
Amir Goldstein b9a1b97725 fsnotify: create method handle_inode_event() in fsnotify_operations
The method handle_event() grew a lot of complexity due to the design of
fanotify and merging of ignore masks.

Most backends do not care about this complex functionality, so we can hide
this complexity from them.

Introduce a method handle_inode_event() that serves those backends and
passes a single inode mark and less arguments.

This change converts all backends except fanotify and inotify to use the
simplified handle_inode_event() method.  In pricipal, inotify could have
also used the new method, but that would require passing more arguments
on the simple helper (data, data_type, cookie), so we leave it with the
handle_event() method.

Link: https://lore.kernel.org/r/20200722125849.17418-9-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:25:50 +02:00
Amir Goldstein 691d976352 fanotify: report parent fid + child fid
Add support for FAN_REPORT_FID | FAN_REPORT_DIR_FID.
Internally, it is implemented as a private case of reporting both
parent and child fids and name, the parent and child fids are recorded
in a variable length fanotify_name_event, but there is no name.

It should be noted that directory modification events are recorded
in fixed size fanotify_fid_event when not reporting name, just like
with group flags FAN_REPORT_FID.

Link: https://lore.kernel.org/r/20200716084230.30611-23-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:24:01 +02:00
Amir Goldstein 7e8283af6e fanotify: report parent fid + name + child fid
For a group with fanotify_init() flag FAN_REPORT_DFID_NAME, the parent
fid and name are reported for events on non-directory objects with an
info record of type FAN_EVENT_INFO_TYPE_DFID_NAME.

If the group also has the init flag FAN_REPORT_FID, the child fid
is also reported with another info record that follows the first info
record. The second info record is the same info record that would have
been reported to a group with only FAN_REPORT_FID flag.

When the child fid needs to be recorded, the variable size struct
fanotify_name_event is preallocated with enough space to store the
child fh between the dir fh and the name.

Link: https://lore.kernel.org/r/20200716084230.30611-22-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:24:00 +02:00
Amir Goldstein 929943b38d fanotify: add support for FAN_REPORT_NAME
Introduce a new fanotify_init() flag FAN_REPORT_NAME.  It requires the
flag FAN_REPORT_DIR_FID and there is a constant for setting both flags
named FAN_REPORT_DFID_NAME.

For a group with flag FAN_REPORT_NAME, the parent fid and name are
reported for directory entry modification events (create/detete/move)
and for events on non-directory objects.

Events on directories themselves are reported with their own fid and
"." as the name.

The parent fid and name are reported with an info record of type
FAN_EVENT_INFO_TYPE_DFID_NAME, similar to the way that parent fid is
reported with into type FAN_EVENT_INFO_TYPE_DFID, but with an appended
null terminated name string.

Link: https://lore.kernel.org/r/20200716084230.30611-21-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:24:00 +02:00
Amir Goldstein 5128063739 fanotify: report events with parent dir fid to sb/mount/non-dir marks
In a group with flag FAN_REPORT_DIR_FID, when adding an inode mark with
FAN_EVENT_ON_CHILD, events on non-directory children are reported with
the fid of the parent.

When adding a filesystem or mount mark or mark on a non-dir inode, we
want to report events that are "possible on child" (e.g. open/close)
also with fid of the parent, as if the victim inode's parent is
interested in events "on child".

Some events, currently only FAN_MOVE_SELF, should be reported to a
sb/mount/non-dir mark with parent fid even though they are not
reported to a watching parent.

To get the desired behavior we set the flag FAN_EVENT_ON_CHILD on
all the sb/mount/non-dir mark masks in a group with FAN_REPORT_DIR_FID.

Link: https://lore.kernel.org/r/20200716084230.30611-20-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:24:00 +02:00
Amir Goldstein 83b7a59896 fanotify: add basic support for FAN_REPORT_DIR_FID
For now, the flag is mutually exclusive with FAN_REPORT_FID.
Events include a single info record of type FAN_EVENT_INFO_TYPE_DFID
with a directory file handle.

For now, events are only reported for:
- Directory modification events
- Events on children of a watching directory
- Events on directory objects

Soon, we will add support for reporting the parent directory fid
for events on non-directories with filesystem/mount mark and
support for reporting both parent directory fid and child fid.

Link: https://lore.kernel.org/r/20200716084230.30611-19-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:24:00 +02:00
Amir Goldstein 9b93f33105 fsnotify: send event with parent/name info to sb/mount/non-dir marks
Similar to events "on child" to watching directory, send event
with parent/name info if sb/mount/non-dir marks are interested in
parent/name info.

The FS_EVENT_ON_CHILD flag can be set on sb/mount/non-dir marks to specify
interest in parent/name info for events on non-directory inodes.

Events on "orphan" children (disconnected dentries) are sent without
parent/name info.

Events on directories are sent with parent/name info only if the parent
directory is watching.

After this change, even groups that do not subscribe to events on
children could get an event with mark iterator type TYPE_CHILD and
without mark iterator type TYPE_INODE if fanotify has marks on the same
objects.

dnotify and inotify event handlers can already cope with that situation.
audit does not subscribe to events that are possible on child, so won't
get to this situation. nfsd does not access the marks iterator from its
event handler at the moment, so it is not affected.

This is a bit too fragile, so we should prepare all groups to cope with
mark type TYPE_CHILD preferably using a generic helper.

Link: https://lore.kernel.org/r/20200716084230.30611-16-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:21:02 +02:00
Amir Goldstein 957f7b472c inotify: do not set FS_EVENT_ON_CHILD in non-dir mark mask
FS_EVENT_ON_CHILD has currently no meaning for non-dir inode marks. In
the following patches we want to use that bit to mean that mark's
notification group cares about parent and name information. So stop
setting FS_EVENT_ON_CHILD for non-dir marks.

Link: https://lore.kernel.org/r/20200722125849.17418-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:16:16 +02:00
Amir Goldstein 40a100d3ad fsnotify: pass dir and inode arguments to fsnotify()
The arguments of fsnotify() are overloaded and mean different things
for different event types.

Replace the to_tell argument with separate arguments @dir and @inode,
because we may be sending to both dir and child.  Using the @data
argument to pass the child is not enough, because dirent events pass
this argument (for audit), but we do not report to child.

Document the new fsnotify() function argumenets.

Link: https://lore.kernel.org/r/20200722125849.17418-7-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:15:48 +02:00
Amir Goldstein 82ace1efb3 fsnotify: create helper fsnotify_inode()
Simple helper to consolidate biolerplate code.

Link: https://lore.kernel.org/r/20200722125849.17418-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:13:51 +02:00
Amir Goldstein 497b0c5a7c fsnotify: send event to parent and child with single callback
Instead of calling fsnotify() twice, once with parent inode and once
with child inode, if event should be sent to parent inode, send it
with both parent and child inodes marks in object type iterator and call
the backend handle_event() callback only once.

The parent inode is assigned to the standard "inode" iterator type and
the child inode is assigned to the special "child" iterator type.

In that case, the bit FS_EVENT_ON_CHILD will be set in the event mask,
the dir argument to handle_event will be the parent inode, the file_name
argument to handle_event is non NULL and refers to the name of the child
and the child inode can be accessed with fsnotify_data_inode().

This will allow fanotify to make decisions based on child or parent's
ignored mask.  For example, when a parent is interested in a specific
event on its children, but a specific child wishes to ignore this event,
the event will not be reported.  This is not what happens with current
code, but according to man page, it is the expected behavior.

Link: https://lore.kernel.org/r/20200716084230.30611-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:24:52 +02:00
Amir Goldstein c8f3446c66 inotify: report both events on parent and child with single callback
fsnotify usually calls inotify_handle_event() once for watching parent
to report event with child's name and once for watching child to report
event without child's name.

Do the same thing with a single callback instead of two callbacks when
marks iterator contains both inode and child entries.

Link: https://lore.kernel.org/r/20200716084230.30611-13-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:24:51 +02:00
Amir Goldstein 62cb0af4ce dnotify: report both events on parent and child with single callback
For some events (e.g. DN_ATTRIB on sub-directory) fsnotify may call
dnotify_handle_event() once for watching parent and once again for
the watching sub-directory.

Do the same thing with a single callback instead of two callbacks when
marks iterator contains both inode and child entries.

Link: https://lore.kernel.org/r/20200716084230.30611-12-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:24:51 +02:00
Amir Goldstein f35c415678 fanotify: no external fh buffer in fanotify_name_event
The fanotify_fh struct has an inline buffer of size 12 which is enough
to store the most common local filesystem file handles (e.g. ext4, xfs).
For file handles that do not fit in the inline buffer (e.g. btrfs), an
external buffer is allocated to store the file handle.

When allocating a variable size fanotify_name_event, there is no point
in allocating also an external fh buffer when file handle does not fit
in the inline buffer.

Check required size for encoding fh, preallocate an event buffer
sufficient to contain both file handle and name and store the name after
the file handle.

At this time, when not reporting name in event, we still allocate
the fixed size fanotify_fid_event and an external buffer for large
file handles, but fanotify_alloc_name_event() has already been prepared
to accept a NULL file_name.

Link: https://lore.kernel.org/r/20200716084230.30611-11-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:37 +02:00
Amir Goldstein f454fa610a fanotify: use struct fanotify_info to parcel the variable size buffer
An fanotify event name is always recorded relative to a dir fh.
Encapsulate the name_len member of fanotify_name_event in a new struct
fanotify_info, which describes the parceling of the variable size
buffer of an fanotify_name_event.

The dir_fh member of fanotify_name_event is renamed to _dir_fh and is not
accessed directly, but via the fanotify_info_dir_fh() accessor.
Although the dir_fh len information is already available in struct
fanotify_fh, we store it also in dif_fh_totlen member of fanotify_info,
including the size of fanotify_fh header, so we know the offset of the
name in the buffer without looking inside the dir_fh.

We also add a file_fh_totlen member to allow packing another file handle
in the variable size buffer after the dir_fh and before the name.
We are going to use that space to store the child fid.

Link: https://lore.kernel.org/r/20200716084230.30611-10-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:37 +02:00
Amir Goldstein 85af5d9258 fanotify: use FAN_EVENT_ON_CHILD as implicit flag on sb/mount/non-dir marks
Up to now, fanotify allowed to set the FAN_EVENT_ON_CHILD flag on
sb/mount marks and non-directory inode mask, but the flag was ignored.

Mask out the flag if it is provided by user on sb/mount/non-dir marks
and define it as an implicit flag that cannot be removed by user.

This flag is going to be used internally to request for events with
parent and name info.

Link: https://lore.kernel.org/r/20200716084230.30611-8-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:37 +02:00
Amir Goldstein 4ed6814a91 fanotify: prepare for implicit event flags in mark mask
So far, all flags that can be set in an fanotify mark mask can be set
explicitly by a call to fanotify_mark(2).

Prepare for defining implicit event flags that cannot be set by user with
fanotify_mark(2), similar to how inotify/dnotify implicitly set the
FS_EVENT_ON_CHILD flag.

Implicit event flags cannot be removed by user and mark gets destroyed
when only implicit event flags remain in the mask.

Link: https://lore.kernel.org/r/20200716084230.30611-7-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein 3ef8665366 fanotify: mask out special event flags from ignored mask
The special event flags (FAN_ONDIR, FAN_EVENT_ON_CHILD) never had
any meaning in ignored mask. Mask them out explicitly.

Link: https://lore.kernel.org/r/20200716084230.30611-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein d809daf1b6 fanotify: generalize test for FAN_REPORT_FID
As preparation for new flags that report fids, define a bit set
of flags for a group reporting fids, currently containing the
only bit FAN_REPORT_FID.

Link: https://lore.kernel.org/r/20200716084230.30611-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein 6ad1aadd97 fanotify: distinguish between fid encode error and null fid
In fanotify_encode_fh(), both cases of NULL inode and failure to encode
ended up with fh type FILEID_INVALID.

Distiguish the case of NULL inode, by setting fh type to FILEID_ROOT.
This is just a semantic difference at this point.

Remove stale comment and unneeded check from fid event compare helpers.

Link: https://lore.kernel.org/r/20200716084230.30611-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein 103ff6a554 fanotify: generalize merge logic of events on dir
An event on directory should never be merged with an event on
non-directory regardless of the event struct type.

This change has no visible effect, because currently, with struct
fanotify_path_event, the relevant events will not be merged because
event path of dir will be different than event path of non-dir.

Link: https://lore.kernel.org/r/20200716084230.30611-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein 0badfa029e fanotify: generalize the handling of extra event flags
In fanotify_group_event_mask() there is logic in place to make sure we
are not going to handle an event with no type and just FAN_ONDIR flag.
Generalize this logic to any FANOTIFY_EVENT_FLAGS.

There is only one more flag in this group at the moment -
FAN_EVENT_ON_CHILD. We never report it to user, but we do pass it in to
fanotify_alloc_event() when group is reporting fid as indication that
event happened on child. We will have use for this indication later on.

Link: https://lore.kernel.org/r/20200716084230.30611-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein 08b95c338e fanotify: remove event FAN_DIR_MODIFY
It was never enabled in uapi and its functionality is about to be
superseded by events FAN_CREATE, FAN_DELETE, FAN_MOVE with group
flag FAN_REPORT_NAME.

Keep a place holder variable name_event instead of removing the
name recording code since it will be used by the new events.

Link: https://lore.kernel.org/r/20200708111156.24659-17-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 21:23:36 +02:00
Amir Goldstein b54cecf5e2 fsnotify: pass dir argument to handle_event() callback
The 'inode' argument to handle_event(), sometimes referred to as
'to_tell' is somewhat obsolete.
It is a remnant from the times when a group could only have an inode mark
associated with an event.

We now pass an iter_info array to the callback, with all marks associated
with an event.

Most backends ignore this argument, with two exceptions:
1. dnotify uses it for sanity check that event is on directory
2. fanotify uses it to report fid of directory on directory entry
   modification events

Remove the 'inode' argument and add a 'dir' argument.
The callback function signature is deliberately changed, because
the meaning of the argument has changed and the arguments have
been documented.

The 'dir' argument is set to when 'file_name' is specified and it is
referring to the directory that the 'file_name' entry belongs to.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 18:32:47 +02:00
Amir Goldstein 9c61f3b560 fanotify: break up fanotify_alloc_event()
Break up fanotify_alloc_event() into helpers by event struct type.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:41:33 +02:00
Amir Goldstein b8a6c3a2f0 fanotify: create overflow event type
The special overflow event is allocated as struct fanotify_path_event,
but with a null path.

Use a special event type to identify the overflow event, so the helper
fanotify_has_event_path() will always indicate a non null path.

Allocating the overflow event doesn't need any of the fancy stuff in
fanotify_alloc_event(), so create a simplified helper for allocating the
overflow event.

There is also no need to store and report the pid with an overflow event.

Link: https://lore.kernel.org/r/20200708111156.24659-7-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:37:03 +02:00
Amir Goldstein 956235afd1 inotify: do not use objectid when comparing events
inotify's event->wd is the object identifier.
Compare that instead of the common fsnotidy event objectid, so
we can get rid of the objectid field later.

Link: https://lore.kernel.org/r/20200708111156.24659-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:58 +02:00
Amir Goldstein cbcf47adc8 fsnotify: return non const from fsnotify_data_inode()
Return non const inode pointer from fsnotify_data_inode().
None of the fsnotify hooks pass const inode pointer as data and
callers often need to cast to a non const pointer.

Link: https://lore.kernel.org/r/20200708111156.24659-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:45 +02:00
Amir Goldstein c738fbabb0 fsnotify: fold fsnotify() call into fsnotify_parent()
All (two) callers of fsnotify_parent() also call fsnotify() to notify
the child inode. Move the second fsnotify() call into fsnotify_parent().

This will allow more flexibility in making decisions about which of the
two event falvors should be sent.

Using 'goto notify_child' in the inline helper seems a bit strange, but
it mimics the code in __fsnotify_parent() for clarity and the goto
pattern will become less strage after following patches are applied.

Link: https://lore.kernel.org/r/20200708111156.24659-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 17:36:41 +02:00
Mel Gorman 71d734103e fsnotify: Rearrange fast path to minimise overhead when there is no watcher
The fsnotify paths are trivial to hit even when there are no watchers and
they are surprisingly expensive. For example, every successful vfs_write()
hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
As it stands, fsnotify_parent is a guaranteed functional call even if there
are no watchers and fsnotify() does a substantial amount of unnecessary
work before it checks if there are any watchers. A perf profile showed
that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
total samples taken in that function during a test. This patch rearranges
the fast paths to reduce the amount of work done when there are no
watchers.

The test motivating this was "perf bench sched messaging --pipe". Despite
the fact the pipes are anonymous, fsnotify is still called a lot and
the overhead is noticeable even though it's completely pointless. It's
likely the overhead is negligible for real IO so this is an extreme
example. This is a comparison of hackbench using processes and pipes on
a 1-socket machine with 8 CPU threads without fanotify watchers.

                              5.7.0                  5.7.0
                            vanilla      fastfsnotify-v1r1
Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)

                       5.7.0       5.7.0
                     vanilla fastfsnotify-v1r1
Duration User         157.05      152.79
Duration System      1279.98     1219.32
Duration Elapsed      182.81      174.52

This is showing that the latencies are improved by roughly 2-9%. The
variability is not shown but some of these results are within the noise
as this workload heavily overloads the machine. That said, the system CPU
usage is reduced by quite a bit so it makes sense to avoid the overhead
even if it is a bit tricky to detect at times. A perf profile of just 1
group of tasks showed that 5.14% of samples taken were in either fsnotify()
or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
mostly function entry and the initial check for watchers.  The check for
watchers is complicated enough that inlining it may be controversial.

[Amir] Slightly simplify with mnt_or_sb_mask => marks_mask

Link: https://lore.kernel.org/r/20200708111156.24659-1-amir73il@gmail.com
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 15:29:10 +02:00
Jan Kara 47aaabdedf fanotify: Avoid softlockups when reading many events
When user provides large buffer for events and there are lots of events
available, we can try to copy them all to userspace without scheduling
which can softlockup the kernel (furthermore exacerbated by the
contention on notification_lock). Add a scheduling point after copying
each event.

Note that usually the real underlying problem is the cost of fanotify
event merging and the resulting contention on notification_lock but this
is a cheap way to somewhat reduce the problem until we can properly
address that.

Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Link: https://lore.kernel.org/lkml/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-15 15:23:28 +02:00
Masahiro Yamada a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Linus Torvalds 07c8f3bfef \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl7Y2McACgkQnJ2qBz9k
 QNlHzwf/e4oz9oRCXPqBwh6C318nl6ksQO5ooW+Dhb535cr/Cn99nuZa3GrvW+aq
 eSbypsvZQMguk0/okEc4jcTgLmEw+KubpBXOi/DJZ9dzGQrvjT2nBkQmaTqwp9dO
 WMZcJLmszkrtokjKD4lVjyQArcwqQF/v/moEKIImw5A6CY4R4odTaUOCPnTwF7P6
 OXsDPwRfAccJ25ZUZ8hjc+fRl/Ncex6szciaJ08T4btlaAtc5UIn5Sy/u8BqNNiw
 0VRheD4sJ2c25hLOIQJ5RETIeuYaRcR/BA3vm+k1d2iIiw4ubj9+ppwiaWOryA9U
 5fXnBmXKuUUrwFihzmiLSckIpm3IPg==
 =kghV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "Several smaller fixes and cleanups for fsnotify subsystem"

* tag 'fsnotify_for_v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: fix ignore mask logic for events on child and on dir
  fanotify: don't write with size under sizeof(response)
  fsnotify: Remove proc_fs.h include
  fanotify: remove reference to fill_event_metadata()
  fsnotify: add mutex destroy
  fanotify: prefix should_merge()
  fanotify: Replace zero-length array with flexible-array
  inotify: Fix error return code assignment flow.
  fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()
2020-06-04 13:51:54 -07:00
Linus Torvalds b23c4771ff A fair amount of stuff this time around, dominated by yet another massive
set from Mauro toward the completion of the RST conversion.  I *really*
 hope we are getting close to the end of this.  Meanwhile, those patches
 reach pretty far afield to update document references around the tree;
 there should be no actual code changes there.  There will be, alas, more of
 the usual trivial merge conflicts.
 
 Beyond that we have more translations, improvements to the sphinx
 scripting, a number of additions to the sysctl documentation, and lots of
 fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
 bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
 RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
 SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
 oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
 =fDC/
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.8' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "A fair amount of stuff this time around, dominated by yet another
  massive set from Mauro toward the completion of the RST conversion. I
  *really* hope we are getting close to the end of this. Meanwhile,
  those patches reach pretty far afield to update document references
  around the tree; there should be no actual code changes there. There
  will be, alas, more of the usual trivial merge conflicts.

  Beyond that we have more translations, improvements to the sphinx
  scripting, a number of additions to the sysctl documentation, and lots
  of fixes"

* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
  Documentation: fixes to the maintainer-entry-profile template
  zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
  tracing: Fix events.rst section numbering
  docs: acpi: fix old http link and improve document format
  docs: filesystems: add info about efivars content
  Documentation: LSM: Correct the basic LSM description
  mailmap: change email for Ricardo Ribalda
  docs: sysctl/kernel: document unaligned controls
  Documentation: admin-guide: update bug-hunting.rst
  docs: sysctl/kernel: document ngroups_max
  nvdimm: fixes to maintainter-entry-profile
  Documentation/features: Correct RISC-V kprobes support entry
  Documentation/features: Refresh the arch support status files
  Revert "docs: sysctl/kernel: document ngroups_max"
  docs: move locking-specific documents to locking/
  docs: move digsig docs to the security book
  docs: move the kref doc into the core-api book
  docs: add IRQ documentation at the core-api book
  docs: debugging-via-ohci1394.txt: add it to the core-api book
  docs: fix references for ipmi.rst file
  ...
2020-06-01 15:45:27 -07:00
Amir Goldstein f17936993a fanotify: turn off support for FAN_DIR_MODIFY
FAN_DIR_MODIFY has been enabled by commit 44d705b037 ("fanotify:
report name info for FAN_DIR_MODIFY event") in 5.7-rc1. Now we are
planning further extensions to the fanotify API and during that we
realized that FAN_DIR_MODIFY may behave slightly differently to be more
consistent with extensions we plan. So until we finalize these
extensions, let's not bind our hands with exposing FAN_DIR_MODIFY to
userland.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-27 18:55:54 +02:00
Amir Goldstein 2f02fd3fa1 fanotify: fix ignore mask logic for events on child and on dir
The comments in fanotify_group_event_mask() say:

  "If the event is on dir/child and this mark doesn't care about
   events on dir/child, don't send it!"

Specifically, mount and filesystem marks do not care about events
on child, but they can still specify an ignore mask for those events.
For example, a group that has:
- A mount mark with mask 0 and ignore_mask FAN_OPEN
- An inode mark on a directory with mask FAN_OPEN | FAN_OPEN_EXEC
  with flag FAN_EVENT_ON_CHILD

A child file open for exec would be reported to group with the FAN_OPEN
event despite the fact that FAN_OPEN is in ignore mask of mount mark,
because the mark iteration loop skips over non-inode marks for events
on child when calculating the ignore mask.

Move ignore mask calculation to the top of the iteration loop block
before excluding marks for events on dir/child.

Link: https://lore.kernel.org/r/20200524072441.18258-1-amir73il@gmail.com
Reported-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/linux-fsdevel/20200521162443.GA26052@quack2.suse.cz/
Fixes: 55bf882c7f "fanotify: fix merging marks masks with FAN_ONDIR"
Fixes: b469e7e47c "fanotify: fix handling of events on child..."
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-25 10:43:56 +02:00
Fabian Frederick 5e23663b49 fanotify: don't write with size under sizeof(response)
fanotify_write() only aligned copy_from_user size to sizeof(response)
for higher values. This patch avoids all values below as suggested
by Amir Goldstein and set to response size unconditionally.

Link: https://lore.kernel.org/r/20200512181921.405973-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13 17:16:57 +02:00
Fabian Frederick 5a449099b9 fsnotify: Remove proc_fs.h include
proc_fs.h was already included in fdinfo.h

Link: https://lore.kernel.org/r/20200512181906.405927-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13 17:16:14 +02:00
Fabian Frederick c5e443cb7b fanotify: remove reference to fill_event_metadata()
fill_event_metadata() was removed in commit bb2f7b4542
("fanotify: open code fill_event_metadata()")

Link: https://lore.kernel.org/r/20200512181836.405879-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13 17:15:43 +02:00
Fabian Frederick 191e1656d1 fsnotify: add mutex destroy
Call mutex_destroy() before freeing notification group. This only adds
some additional debug checks when mutex debugging is enabled but still
it may be useful.

Link: https://lore.kernel.org/r/20200512181803.405832-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13 17:14:20 +02:00
Fabian Frederick ab3c4da0ad fanotify: prefix should_merge()
Prefix function with fanotify_ like others.

Link: https://lore.kernel.org/r/20200512181715.405728-1-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-13 17:13:12 +02:00
Gustavo A. R. Silva 374ad001f7 fanotify: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Link: https://lore.kernel.org/r/20200507185230.GA14229@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-08 10:38:12 +02:00
youngjun 7b26aa243d inotify: Fix error return code assignment flow.
If error code is initialized -EINVAL, there is no need to assign -EINVAL.

Link: https://lore.kernel.org/r/20200426143316.29877-1-her0gyugyu@gmail.com
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-04-27 12:14:56 +02:00
Mauro Carvalho Chehab 0c1bc6b845 docs: filesystems: fix renamed references
Some filesystem references got broken by a previous patch
series I submitted. Address those.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # fs/affs/Kconfig
Link: https://lore.kernel.org/r/57318c53008dbda7f6f4a5a9e5787f4d37e8565a.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-04-20 15:45:22 -06:00
Jules Irenge 00e0afb658 fsnotify: Add missing annotation for fsnotify_finish_user_wait() and for fsnotify_prepare_user_wait()
Sparse reports warnings at fsnotify_prepare_user_wait()
	and at fsnotify_finish_user_wait()

warning: context imbalance in fsnotify_finish_user_wait()
	- wrong count at exit
warning: context imbalance in fsnotify_prepare_user_wait()
	- unexpected unlock

The root cause is the missing annotation at fsnotify_finish_user_wait()
	and at fsnotify_prepare_user_wait()
fsnotify_prepare_user_wait() has an extra annotation __release()
 that only tell Sparse and not GCC to shutdown the warning

Add the missing  __acquires(&fsnotify_mark_srcu) annotation
Add the missing __releases(&fsnotify_mark_srcu) annotation
Add the __release(&fsnotify_mark_srcu) annotation.

Link: https://lore.kernel.org/r/20200413214240.15245-1-jbi.octave@gmail.com
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-04-15 11:44:43 +02:00
Nathan Chancellor 6def1a1d2d fanotify: Fix the checks in fanotify_fsid_equal
Clang warns:

fs/notify/fanotify/fanotify.c:28:23: warning: self-comparison always
evaluates to true [-Wtautological-compare]
        return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1];
                             ^
fs/notify/fanotify/fanotify.c:28:57: warning: self-comparison always
evaluates to true [-Wtautological-compare]
        return fsid1->val[0] == fsid1->val[0] && fsid2->val[1] == fsid2->val[1];
                                                               ^
2 warnings generated.

The intention was clearly to compare val[0] and val[1] in the two
different fsid structs. Fix it otherwise this function always returns
true.

Fixes: afc894c784 ("fanotify: Store fanotify handles differently")
Link: https://github.com/ClangBuiltLinux/linux/issues/952
Link: https://lore.kernel.org/r/20200327171030.30625-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-30 12:40:53 +02:00
Amir Goldstein 44d705b037 fanotify: report name info for FAN_DIR_MODIFY event
Report event FAN_DIR_MODIFY with name in a variable length record similar
to how fid's are reported.  With name info reporting implemented, setting
FAN_DIR_MODIFY in mark mask is now allowed.

When events are reported with name, the reported fid identifies the
directory and the name follows the fid. The info record type for this
event info is FAN_EVENT_INFO_TYPE_DFID_NAME.

For now, all reported events have at most one info record which is
either FAN_EVENT_INFO_TYPE_FID or FAN_EVENT_INFO_TYPE_DFID_NAME (for
FAN_DIR_MODIFY).  Later on, events "on child" will report both records.

There are several ways that an application can use this information:

1. When watching a single directory, the name is always relative to
the watched directory, so application need to fstatat(2) the name
relative to the watched directory.

2. When watching a set of directories, the application could keep a map
of dirfd for all watched directories and hash the map by fid obtained
with name_to_handle_at(2).  When getting a name event, the fid in the
event info could be used to lookup the base dirfd in the map and then
call fstatat(2) with that dirfd.

3. When watching a filesystem (FAN_MARK_FILESYSTEM) or a large set of
directories, the application could use open_by_handle_at(2) with the fid
in event info to obtain dirfd for the directory where event happened and
call fstatat(2) with this dirfd.

The last option scales better for a large number of watched directories.
The first two options may be available in the future also for non
privileged fanotify watchers, because open_by_handle_at(2) requires
the CAP_DAC_READ_SEARCH capability.

Link: https://lore.kernel.org/r/20200319151022.31456-15-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-25 23:17:16 +01:00