License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2010-07-07 11:24:06 +08:00
|
|
|
#undef TRACE_SYSTEM
|
|
|
|
#define TRACE_SYSTEM writeback
|
|
|
|
|
|
|
|
#if !defined(_TRACE_WRITEBACK_H) || defined(TRACE_HEADER_MULTI_READ)
|
|
|
|
#define _TRACE_WRITEBACK_H
|
|
|
|
|
2014-02-27 01:52:13 +08:00
|
|
|
#include <linux/tracepoint.h>
|
2010-07-07 11:24:06 +08:00
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/writeback.h>
|
|
|
|
|
2010-12-02 07:33:37 +08:00
|
|
|
#define show_inode_state(state) \
|
|
|
|
__print_flags(state, "|", \
|
|
|
|
{I_DIRTY_SYNC, "I_DIRTY_SYNC"}, \
|
|
|
|
{I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \
|
|
|
|
{I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \
|
|
|
|
{I_NEW, "I_NEW"}, \
|
|
|
|
{I_WILL_FREE, "I_WILL_FREE"}, \
|
|
|
|
{I_FREEING, "I_FREEING"}, \
|
|
|
|
{I_CLEAR, "I_CLEAR"}, \
|
|
|
|
{I_SYNC, "I_SYNC"}, \
|
2015-02-02 13:37:00 +08:00
|
|
|
{I_DIRTY_TIME, "I_DIRTY_TIME"}, \
|
|
|
|
{I_DIRTY_TIME_EXPIRED, "I_DIRTY_TIME_EXPIRED"}, \
|
2010-12-02 07:33:37 +08:00
|
|
|
{I_REFERENCED, "I_REFERENCED"} \
|
|
|
|
)
|
|
|
|
|
2015-03-28 05:08:41 +08:00
|
|
|
/* enums need to be exported to user space */
|
|
|
|
#undef EM
|
|
|
|
#undef EMe
|
|
|
|
#define EM(a,b) TRACE_DEFINE_ENUM(a);
|
|
|
|
#define EMe(a,b) TRACE_DEFINE_ENUM(a);
|
|
|
|
|
2011-12-09 06:53:54 +08:00
|
|
|
#define WB_WORK_REASON \
|
2015-03-28 05:08:41 +08:00
|
|
|
EM( WB_REASON_BACKGROUND, "background") \
|
2017-02-25 06:56:14 +08:00
|
|
|
EM( WB_REASON_VMSCAN, "vmscan") \
|
2015-03-28 05:08:41 +08:00
|
|
|
EM( WB_REASON_SYNC, "sync") \
|
|
|
|
EM( WB_REASON_PERIODIC, "periodic") \
|
|
|
|
EM( WB_REASON_LAPTOP_TIMER, "laptop_timer") \
|
|
|
|
EM( WB_REASON_FREE_MORE_MEM, "free_more_memory") \
|
|
|
|
EM( WB_REASON_FS_FREE_SPACE, "fs_free_space") \
|
|
|
|
EMe(WB_REASON_FORKER_THREAD, "forker_thread")
|
|
|
|
|
|
|
|
WB_WORK_REASON
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now redefine the EM() and EMe() macros to map the enums to the strings
|
|
|
|
* that will be printed in the output.
|
|
|
|
*/
|
|
|
|
#undef EM
|
|
|
|
#undef EMe
|
|
|
|
#define EM(a,b) { a, b },
|
|
|
|
#define EMe(a,b) { a, b }
|
2011-12-09 06:53:54 +08:00
|
|
|
|
2010-07-07 11:24:06 +08:00
|
|
|
struct wb_writeback_work;
|
|
|
|
|
2019-05-14 08:23:11 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_page_template,
|
2013-01-12 05:06:37 +08:00
|
|
|
|
|
|
|
TP_PROTO(struct page *page, struct address_space *mapping),
|
|
|
|
|
|
|
|
TP_ARGS(page, mapping),
|
|
|
|
|
|
|
|
TP_STRUCT__entry (
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2013-01-12 05:06:37 +08:00
|
|
|
__field(pgoff_t, index)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
include/trace/events/writeback.h: fix -Wstringop-truncation warnings
There are many of those warnings.
In file included from ./arch/powerpc/include/asm/paca.h:15,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./include/asm-generic/preempt.h:5,
from ./arch/powerpc/include/generated/asm/preempt.h:1,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:51,
from fs/fs-writeback.c:19:
In function 'strncpy',
inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy(). Also, change strlcpy() to use strscpy_pad() in this file for
consistency.
Link: http://lkml.kernel.org/r/1564075099-27750-1-git-send-email-cai@lca.pw
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nitin Gote <nitin.r.gote@intel.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:46:16 +08:00
|
|
|
strscpy_pad(__entry->name,
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
bdi_dev_name(mapping ? inode_to_bdi(mapping->host) :
|
|
|
|
NULL), 32);
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->ino = mapping ? mapping->host->i_ino : 0;
|
|
|
|
__entry->index = page->index;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("bdi %s: ino=%lu index=%lu",
|
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->index
|
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2019-05-14 08:23:11 +08:00
|
|
|
DEFINE_EVENT(writeback_page_template, writeback_dirty_page,
|
|
|
|
|
|
|
|
TP_PROTO(struct page *page, struct address_space *mapping),
|
|
|
|
|
|
|
|
TP_ARGS(page, mapping)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_page_template, wait_on_page_writeback,
|
|
|
|
|
|
|
|
TP_PROTO(struct page *page, struct address_space *mapping),
|
|
|
|
|
|
|
|
TP_ARGS(page, mapping)
|
|
|
|
);
|
|
|
|
|
2013-01-12 05:06:37 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, int flags),
|
|
|
|
|
|
|
|
TP_ARGS(inode, flags),
|
|
|
|
|
|
|
|
TP_STRUCT__entry (
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2015-02-02 13:37:00 +08:00
|
|
|
__field(unsigned long, state)
|
2013-01-12 05:06:37 +08:00
|
|
|
__field(unsigned long, flags)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2015-01-14 17:42:36 +08:00
|
|
|
struct backing_dev_info *bdi = inode_to_bdi(inode);
|
2013-01-12 05:06:37 +08:00
|
|
|
|
|
|
|
/* may be called for files on pseudo FSes w/ unregistered bdi */
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
2015-02-02 13:37:00 +08:00
|
|
|
__entry->state = inode->i_state;
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->flags = flags;
|
|
|
|
),
|
|
|
|
|
2015-02-02 13:37:00 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu state=%s flags=%s",
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2015-02-02 13:37:00 +08:00
|
|
|
show_inode_state(__entry->state),
|
2013-01-12 05:06:37 +08:00
|
|
|
show_inode_state(__entry->flags)
|
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2015-02-02 13:37:00 +08:00
|
|
|
DEFINE_EVENT(writeback_dirty_inode_template, writeback_mark_inode_dirty,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, int flags),
|
|
|
|
|
|
|
|
TP_ARGS(inode, flags)
|
|
|
|
);
|
|
|
|
|
2013-01-12 05:06:37 +08:00
|
|
|
DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, int flags),
|
|
|
|
|
|
|
|
TP_ARGS(inode, flags)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, int flags),
|
|
|
|
|
|
|
|
TP_ARGS(inode, flags)
|
|
|
|
);
|
|
|
|
|
2015-08-19 05:54:56 +08:00
|
|
|
#ifdef CREATE_TRACE_POINTS
|
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
static inline ino_t __trace_wb_assign_cgroup(struct bdi_writeback *wb)
|
2015-08-19 05:54:56 +08:00
|
|
|
{
|
2019-11-05 07:54:30 +08:00
|
|
|
return cgroup_ino(wb->memcg_css->cgroup);
|
2015-08-19 05:54:56 +08:00
|
|
|
}
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
static inline ino_t __trace_wbc_assign_cgroup(struct writeback_control *wbc)
|
2015-08-19 05:54:56 +08:00
|
|
|
{
|
|
|
|
if (wbc->wb)
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
return __trace_wb_assign_cgroup(wbc->wb);
|
2015-08-19 05:54:56 +08:00
|
|
|
else
|
2019-11-05 07:54:29 +08:00
|
|
|
return 1;
|
2015-08-19 05:54:56 +08:00
|
|
|
}
|
|
|
|
#else /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
static inline ino_t __trace_wb_assign_cgroup(struct bdi_writeback *wb)
|
2015-08-19 05:54:56 +08:00
|
|
|
{
|
2019-11-05 07:54:29 +08:00
|
|
|
return 1;
|
2015-08-19 05:54:56 +08:00
|
|
|
}
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
static inline ino_t __trace_wbc_assign_cgroup(struct writeback_control *wbc)
|
2015-08-19 05:54:56 +08:00
|
|
|
{
|
2019-11-05 07:54:29 +08:00
|
|
|
return 1;
|
2015-08-19 05:54:56 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
#endif /* CREATE_TRACE_POINTS */
|
|
|
|
|
2019-08-30 06:47:19 +08:00
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
TRACE_EVENT(inode_foreign_history,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, struct writeback_control *wbc,
|
|
|
|
unsigned int history),
|
|
|
|
|
|
|
|
TP_ARGS(inode, wbc, history),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
|
|
|
__field(ino_t, cgroup_ino)
|
2019-08-30 06:47:19 +08:00
|
|
|
__field(unsigned int, history)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strncpy(__entry->name, bdi_dev_name(inode_to_bdi(inode)), 32);
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
|
|
|
|
__entry->history = history;
|
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu cgroup_ino=%lu history=0x%x",
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
|
|
|
(unsigned long)__entry->cgroup_ino,
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->history
|
|
|
|
)
|
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(inode_switch_wbs,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, struct bdi_writeback *old_wb,
|
|
|
|
struct bdi_writeback *new_wb),
|
|
|
|
|
|
|
|
TP_ARGS(inode, old_wb, new_wb),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
|
|
|
__field(ino_t, old_cgroup_ino)
|
|
|
|
__field(ino_t, new_cgroup_ino)
|
2019-08-30 06:47:19 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strncpy(__entry->name, bdi_dev_name(old_wb->bdi), 32);
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->old_cgroup_ino = __trace_wb_assign_cgroup(old_wb);
|
|
|
|
__entry->new_cgroup_ino = __trace_wb_assign_cgroup(new_wb);
|
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu old_cgroup_ino=%lu new_cgroup_ino=%lu",
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
|
|
|
(unsigned long)__entry->old_cgroup_ino,
|
|
|
|
(unsigned long)__entry->new_cgroup_ino
|
2019-08-30 06:47:19 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(track_foreign_dirty,
|
|
|
|
|
|
|
|
TP_PROTO(struct page *page, struct bdi_writeback *wb),
|
|
|
|
|
|
|
|
TP_ARGS(page, wb),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
|
|
|
__field(u64, bdi_id)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2019-08-30 06:47:19 +08:00
|
|
|
__field(unsigned int, memcg_id)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
|
|
|
__field(ino_t, page_cgroup_ino)
|
2019-08-30 06:47:19 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2019-08-31 07:39:54 +08:00
|
|
|
struct address_space *mapping = page_mapping(page);
|
|
|
|
struct inode *inode = mapping ? mapping->host : NULL;
|
|
|
|
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strncpy(__entry->name, bdi_dev_name(wb->bdi), 32);
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->bdi_id = wb->bdi->id;
|
2019-08-31 07:39:54 +08:00
|
|
|
__entry->ino = inode ? inode->i_ino : 0;
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->memcg_id = wb->memcg_css->id;
|
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2019-11-05 07:54:30 +08:00
|
|
|
__entry->page_cgroup_ino = cgroup_ino(page->mem_cgroup->css.cgroup);
|
2019-08-30 06:47:19 +08:00
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->name,
|
|
|
|
__entry->bdi_id,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->memcg_id,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino,
|
|
|
|
(unsigned long)__entry->page_cgroup_ino
|
2019-08-30 06:47:19 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
|
|
|
TRACE_EVENT(flush_foreign,
|
|
|
|
|
|
|
|
TP_PROTO(struct bdi_writeback *wb, unsigned int frn_bdi_id,
|
|
|
|
unsigned int frn_memcg_id),
|
|
|
|
|
|
|
|
TP_ARGS(wb, frn_bdi_id, frn_memcg_id),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2019-08-30 06:47:19 +08:00
|
|
|
__field(unsigned int, frn_bdi_id)
|
|
|
|
__field(unsigned int, frn_memcg_id)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strncpy(__entry->name, bdi_dev_name(wb->bdi), 32);
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
|
|
|
__entry->frn_bdi_id = frn_bdi_id;
|
|
|
|
__entry->frn_memcg_id = frn_memcg_id;
|
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: cgroup_ino=%lu frn_bdi_id=%u frn_memcg_id=%u",
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino,
|
2019-08-30 06:47:19 +08:00
|
|
|
__entry->frn_bdi_id,
|
|
|
|
__entry->frn_memcg_id
|
|
|
|
)
|
|
|
|
);
|
|
|
|
#endif
|
|
|
|
|
2013-01-12 05:06:37 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_write_inode_template,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, struct writeback_control *wbc),
|
|
|
|
|
|
|
|
TP_ARGS(inode, wbc),
|
|
|
|
|
|
|
|
TP_STRUCT__entry (
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2013-01-12 05:06:37 +08:00
|
|
|
__field(int, sync_mode)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2013-01-12 05:06:37 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
include/trace/events/writeback.h: fix -Wstringop-truncation warnings
There are many of those warnings.
In file included from ./arch/powerpc/include/asm/paca.h:15,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./include/asm-generic/preempt.h:5,
from ./arch/powerpc/include/generated/asm/preempt.h:1,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:51,
from fs/fs-writeback.c:19:
In function 'strncpy',
inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy(). Also, change strlcpy() to use strscpy_pad() in this file for
consistency.
Link: http://lkml.kernel.org/r/1564075099-27750-1-git-send-email-cai@lca.pw
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nitin Gote <nitin.r.gote@intel.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:46:16 +08:00
|
|
|
strscpy_pad(__entry->name,
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
bdi_dev_name(inode_to_bdi(inode)), 32);
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->sync_mode = wbc->sync_mode;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
|
2013-01-12 05:06:37 +08:00
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu sync_mode=%d cgroup_ino=%lu",
|
2013-01-12 05:06:37 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->sync_mode,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2013-01-12 05:06:37 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode_start,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, struct writeback_control *wbc),
|
|
|
|
|
|
|
|
TP_ARGS(inode, wbc)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode, struct writeback_control *wbc),
|
|
|
|
|
|
|
|
TP_ARGS(inode, wbc)
|
|
|
|
);
|
|
|
|
|
2010-07-07 11:24:06 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_work_class,
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
|
|
|
|
TP_ARGS(wb, work),
|
2010-07-07 11:24:06 +08:00
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
|
|
|
__field(long, nr_pages)
|
|
|
|
__field(dev_t, sb_dev)
|
|
|
|
__field(int, sync_mode)
|
|
|
|
__field(int, for_kupdate)
|
|
|
|
__field(int, range_cyclic)
|
|
|
|
__field(int, for_background)
|
2011-10-08 11:54:10 +08:00
|
|
|
__field(int, reason)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2010-07-07 11:24:06 +08:00
|
|
|
),
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
|
2010-07-07 11:24:06 +08:00
|
|
|
__entry->nr_pages = work->nr_pages;
|
|
|
|
__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
|
|
|
|
__entry->sync_mode = work->sync_mode;
|
|
|
|
__entry->for_kupdate = work->for_kupdate;
|
|
|
|
__entry->range_cyclic = work->range_cyclic;
|
|
|
|
__entry->for_background = work->for_background;
|
2011-10-08 11:54:10 +08:00
|
|
|
__entry->reason = work->reason;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2010-07-07 11:24:06 +08:00
|
|
|
),
|
|
|
|
TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
|
2019-11-05 07:54:29 +08:00
|
|
|
"kupdate=%d range_cyclic=%d background=%d reason=%s cgroup_ino=%lu",
|
2010-07-07 11:24:06 +08:00
|
|
|
__entry->name,
|
|
|
|
MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
|
|
|
|
__entry->nr_pages,
|
|
|
|
__entry->sync_mode,
|
|
|
|
__entry->for_kupdate,
|
|
|
|
__entry->range_cyclic,
|
2011-10-08 11:54:10 +08:00
|
|
|
__entry->for_background,
|
2015-08-19 05:54:56 +08:00
|
|
|
__print_symbolic(__entry->reason, WB_WORK_REASON),
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2010-07-07 11:24:06 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
#define DEFINE_WRITEBACK_WORK_EVENT(name) \
|
|
|
|
DEFINE_EVENT(writeback_work_class, name, \
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), \
|
|
|
|
TP_ARGS(wb, work))
|
2010-07-07 11:24:06 +08:00
|
|
|
DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
|
|
|
|
DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
|
2011-05-05 09:54:37 +08:00
|
|
|
DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
|
|
|
|
DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
|
|
|
|
DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
|
2010-07-07 11:24:06 +08:00
|
|
|
|
|
|
|
TRACE_EVENT(writeback_pages_written,
|
|
|
|
TP_PROTO(long pages_written),
|
|
|
|
TP_ARGS(pages_written),
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(long, pages)
|
|
|
|
),
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->pages = pages_written;
|
|
|
|
),
|
|
|
|
TP_printk("%ld", __entry->pages)
|
|
|
|
);
|
|
|
|
|
|
|
|
DECLARE_EVENT_CLASS(writeback_class,
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb),
|
|
|
|
TP_ARGS(wb),
|
2010-07-07 11:24:06 +08:00
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2010-07-07 11:24:06 +08:00
|
|
|
),
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2010-07-07 11:24:06 +08:00
|
|
|
),
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: cgroup_ino=%lu",
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2010-07-07 11:24:06 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
#define DEFINE_WRITEBACK_EVENT(name) \
|
|
|
|
DEFINE_EVENT(writeback_class, name, \
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb), \
|
|
|
|
TP_ARGS(wb))
|
2010-07-07 11:24:06 +08:00
|
|
|
|
2011-01-14 07:45:46 +08:00
|
|
|
DEFINE_WRITEBACK_EVENT(writeback_wake_background);
|
2015-08-19 05:54:56 +08:00
|
|
|
|
|
|
|
TRACE_EVENT(writeback_bdi_register,
|
|
|
|
TP_PROTO(struct backing_dev_info *bdi),
|
|
|
|
TP_ARGS(bdi),
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
|
|
|
),
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
|
2015-08-19 05:54:56 +08:00
|
|
|
),
|
|
|
|
TP_printk("bdi %s",
|
|
|
|
__entry->name
|
|
|
|
)
|
|
|
|
);
|
2010-07-07 11:24:06 +08:00
|
|
|
|
2010-07-07 11:24:07 +08:00
|
|
|
DECLARE_EVENT_CLASS(wbc_class,
|
|
|
|
TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
|
|
|
|
TP_ARGS(wbc, bdi),
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
|
|
|
__field(long, nr_to_write)
|
|
|
|
__field(long, pages_skipped)
|
|
|
|
__field(int, sync_mode)
|
|
|
|
__field(int, for_kupdate)
|
|
|
|
__field(int, for_background)
|
|
|
|
__field(int, for_reclaim)
|
|
|
|
__field(int, range_cyclic)
|
|
|
|
__field(long, range_start)
|
|
|
|
__field(long, range_end)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2010-07-07 11:24:07 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
|
2010-07-07 11:24:07 +08:00
|
|
|
__entry->nr_to_write = wbc->nr_to_write;
|
|
|
|
__entry->pages_skipped = wbc->pages_skipped;
|
|
|
|
__entry->sync_mode = wbc->sync_mode;
|
|
|
|
__entry->for_kupdate = wbc->for_kupdate;
|
|
|
|
__entry->for_background = wbc->for_background;
|
|
|
|
__entry->for_reclaim = wbc->for_reclaim;
|
|
|
|
__entry->range_cyclic = wbc->range_cyclic;
|
|
|
|
__entry->range_start = (long)wbc->range_start;
|
|
|
|
__entry->range_end = (long)wbc->range_end;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
|
2010-07-07 11:24:07 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
|
2011-05-05 09:54:37 +08:00
|
|
|
"bgrd=%d reclm=%d cyclic=%d "
|
2019-11-05 07:54:29 +08:00
|
|
|
"start=0x%lx end=0x%lx cgroup_ino=%lu",
|
2010-07-07 11:24:07 +08:00
|
|
|
__entry->name,
|
|
|
|
__entry->nr_to_write,
|
|
|
|
__entry->pages_skipped,
|
|
|
|
__entry->sync_mode,
|
|
|
|
__entry->for_kupdate,
|
|
|
|
__entry->for_background,
|
|
|
|
__entry->for_reclaim,
|
|
|
|
__entry->range_cyclic,
|
|
|
|
__entry->range_start,
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->range_end,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2015-08-19 05:54:56 +08:00
|
|
|
)
|
2010-07-07 11:24:07 +08:00
|
|
|
)
|
|
|
|
|
|
|
|
#define DEFINE_WBC_EVENT(name) \
|
|
|
|
DEFINE_EVENT(wbc_class, name, \
|
|
|
|
TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
|
|
|
|
TP_ARGS(wbc, bdi))
|
2010-07-07 11:24:08 +08:00
|
|
|
DEFINE_WBC_EVENT(wbc_writepage);
|
2010-07-07 11:24:07 +08:00
|
|
|
|
2011-04-24 02:27:27 +08:00
|
|
|
TRACE_EVENT(writeback_queue_io,
|
|
|
|
TP_PROTO(struct bdi_writeback *wb,
|
2011-10-08 11:51:56 +08:00
|
|
|
struct wb_writeback_work *work,
|
2011-04-24 02:27:27 +08:00
|
|
|
int moved),
|
2011-10-08 11:51:56 +08:00
|
|
|
TP_ARGS(wb, work, moved),
|
2011-04-24 02:27:27 +08:00
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
|
|
|
__field(unsigned long, older)
|
|
|
|
__field(long, age)
|
|
|
|
__field(int, moved)
|
2011-10-08 11:54:10 +08:00
|
|
|
__field(int, reason)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2011-04-24 02:27:27 +08:00
|
|
|
),
|
|
|
|
TP_fast_assign(
|
2014-02-21 18:19:04 +08:00
|
|
|
unsigned long *older_than_this = work->older_than_this;
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
|
2014-02-21 18:19:04 +08:00
|
|
|
__entry->older = older_than_this ? *older_than_this : 0;
|
2011-04-24 02:27:27 +08:00
|
|
|
__entry->age = older_than_this ?
|
2014-02-21 18:19:04 +08:00
|
|
|
(jiffies - *older_than_this) * 1000 / HZ : -1;
|
2011-04-24 02:27:27 +08:00
|
|
|
__entry->moved = moved;
|
2011-10-08 11:54:10 +08:00
|
|
|
__entry->reason = work->reason;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2011-04-24 02:27:27 +08:00
|
|
|
),
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s cgroup_ino=%lu",
|
2011-04-24 02:27:27 +08:00
|
|
|
__entry->name,
|
|
|
|
__entry->older, /* older_than_this in jiffies */
|
|
|
|
__entry->age, /* older_than_this in relative milliseconds */
|
2011-10-08 11:54:10 +08:00
|
|
|
__entry->moved,
|
2015-08-19 05:54:56 +08:00
|
|
|
__print_symbolic(__entry->reason, WB_WORK_REASON),
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2011-12-09 06:53:54 +08:00
|
|
|
)
|
2011-04-24 02:27:27 +08:00
|
|
|
);
|
|
|
|
|
2010-12-07 12:34:29 +08:00
|
|
|
TRACE_EVENT(global_dirty_state,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned long background_thresh,
|
|
|
|
unsigned long dirty_thresh
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_ARGS(background_thresh,
|
|
|
|
dirty_thresh
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field(unsigned long, nr_dirty)
|
|
|
|
__field(unsigned long, nr_writeback)
|
|
|
|
__field(unsigned long, nr_unstable)
|
|
|
|
__field(unsigned long, background_thresh)
|
|
|
|
__field(unsigned long, dirty_thresh)
|
|
|
|
__field(unsigned long, dirty_limit)
|
|
|
|
__field(unsigned long, nr_dirtied)
|
|
|
|
__field(unsigned long, nr_written)
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
2016-07-29 06:46:20 +08:00
|
|
|
__entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY);
|
|
|
|
__entry->nr_writeback = global_node_page_state(NR_WRITEBACK);
|
|
|
|
__entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS);
|
2016-07-29 06:46:23 +08:00
|
|
|
__entry->nr_dirtied = global_node_page_state(NR_DIRTIED);
|
|
|
|
__entry->nr_written = global_node_page_state(NR_WRITTEN);
|
2010-12-07 12:34:29 +08:00
|
|
|
__entry->background_thresh = background_thresh;
|
|
|
|
__entry->dirty_thresh = dirty_thresh;
|
2015-05-23 06:23:22 +08:00
|
|
|
__entry->dirty_limit = global_wb_domain.dirty_limit;
|
2010-12-07 12:34:29 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("dirty=%lu writeback=%lu unstable=%lu "
|
|
|
|
"bg_thresh=%lu thresh=%lu limit=%lu "
|
|
|
|
"dirtied=%lu written=%lu",
|
|
|
|
__entry->nr_dirty,
|
|
|
|
__entry->nr_writeback,
|
|
|
|
__entry->nr_unstable,
|
|
|
|
__entry->background_thresh,
|
|
|
|
__entry->dirty_thresh,
|
|
|
|
__entry->dirty_limit,
|
|
|
|
__entry->nr_dirtied,
|
|
|
|
__entry->nr_written
|
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2011-03-03 07:22:49 +08:00
|
|
|
#define KBps(x) ((x) << (PAGE_SHIFT - 10))
|
|
|
|
|
|
|
|
TRACE_EVENT(bdi_dirty_ratelimit,
|
|
|
|
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb,
|
2011-03-03 07:22:49 +08:00
|
|
|
unsigned long dirty_rate,
|
|
|
|
unsigned long task_ratelimit),
|
|
|
|
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_ARGS(wb, dirty_rate, task_ratelimit),
|
2011-03-03 07:22:49 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, bdi, 32)
|
|
|
|
__field(unsigned long, write_bw)
|
|
|
|
__field(unsigned long, avg_write_bw)
|
|
|
|
__field(unsigned long, dirty_rate)
|
|
|
|
__field(unsigned long, dirty_ratelimit)
|
|
|
|
__field(unsigned long, task_ratelimit)
|
|
|
|
__field(unsigned long, balanced_dirty_ratelimit)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2011-03-03 07:22:49 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->write_bw = KBps(wb->write_bandwidth);
|
|
|
|
__entry->avg_write_bw = KBps(wb->avg_write_bandwidth);
|
2011-03-03 07:22:49 +08:00
|
|
|
__entry->dirty_rate = KBps(dirty_rate);
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->dirty_ratelimit = KBps(wb->dirty_ratelimit);
|
2011-03-03 07:22:49 +08:00
|
|
|
__entry->task_ratelimit = KBps(task_ratelimit);
|
|
|
|
__entry->balanced_dirty_ratelimit =
|
2015-08-19 05:54:56 +08:00
|
|
|
KBps(wb->balanced_dirty_ratelimit);
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2011-03-03 07:22:49 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("bdi %s: "
|
|
|
|
"write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
|
|
|
|
"dirty_ratelimit=%lu task_ratelimit=%lu "
|
2019-11-05 07:54:29 +08:00
|
|
|
"balanced_dirty_ratelimit=%lu cgroup_ino=%lu",
|
2011-03-03 07:22:49 +08:00
|
|
|
__entry->bdi,
|
|
|
|
__entry->write_bw, /* write bandwidth */
|
|
|
|
__entry->avg_write_bw, /* avg write bandwidth */
|
|
|
|
__entry->dirty_rate, /* bdi dirty rate */
|
|
|
|
__entry->dirty_ratelimit, /* base ratelimit */
|
|
|
|
__entry->task_ratelimit, /* ratelimit with position control */
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->balanced_dirty_ratelimit, /* the balanced ratelimit */
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2011-03-03 07:22:49 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2010-08-30 13:33:20 +08:00
|
|
|
TRACE_EVENT(balance_dirty_pages,
|
|
|
|
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_PROTO(struct bdi_writeback *wb,
|
2010-08-30 13:33:20 +08:00
|
|
|
unsigned long thresh,
|
|
|
|
unsigned long bg_thresh,
|
|
|
|
unsigned long dirty,
|
|
|
|
unsigned long bdi_thresh,
|
|
|
|
unsigned long bdi_dirty,
|
|
|
|
unsigned long dirty_ratelimit,
|
|
|
|
unsigned long task_ratelimit,
|
|
|
|
unsigned long dirtied,
|
2011-06-12 09:25:42 +08:00
|
|
|
unsigned long period,
|
2010-08-30 13:33:20 +08:00
|
|
|
long pause,
|
|
|
|
unsigned long start_time),
|
|
|
|
|
2015-08-19 05:54:56 +08:00
|
|
|
TP_ARGS(wb, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
|
2010-08-30 13:33:20 +08:00
|
|
|
dirty_ratelimit, task_ratelimit,
|
2011-06-12 09:25:42 +08:00
|
|
|
dirtied, period, pause, start_time),
|
2010-08-30 13:33:20 +08:00
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array( char, bdi, 32)
|
|
|
|
__field(unsigned long, limit)
|
|
|
|
__field(unsigned long, setpoint)
|
|
|
|
__field(unsigned long, dirty)
|
|
|
|
__field(unsigned long, bdi_setpoint)
|
|
|
|
__field(unsigned long, bdi_dirty)
|
|
|
|
__field(unsigned long, dirty_ratelimit)
|
|
|
|
__field(unsigned long, task_ratelimit)
|
|
|
|
__field(unsigned int, dirtied)
|
|
|
|
__field(unsigned int, dirtied_pause)
|
|
|
|
__field(unsigned long, paused)
|
|
|
|
__field( long, pause)
|
2011-06-12 09:25:42 +08:00
|
|
|
__field(unsigned long, period)
|
|
|
|
__field( long, think)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2010-08-30 13:33:20 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
unsigned long freerun = (thresh + bg_thresh) / 2;
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
|
2010-08-30 13:33:20 +08:00
|
|
|
|
2015-05-23 06:23:22 +08:00
|
|
|
__entry->limit = global_wb_domain.dirty_limit;
|
|
|
|
__entry->setpoint = (global_wb_domain.dirty_limit +
|
|
|
|
freerun) / 2;
|
2010-08-30 13:33:20 +08:00
|
|
|
__entry->dirty = dirty;
|
|
|
|
__entry->bdi_setpoint = __entry->setpoint *
|
|
|
|
bdi_thresh / (thresh + 1);
|
|
|
|
__entry->bdi_dirty = bdi_dirty;
|
|
|
|
__entry->dirty_ratelimit = KBps(dirty_ratelimit);
|
|
|
|
__entry->task_ratelimit = KBps(task_ratelimit);
|
|
|
|
__entry->dirtied = dirtied;
|
|
|
|
__entry->dirtied_pause = current->nr_dirtied_pause;
|
2011-06-12 09:25:42 +08:00
|
|
|
__entry->think = current->dirty_paused_when == 0 ? 0 :
|
|
|
|
(long)(jiffies - current->dirty_paused_when) * 1000/HZ;
|
|
|
|
__entry->period = period * 1000 / HZ;
|
2010-08-30 13:33:20 +08:00
|
|
|
__entry->pause = pause * 1000 / HZ;
|
|
|
|
__entry->paused = (jiffies - start_time) * 1000 / HZ;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
|
2010-08-30 13:33:20 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
|
|
|
|
TP_printk("bdi %s: "
|
|
|
|
"limit=%lu setpoint=%lu dirty=%lu "
|
|
|
|
"bdi_setpoint=%lu bdi_dirty=%lu "
|
|
|
|
"dirty_ratelimit=%lu task_ratelimit=%lu "
|
|
|
|
"dirtied=%u dirtied_pause=%u "
|
2019-11-05 07:54:29 +08:00
|
|
|
"paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%lu",
|
2010-08-30 13:33:20 +08:00
|
|
|
__entry->bdi,
|
|
|
|
__entry->limit,
|
|
|
|
__entry->setpoint,
|
|
|
|
__entry->dirty,
|
|
|
|
__entry->bdi_setpoint,
|
|
|
|
__entry->bdi_dirty,
|
|
|
|
__entry->dirty_ratelimit,
|
|
|
|
__entry->task_ratelimit,
|
|
|
|
__entry->dirtied,
|
|
|
|
__entry->dirtied_pause,
|
|
|
|
__entry->paused, /* ms */
|
2011-06-12 09:25:42 +08:00
|
|
|
__entry->pause, /* ms */
|
|
|
|
__entry->period, /* ms */
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->think, /* ms */
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2010-08-30 13:33:20 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2012-05-03 20:47:56 +08:00
|
|
|
TRACE_EVENT(writeback_sb_inodes_requeue,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
TP_ARGS(inode),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2012-05-03 20:47:56 +08:00
|
|
|
__field(unsigned long, state)
|
|
|
|
__field(unsigned long, dirtied_when)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2012-05-03 20:47:56 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
include/trace/events/writeback.h: fix -Wstringop-truncation warnings
There are many of those warnings.
In file included from ./arch/powerpc/include/asm/paca.h:15,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./include/asm-generic/preempt.h:5,
from ./arch/powerpc/include/generated/asm/preempt.h:1,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:51,
from fs/fs-writeback.c:19:
In function 'strncpy',
inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy(). Also, change strlcpy() to use strscpy_pad() in this file for
consistency.
Link: http://lkml.kernel.org/r/1564075099-27750-1-git-send-email-cai@lca.pw
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nitin Gote <nitin.r.gote@intel.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:46:16 +08:00
|
|
|
strscpy_pad(__entry->name,
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
bdi_dev_name(inode_to_bdi(inode)), 32);
|
2012-05-03 20:47:56 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->state = inode->i_state;
|
|
|
|
__entry->dirtied_when = inode->dirtied_when;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wb_assign_cgroup(inode_to_wb(inode));
|
2012-05-03 20:47:56 +08:00
|
|
|
),
|
|
|
|
|
2019-11-05 07:54:29 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu cgroup_ino=%lu",
|
2012-05-03 20:47:56 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2012-05-03 20:47:56 +08:00
|
|
|
show_inode_state(__entry->state),
|
|
|
|
__entry->dirtied_when,
|
2015-08-19 05:54:56 +08:00
|
|
|
(jiffies - __entry->dirtied_when) / HZ,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2012-05-03 20:47:56 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2010-10-27 05:21:41 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_congest_waited_template,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
|
|
|
|
|
|
|
|
TP_ARGS(usec_timeout, usec_delayed),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( unsigned int, usec_timeout )
|
|
|
|
__field( unsigned int, usec_delayed )
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->usec_timeout = usec_timeout;
|
|
|
|
__entry->usec_delayed = usec_delayed;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("usec_timeout=%u usec_delayed=%u",
|
|
|
|
__entry->usec_timeout,
|
|
|
|
__entry->usec_delayed)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
|
|
|
|
|
|
|
|
TP_ARGS(usec_timeout, usec_delayed)
|
|
|
|
);
|
|
|
|
|
2010-10-27 05:21:45 +08:00
|
|
|
DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
|
|
|
|
|
|
|
|
TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
|
|
|
|
|
|
|
|
TP_ARGS(usec_timeout, usec_delayed)
|
|
|
|
);
|
|
|
|
|
2010-12-02 07:33:37 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_single_inode_template,
|
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
unsigned long nr_to_write
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_ARGS(inode, wbc, nr_to_write),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__array(char, name, 32)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, ino)
|
2010-12-02 07:33:37 +08:00
|
|
|
__field(unsigned long, state)
|
2011-08-29 23:52:23 +08:00
|
|
|
__field(unsigned long, dirtied_when)
|
2010-12-02 07:33:37 +08:00
|
|
|
__field(unsigned long, writeback_index)
|
|
|
|
__field(long, nr_to_write)
|
|
|
|
__field(unsigned long, wrote)
|
2019-11-05 07:54:29 +08:00
|
|
|
__field(ino_t, cgroup_ino)
|
2010-12-02 07:33:37 +08:00
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
include/trace/events/writeback.h: fix -Wstringop-truncation warnings
There are many of those warnings.
In file included from ./arch/powerpc/include/asm/paca.h:15,
from ./arch/powerpc/include/asm/current.h:13,
from ./include/linux/thread_info.h:21,
from ./include/asm-generic/preempt.h:5,
from ./arch/powerpc/include/generated/asm/preempt.h:1,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:51,
from fs/fs-writeback.c:19:
In function 'strncpy',
inlined from 'perf_trace_writeback_page_template' at
./include/trace/events/writeback.h:56:1:
./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix it by using the new strscpy_pad() which was introduced in "lib/string:
Add strscpy_pad() function" and will always be NUL-terminated instead of
strncpy(). Also, change strlcpy() to use strscpy_pad() in this file for
consistency.
Link: http://lkml.kernel.org/r/1564075099-27750-1-git-send-email-cai@lca.pw
Fixes: 455b2864686d ("writeback: Initial tracing support")
Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()")
Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nitin Gote <nitin.r.gote@intel.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 07:46:16 +08:00
|
|
|
strscpy_pad(__entry->name,
|
memcg: fix a crash in wb_workfn when a device disappears
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 14:11:04 +08:00
|
|
|
bdi_dev_name(inode_to_bdi(inode)), 32);
|
2010-12-02 07:33:37 +08:00
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->state = inode->i_state;
|
2011-08-29 23:52:23 +08:00
|
|
|
__entry->dirtied_when = inode->dirtied_when;
|
2010-12-02 07:33:37 +08:00
|
|
|
__entry->writeback_index = inode->i_mapping->writeback_index;
|
|
|
|
__entry->nr_to_write = nr_to_write;
|
|
|
|
__entry->wrote = nr_to_write - wbc->nr_to_write;
|
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback
tracepoints to report cgroup") made writeback tracepoints print out cgroup
path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt
kernel since kernfs_path and kernfs_path_len are called by tracepoints, which
acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3
INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Workqueue: writeback wb_workfn (flush-7:0)
Call trace:
[<ffffffc00008d708>] dump_backtrace+0x0/0x200
[<ffffffc00008d92c>] show_stack+0x24/0x30
[<ffffffc0007b0f40>] dump_stack+0x88/0xa8
[<ffffffc000127d74>] ___might_sleep+0x2ec/0x300
[<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8
[<ffffffc0003e0548>] kernfs_path_len+0x30/0x90
[<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8
[<ffffffc000374f90>] wb_writeback+0x620/0x830
[<ffffffc000376224>] wb_workfn+0x61c/0x950
[<ffffffc000110adc>] process_one_work+0x3ac/0xb30
[<ffffffc0001112fc>] worker_thread+0x9c/0x7a8
[<ffffffc00011a9e8>] kthread+0x190/0x1b0
[<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in
kernfs_rename which could be called in syscall path, but it is problematic.
So, print out cgroup ino instead of path name, which could be converted to
path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root
dir ino vary from different filesystems, so printing out -1U to indicate
an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-03 17:08:57 +08:00
|
|
|
__entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
|
2010-12-02 07:33:37 +08:00
|
|
|
),
|
|
|
|
|
2011-08-29 23:52:23 +08:00
|
|
|
TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu "
|
2019-11-05 07:54:29 +08:00
|
|
|
"index=%lu to_write=%ld wrote=%lu cgroup_ino=%lu",
|
2010-12-02 07:33:37 +08:00
|
|
|
__entry->name,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino,
|
2010-12-02 07:33:37 +08:00
|
|
|
show_inode_state(__entry->state),
|
2011-08-29 23:52:23 +08:00
|
|
|
__entry->dirtied_when,
|
|
|
|
(jiffies - __entry->dirtied_when) / HZ,
|
2010-12-02 07:33:37 +08:00
|
|
|
__entry->writeback_index,
|
|
|
|
__entry->nr_to_write,
|
2015-08-19 05:54:56 +08:00
|
|
|
__entry->wrote,
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->cgroup_ino
|
2010-12-02 07:33:37 +08:00
|
|
|
)
|
|
|
|
);
|
|
|
|
|
2013-01-12 05:06:37 +08:00
|
|
|
DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_start,
|
|
|
|
TP_PROTO(struct inode *inode,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
unsigned long nr_to_write),
|
|
|
|
TP_ARGS(inode, wbc, nr_to_write)
|
|
|
|
);
|
|
|
|
|
2010-12-02 07:33:37 +08:00
|
|
|
DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
|
|
|
|
TP_PROTO(struct inode *inode,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
unsigned long nr_to_write),
|
|
|
|
TP_ARGS(inode, wbc, nr_to_write)
|
|
|
|
);
|
|
|
|
|
2016-07-27 06:21:53 +08:00
|
|
|
DECLARE_EVENT_CLASS(writeback_inode_template,
|
2015-02-02 13:37:00 +08:00
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
|
|
|
|
TP_ARGS(inode),
|
|
|
|
|
|
|
|
TP_STRUCT__entry(
|
|
|
|
__field( dev_t, dev )
|
2019-11-05 07:54:29 +08:00
|
|
|
__field( ino_t, ino )
|
2015-02-02 13:37:00 +08:00
|
|
|
__field(unsigned long, state )
|
|
|
|
__field( __u16, mode )
|
|
|
|
__field(unsigned long, dirtied_when )
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_fast_assign(
|
|
|
|
__entry->dev = inode->i_sb->s_dev;
|
|
|
|
__entry->ino = inode->i_ino;
|
|
|
|
__entry->state = inode->i_state;
|
|
|
|
__entry->mode = inode->i_mode;
|
|
|
|
__entry->dirtied_when = inode->dirtied_when;
|
|
|
|
),
|
|
|
|
|
|
|
|
TP_printk("dev %d,%d ino %lu dirtied %lu state %s mode 0%o",
|
|
|
|
MAJOR(__entry->dev), MINOR(__entry->dev),
|
2019-11-15 01:17:41 +08:00
|
|
|
(unsigned long)__entry->ino, __entry->dirtied_when,
|
2015-02-02 13:37:00 +08:00
|
|
|
show_inode_state(__entry->state), __entry->mode)
|
|
|
|
);
|
|
|
|
|
2016-07-27 06:21:53 +08:00
|
|
|
DEFINE_EVENT(writeback_inode_template, writeback_lazytime,
|
2015-02-02 13:37:00 +08:00
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
|
|
|
|
TP_ARGS(inode)
|
|
|
|
);
|
|
|
|
|
2016-07-27 06:21:53 +08:00
|
|
|
DEFINE_EVENT(writeback_inode_template, writeback_lazytime_iput,
|
2015-02-02 13:37:00 +08:00
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
|
|
|
|
TP_ARGS(inode)
|
|
|
|
);
|
|
|
|
|
2016-07-27 06:21:53 +08:00
|
|
|
DEFINE_EVENT(writeback_inode_template, writeback_dirty_inode_enqueue,
|
2015-02-02 13:37:00 +08:00
|
|
|
|
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
|
|
|
|
TP_ARGS(inode)
|
|
|
|
);
|
|
|
|
|
2016-07-27 06:21:53 +08:00
|
|
|
/*
|
|
|
|
* Inode writeback list tracking.
|
|
|
|
*/
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_inode_template, sb_mark_inode_writeback,
|
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
TP_ARGS(inode)
|
|
|
|
);
|
|
|
|
|
|
|
|
DEFINE_EVENT(writeback_inode_template, sb_clear_inode_writeback,
|
|
|
|
TP_PROTO(struct inode *inode),
|
|
|
|
TP_ARGS(inode)
|
|
|
|
);
|
|
|
|
|
2010-07-07 11:24:06 +08:00
|
|
|
#endif /* _TRACE_WRITEBACK_H */
|
|
|
|
|
|
|
|
/* This part must be outside protection */
|
|
|
|
#include <trace/define_trace.h>
|