Pull perf fixes from Thomas Gleixner:
"This is much bigger than typical fixes, but Peter found a category of
races that spurred more fixes and more debugging enhancements. Work
started before the merge window, but got finished only now.
Aside of that this contains the usual small fixes to perf and tools.
Nothing particular exciting"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
perf: Remove/simplify lockdep annotation
perf: Synchronously clean up child events
perf: Untangle 'owner' confusion
perf: Add flags argument to perf_remove_from_context()
perf: Clean up sync_child_event()
perf: Robustify event->owner usage and SMP ordering
perf: Fix STATE_EXIT usage
perf: Update locking order
perf: Remove __free_event()
perf/bpf: Convert perf_event_array to use struct file
perf: Fix NULL deref
perf/x86: De-obfuscate code
perf/x86: Fix uninitialized value usage
perf: Fix race in perf_event_exit_task_context()
perf: Fix orphan hole
perf stat: Do not clean event's private stats
perf hists: Fix HISTC_MEM_DCACHELINE width setting
perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
perf tests: Remove wrong semicolon in while loop in CQM test
perf: Synchronously free aux pages in case of allocation failure
...
When a max stack trace is discovered, the stack dump is saved. In order to
not record the overhead of the stack tracer, the ip of the traced function
is looked for within the dump. The trace is started from the location of
that function. But if for some reason the ip is not found, the entire stack
trace is then truncated. That's not very useful. Instead, print everything
if the ip of the traced function is not found within the trace.
This issue showed up on s390.
Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home
Fixes: 72ac426a5b ("tracing: Clean up stack tracing and fix fentry updates")
Cc: stable@vger.kernel.org # v4.3+
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The first is a cut and paste issue that changed the amount of stack
to skip when tracing a stack dump from 0 to 6, which basically made
the stack disappear for small stack traces.
The second fix is just removing an unused field in a struct that is no
longer used, and currently just wastes space.
The third is another cut-and-paste fix that had a tracepoint recording
the wrong field (it was recording the previous field a second time).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWqibPAAoJEKKk/i67LK/8/NkH/3M6WB7RIiMMd4O403imbKcs
yIH0j9vH6Z5hwoAUUr0bEw+gHVgzsiRky5z+fP0f1J3QdVAdgEig6RgQtIbWRynu
i7fohNAiSMBob0wOIHTohQDKkQjvgoO9gO5S8nY6Axgpf4iqOTy3RF2a/gcltULY
qdgy9A0vLk6yMbP6c0P+kEzg4y+Q90DsUh8YzQKW7F1EJPneDmNdug3VM16gefTR
4yrodSBHxr8NV3kAhN8G7FjWmK5cBDFwD66vsti64mKVCW00hjYRCQ+5BrgQ7h0V
EDC7kHisckLb415SQxe8XdF4fKbfE1PuQYZhjTo02hx9XCMeyxDWbjTF2PrZCHw=
=gab6
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull minor tracing fixes from Steven Rostedt:
"This includes three minor fixes, mostly due to cut-and-paste issues.
The first is a cut and paste issue that changed the amount of stack to
skip when tracing a stack dump from 0 to 6, which basically made the
stack disappear for small stack traces.
The second fix is just removing an unused field in a struct that is no
longer used, and currently just wastes space.
The third is another cut-and-paste fix that had a tracepoint recording
the wrong field (it was recording the previous field a second time)"
* tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/dma-buf/fence: Fix timeline str value on fence_annotate_wait_on
ftrace: Remove unused nr_trampolines var
tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
minor fixes.
Here's what else is new:
o A new TRACE_EVENT_FN_COND macro, combining both _FN and _COND for
those that want both.
o New selftest to test the instance create and delete
o Better debug output when ftrace fails
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWlU8tAAoJEKKk/i67LK/8JckH/2XIhjwMunm35uCg1308sDqy
d44G3+p0pm8ztjBf8iD8wH2nP3m7z+nC8JBmSPIUgAHsKOYHWsBy2A/36OVWv5lK
1hVXvBwOuZXnyWXr7bC2RO9S9f9acSFaabZXWDi1BCJRJSgEcknz32V7ZAL4jOCO
SfBWBNrWJfUsURbfbElfVxPLArvyUg9Bb5dW5B+QFf6PuoJaORYzNLYXHlbsq++T
WlrlnD+mFZ/DKFZ/gl3FMSGMPaGimw09/3eqMzv/tLQobp6PbCWlJTwjUoxJ/9dO
XOY4sWUrUUZilU8qCk0i0ZSEumWmE+SWS3eq+Ef18B/5haIj/LkoM4UQD3h2Rc4=
=FDR+
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Not much new with tracing for this release. Mostly just clean ups and
minor fixes.
Here's what else is new:
- A new TRACE_EVENT_FN_COND macro, combining both _FN and _COND for
those that want both.
- New selftest to test the instance create and delete
- Better debug output when ftrace fails"
* tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits)
ftrace: Fix the race between ftrace and insmod
ftrace: Add infrastructure for delayed enabling of module functions
x86: ftrace: Fix the comments for ftrace_modify_code_direct()
tracing: Fix comment to use tracing_on over tracing_enable
metag: ftrace: Fix the comments for ftrace_modify_code
sh: ftrace: Fix the comments for ftrace_modify_code()
ia64: ftrace: Fix the comments for ftrace_modify_code()
ftrace: Clean up ftrace_module_init() code
ftrace: Join functions ftrace_module_init() and ftrace_init_module()
tracing: Introduce TRACE_EVENT_FN_COND macro
tracing: Use seq_buf_used() in seq_buf_to_user() instead of len
bpf: Constify bpf_verifier_ops structure
ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too
ftrace: Remove use of control list and ops
ftrace: Fix output of enabled_functions for showing tramp
ftrace: Fix a typo in comment
ftrace: Show all tramps registered to a record on ftrace_bug()
ftrace: Add variable ftrace_expected for archs to show expected code
ftrace: Add new type to distinguish what kind of ftrace_bug()
tracing: Update cond flag when enabling or disabling a trigger
...
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
We hit ftrace_bug report when booting Android on a 64bit ATOM SOC chip.
Basically, there is a race between insmod and ftrace_run_update_code.
After load_module=>ftrace_module_init, another thread jumps in to call
ftrace_run_update_code=>ftrace_arch_code_modify_prepare
=>set_all_modules_text_rw, to change all modules
as RW. Since the new module is at MODULE_STATE_UNFORMED, the text attribute
is not changed. Then, the 2nd thread goes ahead to change codes.
However, load_module continues to call complete_formation=>set_section_ro_nx,
then 2nd thread would fail when probing the module's TEXT.
The patch fixes it by using notifier to delay the enabling of ftrace
records to the time when module is at state MODULE_STATE_COMING.
Link: http://lkml.kernel.org/r/567CE628.3000609@intel.com
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Qiu Peiyang pointed out that there's a race when enabling function tracing
and loading a module. In order to make the modifications of converting nops
in the prologue of functions into callbacks, the text needs to be converted
from read-only to read-write. When enabling function tracing, the text
permission is updated, the functions are modified, and then they are put
back.
When loading a module, the updates to convert function calls to mcount is
done before the module text is set to read-only. But after it is done, the
module text is visible by the function tracer. Thus we have the following
race:
CPU 0 CPU 1
----- -----
start function tracing
set text to read-write
load_module
add functions to ftrace
set module text read-only
update all functions to callbacks
modify module functions too
< Can't it's read-only >
When this happens, ftrace detects the issue and disables itself till the
next reboot.
To fix this, a new DISABLED flag is added for ftrace records, which all
module functions get when they are added. Then later, after the module code
is all set, the records will have the DISABLED flag cleared, and they will
be enabled if any callback wants all functions to be traced.
Note, this doesn't add the delay to later. It simply changes the
ftrace_module_init() to do both the setting of DISABLED records, and then
immediately calls the enable code. This helps with testing this new code as
it has the same behavior as previously. Another change will come after this
to have the ftrace_module_enable() called after the text is set to
read-only.
Cc: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
[<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS: 0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
[<ffffffff811dc076>] seq_read+0x2f6/0x3e0
[<ffffffff811b749b>] vfs_read+0x9b/0x160
[<ffffffff811b7f69>] SyS_read+0x49/0xb0
[<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception
When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.
This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.
Link: http://lkml.kernel.org/r/5684B900.9000309@intel.com
Cc: stable@vger.kernel.org # 3.12+
Fixes: 102c9323c3 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The file tracing_enable is obsolete and does not exist anymore. Replace
the comment that references it with the proper tracing_on file.
Link: http://lkml.kernel.org/r/1450787141-45544-1-git-send-email-chuhu@redhat.com
Signed-off-by: Chuyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The start and end variables were only used when ftrace_module_init() was
split up into multiple functions. No need to keep them around after the
merger.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This bpf_verifier_ops structure is never modified, like the other
bpf_verifier_ops structures, so declare it as const.
Done with the help of Coccinelle.
Link: http://lkml.kernel.org/r/1449855359-13724-1-git-send-email-Julia.Lawall@lip6.fr
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jiri Olsa noted that the change to replace the control_ops did not update
the trampoline for when running perf on a single CPU and with CONFIG_PREEMPT
disabled (where dynamic ops, like perf, can use trampolines directly). The
result was that perf function could be called when RCU is not watching as
well as not handle the ftrace_local_disable().
Modify the ftrace_ops_get_func() to also check the RCU and PER_CPU ops flags
and use the recursive function if they are set. The recursive function is
modified to check those flags and execute the appropriate checks if they are
set.
Link: http://lkml.kernel.org/r/20151201134213.GA14155@krava.brq.redhat.com
Reported-by: Jiri Olsa <jolsa@redhat.com>
Patch-fixed-up-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently perf has its own list function within the ftrace infrastructure
that seems to be used only to allow for it to have per-cpu disabling as well
as a check to make sure that it's not called while RCU is not watching. It
uses something called the "control_ops" which is used to iterate over ops
under it with the control_list_func().
The problem is that this control_ops and control_list_func unnecessarily
complicates the code. By replacing FTRACE_OPS_FL_CONTROL with two new flags
(FTRACE_OPS_FL_RCU and FTRACE_OPS_FL_PER_CPU) we can remove all the code
that is special with the control ops and add the needed checks within the
generic ftrace_list_func().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When showing all tramps registered to a ftrace record in the file
enabled_functions, it exits the loop with ops == NULL. But then it is
suppose to show the function on the ops->trampoline and
add_trampoline_func() is called with the given ops. But because ops is now
NULL (to exit the loop), it always shows the static trampoline instead of
the one that is really registered to the record.
The call to add_trampoline_func() that shows the trampoline for the given
ops needs to be called at every iteration.
Fixes: 39daa7b9e8 "ftrace: Show all tramps registered to a record on ftrace_bug()"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"This tree includes four core perf fixes for misc bugs, three fixes to
x86 PMU drivers, and two updates to old email addresses"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not send exit event twice
perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
perf: Fix PERF_EVENT_IOC_PERIOD deadlock
treewide: Remove old email address
perf/x86: Fix LBR call stack save/restore
perf: Update email address in MAINTAINERS
perf/core: Robustify the perf_cgroup_from_task() RCU checks
perf/core: Fix RCU problem with cgroup context switching code
The set_event_pid filter relies on attaching to the sched_switch and
sched_wakeup tracepoints to see if it should filter the tracing on schedule
tracepoints. By adding the callbacks to sched_wakeup, pids in the
set_event_pid file will trace the wakeups of those tasks with those pids.
But sched_wakeup_new and sched_waking were missed. These two should also be
traced. Luckily, these tracepoints share the same class as sched_wakeup
which means they can use the same pre and post callbacks as sched_wakeup
does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an anomaly is detected in the function call modification code,
ftrace_bug() is called to disable function tracing as well as give any
information that may help debug the problem. Currently, only the first found
trampoline that is attached to the failed record is reported. Instead, show
all trampolines that are hooked to it.
Also, not only show the ops pointer but also report the function it calls.
While at it, add this info to the enabled_functions debug file too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an anomaly is found while modifying function code, ftrace_bug() is
called which disables the function tracing infrastructure and reports
information about what failed. If the code that is to be replaced does not
match what is expected, then actual code is shown. Currently there is no
arch generic way to show what was expected.
Add a new variable pointer calld ftrace_expected that the arch code can set
to point to what it expected so that ftrace_bug() can report the actual text
as well as the text that was expected to be there.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace function hook utility has several internal checks to make sure
that whatever it modifies is exactly what it expects to be modifying. This
is essential as modifying running code can be extremely dangerous to the
system.
When an anomaly is detected, ftrace_bug() is called which sends a splat to
the console and disables function tracing. There's some extra information
that is printed to help diagnose the issue.
One thing that is missing though is output of what ftrace was doing at the
time of the crash. Was it updating a call site or perhaps converting a call
site to a nop? A new global enum variable is created to state what ftrace
was doing at the time of the anomaly, and this is reported in ftrace_bug().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a trigger is enabled, the cond flag should be set beforehand,
otherwise a trigger that's expecting to process a trace record
(e.g. one with post_trigger set) could be invoked without one.
Likewise a trigger's cond flag should be reset after it's disabled,
not before.
Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When crossing over to a new page, commit the current work. This will allow
readers to get data with less latency, and also simplifies the work to get
timestamps working for interrupted events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The first commit of a buffer page updates the timestamp of that page. No
need to have the update to the next page add the timestamp too. It will only
be replaced by the first commit on that page anyway.
Only update to a page if it contains an event.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As cpu_buffer->tail_page may be modified by interrupts at almost any time,
the flow of logic is very important. Do not let gcc get smart with
re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
accesses.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit fcc742eaad "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.
Remove the extra space added if a timestamp was added and the event went
across the page.
Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742eaad "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.
rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.
Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
One patch is needed to make tracing work without debugfs now that tracing
uses its own tracefs.
The second is removing an unused variable.
The third is fixing a warning about unused variables when MAX_TRACER is
not configured. Note, this warning shows up in gcc 6.0, but does not show
up in gcc 4.9, as it seems that gcc does not complain about constants
not being used.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWRKA+AAoJEKKk/i67LK/8BZ4H/0UgA8+OpbUmBHQk/RdMRZW2
BSwO/GuErq3nSKJUjP4f3b4BGdMoFgThIOxR9dvHKwJvGfwpbTivpgR0OqUdFKAZ
1FyUT/oNM0ySNGpZFGuMQ9hvR+deeUvvyPhosjPWlxZLSnRSsEzTlePUqikX1VY6
hEQj+ulK8J0iLlHMDYLJ5FkQ10+owrI2vmT8BU1O8KZN8EBvhoUx4EfJewD63/pa
Gzjm6C9RpVb6ks555NmwFzLeN1Yd5X4n5m2CQFrIpzBgRlibWIGCPMVe7OgNawMu
wONNvhq9zLfo3pftS092lek4iPgEw7ypl6iEgBxDzqTIAjSxUqf+N+1xIbOCcOE=
=GikR
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace cleanups from Steven Rostedt:
"This contains three more clean up patches.
One patch is needed to make tracing work without debugfs now that
tracing uses its own tracefs.
The second is removing an unused variable.
The third is fixing a warning about unused variables when MAX_TRACER
is not configured. Note, this warning shows up in gcc 6.0, but does
not show up in gcc 4.9, as it seems that gcc does not complain about
constants not being used"
* tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: #ifdef out uses of max trace when CONFIG_TRACER_MAX_TRACE is not set
tracing: Remove unused ftrace_cpu_disabled per cpu variable
tracing: Make tracing work when debugfs is not configured in
Pull networking fixes from David Miller:
1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.
2) Several spots need to get to the original listner for SYN-ACK
packets, most spots got this ok but some were not. Whilst covering
the remaining cases, create a helper to do this. From Eric Dumazet.
3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
from Rasmus Villemoes.
4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.
5) Use after free in mvneta driver, from Justin Maggard.
6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.
7) Add missing ZLIB_INFLATE dependency for new qed driver. From Arnd
Bergmann.
8) Fix multicast getsockopt deadlock, from WANG Cong.
9) Fix deadlock in btusb, from Kuba Pawlak.
10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
counter state. From Sabrina Dubroca.
11) Fix packet_bind() race, which can cause lost notifications, from
Francesco Ruggeri.
12) Fix MAC restoration in qlcnic driver during bonding mode changes,
from Jarod Wilson.
13) Revert bridging forward delay change which broke libvirt and other
userspace things, from Vlad Yasevich.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
Revert "bridge: Allow forward delay to be cfgd when STP enabled"
bpf_trace: Make dependent on PERF_EVENTS
qed: select ZLIB_INFLATE
net: fix a race in dst_release()
net: mvneta: Fix memory use after free.
net: Documentation: Fix default value tcp_limit_output_bytes
macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
mvneta: add FIXED_PHY dependency
net: caif: check return value of alloc_netdev
net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
drivers: net: xgene: fix RGMII 10/100Mb mode
netfilter: nft_meta: use skb_to_full_sk() helper
net_sched: em_meta: use skb_to_full_sk() helper
sched: cls_flow: use skb_to_full_sk() helper
netfilter: xt_owner: use skb_to_full_sk() helper
smack: use skb_to_full_sk() helper
net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
bpf: doc: correct arch list for supported eBPF JIT
dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
bonding: fix panic on non-ARPHRD_ETHER enslave failure
...
Arnd Bergmann reported:
In my ARM randconfig tests, I'm getting a build error for
newly added code in bpf_perf_event_read and bpf_perf_event_output
whenever CONFIG_PERF_EVENTS is disabled:
kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
if (event->oncpu != smp_processor_id() ||
^
kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
event->pmu->count)
This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
is disabled. I'm not sure if that is a configuration we care
about, otherwise we could prevent this case from occuring by
adding Kconfig dependencies.
Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.
Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfel
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tracing_max_lat_fops is used only when TRACER_MAX_TRACE enabled, so also
swith the related code. The related warning with defconfig under x86_64:
CC kernel/trace/trace.o
kernel/trace/trace.c:5466:37: warning: ‘tracing_max_lat_fops’ defined but not used [-Wunused-const-variable]
static const struct file_operations tracing_max_lat_fops = {
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the ring buffer is lockless, there is no need to disable ftrace on
CPU. And no one doing so: after commit 68179686ac ("tracing: Remove
ftrace_disable/enable_cpu()") ftrace_cpu_disabled stays the same after
initialization, nothing changes it.
ftrace_cpu_disabled shouldn't be used by any external module since it
disables only function and graph_function tracers but not any other
tracer.
Link: http://lkml.kernel.org/r/1446836846-22239-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
stable tags to them. I searched through my INBOX just as the merge window
opened and found lots of patches to pull. I ran them through all my tests
and they were in linux-next for a few days.
Features added this release:
----------------------------
o Module globbing. You can now filter function tracing to several
modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)
o Tracer specific options are now visible even when the tracer is not
active. It was rather annoying that you can only see and modify tracer
options after enabling the tracer. Now they are in the options/ directory
even when the tracer is not active. Although they are still only visible
when the tracer is active in the trace_options file.
o Trace options are now per instance (although some of the tracer specific
options are global)
o New tracefs file: set_event_pid. If any pid is added to this file, then
all events in the instance will filter out events that are not part of
this pid. sched_switch and sched_wakeup events handle next and the wakee
pids.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWPLQ5AAoJEKKk/i67LK/8CTYIAI1u8DE5QCzv3J0p54jVpNVR
J5FqEU3eXIzd6FS4JXD4nxCeMpUZAy21YnhlZpsnrbJJM5bc9bUsBCwiKKM+MuSZ
ztmy2sgYKkO0h/KUdhNgYJrzis3/Ojquyx9iAqK5ST/Fr+nKYx81akFKjNK53iur
RJRut45sSa8rv11LaL8sgJ6hAWQTc+YkybUdZ5xaMdJmZ6A61T7Y6VzTjbUexuvL
hntCfTjYLtVd8dbfknAnf3B7n/VOO3IFF85wr7ciYR5oEVfPrF8tHmJBlhHExPpX
kaXAiDDRY/UTg/5DQqnp4zmxJoR5BQ2l4pT5PwiLcnwhcphIDNYS8EYUmOYAWjU=
=TjOE
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracking updates from Steven Rostedt:
"Most of the changes are clean ups and small fixes. Some of them have
stable tags to them. I searched through my INBOX just as the merge
window opened and found lots of patches to pull. I ran them through
all my tests and they were in linux-next for a few days.
Features added this release:
----------------------------
- Module globbing. You can now filter function tracing to several
modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)
- Tracer specific options are now visible even when the tracer is not
active. It was rather annoying that you can only see and modify
tracer options after enabling the tracer. Now they are in the
options/ directory even when the tracer is not active. Although
they are still only visible when the tracer is active in the
trace_options file.
- Trace options are now per instance (although some of the tracer
specific options are global)
- New tracefs file: set_event_pid. If any pid is added to this file,
then all events in the instance will filter out events that are not
part of this pid. sched_switch and sched_wakeup events handle next
and the wakee pids"
* tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
tracefs: Fix refcount imbalance in start_creating()
tracing: Put back comma for empty fields in boot string parsing
tracing: Apply tracer specific options from kernel command line.
tracing: Add some documentation about set_event_pid
ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmark
tracing: Allow dumping traces without tracking trace started cpus
ring_buffer: Fix more races when terminating the producer in the benchmark
ring_buffer: Do no not complete benchmark reader too early
tracing: Remove redundant TP_ARGS redefining
tracing: Rename max_stack_lock to stack_trace_max_lock
tracing: Allow arch-specific stack tracer
recordmcount: arm64: Replace the ignored mcount call into nop
recordmcount: Fix endianness handling bug for nop_mcount
tracepoints: Fix documentation of RCU lockdep checks
tracing: ftrace_event_is_function() can return boolean
tracing: is_legal_op() can return boolean
ring-buffer: rb_event_is_commit() can return boolean
ring-buffer: rb_per_cpu_empty() can return boolean
ring_buffer: ring_buffer_empty{cpu}() can return boolean
ring-buffer: rb_is_reader_page() can return boolean
...
Currently tracing_init_dentry() returns -ENODEV when debugfs is not
configured in, which causes tracefs not populated with tracing files and
directories, so we will get an empty directory even after we manually
mount tracefs.
We can make tracing_init_dentry() return NULL if debugfs is not
configured in and can manually mount tracefs. But return -ENODEV
if debugfs is configured in but not initialized or failed to create
automount point as that would break backward compatibility with older
tools.
Link: http://lkml.kernel.org/r/1446797056-11683-1-git-send-email-hello.wjx@gmail.com
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.
This pull request contains:
- Enable blk-mq page allocation tracking with kmemleak, from Catalin.
- Unused prototype removal in blk-mq from Christoph.
- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.
- A plug flush fix from Jeff.
- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.
- blk-mq waitqueue barrier fix from Kosuke.
- Various fixes from Ming:
- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.
- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.
- direct-io no page dirty on kernel direct reads.
- A WRITE_SYNC fix for mpage from Roman"
* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request
Pull networking updates from David Miller:
Changes of note:
1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.
2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
David Ahern.
3) Allow the user to ask for the statistics to be filtered out of
ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.
4) More work to pass the network namespace context around deep into
various packet path APIs, starting with the netfilter hooks. From
Eric W Biederman.
5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
Richter.
6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.
7) Support Very High Throughput in wireless MESH code, from Bob
Copeland.
8) Allow setting the ageing_time in switchdev/rocker. From Scott
Feldman.
9) Properly autoload L2TP type modules, from Stephen Hemminger.
10) Fix and enable offload features by default in 8139cp driver, from
David Woodhouse.
11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
Jiri Benc.
12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
Opstad.
13) Fix IPSEC flowcache overflows on large systems, from Steffen
Klassert.
14) Convert bridging to track VLANs using rhashtable entries rather than
a bitmap. From Nikolay Aleksandrov.
15) Make TCP listener handling completely lockless, this is a major
accomplishment. Incoming request sockets now live in the
established hash table just like any other socket too.
From Eric Dumazet.
15) Provide more bridging attributes to netlink, from Nikolay
Aleksandrov.
16) Use hash based algorithm for ipv4 multipath routing, this was very
long overdue. From Peter Nørlund.
17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.
18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.
19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
influences the port binding selection logic used by SO_REUSEPORT.
20) Add ipv6 support to VRF, from David Ahern.
21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.
22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.
23) Implement RACK loss recovery in TCP, from Yuchung Cheng.
24) Support multipath routes in MPLS, from Roopa Prabhu.
25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
Dumazet.
26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
Sudarsana Kalluru.
27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
Sowa.
28) Support ipv6 geneve tunnels, from John W Linville.
29) Add flood control support to switchdev layer, from Ido Schimmel.
30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
Hannes Frederic Sowa.
31) Support persistent maps and progs in bpf, from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
sh_eth: use DMA barriers
switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
net: sched: kill dead code in sch_choke.c
irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
net: dsa: mv88e6xxx: include DSA ports in VLANs
net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
net/core: fix for_each_netdev_feature
vlan: Invoke driver vlan hooks only if device is present
arcnet/com20020: add LEDS_CLASS dependency
bpf, verifier: annotate verbose printer with __printf
dp83640: Only wait for timestamps for packets with timestamping enabled.
ptp: Change ptp_class to a proper bitmask
dp83640: Prune rx timestamp list before reading from it
dp83640: Delay scheduled work.
dp83640: Include hash in timestamp/packet matching
ipv6: fix tunnel error handling
net/mlx5e: Fix LSO vlan insertion
net/mlx5e: Re-eanble client vlan TX acceleration
net/mlx5e: Return error in case mlx5e_set_features() fails
net/mlx5e: Don't allow more than max supported channels
...
Both early_enable_events() and apply_trace_boot_options() parse a boot
string that may get parsed later on. They both use strsep() which converts a
comma into a nul character. To still allow the boot string to be parsed
again the same way, the nul character gets converted back to a comma after
the token is processed.
The problem is that these two functions check for an empty parameter (two
commas in a row ",,"), and continue the loop if the parameter is empty, but
fails to place the comma back. In this case, the second parsing will end at
this blank field, and not process fields afterward.
In most cases, users should not have an empty field, but if its going to be
checked, the code might as well be correct.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the trace_options parameter is only applied in
tracer_alloc_buffers() when global_trace.current_trace is nop_trace,
so a tracer specific option will not be applied even when the specific
tracer is also enabled from kernel command line. For example, the
'func_stack_trace' option can't be enabled with the following kernel
parameter:
ftrace=function ftrace_filter=kfree trace_options=func_stack_trace
We can enable tracer specific options by simply apply the options again
if the specific tracer is also supplied from command line and started
in register_tracer().
To make trace_boot_options_buf can be parsed again, a comma and a space
is put back if they were replaced by strsep and strstrip respectively.
Also make register_tracer() be __init to access the __init data, and
in fact register_tracer is only called from __init code.
Link: http://lkml.kernel.org/r/1446599669-9294-1-git-send-email-hello.wjx@gmail.com
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:
- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)
- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)
- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
wake_up_process() has a memory barrier before doing anything, thus adding a
memory barrier before calling it is redundant.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We don't init iter->started when dumping the ftrace buffer, and there's no
real need to do so - so allow skipping that check if the iter doesn't have
an initialized ->started cpumask.
Link: http://lkml.kernel.org/r/1441385156-27279-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit b44754d826 ("ring_buffer: Allow to exit the ring
buffer benchmark immediately") added a hack into ring_buffer_producer()
that set @kill_test when kthread_should_stop() returned true. It improved
the situation a lot. It stopped the kthread in most cases because
the producer spent most of the time in the patched while cycle.
But there are still few possible races when kthread_should_stop()
is set outside of the cycle. Then we do not set @kill_test and
some other checks pass.
This patch adds a better fix. It renames @test_kill/TEST_KILL() into
a better descriptive @test_error/TEST_ERROR(). Also it introduces
break_test() function that checks for both @test_error and
kthread_should_stop().
The new function is used in the producer when the check for @test_error
is not enough. It is not used in the consumer because its state
is manipulated by the producer via the "reader_finish" variable.
Also we add a missing check into ring_buffer_producer_thread()
between setting TASK_INTERRUPTIBLE and calling schedule_timeout().
Otherwise, we might miss a wakeup from kthread_stop().
Link: http://lkml.kernel.org/r/1441629518-32712-3-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that complete(&read_done) might be called too early
in some situations.
1st scenario:
-------------
CPU0 CPU1
ring_buffer_producer_thread()
wake_up_process(consumer);
wait_for_completion(&read_start);
ring_buffer_consumer_thread()
complete(&read_start);
ring_buffer_producer()
# producing data in
# the do-while cycle
ring_buffer_consumer();
# reading data
# got error
# set kill_test = 1;
set_current_state(
TASK_INTERRUPTIBLE);
if (reader_finish) # false
schedule();
# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer);
# spurious wakeup
while (!reader_finish &&
!kill_test)
# leaving because
# kill_test == 1
reader_finish = 0;
complete(&read_done);
1st BANG: We might access uninitialized "read_done" if this is the
the first round.
# producer finally leaving
# the do-while cycle because kill_test == 1;
if (consumer) {
reader_finish = 1;
wake_up_process(consumer);
wait_for_completion(&read_done);
2nd BANG: This will never complete because consumer already did
the completion.
2nd scenario:
-------------
CPU0 CPU1
ring_buffer_producer_thread()
wake_up_process(consumer);
wait_for_completion(&read_start);
ring_buffer_consumer_thread()
complete(&read_start);
ring_buffer_producer()
# CPU3 removes the module <--- difference from
# and stops producer <--- the 1st scenario
if (kthread_should_stop())
kill_test = 1;
ring_buffer_consumer();
while (!reader_finish &&
!kill_test)
# kill_test == 1 => we never go
# into the top level while()
reader_finish = 0;
complete(&read_done);
# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer);
# spurious wakeup
while (!reader_finish &&
!kill_test)
# leaving because kill_test == 1
reader_finish = 0;
complete(&read_done);
BANG: We are in the same "bang" situations as in the 1st scenario.
Root of the problem:
--------------------
ring_buffer_consumer() must complete "read_done" only when "reader_finish"
variable is set. It must not be skipped due to other conditions.
Note that we still must keep the check for "reader_finish" in a loop
because there might be spurious wakeups as described in the
above scenarios.
Solution:
----------
The top level cycle in ring_buffer_consumer() will finish only when
"reader_finish" is set. The data will be read in "while-do" cycle
so that they are not read after an error (kill_test == 1)
or a spurious wake up.
In addition, "reader_finish" is manipulated by the producer thread.
Therefore we add READ_ONCE() to make sure that the fresh value is
read in each cycle. Also we add the corresponding barrier
to synchronize the sleep check.
Next we set the state back to TASK_RUNNING for the situation where we
did not sleep.
Just from paranoid reasons, we initialize both completions statically.
This is safer, in case there are other races that we are unaware of.
As a side effect we could remove the memory barrier from
ring_buffer_producer_thread(). IMHO, this was the reason for
the barrier. ring_buffer_reset() uses spin locks that should
provide the needed memory barrier for using the buffer.
Link: http://lkml.kernel.org/r/1441629518-32712-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TP_ARGS is not used anywhere in trace.h nor trace_entries.h
Firstly, I left just #undef TP_ARGS and had no errors - remove it.
Link: http://lkml.kernel.org/r/1446576560-14085-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that max_stack_lock is a global variable, it requires a naming
convention that is unlikely to collide. Rename it to the same naming
convention that the other stack_trace variables have.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A stack frame may be used in a different way depending on cpu architecture.
Thus it is not always appropriate to slurp the stack contents, as current
check_stack() does, in order to calcurate a stack index (height) at a given
function call. At least not on arm64.
In addition, there is a possibility that we will mistakenly detect a stale
stack frame which has not been overwritten.
This patch makes check_stack() a weak function so as to later implement
arch-specific version.
Link: http://lkml.kernel.org/r/1446182741-31019-5-git-send-email-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make ftrace_event_is_function() return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-9-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make is_legal_op() return bool to improve readability due to this particular
function only using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-8-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make rb_event_is_commit() return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-7-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make rb_is_reader_page() return bool to improve readability due to this
particular function only using either true or false as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-4-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes report_latency return bool due to this
particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-3-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes report_latency return bool to improve readability,
indicating whether this new latency should be reported/recorded.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-2-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Update instancd_rmdir to use tracefs_remove_recursive instead of
debugfs_remove_recursive.This was left in the transition from debugfs
to tracefs.
Link: http://lkml.kernel.org/r/1445169490-18315-2-git-send-email-hello.wjx@gmail.com
Cc: stable@vger.kernel.org # 4.1+
Fixes: 8434dc9340 ("tracing: Convert the tracing facility over to use tracefs")
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no need to record the time tracepoints take when tracing is off.
This is because:
1) We cannot see these records since ring_buffer record is off at that
moment.
2) If tracing is off and benchmark tracepoint is enabled, the time
tracepoint takes is fewer than the same situation when tracing is on,
since the tracepoints need to be wrote into ring_buffer, it would
take more time. If turn on tracing at this moment, the average and
standard deviation cannot exactly present the time that tracepoints
take to write data into ring_buffer.
Link: http://lkml.kernel.org/r/1445947933-27955-1-git-send-email-zhang.chunyan@linaro.org
Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For the case where pids are already in set_event_pid, and one is added or
removed then each CPU should be checked to make sure that the new or old pid
is on or not on a CPU.
For example:
# echo 123 >> set_event_pid
or
# echo '!123' >> set_event_pid
Link: http://lkml.kernel.org/r/20151030061643.GA19480@cac
Suggested-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is really about simplifying the double xchg patterns into
a single cmpxchg, with the same logic. Other than the immediate
cleanup, there are some subtleties this change deals with:
(i) While the load of the old bt is fully ordered wrt everything,
ie:
old_bt = xchg(&q->blk_trace, bt); [barrier]
if (old_bt)
(void) xchg(&q->blk_trace, old_bt); [barrier]
blk_trace could still be changed between the xchg and the old_bt
load. Note that this description is merely theoretical and afaict
very small, but doing everything in a single context with cmpxchg
closes this potential race.
(ii) Ordering guarantees are obviously kept with cmpxchg.
(iii) Gets rid of the hacky-by-nature (void)xchg pattern.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
eviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
exported perf symbols are GPL only, mark eBPF helper functions
used in tracing as GPL only as well.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
(do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.
Also fix error path after perf_event_attrs()
and remove redundant 'extern'.
Fixes: 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
p_start() and p_stop() are seq_file functions that match. Teach sparse to
know that rcu_read_lock_sched() that is taken by p_start() is released by
p_stop.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
My tests found that if a task is running but not filtered when set_event_pid
is modified, then it can still be traced.
Call on_each_cpu() to check if the current running task should be filtered
and update the per cpu flags of tr->data appropriately.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the necessary hooks to use the pids loaded in set_event_pid to filter
all the events enabled in the tracing instance that match the pids listed.
Two probes are added to both sched_switch and sched_wakeup tracepoints to be
called before other probes are called and after the other probes are called.
The first is used to set the necessary flags to let the probes know to test
if they should be traced or not.
The sched_switch pre probe will set the "ignore_pid" flag if neither the
previous or next task has a matching pid.
The sched_switch probe will set the "ignore_pid" flag if the next task
does not match the matching pid.
The pre probe allows for probes tracing sched_switch to be traced if
necessary.
The sched_wakeup pre probe will set the "ignore_pid" flag if neither the
current task nor the wakee task has a matching pid.
The sched_wakeup post probe will set the "ignore_pid" flag if the current
task does not have a matching pid.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a tracing directory called set_event_pid, which currently has no
function, but will be used to filter all events for the tracing instance or
the pids that are added to the file.
The reason no functionality is added with this commit is that this commit
focuses on the creation and removal of the pids in a safe manner. And tests
can be made against this change to make sure things are correct before
hooking features to the list of pids.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) and
store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.
Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is the debug
facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both start_branch_trace() and stop_branch_trace() are used in only one
location, and are both static. As they are small functions there is no
need to keep them separated out.
Link: http://lkml.kernel.org/r/1445000689-32596-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new options to trace Kconfig, CONFIG_TRACING_EVENTS_GPIO, that is
used for enabling/disabling compilation of gpio function trace events.
Link: http://lkml.kernel.org/r/1438432079-11704-4-git-send-email-tal.shorer@gmail.com
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Tal Shorer <tal.shorer@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code in stack tracer should not be executed within an NMI as it grabs
spinlocks and stack tracing an NMI gives the possibility of causing a
deadlock. Although this is safe on x86_64, because it does not perform stack
traces when the task struct stack is not in use (interrupts and NMIs), it
may be an issue for NMIs on i386 and other archs that use the same stack as
the NMI.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Extend module command for function filter selection with globbing.
It uses the same globbing as function filter.
sh# echo '*alloc*:mod:*' > set_ftrace_filter
Will trace any function with the letters 'alloc' in the name in any
module but not in kernel.
sh# echo '!*alloc*:mod:ipv6' >> set_ftrace_filter
Will prevent from tracing functions with 'alloc' in the name from module
ipv6 (do not forget to append to set_ftrace_filter file).
sh# echo '*alloc*:mod:!ipv6' > set_ftrace_filter
Will trace functions with 'alloc' in the name from kernel and any
module except ipv6.
sh# echo '*alloc*:mod:!*' > set_ftrace_filter
Will trace any function with the letters 'alloc' in the name only from
kernel, but not from any module.
sh# echo '*:mod:!*' > set_ftrace_filter
or
sh# echo ':mod:!' > set_ftrace_filter
Will trace every function in the kernel, but will not trace functions
from any module.
sh# echo '*:mod:*' > set_ftrace_filter
or
sh# echo ':mod:' > set_ftrace_filter
As the opposite will trace all functions from all modules, but not from
kernel.
sh# echo '*:mod:*snd*' > set_ftrace_filter
Will trace your sound drivers only (if any).
Link: http://lkml.kernel.org/r/1443545176-3215-4-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ Made format changes ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_match parameters are very related and I reduce the number of local
variables & parameters with it.
This is also preparation for module globbing as it would introduce more
realated variables & parameters.
Link: http://lkml.kernel.org/r/1443545176-3215-3-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ Made some formatting changes ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stack tracer was triggering the WARN_ON() in module.c:
static void module_assert_mutex_or_preempt(void)
{
#ifdef CONFIG_LOCKDEP
if (unlikely(!debug_locks))
return;
WARN_ON(!rcu_read_lock_sched_held() &&
!lockdep_is_held(&module_mutex));
#endif
}
The reason is that the stack tracer traces all function calls, and some of
those calls happen while exiting or entering user space and idle. Some of
these functions are called after RCU had already stopped watching, as RCU
does not watch userspace or idle CPUs.
If a max stack is hit, then the save_stack_trace() is called, which will
check module addresses and call module_assert_mutex_or_preempt(), and then
trigger the warning. Sad part is, the warning itself will also do a stack
trace and tigger the same warning. That probably should be fixed.
The warning was added by 0be964be0d "module: Sanitize RCU usage and
locking" but this bug has probably been around longer. But it's unlikely to
cause much harm, but the new warning causes the system to lock up.
Cc: stable@vger.kernel.org # 4.2+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc:"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
"Not" is too abstract variable name - changed to clear_filter.
Removed ftrace_match_module_records function: comparison with !* or *
not does the general code in filter_parse_regex() as it works without
mod command for
sh# echo '!*' > /sys/kernel/debug/tracing/set_ftrace_filter
Link: http://lkml.kernel.org/r/1443545176-3215-2-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By now there isn't any subcommand for mod.
Before:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
sh$ echo '*:mod:ipv6' > set_ftrace_filter
had the same results, but now first will result in:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
-bash: echo: write error: Invalid argument
Also, I clarified ftrace_mod_callback code a little.
Link: http://lkml.kernel.org/r/1443545176-3215-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ converted 'if (ret == 0)' to 'if (!ret)' ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE
user, move trace_sched_switch() from prepare_task_switch() to
__schedule() and propagate the @preempt argument.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To cover the common case of sorting an array of pointers, Daniel
Wagner recently modified the library sort() to use a specific swap
function for size==8, in addition to the size==4 case which was
already handled. Since sizeof(long) is either 4 or 8,
ftrace_swap_ips() is redundant and we can just let sort() pick an
appropriate and fast swap callback.
Link: http://lkml.kernel.org/r/1441834023-13130-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The kernel now has kstrdup_const/kfree_const for reusing .rodata
(typically string literals) when possible; there's no reason to
duplicate that logic in the tracing system. Moreover, as the comment
above core_kernel_data states, it may not always return true for
.rodata - that is for example the case on x86_64, where we thus end up
kstrdup'ing all the passed-in strings.
Arguably, testing for .rodata explicitly (as kstrdup_const does) is
also more correct: I don't think one is supposed to be able to change
the name after creating the event_subsystem by passing the address of
a static char (but non-const) array.
Link: http://lkml.kernel.org/r/1441833841-12955-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the tracer options to instances options directory as well. Only add the
options for tracers that are allowed to be enabled by an instance. But note,
that tracer options are global. That is, tracer options enabled in an
instance, also take affect at the top level and in other instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow instances to have their own options, at least for the core options
(non tracer specific ones). There are a few global options that should not
be added to instances, like enabling of trace_printk, and the sched comm
recording, which do not have a specific trace instance associated to them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for the multi buffer instances to have their own trace_flags,
the check in ftrace_trace_stack() needs to test the trace_array descriptor
flag that is for the current event, not the global_trace descriptor.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation of having the multi buffer instances having their own trace
option flags, the trace option files needs a way to not only pass in the
flag they represent, but also the trace_array descriptor.
A new field is added to the trace_array descriptor called trace_flags_index,
which is a 32 byte character array representing a bit. This array is simply
filled with the index of the array, where
index_array[n] = n;
Then the address of this array is passed to the file callbacks instead of
the index of the flag index. Then to retrieve both the flag index and the
trace_array descriptor:
data is the passed in argument.
index = *(unsigned char *)data;
data -= index;
/* Now data points to the address of the array in the trace_array */
tr = container_of(data, struct trace_array, trace_flags_index);
Suggested-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation to make trace options per instance, the global trace_flags
needs to be moved from being a global variable to a field within the trace
instance trace_array structure.
There's still more work to do, as there's some functions that use
trace_flags without passing in a way to get to the current_trace array. For
those, the global_trace is used directly (from trace.c). This includes
setting and clearing the trace_flags. This means that when a new instance is
created, it just gets the trace_flags of the global_trace and will not be
able to modify them. Depending on the functions that have access to the
trace_array, the flags of an instance may not affect parts of its trace,
where the global_trace is used. These will be fixed in future changes.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The sleep-time and graph-time options are only for the function graph tracer
and are not used by anything else. As tracer options are now visible when
the tracer is not activated, its better to move the function graph specific
tracer options into the function graph tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the effort to move the global trace_flags to the tracing instances, the
direct access to trace_flags must be removed from trace_printk.c
Instead, add a new trace_printk_enabled boolean that is set by a new access
function trace_printk_control(), that will enable or disable trace_printk.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a enum that denotes the last bit of the trace_flags and have a
BUILD_BUG_ON(last_bit > 32).
If we add more bits than we have in trace_flags, the kernel wont build.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are options that are unique to a specific tracer (like function and
function graph). Currently, these options are only visible in the options
directory when the tracer is enabled.
This has been a pain, especially for something like the func_stack_trace
option that if used inappropriately, could bring the system to a crawl. But
the only way to see it, is to enable the function tracer.
For example, if one had done:
# cd /sys/kernel/tracing
# echo __schedule > set_ftrace_filter
# echo 1 > options/func_stack_trace
# echo function > current_tracer
The __schedule call will be traced and a stack trace will also be recorded
there. Now when you were done, you may do...
# echo nop > current_tracer
# echo > set_ftrace_filter
But you forgot to disable the func_stack_trace. The only way to disable it
is to re-enable function tracing first. If you do not add a filter to
set_ftrace_filter and just do:
# echo function > current_tracer
Now you would be performing a stack trace on *every* function! On some
systems, that causes a live lock. Others may take a few minutes to fix your
mistake.
Having the func_stack_trace option visible allows you to check it and
disable it before enabling the funtion tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Only create the stacktrace trace option when CONFIG_STACKTRACE is
configured.
Cleaned up the ftrace_trace_stack() function call a little to allow better
encapsulation of the stacktrace trace flag.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function tracer is not compiled in, do not create the option files
for it.
Fix up both the sched_wakeup and irqsoff tracers to handle the change.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use a cute little macro trick to keep the names of the trace flags file
guaranteed to match the corresponding masks.
The macro TRACE_FLAGS is defined as a serious of enum names followed by
the string name of the file that matches it. For example:
#define TRACE_FLAGS \
C(PRINT_PARENT, "print-parent"), \
C(SYM_OFFSET, "sym-offset"), \
C(SYM_ADDR, "sym-addr"), \
C(VERBOSE, "verbose"),
Now we can define the following:
#undef C
#define C(a, b) TRACE_ITER_##a##_BIT
enum trace_iterator_bits { TRACE_FLAGS };
The above creates:
enum trace_iterator_bits {
TRACE_ITER_PRINT_PARENT_BIT,
TRACE_ITER_SYM_OFFSET_BIT,
TRACE_ITER_SYM_ADDR_BIT,
TRACE_ITER_VERBOSE_BIT,
};
Then we can redefine C as:
#undef C
#define C(a, b) TRACE_ITER_##a = (1 << TRACE_ITER_##a##_BIT)
enum trace_iterator_flags { TRACE_FLAGS };
Which creates:
enum trace_iterator_flags {
TRACE_ITER_PRINT_PARENT = (1 << TRACE_ITER_PRINT_PARENT_BIT),
TRACE_ITER_SYM_OFFSET = (1 << TRACE_ITER_SYM_OFFSET_BIT),
TRACE_ITER_SYM_ADDR = (1 << TRACE_ITER_SYM_ADDR_BIT),
TRACE_ITER_VERBOSE = (1 << TRACE_ITER_VERBOSE_BIT),
};
Then finally we can create the list of file names:
#undef C
#define C(a, b) b
static const char *trace_options[] = {
TRACE_FLAGS
NULL
};
Which creates:
static const char *trace_options[] = {
"print-parent",
"sym-offset",
"sym-addr",
"verbose",
NULL
};
The importance of this is that the strings match the bit index.
trace_options[TRACE_ITER_SYM_ADDR_BIT] == "sym-addr"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using enums with FLAG_BIT and then defining a FLAG = (1 << FLAG_BIT), is a
bit more robust as we require that there are no bits out of order or skipped
to match the file names that represent the bits.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There was a time where the function tracing would disable interrupts unless
specifically told not to, where it would only disable preemption. With the
new lockless code, the function tracing never disalbes interrupts and just
uses disabling of preemption. Remove the option "ftrace_preempt" as it does
nothing anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to facilitate making all tracer options visible even when the
tracer is not active, we need to get rid of duplicate options. Any option
that is shared between multiple tracers really should be a main option.
As the wakeup and irqsoff tracers both use the "display-graph" option, and
use it exactly the same way, move that option from the tracer options to the
main options and consolidate them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_user_ip() is used in only one location in one file. Turn it into a
static function. We could inject its code into the caller, but that would
make the code a bit too complex. Keep the code separate.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_userip_objs() is used only in one location, in one file. Instead
of having it as an external function, go one further than making it static,
but inject is code into its only user. It doesn't make the calling function
much more complex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having trace options be per instance, the trace_array
needs to be passed to the trace_buffer_unlock_commit(). The
trace_event_buffer_lock_reserve() already passes in the trace_event_file
where the trace_array can be derived from.
Also added a "__init" to the boot up test event plus function tracing
function function_test_events_call().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_trace_stack_regs() is used in only one place, and because that is
such a simple function, just move its code into the location that it was
used in (trace_buffer_unlock_commit_regs()).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes is_good_name return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1442929393-4753-2-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The changes with more meat are:
o Allowing the trace event filters to filter on CPU number and process ids
o Two new markers for trace output latency were added
(10 and 100 msec latencies)
o Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future
work, and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression
as well as the code to revert that change without touching the other
changes that were made on top of it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV6aZEAAoJEEjnJuOKh9ldrR4H/A1RcQf1prLLoUibPP4w3lat
dmQcdpS1NY+cqyiKuKPAOkFDGQL7qWzRqZ8whcPSJIsHq57ufqNSLf+0bbQYPzg9
g3CgGL7OApmGi5ulj0sNxhadvc9TFm/SAN0nVJlNuUWdm8e1UWHLsrJZaMfopu2r
RDEtkOhg619mhDL4rktNdS6rk0B92Fhu2o2PwLZPVlUl1NNEt4WJU+ejitXUVO1A
Nb70/rTGGJKtyHbW+74on4LnEN5Uu0Viu6rMwGfYyIgRmC2otdBDvE4xfKMiTUKr
SzBjzrhIoMIRn4Vl0vElfulkpYaw7pcC2BdpZ4d9VpIOiLSlZs0x/TgCtpFEv5M=
=baZ3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"Mostly this is just clean ups and micro optimizations.
The changes with more meat are:
- Allowing the trace event filters to filter on CPU number and
process ids
- Two new markers for trace output latency were added (10 and 100
msec latencies)
- Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future work,
and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression as
well as the code to revert that change without touching the other
changes that were made on top of it"
* tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
tracing: Don't make assumptions about length of string on task rename
tracing: Allow triggers to filter for CPU ids and process names
ftrace: Format MCOUNT_ADDR address as type unsigned long
tracing: Introduce two additional marks for delay
ftrace: Fix function_graph duration spacing with 7-digits
ftrace: add tracing_thresh to function profile
tracing: Clean up stack tracing and fix fentry updates
ring-buffer: Reorganize function locations
ring-buffer: Make sure event has enough room for extend and padding
ring-buffer: Get timestamp after event is allocated
ring-buffer: Move the adding of the extended timestamp out of line
ring-buffer: Add event descriptor to simplify passing data
ftrace: correct the counter increment for trace_buffer data
tracing: Fix for non-continuous cpu ids
tracing: Prefer kcalloc over kzalloc with multiply
Pull networking updates from David Miller:
"Another merge window, another set of networking changes. I've heard
rumblings that the lightweight tunnels infrastructure has been voted
networking change of the year. But what do I know?
1) Add conntrack support to openvswitch, from Joe Stringer.
2) Initial support for VRF (Virtual Routing and Forwarding), which
allows the segmentation of routing paths without using multiple
devices. There are some semantic kinks to work out still, but
this is a reasonably strong foundation. From David Ahern.
3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.
4) Ignore route nexthops with a link down state in ipv6, just like
ipv4. From Andy Gospodarek.
5) Remove spinlock from fast path of act_gact and act_mirred, from
Eric Dumazet.
6) Document the DSA layer, from Florian Fainelli.
7) Add netconsole support to bcmgenet, systemport, and DSA. Also
from Florian Fainelli.
8) Add Mellanox Switch Driver and core infrastructure, from Jiri
Pirko.
9) Add support for "light weight tunnels", which allow for
encapsulation and decapsulation without bearing the overhead of a
full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
others.
10) Add Identifier Locator Addressing support for ipv6, from Tom
Herbert.
11) Support fragmented SKBs in iwlwifi, from Johannes Berg.
12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.
13) Add BQL support to 3c59x driver, from Loganaden Velvindron.
14) Stop using a zero TX queue length to mean that a device shouldn't
have a qdisc attached, use an explicit flag instead. From Phil
Sutter.
15) Use generic geneve netdevice infrastructure in openvswitch, from
Pravin B Shelar.
16) Add infrastructure to avoid re-forwarding a packet in software
that was already forwarded by a hardware switch. From Scott
Feldman.
17) Allow AF_PACKET fanout function to be implemented in a bpf
program, from Willem de Bruijn"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
net: fec: clear receive interrupts before processing a packet
ipv6: fix exthdrs offload registration in out_rt path
xen-netback: add support for multicast control
bgmac: Update fixed_phy_register()
sock, diag: fix panic in sock_diag_put_filterinfo
flow_dissector: Use 'const' where possible.
flow_dissector: Fix function argument ordering dependency
ixgbe: Resolve "initialized field overwritten" warnings
ixgbe: Remove bimodal SR-IOV disabling
ixgbe: Add support for reporting 2.5G link speed
ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
ixgbe: support for ethtool set_rxfh
ixgbe: Avoid needless PHY access on copper phys
ixgbe: cleanup to use cached mask value
ixgbe: Remove second instance of lan_id variable
ixgbe: use kzalloc for allocating one thing
flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
ixgbe: Remove unused PCI bus types
...
The commit a4543a2fa9 "ring-buffer: Get timestamp after event is
allocated" is needed for some future work. But after adding it, there is a
race somewhere that causes the saved timestamp to have a slight shift, and
get ahead of the actual timestamp and make it look like time goes backwards.
I'm still looking into why this happens, but in the mean time, this is
holding up other work to get in. I'm reverting the change for now (which
makes the problem go away), and will add it back after I know what is wrong
and fix it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:
9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
It is done in a way that significantly reduces complexity of the code:
5 files changed, 249 insertions(+), 494 deletions(-)
and the performance testing results are encouraging. Nevertheless we
need to keep an eye on potential regressions, since this potentially
affects every SMP workload in existence.
This work comes from Yuyang Du.
Other changes:
- SCHED_DL updates. (Andrea Parri)
- Simplify architecture callbacks by removing finish_arch_switch().
(Peter Zijlstra et al)
- cputime accounting: guarantee stime + utime == rtime. (Peter
Zijlstra)
- optimize idle CPU wakeups some more - inspired by Facebook server
loads. (Mike Galbraith)
- stop_machine fixes and updates. (Oleg Nesterov)
- Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)
- sched/numa tweaks. (Srikar Dronamraju)
- misc fixes and small cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched/deadline: Fix comment in enqueue_task_dl()
sched/deadline: Fix comment in push_dl_tasks()
sched: Change the sched_class::set_cpus_allowed() calling context
sched: Make sched_class::set_cpus_allowed() unconditional
sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched: Ensure a task has a non-normalized vruntime when returning back to CFS
sched/numa: Fix NUMA_DIRECT topology identification
tile: Reorganize _switch_to()
sched, sparc32: Update scheduler comments in copy_thread()
sched: Remove finish_arch_switch()
sched, tile: Remove finish_arch_switch
sched, sh: Fold finish_arch_switch() into switch_to()
sched, score: Remove finish_arch_switch()
sched, avr32: Remove finish_arch_switch()
sched, MIPS: Get rid of finish_arch_switch()
sched, arm: Remove finish_arch_switch()
sched/fair: Clean up load average references
sched/fair: Provide runnable_load_avg back to cfs_rq
sched/fair: Remove task and group entity load when they are dead
sched/fair: Init cfs_rq's sched_entity load average
...
generalize FETCH_FUNC_NAME(memory, string) into
strncpy_from_unsafe() and fix sparse warnings that were
present in original implementation.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By extending the filter rules by more generic fields
we can write triggers filters like
echo 'stacktrace if cpu == 1' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
or
echo 'stacktrace if comm == sshd' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
CPU and COMM are not part of struct trace_entry. We could add the two
new fields to ftrace_common_field list and fix up all depending
sides. But that looks pretty ugly. Another thing I would like to
avoid that the 'format' file contents changes.
All this can be avoided by introducing another list which contains
non field members of struct trace_entry.
Link: http://lkml.kernel.org/r/1439210146-24707-1-git-send-email-daniel.wagner@bmw-carit.de
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.
After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.
Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Mathieu reported that since 317f394160 ("sched: Move the second half
of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of
context of the waker.
This is a problem when you want to analyse wakeup paths because it is
now very hard to correlate the wakeup event to whoever issued the
wakeup.
OTOH trace_sched_wakeup() is issued at the point where we set
p->state = TASK_RUNNING, which is right were we hand the task off to
the scheduler, so this is an important point when looking at
scheduling behaviour, up to here its been the wakeup path everything
hereafter is due to scheduler policy.
To bridge this gap, introduce a second tracepoint: trace_sched_waking.
It is guaranteed to be called in the waker context.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Francis Giraldeau <francis.giraldeau@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150609091336.GQ3644@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 4104d326b6 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.
Add back the proper hooks to have pid filtering working once again.
Cc: stable@vger.kernel.org # 3.16+
Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A fine granulity support for delay would be very useful when profiling
VM logics, such as page allocation including page reclaim and memory
compaction with function graph.
Thus, this patch adds two additional marks with two changes.
- An equal sign in mark selection function is removed to align code
behavior with comments and documentation.
- The function graph example related to delay in ftrace.txt is updated
to cover all supported marks.
Link: http://lkml.kernel.org/r/1436626300-1679-3-git-send-email-jungseoklee85@gmail.com
Cc: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jungseok Lee noticed the following:
Currently, row's width of 7-digit duration numbers not aligned with
other cases like the following example.
3) $ 3999884 us | }
3) | finish_task_switch() {
3) 0.365 us | _raw_spin_unlock_irq();
3) 3.333 us | }
3) $ 3999976 us | }
3) $ 3999979 us | } /* schedule */
As adding a single white space in case of 7-digit numbers, the format
could be unified easily as follows.
3) $ 2237472 us | }
3) | finish_task_switch() {
3) 0.364 us | _raw_spin_unlock_irq();
3) 3.125 us | }
3) $ 2237556 us | }
3) $ 2237559 us | } /* schedule */
Instead of making a special case for 7-digit numbers, the logic
of the len and the space loop is slightly modified to make the
two cases have the same format.
Link: http://lkml.kernel.org/r/1436626300-1679-2-git-send-email-jungseoklee85@gmail.com
Reported-by: Jungseok Lee <jungseoklee85@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch extends tracing_thresh functionality to function profile tracer.
If tracing_thresh is set, print those entries only,
whose average is > tracing thresh.
Link: http://lkml.kernel.org/r/1434972488-8571-1-git-send-email-umesh.t@samsung.com
Signed-off-by: Umesh Tiwari <umesh.t@samsung.com>
[ Removed unnecessary 'moved' comment ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Akashi Takahiro was porting the stack tracer to arm64 and found some
issues with it. One was that it repeats the top function, due to the
stack frame added by the mcount caller and added by itself. This
was added when fentry came in, and before fentry created its own stack
frame. But x86's fentry now creates its own stack frame, and there's
no need to insert the function again.
This also cleans up the code a bit, where it doesn't need to do something
special for fentry, and doesn't include insertion of a duplicate
entry for the called function being traced.
Link: http://lkml.kernel.org/r/55A646EE.6030402@linaro.org
Some-suggestions-by: Jungseok Lee <jungseoklee85@gmail.com>
Some-suggestions-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Functions in ring-buffer.c have gotten interleaved between different
use cases. Move the functions around to get like functions closer
together. This may or may not help gcc keep cache locality, but it
makes it a little easier to work with the code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that events only add time extends after it is committed, in case
an event comes in before it can discard the allocated event, the time
extend needs to be stored within the event. If the event is bigger
than then size needed for the time extend, padding must be added.
The minimum padding size is 8 bytes. Thus if the event is 12 bytes
(size of time extend + 4), there will not be enough room to add both
the time extend and padding. Make sure all events are either 8 bytes
or 16 or more bytes.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the capturing of the timestamp to after an event is allocated.
If the event is not a commit (where it is an event that preempted
another event), then no timestamp is needed, because the delta of
nested events is always zero.
If the event starts on a new page, no delta needs to be calculated
as the full timestamp will be added to the page header, and the
event will have a delta of zero.
Now if the event requires a time extend (the delta does not fit
in the 27 bit delta slot in the header), then the event is discarded,
the length is extended to hold the TIME_EXTEND event that allows for
a 59 bit delta, and the commit is tried again.
If the event can't be discarded (another event came in after it),
then the TIME_EXTEND is added directly to the allocated event and
the rest of the event is given padding.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Requiring a extended time stamp is an uncommon occurrence, and it is
best to do it out of line when needed.
Add a noinline function that handles the extended timestamp and
have it called with an unlikely to completely move it out of the
fast path.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add rb_event_info descriptor to pass event info to functions a bit
easier than using a bunch of parameters. This will also allow for
changing the code around a bit to find better fast paths.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In ftrace_dump, for disabling buffer, iter.tr->trace_buffer.data is used.
But for enabling, iter.trace_buffer->data is used.
Even though, both point to same buffer, for readability, same convention
should be used.
Link: http://lkml.kernel.org/r/1434972306-20043-1-git-send-email-umesh.t@samsung.com
Signed-off-by: Umesh Tiwari <umesh.t@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently exception occures due to access beyond buffer_iter
range while using index of cpu bigger than num_possible_cpus().
Below there is an example for such exception when we use
cpus 0,1,16,17.
In order to fix buffer allocation size for non-continuous cpu ids
we allocate according to the max cpu id and not according to the
amount of possible cpus.
Example:
$ cat /sys/kernel/debug/tracing/per_cpu/cpu1/trace
Path: /bin/busybox
CPU: 0 PID: 82 Comm: cat Not tainted 4.0.0 #29
task: 80734c80 ti: 80012000 task.ti: 80012000
[ECR ]: 0x00220100 => Invalid Read @ 0x00000000 by insn @ 0x800abafc
[EFA ]: 0x00000000
[BLINK ]: ring_buffer_read_finish+0x24/0x64
[ERET ]: rb_check_pages+0x20/0x188
[STAT32]: 0x00001a00 :
BTA: 0x800abafc SP: 0x80013f0c FP: 0x57719cf8
LPS: 0x200036b4 LPE: 0x200036b8 LPC: 0x00000000
r00: 0x8002aca0 r01: 0x00001606 r02: 0x00000000
r03: 0x00000001 r04: 0x00000000 r05: 0x804b4954
r06: 0x00030003 r07: 0x8002a260 r08: 0x00000286
r09: 0x00080002 r10: 0x00001006 r11: 0x807351a4
r12: 0x00000001
Stack Trace:
rb_check_pages+0x20/0x188
ring_buffer_read_finish+0x24/0x64
tracing_release+0x4e/0x170
__fput+0x62/0x158
task_work_run+0xa2/0xd4
do_notify_resume+0x52/0x7c
resume_user_mode_begin+0xdc/0xe0
Link: http://lkml.kernel.org/r/1433835155-6894-3-git-send-email-gilf@ezchip.com
Signed-off-by: Noam Camus <noamc@ezchip.com>
Signed-off-by: Gil Fruchter <gilf@ezchip.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use kcalloc for allocating an array instead of kzalloc with multiply,
as that is what kcalloc is used for.
Found with checkpatch.
Link: http://lkml.kernel.org/r/1433835155-6894-2-git-send-email-gilf@ezchip.com
Signed-off-by: Gil Fruchter <gilf@ezchip.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.
The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.
Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.
There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.
This prevents the recursion and the inevitable crash that follows.
Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
"monitonic raw". Also some enhancements to make the ring buffer even
faster. But the biggest and most noticeable change is the renaming of
the ftrace* files, structures and variables that have to deal with
trace events.
Over the years I've had several developers tell me about their confusion
with what ftrace is compared to events. Technically, "ftrace" is the
infrastructure to do the function hooks, which include tracing and also
helps with live kernel patching. But the trace events are a separate
entity altogether, and the files that affect the trace events should
not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled()-> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not heard
a single complaint about this rename breaking anything. Mostly because
these functions, variables and structures are mostly internal to the
tracing system and are seldom (if ever) used by anything external to that.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJViYhVAAoJEEjnJuOKh9ldcJ0IAI+mytwoMAN/CWDE8pXrTrgs
aHlcr1zorSzZ0Lq6lKsWP+V0VGVhP8KWO16vl35HaM5ZB9U+cDzWiGobI8JTHi/3
eeTAPTjQdgrr/L+ZO1ApzS1jYPhN3Xi5L7xublcYMJjKfzU+bcYXg/x8gRt0QbG3
S9QN/kBt0JIIjT7McN64m5JVk2OiU36LxXxwHgCqJvVCPHUrriAdIX7Z5KRpEv13
zxgCN4d7Jiec/FsMW8dkO0vRlVAvudZWLL7oDmdsvNhnLy8nE79UOeHos2c1qifQ
LV4DeQ+2Hlu7w9wxixHuoOgNXDUEiQPJXzPc/CuCahiTL9N/urQSGQDoOVMltR4=
=hkdz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This patch series contains several clean ups and even a new trace
clock "monitonic raw". Also some enhancements to make the ring buffer
even faster. But the biggest and most noticeable change is the
renaming of the ftrace* files, structures and variables that have to
deal with trace events.
Over the years I've had several developers tell me about their
confusion with what ftrace is compared to events. Technically,
"ftrace" is the infrastructure to do the function hooks, which include
tracing and also helps with live kernel patching. But the trace
events are a separate entity altogether, and the files that affect the
trace events should not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not
heard a single complaint about this rename breaking anything. Mostly
because these functions, variables and structures are mostly internal
to the tracing system and are seldom (if ever) used by anything
external to that"
* tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
ring_buffer: Allow to exit the ring buffer benchmark immediately
ring-buffer-benchmark: Fix the wrong type
ring-buffer-benchmark: Fix the wrong param in module_param
ring-buffer: Add enum names for the context levels
ring-buffer: Remove useless unused tracing_off_permanent()
ring-buffer: Give NMIs a chance to lock the reader_lock
ring-buffer: Add trace_recursive checks to ring_buffer_write()
ring-buffer: Allways do the trace_recursive checks
ring-buffer: Move recursive check to per_cpu descriptor
ring-buffer: Add unlikelys to make fast path the default
tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
tracing: Rename ftrace_event_name() to trace_event_name()
tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
...
I could not come up with a situation where the operand counter (cnt)
could go below zero, so I added a WARN_ON_ONCE(cnt < 0). Vince was
able to trigger that warn on with his fuzzer test, but didn't have
a filter input that caused it.
Later, Sasha Levin was able to trigger that same warning, and was
able to give me the filter string that triggered it. It was simply
a single operation ">".
I wrapped the filtering code in a userspace program such that I could
single step through the logic. With a single operator the operand
counter can legitimately go below zero, and should be reported to the
user as an error, but should not produce a kernel warning. The
WARN_ON_ONCE(cnt < 0) should be just a "if (cnt < 0) break;" and the
code following it will produce the error message for the user.
While debugging this, I found that there was another bug that let
the pointer to the filter string go beyond the filter string.
This too was fixed.
Finally, there was a typo in a stub function that only gets compiled
if trace events is disabled but tracing is enabled (I'm not even sure
that's possible).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVjWh2AAoJEEjnJuOKh9ldOn0IANHPW82++0O87U1pEe3hHnKv
gSTKiNPVNC3GBt9DHnawA0EuyPfPa+Wj5X2xgrstWA+KRADZErZzdWpzbh/iHosJ
0kaUFqFcaKBheOSqhHfz3WQshD16pb1lQYbV7vbdzMjpcIpYT3VcuKQq3zQVb5Pr
njPmgZXK+I9ITYQ8E+DysnTg0+Mo+l/2P/tqnBoIkAVmuZitfJS5okTtVw1GNzyR
7VRMGBE3G0GxB++57T/xILXjFc9sSGQH5lZgLHQhEh36YgWuDvc0R2FfxDKROmeq
b/xw68uCO1Hv8oEng6r/UceVtUoaXhf+JamSJqxztBTsjsqR/iXCV78Jac1vnPY=
=cmr8
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This isn't my 4.2 pull request (yet). I found a few more bugs that I
would have sent to fix 4.1, but since 4.1 is already out, I'm sending
this before sending my 4.2 request (which is ready to go).
After fixing the previous filter issue reported by Vince Weaver, I
could not come up with a situation where the operand counter (cnt)
could go below zero, so I added a WARN_ON_ONCE(cnt < 0). Vince was
able to trigger that warn on with his fuzzer test, but didn't have a
filter input that caused it.
Later, Sasha Levin was able to trigger that same warning, and was able
to give me the filter string that triggered it. It was simply a
single operation ">".
I wrapped the filtering code in a userspace program such that I could
single step through the logic. With a single operator the operand
counter can legitimately go below zero, and should be reported to the
user as an error, but should not produce a kernel warning. The
WARN_ON_ONCE(cnt < 0) should be just a "if (cnt < 0) break;" and the
code following it will produce the error message for the user.
While debugging this, I found that there was another bug that let the
pointer to the filter string go beyond the filter string. This too
was fixed.
Finally, there was a typo in a stub function that only gets compiled
if trace events is disabled but tracing is enabled (I'm not even sure
that's possible)"
* tag 'trace-fixes-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix typo from "static inlin" to "static inline"
tracing/filter: Do not allow infix to exceed end of string
tracing/filter: Do not WARN on operand count going below zero
Part of the disassembly of do_blk_trace_setup:
231b: e8 00 00 00 00 callq 2320 <do_blk_trace_setup+0x50>
231c: R_X86_64_PC32 strlen+0xfffffffffffffffc
2320: eb 0a jmp 232c <do_blk_trace_setup+0x5c>
2322: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
2328: 48 83 c3 01 add $0x1,%rbx
232c: 48 39 d8 cmp %rbx,%rax
232f: 76 47 jbe 2378 <do_blk_trace_setup+0xa8>
2331: 41 80 3c 1c 2f cmpb $0x2f,(%r12,%rbx,1)
2336: 75 f0 jne 2328 <do_blk_trace_setup+0x58>
2338: 41 c6 04 1c 5f movb $0x5f,(%r12,%rbx,1)
233d: 4c 89 e7 mov %r12,%rdi
2340: e8 00 00 00 00 callq 2345 <do_blk_trace_setup+0x75>
2341: R_X86_64_PC32 strlen+0xfffffffffffffffc
2345: eb e1 jmp 2328 <do_blk_trace_setup+0x58>
Yep, that's right: gcc isn't smart enough to realize that replacing '/' by
'_' cannot change the strlen(), so we call it again and again (at least
when a '/' is found). Even if gcc were that smart, this construction
would still loop over the string twice, once for the initial strlen() call
and then the open-coded loop.
Let's simply use strreplace() instead.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Liked-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no point in starting over every time we see a ','...
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The trace.h header when called without CONFIG_EVENT_TRACING enabled
(seldom done), will not compile because of a typo in the protocol
of trace_event_enum_update().
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has
ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];
The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.
Luckily, only root can write to the filter.
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.
Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
When the following filter is used it causes a warning to trigger:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: No error
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
Modules linked in: bnep lockd grace bluetooth ...
CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
Call Trace:
[<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
[<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
[<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
[<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81159065>] replace_preds+0x3c5/0x990
[<ffffffff811596b2>] create_filter+0x82/0xb0
[<ffffffff81159944>] apply_event_filter+0xd4/0x180
[<ffffffff81152bbf>] event_filter_write+0x8f/0x120
[<ffffffff811db2a8>] __vfs_write+0x28/0xe0
[<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
[<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
[<ffffffff811dc408>] vfs_write+0xb8/0x1b0
[<ffffffff811dc72f>] SyS_write+0x4f/0xb0
[<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
---[ end trace e11028bd95818dcd ]---
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: Meaningless filter expression
And give no kernel warning.
Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: stable@vger.kernel.org # 2.6.31+
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's useful to do per-cpu histograms.
Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:
u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid
u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid
bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf
They can be used from the programs attached to TC as well to classify packets
based on current task fields.
Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It takes a while until the ring_buffer_benchmark module is removed
when the ring buffer hammer is running. It is because it takes
few seconds and kthread_should_stop() is not being checked.
This patch adds the check for kthread termination into the producer.
It uses the existing @kill_test flag to finish the kthreads as
cleanly as possible.
It disables printing the "ERROR" message when the kthread is going.
It makes sure that producer does not go into the 10sec sleep
when it is being killed.
Finally, it does not call wait_to_die() when kthread_should_stop()
already returns true.
Link: http://lkml.kernel.org/r/20150615155428.GD3135@pathway.suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The macro 'module_param' shows that the type of the
variable disable_reader and write_iteration is unsigned
integer. so, we change their type form int to unsigned int.
Link: http://lkml.kernel.org/r/1433923927-67782-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The {producer|consumer}_{nice|fifo} parameters are integer
type, we should use 'int' as the second param in module_param.
For example(consumer_fifo):
the default value of consumer_fifo is -1.
Without this patch:
# cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo
4294967295
With this patch:
# cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo
-1
Link: http://lkml.kernel.org/r/1433923873-67712-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As this is already exported from tracing side via commit d9847d310a
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tracing_off_permanent() call is a way to disable all ring_buffers.
Nothing uses it and nothing should use it, as tracing_off() and
friends are better, as they disable the ring buffers related to
tracing. The tracing_off_permanent() even disabled non tracing
ring buffers. This is a bit drastic, and was added to handle NMIs
doing outputs that could corrupt the ring buffer when only tracing
used them. It is now obsolete and adds a little overhead, it should
be removed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, if an NMI does a dump of a ring buffer, it disables
all ring buffers from ever doing any writes again. This is because
it wont take the locks for the cpu_buffer and this can cause
corruption if it preempted a read, or a read happens on another
CPU for the current cpu buffer. This is a bit overkill.
First, it should at least try to take the lock, and if it fails
then disable it. Also, there's no need to disable all ring
buffers, even those that are unrelated to what is being read.
Only disable the per cpu ring buffer that is being read if
it can not get the lock for it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring_buffer_write() function isn't protected by the trace recursive
writes. Luckily, this function is not used as much and is unlikely
to ever recurse. But it should still have the protection, because
even a call to ring_buffer_lock_reserve() could cause ring buffer
corruption if called when ring_buffer_write() is being used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace_recursive checks are only done if CONFIG_TRACING
is enabled. That was because there use to be a dependency with tracing
for the recursive checks (it used the task_struct trace recursive
variable). But now it uses its own variable and there is no dependency.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of using a global per_cpu variable to perform the recursive
checks into the ring buffer, use the already existing per_cpu descriptor
that is part of the ring buffer itself.
Not only does this simplify the code, it also allows for one ring buffer
to be used within the guts of the use of another ring buffer. For example
trace_printk() can now be used within the ring buffer to record changes
done by an instance into the main ring buffer. The recursion checks
will prevent the trace_printk() itself from causing recursive issues
with the main ring buffer (it is just ignored), but the recursive
checks wont prevent the trace_printk() from recording other ring buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_raw_##call structures are built
by macros for trace events. They have nothing to do with function tracing.
Rename them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_trigger_soft_disabled() tests if a
trace_event is soft disabled (called but not traced), and returns true if
it is. It has nothing to do with function tracing and should be renamed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The FTRACE_EVENT_FL_* flags are flags to
do with the trace_event files in the tracefs directory. They are not related
to function tracing. Rename them to a more descriptive name.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structure ftrace_subsystem_dir holds
the information about trace event subsystems. It should not be named
ftrace, rename it to trace_subsystem_dir.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. ftrace_event_name() returns the name of
an event tracepoint, has nothing to do with function tracing. Rename it
to trace_event_name().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. Rename the max trace_event type size to
something more descriptive and appropriate.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_output_*() and ftrace_raw_output_*()
functions represent the trace_event code. Rename them to just trace_output
or trace_raw_output.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_event_buffer functions and data
structures are for trace_events and not for function hooks. Rename them
to trace_event_buffer*.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structures ftrace_event_call and
ftrace_event_class have nothing to do with the function hooks, and are
really trace_event structures. Rename ftrace_event_* to trace_event_*.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structure ftrace_event_file is really
about trace events and not "ftrace". Rename it to trace_event_file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The functions (un)register_ftrace_event() is
really about trace_events, and the name should be register_trace_event()
instead.
Also renamed ftrace_event_reg() to trace_event_reg() for the same reason.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The functions ftrace_print_*() are not part of
the function infrastructure, and the names can be confusing. Rename them
to be trace_print_*().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The term "ftrace" is really the infrastructure of the function hooks,
and not the trace events. Rename ftrace_event.h to trace_events.h to
represent the trace_event infrastructure and decouple the term ftrace
from it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Expose the NMI safe accessor to the monotonic raw clock to the
tracer. The mono clock was added with commit
1b3e5c0936. The advantage of the
monotonic raw clock is that it will advance more constantly than the
monotonic clock.
Imagine someone is trying to optimize a particular program to reduce
instructions executed for a given workload while minimizing the effect
on runtime. Also suppose that NTP is running and potentially making
larger adjustments to the monotonic clock. If NTP is adjusting the
monotonic clock to advance more rapidly, the program will appear to
use fewer instructions per second but run longer than if the monotonic
raw clock had been used. The total number of instructions observed
would be the same regardless of the clock source used, but how it's
attributed to time would be affected.
Conversely if NTP is adjusting the monotonic clock to advance more
slowly, the program will appear to use more instructions per second
but run more quickly. Of course there are many sources that can cause
jitter in performance measurements on modern processors, but let's
remove NTP from the list.
The monotonic raw clock can also be useful for tracing early boot,
e.g. when debugging issues with NTP.
Link: http://lkml.kernel.org/r/20150508143037.GB1276@dreric01-Precision-T1650
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Critical tracepoint hooks should never call anything that takes a lock,
so they are unable to call getrawmonotonic() or ktime_get().
Export the rest of the tracing clock functions so can be used in
tracepoint hooks.
Background: We have a customer that adds their own module and registers
a tracepoint hook to sched_wakeup. They were using ktime_get() for a
time source, but it grabs a seq lock and caused a deadlock to occur.
Link: http://lkml.kernel.org/r/1430406624-22609-1-git-send-email-jsnitsel@redhat.com
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only caller to this function (__print_array) was getting it wrong by
passing the array length instead of buffer length. As the element size
was already being passed for other reasons it seems reasonable to push
the calculation of buffer length into the function.
Link: http://lkml.kernel.org/r/1430320727-14582-1-git-send-email-alex.bennee@linaro.org
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
The first is a bug when ftrace_dump_on_oops is triggered in atomic context
and function graph tracer is the tracer that is being reported.
The second fix is bad parsing of the trace_events from the kernel
command line, where it would ignore specific events if the system
name is used with defining the event(it enables all events within the
system).
The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing
to see if the ptr was incremented to the end of the string, but the loop
increments it again and can miss the nul delimiter to stop processing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVN7g7AAoJEEjnJuOKh9ldLCkIAJj6wsj59wR8aIxtxnD5bJZZ
Y9dKkas6BOaUCGp0MVBCkpm3uMgh0uPO10jOuthgc3LgQy0piwKJrIzGkI5gcRuA
JvZw2X08/jCSNu8BHydes2A0XMkZobMuWFAeWz3CSzNBfbI3sDDqgnRQ9eyMM66R
+sPV7ZELQLK/ZFs93gFoaZe/OKGYJcamhMtG2v7p9h1qBJZUDakRFo478bAwu5SB
Kh6xpXA76WnmkeQekAAsWGdqIQzoNy3IoVePmdhVpZvoiKLUrf1JWVYFoNkO39Ik
kC1gqJro0EmHQFOo5rt8yQmqXjvSU0sS/sAW3HZ0Szl5OtiUNnag+Wt3ku7lr8U=
=VORa
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This adds three fixes for the tracing code.
The first is a bug when ftrace_dump_on_oops is triggered in atomic
context and function graph tracer is the tracer that is being
reported.
The second fix is bad parsing of the trace_events from the kernel
command line, where it would ignore specific events if the system name
is used with defining the event(it enables all events within the
system).
The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was
missing to see if the ptr was incremented to the end of the string,
but the loop increments it again and can miss the nul delimiter to
stop processing"
* tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible out of bounds memory access when parsing enums
tracing: Fix incorrect enabling of trace events by boot cmdline
tracing: Handle ftrace_dump() atomic context in graph_trace_open()
The code that replaces the enum names with the enum values in the
tracepoints' format files could possible miss the end of string nul
character. This was caused by processing things like backslashes, quotes
and other tokens. After processing the tokens, a check for the nul
character needed to be done before continuing the loop, because the loop
incremented the pointer before doing the check, which could bypass the nul
character.
Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan
Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Fixes: 0c564a538a "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a problem that trace events are not properly enabled with
boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc"
to the boot cmdline, it enables all kmem trace events, and not just
the page_alloc event.
This is caused by the parsing mechanism. When we parse the cmdline, the buffer
contents is modified due to tokenization. And, if we use this buffer
again, we will get the wrong result.
Unfortunately, this buffer is be accessed three times to set trace events
properly at boot time. So, we need to handle this situation.
There is already code handling ",", but we need another for ":".
This patch adds it.
Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[ added missing return ret; ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.
See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Miscellanea:
o Remove unused return value from trace_lookup_stack
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
relayfs and tracefs are dealing with inodes of their own;
those two act as filesystem drivers
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull perf changes from Ingo Molnar:
"Core kernel changes:
- One of the more interesting features in this cycle is the ability
to attach eBPF programs (user-defined, sandboxed bytecode executed
by the kernel) to kprobes.
This allows user-defined instrumentation on a live kernel image
that can never crash, hang or interfere with the kernel negatively.
(Right now it's limited to root-only, but in the future we might
allow unprivileged use as well.)
(Alexei Starovoitov)
- Another non-trivial feature is per event clockid support: this
allows, amongst other things, the selection of different clock
sources for event timestamps traced via perf.
This feature is sought by people who'd like to merge perf generated
events with external events that were measured with different
clocks:
- cluster wide profiling
- for system wide tracing with user-space events,
- JIT profiling events
etc. Matching perf tooling support is added as well, available via
the -k, --clockid <clockid> parameter to perf record et al.
(Peter Zijlstra)
Hardware enablement kernel changes:
- x86 Intel Processor Trace (PT) support: which is a hardware tracer
on steroids, available on Broadwell CPUs.
The hardware trace stream is directly output into the user-space
ring-buffer, using the 'AUX' data format extension that was added
to the perf core to support hardware constraints such as the
necessity to have the tracing buffer physically contiguous.
This patch-set was developed for two years and this is the result.
A simple way to make use of this is to use BTS tracing, the PT
driver emulates BTS output - available via the 'intel_bts' PMU.
More explicit PT specific tooling support is in the works as well -
will probably be ready by 4.2.
(Alexander Shishkin, Peter Zijlstra)
- x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
feature of Intel Xeon CPUs that allows the measurement and
allocation/partitioning of caches to individual workloads.
These kernel changes expose the measurement side as a new PMU
driver, which exposes various QoS related PMU events. (The
partitioning change is work in progress and is planned to be merged
as a cgroup extension.)
(Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
Waskiewicz Jr)
- x86 Intel Haswell LBR call stack support: this is a new Haswell
feature that allows the hardware recording of call chains, plus
tooling support. To activate this feature you have to enable it
via the new 'lbr' call-graph recording option:
perf record --call-graph lbr
perf report
or:
perf top --call-graph lbr
This hardware feature is a lot faster than stack walk or dwarf
based unwinding, but has some limitations:
- It reuses the current LBR facility, so LBR call stack and
branch record can not be enabled at the same time.
- It is only available for user-space callchains.
(Yan, Zheng)
- x86 Intel Broadwell CPU support and various event constraints and
event table fixes for earlier models.
(Andi Kleen)
- x86 Intel HT CPUs event scheduling workarounds. This is a complex
CPU bug affecting the SNB,IVB,HSW families that results in counter
value corruption. The mitigation code is automatically enabled and
is transparent.
(Maria Dimakopoulou, Stephane Eranian)
The perf tooling side had a ton of changes in this cycle as well, so
I'm only able to list the user visible changes here, in addition to
the tooling changes outlined above:
User visible changes affecting all tools:
- Improve support of compressed kernel modules (Jiri Olsa)
- Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
- Bash completion for subcommands (Yunlong Song)
- Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
- Support missing -f to override perf.data file ownership. (Yunlong Song)
- Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)
User visible changes in individual tools:
'perf data':
New tool for converting perf.data to other formats, initially
for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
Sebastian Siewior)
'perf diff':
Add --kallsyms option (David Ahern)
'perf list':
Allow listing events with 'tracepoint' prefix (Yunlong Song)
Sort the output of the command (Yunlong Song)
'perf kmem':
Respect -i option (Jiri Olsa)
Print big numbers using thousands' group (Namhyung Kim)
Allow -v option (Namhyung Kim)
Fix alignment of slab result table (Namhyung Kim)
'perf probe':
Support multiple probes on different binaries on the same command line (Masami Hiramatsu)
Support unnamed union/structure members data collection. (Masami Hiramatsu)
Check kprobes blacklist when adding new events. (Masami Hiramatsu)
'perf record':
Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)
Support recording running/enabled time (Andi Kleen)
'perf sched':
Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)
'perf report' and 'perf top':
Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)
Indicate which callchain entries are annotated in the
TUI hists browser (Arnaldo Carvalho de Melo)
Add pid/tid filtering to 'report' and 'script' commands (David Ahern)
Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
events (Arnaldo Carvalho de Melo)
'perf stat':
Report unsupported events properly (Suzuki K. Poulose)
Output running time and run/enabled ratio in CSV mode (Andi Kleen)
'perf trace':
Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)
Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)
Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)
Dump stack on segfaults (Arnaldo Carvalho de Melo)
No need to explicitely enable evsels for workload started from perf, let it
be enabled via perf_event_attr.enable_on_exec, removing some events that take
place in the 'perf trace' before a workload is really started by it.
(Arnaldo Carvalho de Melo)
Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)
There's also been a ton of infrastructure work done, such as the
split-out of perf's build system into tools/build/ and other changes -
see the shortlog and changelog for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
perf evlist: Fix type for references to data_head/tail
perf probe: Check the orphaned -x option
perf probe: Support multiple probes on different binaries
perf buildid-list: Fix segfault when show DSOs with hits
perf tools: Fix cross-endian analysis
perf tools: Fix error path to do closedir() when synthesizing threads
perf tools: Fix synthesizing fork_event.ppid for non-main thread
perf tools: Add 'I' event modifier for exclude_idle bit
perf report: Don't call map__kmap if map is NULL.
perf tests: Fix attr tests
perf probe: Fix ARM 32 building error
perf tools: Merge all perf_event_attr print functions
perf record: Add clockid parameter
perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
perf sched replay: Support using -f to override perf.data file ownership
perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
...
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that
user space tools like perf and trace-cmd can parse the binary data
and express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values
by the C pre-processor. Thus, the enum string is exported to the
format file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings
in the TP_printk() format into their number, and that is what is
shown to user space. For example, the tracepoint tlb_flush currently
has this in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVLBTuAAoJEEjnJuOKh9ldjHMIALdRS755TXCZGOf0r7O2akOR
wMPeum7C+ae1mH+jCsJKUC0/jUfQKaMt/UxoHlipDgcGg8kD2jtGnGCw4Xlwvdsr
y4rFmcTRSl1mo0zDSsg6ujoupHlVYN0+JPjrd7S3cv/llJoY49zcanNLF7S2XLeM
dZCtWRLWYpBiWO68ai6AqJTnE/eGFIqBI048qb5Eg8dbK243SSeSIf9Ywhb+VsA+
aq6F7cWI/H6j4tbeza8tAN19dcwenDro5EfCDY8ARQHJu1f6Y3+DLf2imjkd6Aiu
JVAoGIjHIpI+djwCZC1u4gi4urjfOqYartrM3Q54tb3YWYqHeNqP2ASI2a4EpYk=
=Ixwt
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Some clean ups and small fixes, but the biggest change is the addition
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that user
space tools like perf and trace-cmd can parse the binary data and
express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values by
the C pre-processor. Thus, the enum string is exported to the format
file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
the TP_printk() format into their number, and that is what is shown to
user space. For example, the tracepoint tlb_flush currently has this
in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })"
* tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
tracing: Add enum_map file to show enums that have been mapped
writeback: Export enums used by tracepoint to user space
v4l: Export enums used by tracepoints to user space
SUNRPC: Export enums in tracepoints to user space
mm: tracing: Export enums in tracepoints to user space
irq/tracing: Export enums in tracepoints to user space
f2fs: Export the enums in the tracepoints to userspace
net/9p/tracing: Export enums in tracepoints to userspace
x86/tlb/trace: Export enums in used by tlb_flush tracepoint
tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
tracing: Allow for modules to convert their enums to values
tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
tracing: Give system name a pointer
brcmsmac: Move each system tracepoints to their own header
iwlwifi: Move each system tracepoints to their own header
mac80211: Move message tracepoints to their own header
tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
tracing: Add TRACE_SYSTEM_VAR to kvm-s390
tracing: Add TRACE_SYSTEM_VAR to intel-sst
...
more than one release, as I had it ready for the 4.0 merge window, but
a last minute thing that needed to go into Linux first had to be done.
That was that perf hard coded the file system number when reading
/sys/kernel/debugfs/tracing directory making sure that the path had
the debugfs mount # before it would parse the tracing file. This broke
other use cases of perf, and the check is removed.
Now when mounting /sys/kernel/debug, tracefs is automatically mounted
in /sys/kernel/debug/tracing such that old tools will still see that
path as expected. But now system admins can mount tracefs directly
and not need to mount debugfs, which can expose security issues.
A new directory is created when tracefs is configured such that
system admins can now mount it separately (/sys/kernel/tracing).
This branch is based off of Al Viro's vfs debugfs_automount branch
at commit 163f9eb95a
debugfs: Provide a file creation function that also takes an initial size
to get the debugfs_create_automount() operation.
I just noticed that Al rebased the pull to add his Signed-off-by to
that commit, and the commit is now e59b4e9187.
I did a git diff of those two and see they are the same. Only the
latter has Al's SOB.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVLA6YAAoJEEjnJuOKh9ldv6AH/1JUINDQwV+M0VTwzbLogloo
Sco0byLhskmx5KLVD7Vs8BJAGrgHTdit32kzBGmLGJvVCKBa+c8lwmRw6rnXg3uX
K4kGp7BIyn1/geXoGpCmDKaLGXhDcw49hRzejKDg/OqFtxKTsSeQtG8fo29ps9Do
0VaF6UDp8gYplC2N2BfpB59LVndrITQ3mSsBBeFPvS7IxFJXAhDBOq2yi0aI6HyJ
ICo2L/bA9HLxMuceWrXbsun+RP68+AQlnFfAtok7AcuBzUYPCKY0shT2VMOUtpTt
1dGxMxq6q1ACfmY7gbp47WMX9aKjWcSEr0V+IYx/xex6Maf0Xsujsy99bKYUWvs=
=OcgU
-----END PGP SIGNATURE-----
Merge tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracefs from Steven Rostedt:
"This adds the new tracefs file system.
This has been in linux-next for more than one release, as I had it
ready for the 4.0 merge window, but a last minute thing that needed to
go into Linux first had to be done. That was that perf hard coded the
file system number when reading /sys/kernel/debugfs/tracing directory
making sure that the path had the debugfs mount # before it would
parse the tracing file. This broke other use cases of perf, and the
check is removed.
Now when mounting /sys/kernel/debug, tracefs is automatically mounted
in /sys/kernel/debug/tracing such that old tools will still see that
path as expected. But now system admins can mount tracefs directly
and not need to mount debugfs, which can expose security issues. A
new directory is created when tracefs is configured such that system
admins can now mount it separately (/sys/kernel/tracing)"
* tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have mkdir and rmdir be part of tracefs
tracefs: Add directory /sys/kernel/tracing
tracing: Automatically mount tracefs on debugfs/tracing
tracing: Convert the tracing facility over to use tracefs
tracefs: Add new tracefs file system
tracing: Create cmdline tracer options on tracing fs init
tracing: Only create tracer options files if directory exists
debugfs: Provide a file creation function that also takes an initial size
Add a enum_map file in the tracing directory to see what enums have been
saved to convert in the print fmt files.
As this requires the enum mapping to be persistent in memory, it is only
created if the new config option CONFIG_TRACE_ENUM_MAP_FILE is enabled.
This is for debugging and will increase the persistent memory footprint
of the kernel.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Update the infrastructure such that modules that declare TRACE_DEFINE_ENUM()
will have those enums converted into their values in the tracepoint
print fmt strings.
Link: http://lkml.kernel.org/r/87vbhjp74q.fsf@rustcorp.com.au
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Several tracepoints use the helper functions __print_symbolic() or
__print_flags() and pass in enums that do the mapping between the
binary data stored and the value to print. This works well for reading
the ASCII trace files, but when the data is read via userspace tools
such as perf and trace-cmd, the conversion of the binary value to a
human string format is lost if an enum is used, as userspace does not
have access to what the ENUM is.
For example, the tracepoint trace_tlb_flush() has:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
Which maps the enum values to the strings they represent. But perf and
trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would
not be able to map it.
With TRACE_DEFINE_ENUM(), developers can place these in the event header
files and ftrace will convert the enums to their values:
By adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
$ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format
[...]
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
The above is what userspace expects to see, and tools do not need to
be modified to parse them.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Guilherme Cox <cox@computer.org>
Cc: Tony Luck <tony.luck@gmail.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Dynamically allocated trampolines call ftrace_ops_get_func to get the
function which they should call. For dynamic fops (FTRACE_OPS_FL_DYNAMIC
flag is set) ftrace_ops_list_func is always returned. This is reasonable
for static trampolines but goes against the main advantage of dynamic
ones, that is avoidance of going through the list of all registered
callbacks for functions that are only being traced by a single callback.
We can fix it by returning ops->func (or recursion safe version) from
ftrace_ops_get_func whenever it is possible for dynamic trampolines.
Note that dynamic trampolines are not allowed for dynamic fops if
CONFIG_PREEMPT=y.
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1501291023000.25445@pobox.suse.cz
Link: http://lkml.kernel.org/r/1424357773-13536-1-git-send-email-mbenes@suse.cz
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
dependency, it also depends on the tracing infrastructure and on kprobes,
without which it will fail to build with:
In file included from kernel/trace/bpf_trace.c:14:0:
kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
unsigned int val = current->trace_recursion;
[...]
It took quite some time to trigger this build failure, because right now
BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
more configurable, not just under CONFIG_EXPERT.
If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
gateway as well.
We might want to make this an interactive option later on, although
I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
an indicator that the user wants BPF support.
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Debugging of BPF programs needs some form of printk from the
program, so let programs call limited trace_printk() with %d %u
%x %p modifiers only.
Similar to kernel modules, during program load verifier checks
whether program is calling bpf_trace_printk() and if so, kernel
allocates trace_printk buffers and emits big 'this is debug
only' banner.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
bpf_ktime_get_ns() is used by programs to compute time delta
between events or as a timestamp
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
tracepoints, since bpf programs can only be attached to kprobe
type of PERF_TYPE_TRACEPOINT perf events.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A clean up of the recursive protection code changed
val = this_cpu_read(current_context);
val--;
val &= this_cpu_read(current_context);
to
val = this_cpu_read(current_context);
val &= val & (val - 1);
Which has a duplicate use of '&' as the above is the same as
val = val & (val - 1);
Actually, it would be best to remove that line altogether and
just add it to where it is used.
And Christoph even mentioned that it can be further compacted to
just a single line:
__this_cpu_and(current_context, __this_cpu_read(current_context) - 1);
Link: http://lkml.kernel.org/alpine.DEB.2.11.1503271423580.23114@gentwo.org
Suggested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit that added a check for this to checkpatch says:
"Using weak declarations can have unintended link defects. The __weak on
the declaration causes non-weak definitions to become weak."
In this case, when a PowerPC kernel is built with CONFIG_KPROBE_EVENT
but not CONFIG_UPROBE_EVENT, it generates the following warning:
WARNING: 1 bad relocations
c0000000014f2190 R_PPC64_ADDR64 uprobes_fetch_type_table
This is fixed by passing the fetch_table arrays to
traceprobe_parse_probe_arg() which also means that they can never be NULL.
Link: http://lkml.kernel.org/r/20150312165834.4482cb48@canb.auug.org.au
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TRACE_EVENT_FL_USE_CALL_FILTER flag in ftrace:functon event can be
removed. This flag was first introduced in commit
f306cc82a9 ("tracing: Update event filters for multibuffer").
Now, the only place uses this flag is ftrace:function, but the filter of
ftrace:function has a different code path with events/syscalls and
events/tracepoints. It uses ftrace_filter_write() and perf's
ftrace_profile_set_filter() to set the filter, the functionality of file
'tracing/events/ftrace/function/filter' is bypassed in function
init_pred(), in which case, neither call->filter nor file->filter is
used.
So we can safely remove TRACE_EVENT_FL_USE_CALL_FILTER flag from
ftrace:function events.
Link: http://lkml.kernel.org/r/1425367294-27852-1-git-send-email-hekuang@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It has come to my attention that this_cpu_read/write are horrible on
architectures other than x86. Worse yet, they actually disable
preemption or interrupts! This caused some unexpected tracing results
on ARM.
101.356868: preempt_count_add <-ring_buffer_lock_reserve
101.356870: preempt_count_sub <-ring_buffer_lock_reserve
The ring_buffer_lock_reserve has recursion protection that requires
accessing a per cpu variable. But since preempt_disable() is traced, it
too got traced while accessing the variable that is suppose to prevent
recursion like this.
The generic version of this_cpu_read() and write() are:
#define this_cpu_generic_read(pcp) \
({ typeof(pcp) ret__; \
preempt_disable(); \
ret__ = *this_cpu_ptr(&(pcp)); \
preempt_enable(); \
ret__; \
})
#define this_cpu_generic_to_op(pcp, val, op) \
do { \
unsigned long flags; \
raw_local_irq_save(flags); \
*__this_cpu_ptr(&(pcp)) op val; \
raw_local_irq_restore(flags); \
} while (0)
Which is unacceptable for locations that know they are within preempt
disabled or interrupt disabled locations.
Paul McKenney stated that __this_cpu_() versions produce much better code on
other architectures than this_cpu_() does, if we know that the call is done in
a preempt disabled location.
I also changed the recursive_unlock() to use two local variables instead
of accessing the per_cpu variable twice.
Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
Cc: stable@vger.kernel.org
Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Tested-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some archs (specifically PowerPC), are sensitive with the ordering of
the enabling of the calls to function tracing and setting of the
function to use to be traced.
That is, update_ftrace_function() sets what function the ftrace_caller
trampoline should call. Some archs require this to be set before
calling ftrace_run_update_code().
Another bug was discovered, that ftrace_startup_sysctl() called
ftrace_run_update_code() directly. If the function the ftrace_caller
trampoline changes, then it will not be updated. Instead a call
to ftrace_startup_enable() should be called because it tests to see
if the callback changed since the code was disabled, and will
tell the arch to update appropriately. Most archs do not need this
notification, but PowerPC does.
The problem could be seen by the following commands:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# cat /sys/kernel/debug/tracing/trace
The trace will show that function tracing was not active.
Cc: stable@vger.kernel.org # 2.6.27+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When ftrace is enabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
ftrace is disabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
Consider the following situation.
# echo 0 > /proc/sys/kernel/ftrace_enabled
After this ftrace_enabled = 0.
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
called.
# echo 1 > /proc/sys/kernel/ftrace_enabled
Now ftrace_enabled will be set to true, but still
ftrace_enable_ftrace_graph_caller() will not be called, which is not
desired.
Further if we execute the following after this:
# echo nop > /sys/kernel/debug/tracing/current_tracer
Now since ftrace_enabled is set it will call
ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
the ARM platform.
On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
it checks whether the old instruction is a nop or not. If it's not a nop,
then it returns an error. If it is a nop then it replaces instruction at
that address with a branch to ftrace_graph_caller.
ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
then it will return an error, which will cause the generic ftrace code to
raise a warning.
Note, x86 does not have an issue with this because the architecture
specific code for ftrace_enable_ftrace_graph_caller() and
ftrace_disable_ftrace_graph_caller() does not check the previous state,
and calling either of these functions twice in a row has no ill effect.
Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Pratyush Anand <panand@redhat.com>
[
removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
if CONFIG_FUNCTION_GRAPH_TRACER is not set.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When /proc/sys/kernel/ftrace_enabled is set to zero, all function
tracing is disabled. But the records that represent the functions
still hold information about the ftrace_ops that are hooked to them.
ftrace_ops may request "REGS" (have a full set of pt_regs passed to
the callback), or "TRAMP" (the ops has its own trampoline to use).
When the record is updated to represent the state of the ops hooked
to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
points to the correct trampoline (REGS has its own trampoline).
When ftrace_enabled is set to zero, all ftrace locations are a nop,
so they do not point to any trampoline. But the _EN flags are still
set. This can cause the accounting to go wrong when ftrace_enabled
is cleared and an ops that has a trampoline is registered or unregistered.
For example, the following will cause ftrace to crash:
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As function_graph uses a trampoline, when ftrace_enabled is set to zero
the updates to the record are not done. When enabling function_graph
again, the record will still have the TRAMP_EN flag set, and it will
look for an op that has a trampoline other than the function_graph
ops, and fail to find one.
Cc: stable@vger.kernel.org # 3.17+
Reported-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways to
make trace events. Lots of features have been added since the sample
code was made, and these features are mostly unknown. Developers
have been making their own hacks to do things that are already available.
o Performance improvements. Most notably, I found a performance bug where
a waiter that is waiting for a full page from the ring buffer will
see that a full page is not available, and go to sleep. The sched
event caused by it going to sleep would cause it to wake up again.
It would see that there was still not a full page, and go back to sleep
again, and that would wake it up again, until finally it would see a
full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3M+GAAoJEEjnJuOKh9ldpWQIAJTUzeVXlU0cf3bVn768VW7e
XS41WHF34l1tNevmKTh6fCPiw8+U0UMGLQt5WKtyaaARsZn2MlefLVuvHPKFlK2w
+qcI4OEVHH97Qgf9HWJSsYgnZaOnOE+TENqnokEgXMimRMuVcd/S4QaGxwJVDcjm
iBF5j2TaG4aGbx4a3J7KueoZ3K+39r3ut15hIGi/IZBZldQ1pt26ytafD/KA3CU3
BLRM2HLttAMsV1ds0EDLgZjSGICVetFcdOmI5Gwj7Qr3KrOTRPYJMNc8NdDL7Js9
v8VhujhFGvcCrhO/IKpVvd9yluz3RCF+Z7ihc+D/+1B3Nsm0PTwN3Fl5J+f89AA=
=u2Mm
-----END PGP SIGNATURE-----
Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The updates included in this pull request for ftrace are:
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways
to make trace events. Lots of features have been added since the
sample code was made, and these features are mostly unknown.
Developers have been making their own hacks to do things that are
already available.
o Performance improvements. Most notably, I found a performance bug
where a waiter that is waiting for a full page from the ring buffer
will see that a full page is not available, and go to sleep. The
sched event caused by it going to sleep would cause it to wake up
again. It would see that there was still not a full page, and go
back to sleep again, and that would wake it up again, until finally
it would see a full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths"
* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not wake up a splice waiter when page is not full
tracing: Fix unmapping loop in tracing_mark_write
tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
tracing: Add TRACE_EVENT_FN example
tracing: Add TRACE_EVENT_CONDITION sample
tracing: Update the TRACE_EVENT fields available in the sample code
tracing: Separate out initializing top level dir from instances
tracing: Make tracing_init_dentry_tr() static
trace: Use 64-bit timekeeping
tracing: Add array printing helper
tracing: Remove newline from trace_printk warning banner
tracing: Use IS_ERR() check for return value of tracing_init_dentry()
tracing: Remove unneeded includes of debugfs.h and fs.h
tracing: Remove taking of trace_types_lock in pipe files
tracing: Add ref count to tracer for when they are being read by pipe
Pull s390 updates from Martin Schwidefsky:
- The remaining patches for the z13 machine support: kernel build
option for z13, the cache synonym avoidance, SMT support,
compare-and-delay for spinloops and the CES5S crypto adapater.
- The ftrace support for function tracing with the gcc hotpatch option.
This touches common code Makefiles, Steven is ok with the changes.
- The hypfs file system gets an extension to access diagnose 0x0c data
in user space for performance analysis for Linux running under z/VM.
- The iucv hvc console gets wildcard spport for the user id filtering.
- The cacheinfo code is converted to use the generic infrastructure.
- Cleanup and bug fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/process: free vx save area when releasing tasks
s390/hypfs: Eliminate hypfs interval
s390/hypfs: Add diagnose 0c support
s390/cacheinfo: don't use smp_processor_id() in preemptible context
s390/zcrypt: fixed domain scanning problem (again)
s390/smp: increase maximum value of NR_CPUS to 512
s390/jump label: use different nop instruction
s390/jump label: add sanity checks
s390/mm: correct missing space when reporting user process faults
s390/dasd: cleanup profiling
s390/dasd: add locking for global_profile access
s390/ftrace: hotpatch support for function tracing
ftrace: let notrace function attribute disable hotpatching if necessary
ftrace: allow architectures to specify ftrace compile options
s390: reintroduce diag 44 calls for cpu_relax()
s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
s390/zcrypt: Number of supported ap domains is not retrievable.
s390/spinlock: add compare-and-delay to lock wait loops
s390/tape: remove redundant if statement
s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
...
When an application connects to the ring buffer via splice, it can only
read full pages. Splice does not work with partial pages. If there is
not enough data to fill a page, the splice command will either block
or return -EAGAIN (if set to nonblock).
Code was added where if the page is not full, to just sleep again.
The problem is, it will get woken up again on the next event. That
is, when something is written into the ring buffer, if there is a waiter
it will wake it up. The waiter would then check the buffer, see that
it still does not have enough data to fill a page and go back to sleep.
To make matters worse, when the waiter goes back to sleep, it could
cause another event, which would wake it back up again to see it
doesn't have enough data and sleep again. This produces a tremendous
overhead and fills the ring buffer with noise.
For example, recording sched_switch on an idle system for 10 seconds
produces 25,350,475 events!!!
Create another wait queue for those waiters wanting full pages.
When an event is written, it only wakes up waiters if there's a full
page of data. It does not wake up the waiter if the page is not yet
full.
After this change, recording sched_switch on an idle system for 10
seconds produces only 800 events. Getting rid of 25,349,675 useless
events (99.9969% of events!!), is something to take seriously.
Cc: stable@vger.kernel.org # 3.16+
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aad2 "tracing: Do not busy wait in buffer splice"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Rework of the core ACPI resources parsing code to fix issues
in it and make using resource offsets more convenient and
consolidation of some resource-handing code in a couple of places
that have grown analagous data structures and code to cover the
the same gap in the core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus,
Jarkko Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states
to make the code more straightforward and less bloated overall
(Rafael J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki,
Yaowei Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in
the right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist,
Pavel Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJU2neOAAoJEILEb/54YlRx51QP/jrv1Wb5eMaemzMksPIWI5Zn
I8IbxzToxu7wDDsrTBRv+LuyllMPrnppFOHHvB35gUYu7Y6I066s3ErwuqeFlbmy
+VicmyGMahv3yN74qg49MXzWtaJZa8hrFXn8ItujiUIcs08yELi0vBQFlZImIbTB
PdQngO88VfiOVjDvmKkYUU//9Sc9LCU0ZcdUQXSnA1oNOxuUHjiARz98R03hhSqu
BWR+7M0uaFbu6XeK+BExMXJTpKicIBZ1GAF6hWrS8V4aYg+hH1cwjf2neDAzZkcU
UkXieJlLJrCq+ZBNcy7WEhkWQkqJNWei5WYiy6eoQeQpNoliY2V+2OtSMJaKqDye
PIiMwXstyDc5rgyULN0d1UUzY6mbcUt2rOL0VN2bsFVIJ1HWCq8mr8qq689pQUYv
tcH18VQ2/6r2zW28sTO/ByWLYomklD/Y6bw2onMhGx3Knl0D8xYJKapVnTGhr5eY
d4k41ybHSWNKfXsZxdJc+RxndhPwj9rFLfvY/CZEhLcW+2pAiMarRDOPXDoUI7/l
aJpmPzy/6mPXGBnTfr6jKDSY3gXNazRIvfPbAdiGayKcHcdRM4glbSbNH0/h1Iq6
HKa8v9Fx87k1X5r4ZbhiPdABWlxuKDiM7725rfGpvjlWC3GNFOq7YTVMOuuBA225
Mu9PRZbOsZsnyNkixBpX
=zZER
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"We have a few new features this time, including a new SFI-based
cpufreq driver, a new devfreq driver for Tegra Activity Monitor, a new
devfreq class for providing its governors with raw utilization data
and a new ACPI driver for AMD SoCs.
Still, the majority of changes here are reworks of existing code to
make it more straightforward or to prepare it for implementing new
features on top of it. The primary example is the rework of ACPI
resources handling from Jiang Liu, Thomas Gleixner and Lv Zheng with
support for IOAPIC hotplug implemented on top of it, but there is
quite a number of changes of this kind in the cpufreq core, ACPICA,
ACPI EC driver, ACPI processor driver and the generic power domains
core code too.
The most active developer is Viresh Kumar with his cpufreq changes.
Specifics:
- Rework of the core ACPI resources parsing code to fix issues in it
and make using resource offsets more convenient and consolidation
of some resource-handing code in a couple of places that have grown
analagous data structures and code to cover the the same gap in the
core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus, Jarkko
Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states to
make the code more straightforward and less bloated overall (Rafael
J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki, Yaowei
Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in the
right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist, Pavel
Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan)"
* tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (151 commits)
tools/power turbostat: relax dependency on APERF_MSR
tools/power turbostat: relax dependency on invariant TSC
Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources
tools/power turbostat: decode MSR_*_PERF_LIMIT_REASONS
tools/power turbostat: relax dependency on root permission
ACPI / video: Add disable_native_backlight quirk for Samsung 510R
ACPI / PM: Remove unneeded nested #ifdef
USB / PM: Remove unneeded #ifdef and associated dead code
intel_pstate: provide option to only use intel_pstate with HWP
ACPI / EC: Add GPE reference counting debugging messages
ACPI / EC: Add query flushing support
ACPI / EC: Refine command storm prevention support
ACPI / EC: Add command flushing support.
ACPI / EC: Introduce STARTED/STOPPED flags to replace BLOCKED flag
ACPI: add AMD ACPI2Platform device support for x86 system
ACPI / table: remove duplicate NULL check for the handler of acpi_table_parse()
ACPI / EC: Update revision due to raw handler mode.
ACPI / EC: Reduce ec_poll() by referencing the last register access timestamp.
ACPI / EC: Fix several GPE handling issues by deploying ACPI_GPE_DISPATCH_RAW_HANDLER mode.
ACPICA: Events: Enable APIs to allow interrupt/polling adaptive request based GPE handling model
...
Commit 6edb2a8a38 introduced
an array map_pages that contains the addresses returned by
kmap_atomic. However, when unmapping those pages, map_pages[0]
is unmapped before map_pages[1], breaking the nesting requirement
as specified in the documentation for kmap_atomic/kunmap_atomic.
This was caught by the highmem debug code present in kunmap_atomic.
Fix the loop to do the unmapping properly.
Link: http://lkml.kernel.org/r/1418871056-6614-1-git-send-email-markivx@codeaurora.org
Cc: stable@vger.kernel.org # 3.5+
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Reported-by: Lime Yang <limey@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing "instances" directory can create sub tracing buffers
with mkdir, and remove them with rmdir. As a mkdir will also create
all the files and directories that control the sub buffer the inode
mutexes need to be released before this is done, to avoid deadlocks.
It is better to let the tracing system unlock the inode mutexes before
calling the functions that create the files within the new directory
(or deletes the files from the one being destroyed).
Now that tracing has been converted over to tracefs, the tracefs file
system can be modified to accommodate this feature. It still releases
the locks, but the filesystem itself can take care of the ugly
business and let the user just do what it needs.
The tracing system now attaches a descriptor to the directory dentry
that can have userspace create or remove sub directories. If this
descriptor does not exist for a dentry, then that dentry can not be
used to create other directories. This descriptor holds a mkdir and
rmdir method that only takes a character string as an argument.
The tracefs file system will first make a copy of the dentry name
before releasing the locks. Then it will pass the copied name to the
methods. It is up to the tracing system that supplied the methods to
handle races with duplicate names and such as all the inode mutexes
would be released when the functions are called.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As tools currently rely on the tracing directory in debugfs, we can not
just created a tracefs infrastructure and expect sysadmins to mount
the new tracefs to have their old tools work.
Instead, the debugfs tracing directory is still created and the tracefs
file system is mounted there when the debugfs filesystem is mounted.
No longer does the tracing infrastructure update the debugfs file system,
but instead interacts with the tracefs file system. But now, it still
appears to the user like nothing changed, except you also have the feature
of mounting just the tracing system without needing all of debugfs!
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
debugfs was fine for the tracing facility as a quick way to get
an interface. Now that tracing has matured, it should separate itself
from debugfs such that it can be mounted separately without needing
to mount all of debugfs with it. That is, users resist using tracing
because it requires mounting debugfs. Having tracing have its own file
system lets users get the features of tracing without needing to bring
in the rest of the kernel's debug infrastructure.
Another reason for tracefs is that debubfs does not support mkdir.
Currently, to create instances, one does a mkdir in the tracing/instance
directory. This is implemented via a hack that forces debugfs to do
something it is not intended on doing. By converting over to tracefs, this
hack can be removed and mkdir can be properly implemented. This patch does
not address this yet, but it lays the ground work for that to be done.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The options for cmdline tracers are not created if the debugfs system
is not ready yet. If tracing has started before debugfs is up, then the
option files for the tracer are not created. Create them when creating
the tracing directory if the current tracer requires option files.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not bother creating tracer options if no tracing directory
exists. If a tracer is enabled via the command line, and is
started before the tracing directory is created, then it wont have
its tracer specific options created.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull in Al Viro's changes to debugfs that implement the new primitive:
debugfs_create_automount(), that creates a directory in debugfs that will
safely mount another file system automatically when debugfs is mounted.
This will let tracefs automount itself on top of debugfs/tracing directory.
The top level trace array is treated a little different than the
instances, as it has to deal with more of the general tracing.
The tr->dir is the tracing directory, which is an immutable
dentry, where as the tr->dir of instances are the dentry that
was created, and can be destroyed later. These should have different
functions accessing them.
As only tracing_init_dentry() deals with the top level array, fold
the code for it into that function, and remove the trace_init_dentry_tr()
that was also used by the instances to get their directory dentry.
Add a tracing_get_dentry() to just get the tracing dir entry for
instances as well as the top level array.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Export the suspend_resume tracepoint so it can be used
in loadable modules.
Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.
This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The ring_buffer_producer uses 'struct timeval' to measure
its start and end times. 'struct timeval' on 32-bit systems
will have its tv_sec value overflow in year 2038 and beyond.
This patch replaces struct timeval with 'ktime_t' which uses
64-bit representation for nanoseconds.
Link: http://lkml.kernel.org/r/20150128141611.GA2701@tinar
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a trace event contains an array, there is currently no standard
way to format this for text output. Drivers are currently hacking
around this by a) local hacks that use the trace_seq functionailty
directly, or b) just not printing that information. For fixed size
arrays, formatting of the elements can be open-coded, but this gets
cumbersome for arrays of non-trivial size.
These approaches result in non-standard content of the event format
description delivered to userspace, so userland tools needs to be
taught to understand and parse each array printing method
individually.
This patch implements a __print_array() helper that tracepoint
implementations can use instead of reinventing it. A simple C-style
syntax is used to delimit the array and its elements {like,this}.
So that the helpers can be used with large static arrays as well as
dynamic arrays, they take a pointer and element count: they can be
used with __get_dynamic_array() for use with dynamic arrays.
Link: http://lkml.kernel.org/r/1422449335-8289-2-git-send-email-javi.merino@arm.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove the output-confusing newline below:
[ 0.191328]
**********************************************************
[ 0.191493] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.191586] ** **
...
Link: http://lkml.kernel.org/r/1422375440-31970-1-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
[ added an extra '\n' by itself, to keep what it was suppose to do ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_init_dentry() will soon return NULL as a valid pointer for the
top level tracing directroy. NULL can not be used as an error value.
Instead, switch to ERR_PTR() and check the return status with
IS_ERR().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The creation of tracing files and directories is for the most part
encapsulated in helper functions in trace.c. Other files do not need to
include debugfs.h or fs.h, as they may have needed to in the past.
Remove them from the files that do not need them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
the mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time
it will crash the system.
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will crash.
This is due to the way jprobes copies the stack frame and does not
do a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack
and replaces it with a hook function. It saves the return addresses in
a separate stack to put back the correct return address when done.
But because the jprobe functions do not do a normal return, their
stack addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function hash
with the function tracer (they both get filtered by the same input).
The changing of the set_ftrace_filter would not sync the function recording
records after a change if the function tracer was disabled but the
function graph tracer was enabled. This was due to the update only checking
one of the ops instead of the shared ops to see if they were enabled and
should perform the sync. This caused the ftrace accounting to break and
a ftrace_bug() would be triggered, disabling ftrace until a reboot.
The second was that the check to update records only checked one of the
filter hashes. It needs to test both the "filter" and "notrace" hashes.
The "filter" hash determines what functions to trace where as the "notrace"
hash determines what functions not to trace (trace all but these).
Both hashes need to be passed to the update code to find out what change
is being done during the update. This also broke the ftrace record
accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported separately
from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up.
This is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the first
one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge window.
The moving to enable events early, moved the enabling before PID 1 was
created. The syscall events require setting the TIF_SYSCALL_TRACEPOINT
for all tasks. But for_each_process_thread() does not include the swapper
task (PID 0), and ended up being a nop. A suggested fix was to add
the init_task() to have its flag set, but I didn't really want to mess
with PID 0 for this minor bug. Instead I disable and re-enable events again
at early_initcall() where it use to be enabled. This also handles any other
event that might have its own reg function that could break at early
boot up.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUt9vmAAoJEEjnJuOKh9ldLHEIAJ9XrPW2xMIY5yI69jT1F7pv
PkSRqENnOK0l4UulD52SvIBecQTTBcEEjao4yVGkc7DCJBOws/1LZ5gW8OfNlKjq
rMB8yaosL1tXJ1ARVPMjcQVy+228zkgTXznwEZCjku1g7LuScQ28qyXsXO7B6yiK
xKoHqKjygmM/a2aVn+8tdiVKiDp6jdmkbYicbaFT4xP7XB5DaMmIiXRHxdvW6xdR
azKrVfYiMyJqTZNt/EVSWUk2WjeaYhoXyNtvgPx515wTo/llCnzhjcsocXBtH2P/
YOtwl+1L7Z89ukV9oXqrtrUJZ6Ps7+g7I1flJuL7/1FlNGnklcP9JojD+t6HeT8=
=vkec
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fixes from Steven Rostedt:
"This holds a few fixes to the ftrace infrastructure as well as the
mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time it
will crash the system:
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will
crash.
This is due to the way jprobes copies the stack frame and does not do
a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack and
replaces it with a hook function. It saves the return addresses in a
separate stack to put back the correct return address when done. But
because the jprobe functions do not do a normal return, their stack
addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function
hash with the function tracer (they both get filtered by the same
input). The changing of the set_ftrace_filter would not sync the
function recording records after a change if the function tracer was
disabled but the function graph tracer was enabled. This was due to
the update only checking one of the ops instead of the shared ops to
see if they were enabled and should perform the sync. This caused the
ftrace accounting to break and a ftrace_bug() would be triggered,
disabling ftrace until a reboot.
The second was that the check to update records only checked one of
the filter hashes. It needs to test both the "filter" and "notrace"
hashes. The "filter" hash determines what functions to trace where as
the "notrace" hash determines what functions not to trace (trace all
but these). Both hashes need to be passed to the update code to find
out what change is being done during the update. This also broke the
ftrace record accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported
separately from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up. This
is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the
first one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge
window. The moving to enable events early, moved the enabling before
PID 1 was created. The syscall events require setting the
TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread()
does not include the swapper task (PID 0), and ended up being a nop.
A suggested fix was to add the init_task() to have its flag set, but I
didn't really want to mess with PID 0 for this minor bug. Instead I
disable and re-enable events again at early_initcall() where it use to
be enabled. This also handles any other event that might have its own
reg function that could break at early boot up"
* tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix enabling of syscall events on the command line
tracing: Remove extra call to init_ftrace_syscalls()
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
ftrace: Check both notrace and filter for old hash
ftrace: Fix updating of filters for shared global_ops filters
Commit 5f893b2639 "tracing: Move enabling tracepoints to just after
rcu_init()" broke the enabling of system call events from the command
line. The reason was that the enabling of command line trace events
was moved before PID 1 started, and the syscall tracepoints require
that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the
swapper task (pid 0) is not part of that. Since the swapper task is the
only task that is running at this early in boot, no task gets the
flag set, and the tracepoint never gets reached.
Instead of setting the swapper task flag (there should be no reason to
do that), re-enabled trace events again after the init thread (PID 1)
has been started. It requires disabling all command line events and
re-enabling them, as just enabling them again will not reset the logic
to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will
be fooled into thinking that it was already set, and wont try setting
it again. For this reason, we must first disable it and re-enable it.
Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_init() calls init_ftrace_syscalls() and then calls trace_event_init()
which also calls init_ftrace_syscalls(). It makes more sense to only
call it from trace_event_init().
Calling it twice wastes memory, as it allocates the syscall events twice,
and loses the first copy of it.
Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com
Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org
Reported-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using just the filter for checking for trampolines or regs is not enough
when updating the code against the records that represent all functions.
Both the filter hash and the notrace hash need to be checked.
To trigger this bug (using trace-cmd and perf):
# perf probe -a do_fork
# trace-cmd start -B foo -e probe
# trace-cmd record -p function_graph -n do_fork sleep 1
The trace-cmd record at the end clears the filter before it disables
function_graph tracing and then that causes the accounting of the
ftrace function records to become incorrect and causes ftrace to bug.
Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org
Cc: stable@vger.kernel.org
[ still need to switch old_hash_ops to old_ops_hash ]
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the set_ftrace_filter affects both the function tracer as well as the
function graph tracer, the ops that represent each have a shared
ftrace_ops_hash structure. This allows both to be updated when the filter
files are updated.
But if function graph is enabled and the global_ops (function tracing) ops
is not, then it is possible that the filter could be changed without the
update happening for the function graph ops. This will cause the changes
to not take place and may even cause a ftrace_bug to occur as it could mess
with the trampoline accounting.
The solution is to check if the ops uses the shared global_ops filter and
if the ops itself is not enabled, to check if there's another ops that is
enabled and also shares the global_ops filter. In that case, the
modification still needs to be executed.
Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org
Cc: stable@vger.kernel.org # 3.17+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.
The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.
Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.
This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.
Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.
We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).
One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cleanups
kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
kgdb/kdb: Allow access on a single core, if a CPU round up is deemed
impossible, which will allow inspection of the now "trashed" kernel
kdb: Add enable mask for the command groups
kdb: access controls to restrict sensitive commands
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUrq8WAAoJEIciOldedpOj+C8P/AjSUVBZdBLWzCU2VG150sQ0
UacwFVLve9heoColHBF7VqIDCRkZokIKJmCbHUBPZTbs22auLRpNI+D6CY5lZD17
jEHxrkKY4ragRRc/W3Y1MSc3aeGnS0i5AR8PJermMWxyUBfN3FBxgFHzTaLB2ZTT
8A+tvmwiG4mHue52gSiYZPCl/52WWOh+NjDe7T9OZ+mNmQKwZ5ssQZmmyUkxrs3b
LKXVXVtTUXxfEgB2x+lYTYAztcTsM5h+NbkT74FpSmwPjvU/p81Ptqveh+3JTdmX
H+Jz/SqD1/NfxC1Eenh5Mc++p/UVxeRbBulV9jwqjOyJqDjw3qHs1cjm8tZZj1qG
J3LODKi3GWhujMCfwdu5EJRnrFxgHCPiWInc2708oLbRi5SyOe6P6hNQ3K3Y4JtF
VkYa62wSaI0fDNQUFRc3bXUOUdMOCXjuzw3BtTi93tcUNcQwCXuYCmWtVvBgmK1h
LTrFCJmzbopiwpomxCwZ4BQm8id9HxP5pod95ypYb8K5aheXHCuSgibqj0nswWMm
ix0YTd4UNTn79r6p4d0fXFjOOYpXZA80ojeVI27D9zW7dBYc5CGVA1IDNH0ZfiPo
qySPUNUMXIjiTSOGZdUehByEC7tliLZczelRPnNh/9fmhJkJ745S7zs3DNQ7Ypg4
xDKthlRGNjn6cXOPl7gX
=cf1c
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb/kdb fixes from Jason Wessel:
"These have been around since 3.17 and in kgdb-next for the last 9
weeks and some will go back to -stable.
Summary of changes:
Cleanups
- kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
- kgdb/kdb: Allow access on a single core, if a CPU round up is
deemed impossible, which will allow inspection of the now "trashed"
kernel
- kdb: Add enable mask for the command groups
- kdb: access controls to restrict sensitive commands"
* tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kernel/debug/debug_core.c: Logging clean-up
kgdb: timeout if secondary CPUs ignore the roundup
kdb: Allow access to sensitive commands to be restricted by default
kdb: Add enable mask for groups of commands
kdb: Categorize kdb commands (similar to SysRq categorization)
kdb: Remove KDB_REPEAT_NONE flag
kdb: Use KDB_REPEAT_* values as flags
kdb: Rename kdb_register_repeat() to kdb_register_flags()
kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
kdb: Remove currently unused kdbtab_t->cmd_flags
Taking the global mutex "trace_types_lock" in the trace_pipe files
causes a bottle neck as most the pipe files can be read per cpu
and there's no reason to serialize them.
The current_trace variable was given a ref count and it can not
change when the ref count is not zero. Opening the trace_pipe
files will up the ref count (and decremented on close), so that
the lock no longer needs to be taken when accessing the
current_trace variable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When one of the trace pipe files are being read (by either the trace_pipe
or trace_pipe_raw), do not allow the current_trace to change. By adding
a ref count that is incremented when the pipe files are opened, will
prevent the current_trace from being changed.
This will allow for the removal of the global trace_types_lock from
reading the pipe buffers (which is currently a bottle neck).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by
the driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken
into account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should
have used from the start and drop some other unuseful KERN_ERR
messages printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool
(Prarit Bhargava).
- Fix two regressions introduced by recent commits in the OPP
library and clean up some existing minor issues in that code
(Viresh Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout
the tree (or drop it where that can be done) in order to make
it possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki,
Ulf Hansson, Ludovic Desroches). There will be one more
"CONFIG_PM_RUNTIME removal" batch after this one, because some
new uses of it have been introduced during the current merge
window, but that should be sufficient to finally get rid of it.
- Make the ACPI EC driver more robust against race conditions
related to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to
disable GPEs that it has not enabled which confuses ACPICA
and makes it report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver
to make it possible to override the blacklisting of some
systems in that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS
entry for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces
witn names that reflect the IDs of the ACPI device objects
they are associated with, except for "generic" ACPI fans
(PNP ID "PNP0C0B"). That's necessary for user space thermal
management tools to be able to connect the fans with the
parts of the system they are supposed to be cooling properly.
From Srinivas Pandruvada.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUk0IDAAoJEILEb/54YlRx7fgP/3+yF/0TnEW93j2ALDAQFiLF
tSv2A2vQC8vtMJjjWx0z/HqPh86gfaReEFZmUJD/Q/e2LXEnxNZJ+QMjcekPVkDM
mTvcIMc2MR8vOA/oMkgxeaKregrrx7RkCfojd+NWZhVukkjl+mvBHgAnYjXRL+NZ
unDWGlbHG97vq/3kGjPYhDS00nxHblw8NHFBu5HL5RxwABdWoeZJITwqxXWyuPLw
nlqNWlOxmwvtSbw2VMKz0uof1nFHyQLykYsMG0ZsyayCRdWUZYkEqmE7GGpCLkLu
D6yfmlpen6ccIOsEAae0eXBt50IFY9Tihk5lovx1mZmci2SNRg29BqMI105wIn0u
8b8Ej7MNHp7yMxRpB5WfU90p/y7ioJns9guFZxY0CKaRnrI2+BLt3RscMi3MPI06
Cu2/WkSSa09fhDPA+pk+VDYsmWgyVawigesNmMP5/cvYO/yYywVRjOuO1k77qQGp
4dSpFYEHfpxinejZnVZOk2V9MkvSLoSMux6wPV0xM0IE1iD0ulVpHjTJrwp80ph4
+bfUFVr/vrD1y7EKbf1PD363ZKvJhWhvQWDgETsM1vgLf21PfWO7C2kflIAsWsdQ
1ukD5nCBRlP4K73hG7bdM6kRztXhUdR0SHg85/t0KB/ExiVqtcXIzB60D0G1lENd
QlKbq3O4lim1WGuhazQY
=5fo2
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are regression fixes (leds-gpio, ACPI backlight driver,
operating performance points library, ACPI device enumeration
messages, cpupower tool), other bug fixes (ACPI EC driver, ACPI device
PM), some cleanups in the operating performance points (OPP)
framework, continuation of CONFIG_PM_RUNTIME elimination, a couple of
minor intel_pstate driver changes, a new MAINTAINERS entry for it and
an ACPI fan driver change needed for better support of thermal
management in user space.
Specifics:
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by the
driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken into
account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should have
used from the start and drop some other unuseful KERN_ERR messages
printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool (Prarit
Bhargava).
- Fix two regressions introduced by recent commits in the OPP library
and clean up some existing minor issues in that code (Viresh
Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout the
tree (or drop it where that can be done) in order to make it
possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki, Ulf
Hansson, Ludovic Desroches).
There will be one more "CONFIG_PM_RUNTIME removal" batch after this
one, because some new uses of it have been introduced during the
current merge window, but that should be sufficient to finally get
rid of it.
- Make the ACPI EC driver more robust against race conditions related
to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to disable
GPEs that it has not enabled which confuses ACPICA and makes it
report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver to
make it possible to override the blacklisting of some systems in
that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS entry
for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces witn
names that reflect the IDs of the ACPI device objects they are
associated with, except for "generic" ACPI fans (PNP ID "PNP0C0B").
That's necessary for user space thermal management tools to be able
to connect the fans with the parts of the system they are supposed
to be cooling properly. From Srinivas Pandruvada"
* tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
MAINTAINERS: add entry for intel_pstate
ACPI / video: update the skip case for acpi_video_device_in_dod()
power / PM: Eliminate CONFIG_PM_RUNTIME
NFC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
SCSI / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / EC: Fix unexpected ec_remove_handlers() invocations
Revert "tools: cpupower: fix return checks for sysfs_get_idlestate_count()"
tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
x86 / PM: Replace CONFIG_PM_RUNTIME in io_apic.c
PM: Remove the SET_PM_RUNTIME_PM_OPS() macro
mmc: atmel-mci: use SET_RUNTIME_PM_OPS() macro
PM / Kconfig: Replace PM_RUNTIME with PM in dependencies
ARM / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
sound / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
phy / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
video / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
tty / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / PM: Do not disable wakeup GPEs that have not been enabled
ACPI / utils: Drop error messages from acpi_evaluate_reference()
...
as I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init(). By passing
in the trace_event= kernel command line parameter, tracepoints can be
enabled at boot up. For debugging things like the initialization of
interrupts, it is needed to have tracepoints enabled very early. People
have asked about this before and this has been on my todo list. As it
can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
pushing this now. This way he can add tracepoints into the IRQ set up
and have users enable them when things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered. If the irq code locks up or reboots the box, having the
tracepoint output go into the kernel ring buffer is useless for
debugging. But being able to add the tp_printk kernel command line
option along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas tried
them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
=Z4Un
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"As the merge window is still open, and this code was not as complex as
I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init().
By passing in the trace_event= kernel command line parameter,
tracepoints can be enabled at boot up. For debugging things like
the initialization of interrupts, it is needed to have tracepoints
enabled very early. People have asked about this before and this
has been on my todo list. As it can be helpful for Thomas to debug
his upcoming 3.20 IRQ work, I'm pushing this now. This way he can
add tracepoints into the IRQ set up and have users enable them when
things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered.
If the irq code locks up or reboots the box, having the tracepoint
output go into the kernel ring buffer is useless for debugging.
But being able to add the tp_printk kernel command line option
along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas
tried them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"
* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add tp_printk cmdline to have tracepoints go to printk()
tracing: Move enabling tracepoints to just after rcu_init()
Add the kernel command line tp_printk option that will have tracepoints
that are active sent to printk() as well as to the trace buffer.
Passing "tp_printk" will activate this. To turn it off, the sysctl
/proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
this only works if the cmdline option is used. Echoing 1 into the sysctl
file without the cmdline option will have no affect.
Note, this is a dangerous option. Having high frequency tracepoints send
their data to printk() can possibly cause a live lock. This is another
reason why this is only active if the command line option is used.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().
This also allows trace_printk() to be used early too.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so files that are
build conditionally if CONFIG_PM_RUNTIME is set may now be build
if CONFIG_PM is set.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
for this reason.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org.
clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
o Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
o The seq_buf code was change to model the seq_file code. I have
a patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was done
to make sure that seq_buf.c is compatible with seq_file.c. I may
try to get that patch in for 3.20.
o The seq_buf.c file was moved to lib/ to remove it from being dependent
on CONFIG_TRACING.
o The printk() was updated to allow for a per_cpu "override" of
the internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow the
NMI to change what printk() does in order to call dump_stack() without
needing to update that code as well.
o Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would wait
till all the NMIs finished, and then it would print the seq_buf
data to the console safely from a non NMI context.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbrnAAoJEEjnJuOKh9ldsCoIAJ3sKIJ5B3jxJJTCHPAx/lZD
GVbV1J1mu4kTAZuhJZOAxW8D6PZGZMyEjg0y6ScDEnBGcjAZ9gTiWCdakPktf9EX
GfaPPqwiL9dZ18J9Qc6uR+7M1Ffpzzwbcc6lJrpoTcjRgkoH9wCiLS9ozFQyYzWb
/7m5UbUM/PIk9WAjLYXPW6UUVtPTPT0RdEQKofMGTeah+vgqj4TXCOROdlxsXXWF
77vqBvPd5TUPWFH9ftzJGDtZS8SroXVKCu3fZIqHgzAU0yqwVtH/JzDTy9u2UYhX
GzDEPeAIdp6m6Uyc406VuIf1QW0gfBgmA0ir80vFoP27uFMM6j5HlF7azgQfx34=
=YBgA
-----END PGP SIGNATURE-----
Merge tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull nmi-safe seq_buf printk update from Steven Rostedt:
"This code is a fork from the trace-3.19 pull as it needed the
trace_seq clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in
practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
- Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
- The seq_buf code was change to model the seq_file code. I have a
patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was
done to make sure that seq_buf.c is compatible with seq_file.c. I
may try to get that patch in for 3.20.
- The seq_buf.c file was moved to lib/ to remove it from being
dependent on CONFIG_TRACING.
- The printk() was updated to allow for a per_cpu "override" of the
internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow
the NMI to change what printk() does in order to call dump_stack()
without needing to update that code as well.
- Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would
wait till all the NMIs finished, and then it would print the
seq_buf data to the console safely from a non NMI context
One added bonus is that this code also makes the NMI dump stack work
on PREEMPT_RT kernels. As printk() includes sleeping locks on
PREEMPT_RT, printk() only writes to console if the console does not
use any rt_mutex converted spin locks. Which a lot do"
* tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
x86/nmi: Fix use of unallocated cpumask_var_t
printk/percpu: Define printk_func when printk is not defined
x86/nmi: Perform a safe NMI stack trace on all CPUs
printk: Add per_cpu printk func to allow printk to be diverted
seq_buf: Move the seq_buf code to lib/
seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF
tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions
tracing: Have seq_buf use full buffer
seq_buf: Add seq_buf_can_fit() helper function
tracing: Add paranoid size check in trace_printk_seq()
tracing: Use trace_seq_used() and seq_buf_used() instead of len
tracing: Clean up tracing_fill_pipe_page()
seq_buf: Create seq_buf_used() to find out how much was written
tracing: Add a seq_buf_clear() helper and clear len and readpos in init
tracing: Convert seq_buf fields to be like seq_file fields
tracing: Convert seq_buf_path() to be like seq_path()
tracing: Create seq_buf layer in trace_seq
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
Currently, blktrace can be started/stopped via its ioctl-based interface
(used by the userspace blktrace tool) or via its ftrace interface. The
function blk_trace_remove_queue(), called each time an "enable" tunable
of the ftrace interface transitions to zero, removes the trace from the
running list, even if no function from the sysfs interface adds it to
such a list. This leads to a null pointer dereference. This commit
changes the blk_trace_remove_queue() function so that it does not remove
the blk_trace from the running list.
v2:
- Now the patch removes the invocation of list_del() instead of
adding an useless if branch, as suggested by Namhyung Kim.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The initial reason for this patch is that I noticed that:
if (len > TRACE_BUF_SIZE)
is off by one. In this code, if len == TRACE_BUF_SIZE, then it means we
have truncated the last character off the output string. If we truncate
two or more characters then we exit without printing.
After some discussion, we decided that printing truncated data is better
than not printing at all so we should just use vscnprintf() and remove
the test entirely. Also I have updated memcpy() to copy the NUL char
instead of setting the NUL in a separate step.
Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, function graph tracer prints "!" or "+" just before
function execution time to signal a function overhead, depending
on the time. And some tracers tracing latency also print "!" or
"+" just after time to signal overhead, depending on the interval
between events. Even it is usually enough to do that, we sometimes
need to signal for bigger execution time than 100 micro seconds.
For example, I used function graph tracer to detect if there is
any case that exit_mm() takes too much time. I did following steps
in /sys/kernel/debug/tracing. It was easier to detect very large
excution time with patched kernel than with original kernel.
$ echo exit_mm > set_graph_function
$ echo function_graph > current_tracer
$ echo > trace
$ cat trace_pipe > $LOGFILE
... (do something and terminate logging)
$ grep "\\$" $LOGFILE
3) $ 22082032 us | } /* kernel_map_pages */
3) $ 22082040 us | } /* free_pages_prepare */
3) $ 22082113 us | } /* free_hot_cold_page */
3) $ 22083455 us | } /* free_hot_cold_page_list */
3) $ 22083895 us | } /* release_pages */
3) $ 22177873 us | } /* free_pages_and_swap_cache */
3) $ 22178929 us | } /* unmap_single_vma */
3) $ 22198885 us | } /* unmap_vmas */
3) $ 22206949 us | } /* exit_mmap */
3) $ 22207659 us | } /* mmput */
3) $ 22207793 us | } /* exit_mm */
And then, it was easy to find out that a schedule-out occured by
sub_preempt_count() within kernel_map_pages().
To detect very large function exection time caused by either problematic
function implementation or scheduling issues, this patch can be useful.
Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support to allow not "!" for and (&&) and (||). That is:
!(field1 == X && field2 == Y)
Where the value of the full clause will be notted.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ted noticed that he could not filter on an event for a bit being cleared.
That's because the filtering logic only tests event fields with a limited
number of comparisons which, for bit logic, only include "&", which can
test if a bit is set, but there's no good way to see if a bit is clear.
This adds a way to do: !(field & 2048)
Which returns true if the bit is not set, and false otherwise.
Note, currently !(field1 == 10 && field2 == 15) is not supported.
That is, the 'not' only works for direct comparisons, not for the
AND and OR logic.
Link: http://lkml.kernel.org/r/20141202021912.GA29096@thunk.org
Link: http://lkml.kernel.org/r/20141202120430.71979060@gandalf.local.home
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
ftrace users who may modify regs->ip to change the execution
path. If two or more users modify the regs->ip on the same
function entry, one of them will be broken. So they must add
IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
flag to have notrace hash, and the ftrace_ops must have a
filter hash (so that the ftrace_ops can hook only specific
entries), because it strongly depends on the address and
must be allowed for only few selected functions.
Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ fixed up some of the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix up a few typos in comments and convert an int into a bool in
update_traceon_count().
Link: http://lkml.kernel.org/r/546DD445.5080108@hitachi.com
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.
The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.
It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.
It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.
Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function bstr_printf() from lib/vsprnintf.c is only available if
CONFIG_BINARY_PRINTF is defined. This is due to the only user currently
being the tracing infrastructure, which needs to select this config
when tracing is configured. Until there is another user of the binary
printf formats, this will continue to be the case.
Since seq_buf.c is now lives in lib/ and is compiled even without
tracing, it must encompass its use of bstr_printf() which is used
by seq_buf_printf(). This too is only used by the tracing infrastructure
and is still encapsulated by the CONFIG_BINARY_PRINTF.
Link: http://lkml.kernel.org/r/20141104160222.969013383@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add two helper functions; seq_buf_get_buf() and seq_buf_commit() that
are used by seq_buf_path(). This makes the code similar to the
seq_file: seq_path() function, and will help to be able to consolidate
the functions between seq_file and trace_seq.
Link: http://lkml.kernel.org/r/20141104160222.644881406@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.977571447@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently seq_buf is full when all but one byte of the buffer is
filled. Change it so that the seq_buf is full when all of the
buffer is filled.
Some of the functions would fill the buffer completely and report
everything was fine. This was inconsistent with the max of size - 1.
Changing this to be max of size makes all functions consistent.
Link: http://lkml.kernel.org/r/20141104160222.502133196@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.811957882@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a seq_buf_can_fit() helper function that removes the possible mistakes
of comparing the seq_buf length plus added data compared to the size of
the buffer.
Link: http://lkml.kernel.org/r/20141118164025.GL23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To be really paranoid about writing out of bound data in
trace_printk_seq(), add another check of len compared to size.
Link: http://lkml.kernel.org/r/20141119144004.GB2332@dhcp128.suse.cz
Suggested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the seq_buf->len will soon be +1 size when there's an overflow, we
must use trace_seq_used() or seq_buf_used() methods to get the real
length. This will prevent buffer overflow issues if just the len
of the seq_buf descriptor is used to copy memory.
Link: http://lkml.kernel.org/r/20141114121911.09ba3d38@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing_fill_pipe_page() logic is a little confusing with the
use of count saving the seq.len and reusing it.
Instead of subtracting a number that is calculated from the saved
value of the seq.len from seq.len, just save the seq.len at the start
and if we need to reset it, just assign it again.
When the seq_buf overflow is len == size + 1, the current logic will
break. Changing it to use a saved length for resetting back to the
original value is more robust and will work when we change the way
seq_buf sets the overflow.
Link: http://lkml.kernel.org/r/20141118161546.GJ23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rewrite seq_buf_path() like it is done in seq_path() and allow
it to accept any escape character instead of just "\n".
Making seq_buf_path() like seq_path() will help prevent problems
when converting seq_file to use the seq_buf logic.
Link: http://lkml.kernel.org/r/20141104160222.048795666@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.338523371@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.
Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/5468F875.7080907@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the trace_seq of ftrace_raw_output_prep() is full this function
returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero.
The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero!
The thing is, the caller of ftrace_raw_output_prep() expects a
success to be zero. Change that to expect it to be
TRACE_TYPE_HANDLED.
Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz
Reminded-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_printf() and friends are used to store strings into a buffer
that can be passed around from function to function. If the trace_seq buffer
fills up, it will not print any more. The return values were somewhat
inconsistant and using trace_seq_has_overflowed() was a better way to know
if the write to the trace_seq buffer succeeded or not.
Now that all users have removed reading the return value of the printf()
type functions, they can safely return void and keep future users of them
from reading the inconsistent values as well.
Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will not be returning values
soon and will be void functions. To know if they succeeded or not, the
functions trace_seq_has_overflowed() and trace_handle_return() should be
used instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon no longer have
return values. Using trace_seq_has_overflowed() and trace_handle_return()
should be used instead.
Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org
Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon not have a return
value and will only be a void function. Use trace_seq_has_overflowed()
instead to know if the trace_seq operations succeeded or not.
Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The return values for trace_seq_printf() and friends are going to be
removed and they will become void functions. The mmio tracer checked
their return and even did so incorrectly.
Some of the funtions which returned the values were never checked
themselves. Removing all the checks simplifies the code.
Use trace_seq_has_overflowed() and trace_handle_return() where
necessary instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of checking the return value of trace_seq_printf() and friends
for overflowing of the buffer, use the trace_seq_has_overflowed() helper
function.
This cleans up the code quite a bit and also takes us a step closer to
changing the return values of trace_seq_printf() and friends to void.
Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of doing individual checks all over the place that makes the code
very messy. Just check trace_seq_has_overflowed() at the end or in
strategic places.
This makes the code much cleaner and also helps with getting closer
to removing the return values of trace_seq_printf() and friends.
Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch tracer should not be checking the trace_seq_printf() return value
as that will soon be void. There's a new trace_handle_return() helper function
that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed
and TRACE_TYPE_HANDLED otherwise.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.
Just return !trace_seq_has_overflowed() instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Checking the return code of every trace_seq_printf() operation and having
to return early if it overflowed makes the code messy.
Using the new trace_seq_has_overflowed() and trace_handle_return() functions
allows us to clean up the code.
In the future, trace_seq_printf() and friends will be turning into void
functions and not returning a value. The trace_seq_has_overflowed() is to
be used instead. This cleanup allows that change to take place.
Cc: Jens Axboe <axboe@fb.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a trace_seq_has_overflowed() which returns true if the trace_seq
had too much written into it allows us to simplify the code.
Instead of checking the return value of every call to trace_seq_printf()
and friends, they can all be called normally, and at the end we can
return !trace_seq_has_overflowed() instead.
Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq
overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function
was created called trace_handle_return() which takes a trace_seq and
returns these enums. Using this helper function also simplifies the
code.
This change also makes it possible to remove the return values of
trace_seq_printf() and friends. They should instead just be
void functions.
Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current
position of the trace_seq buffer (s->buffer + s->len), but instead from
the beginning of the buffer (s->buffer).
Luckily, the only user of this "ipi_raise tracepoint" uses it as the
first parameter, and as such, the start of the temp buffer in
include/trace/ftrace.h (see __get_bitmask()).
Reported-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Stack traces that happen from function tracing check if the address
on the stack is a __kernel_text_address(). That is, is the address
kernel code. This calls core_kernel_text() which returns true
if the address is part of the builtin kernel code. It also calls
is_module_text_address() which returns true if the address belongs
to module code.
But what is missing is ftrace dynamically allocated trampolines.
These trampolines are allocated for individual ftrace_ops that
call the ftrace_ops callback functions directly. But if they do a
stack trace, the code checking the stack wont detect them as they
are neither core kernel code nor module address space.
Adding another field to ftrace_ops that also stores the size of
the trampoline assigned to it we can create a new function called
is_ftrace_trampoline() that returns true if the address is a
dynamically allocate ftrace trampoline. Note, it ignores trampolines
that are not dynamically allocated as they will return true with
the core_kernel_text() function.
Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function probe counting for traceon and traceoff suffered a race
condition where if the probe was executing on two or more CPUs at the
same time, it could decrement the counter by more than one when
disabling (or enabling) the tracer only once.
The way the traceon and traceoff probes are suppose to work is that
they disable (or enable) tracing once per count. If a user were to
echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
schedule function was called, it would disable tracing. But the count
should only be decremented once (to 2). Then if the user enabled tracing
again (via tracing_on file), the next call to schedule would disable
tracing again and the count would be decremented to 1.
But if multiple CPUS called schedule at the same time, it is possible
that the count would be decremented more than once because of the
simple "count--" used.
By reading the count into a local variable and using memory barriers
we can guarantee that the count would only be decremented once per
disable (or enable).
The stack trace probe had a similar race, but here the stack trace will
decrement for each time it is called. But this had the read-modify-
write race, where it could stack trace more than the number of times
that was specified. This case we use a cmpxchg to stack trace only the
number of times specified.
The dump probes can still use the old "update_count()" function as
they only run once, and that is controlled by the dump logic
itself.
Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Usually, "msecs" notation means milli-seconds, and "usecs" notation
means micro-seconds. Since the unit used in the code is micro-seconds,
the notation should be replaced from msecs to usecs.
Link: http://lkml.kernel.org/r/1415171926-9782-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On the function_graph tracer, the print_graph_irq() function prints a
trace line with the flag ==========> on an irq handler entry, and the
flag <========== on an irq handler return.
But when the latency-format is enable, it is not printing the
latency-format flags, causing the following error in the trace output:
0) ==========> |
0) d... | smp_apic_timer_interrupt() {
This patch fixes this issue by printing the latency-format flags when
it is enable.
Link: http://lkml.kernel.org/r/7c2e226dac20c940b6242178fab7f0e3c9b5ce58.1415233316.git.bristot@redhat.com
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Printing a single character to a seqfile might as well be done with
seq_putc instead of seq_puts; this avoids a strlen() call and a memory
access. It also shaves another few bytes off the generated code.
Link: http://lkml.kernel.org/r/1415479332-25944-4-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Consecutive seq_puts calls with literal strings can be merged to a
single call. This reduces the size of the generated code, and can also
lead to slight .rodata reduction (because of fewer nul and padding
bytes). It should also shave a off a few clock cycles.
Link: http://lkml.kernel.org/r/1415479332-25944-3-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using seq_printf to print a simple string or a single character is a
lot more expensive than it needs to be, since seq_puts and seq_putc
exist.
These patches do
seq_printf(m, s) -> seq_puts(m, s)
seq_printf(m, "%s", s) -> seq_puts(m, s)
seq_printf(m, "%c", c) -> seq_putc(m, c)
Subsequent patches will simplify further.
Link: http://lkml.kernel.org/r/1415479332-25944-2-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command will livelock by constantly printk'ing
the empty string at KERN_EMERG level if it run when the ftrace system is
not in use. This occurs because trace_empty() never returns false when
the ring buffers are left at the start of a non-consuming read [launched
by ring_buffer_read_start()].
This patch changes the loop exit condition to use the result of
trace_find_next_entry_inc(). Effectively this switches the non-consuming
kdb dumper to follow the approach of the non-consuming userspace
interface [s_next()] rather than the consuming ftrace_dump().
Link: http://lkml.kernel.org/r/1415277716-19419-3-git-send-email-daniel.thompson@linaro.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command unconditionally crashes due to a null
pointer de-reference whenever the command is run. This in turn causes
the kernel to panic.
The abridged stacktrace (gathered with ARCH=arm) is:
--- cut here ---
[<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
[<c02132dc>] (die) from [<c0952eb8>]
(__do_kernel_fault.part.11+0x74/0x84)
[<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
(do_page_fault+0x1d0/0x3c4)
[<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)
[<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
Exception stack(0xc0deba88 to 0xc0debad0)
ba80: e8c29180 00000001 e9854304 e9854300 c0f567d8
c0df2580
baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
c0debad0
bac0: 0000672e c02f4d60 60000193 ffffffff
[<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
[<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
[<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
[<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
--- cut here ---
The NULL deref occurs due to the initialized use of struct trace_iter's
buffer_iter member.
This is a regression, albeit a fairly elderly one. It was introduced
by commit 6d158a813e ("tracing: Remove NR_CPUS array from
trace_iterator").
This patch solves this by providing a collection of ring_buffer_iter(s)
and using this to initialize buffer_iter. Note that static allocation
is used solely because the trace_iter itself is also static allocated.
Static allocation also means that we have to NULL-ify the pointer during
cleanup to avoid use-after-free problems.
Link: http://lkml.kernel.org/r/1415277716-19419-2-git-send-email-daniel.thompson@linaro.org
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to the documentation, adding "traceoff_on_warning" to the boot
command line should be enough to enable the feature. But right now it is
necessary to specify "traceoff_on_warning=". Along with fixing that, also
verify if the value passed, if any, is either "0" or "off".
Link: http://lkml.kernel.org/r/20141112231400.GL12281@uudg.org
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new logic, if only a single user of ftrace function hooks is
used, it will get its own trampoline assigned to it.
The problem is that the control_ops is an indirect ops that perf ops
uses. What that means is that when perf registers its ops with
register_ftrace_function(), it has the CONTROL flag set and gets added
to the control list instead of the global ftrace list. The control_ops
gets added to that instead and the mcount trampoline calls the control_ops
function. The control_ops function will iterate the control list and
call the ops functions that are attached to it.
But currently the trampoline is added to the perf ops and not the
control ops, and when ftrace tries to find a trampoline hook for it,
it fails to find one and gives the following splat:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 10133 at kernel/trace/ftrace.c:2033 ftrace_get_addr_new+0x6f/0xc0()
Modules linked in: [...]
CPU: 0 PID: 10133 Comm: perf Tainted: P 3.18.0-rc1-test+ #388
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
00000000000007f1 ffff8800c2643bc8 ffffffff814fca6e ffff88011ea0ed01
0000000000000000 ffff8800c2643c08 ffffffff81041ffd 0000000000000000
ffffffff810c388c ffffffff81a5a350 ffff880119b00000 ffffffff810001c8
Call Trace:
[<ffffffff814fca6e>] dump_stack+0x46/0x58
[<ffffffff81041ffd>] warn_slowpath_common+0x81/0x9b
[<ffffffff810c388c>] ? ftrace_get_addr_new+0x6f/0xc0
[<ffffffff810001c8>] ? 0xffffffff810001c8
[<ffffffff81042031>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810c388c>] ftrace_get_addr_new+0x6f/0xc0
[<ffffffff8102e938>] ftrace_replace_code+0xd6/0x334
[<ffffffff810c4116>] ftrace_modify_all_code+0x41/0xc5
[<ffffffff8102eba6>] arch_ftrace_update_code+0x10/0x19
[<ffffffff810c293c>] ftrace_run_update_code+0x21/0x42
[<ffffffff810c298f>] ftrace_startup_enable+0x32/0x34
[<ffffffff810c3049>] ftrace_startup+0x14e/0x15a
[<ffffffff810c307c>] register_ftrace_function+0x27/0x40
[<ffffffff810dc118>] perf_ftrace_event_register+0x3e/0xee
[<ffffffff810dbfbe>] perf_trace_init+0x29d/0x2a9
[<ffffffff810eb422>] perf_tp_event_init+0x27/0x3a
[<ffffffff810f18bc>] perf_init_event+0x9e/0xed
[<ffffffff810f1ba4>] perf_event_alloc+0x299/0x330
[<ffffffff810f236b>] SYSC_perf_event_open+0x3ee/0x816
[<ffffffff8115a066>] ? mntput+0x2d/0x2f
[<ffffffff81142b00>] ? __fput+0xa7/0x1b2
[<ffffffff81091300>] ? do_gettimeofday+0x22/0x3a
[<ffffffff810f279c>] SyS_perf_event_open+0x9/0xb
[<ffffffff81502a92>] system_call_fastpath+0x12/0x17
---[ end trace 81a53565150e4982 ]---
Bad trampoline accounting at: ffffffff810001c8 (run_init_process+0x0/0x2d) (10000001)
Update the control_ops trampoline instead of the perf ops one.
Reported-by: lkp@01.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only code that references tracing_sched_switch_trace() and
tracing_sched_wakeup_trace() is the wakeup latency tracer. Those
two functions use to belong to the sched_switch tracer which has
long been removed. These functions were left behind because the
wakeup latency tracer used them. But since the wakeup latency tracer
is the only one to use them, they should be static functions inside
that code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous patch it is clear that "tracer_enabled" can never be
true, we can remove the "if (tracer_enabled)" code in probe_sched_switch()
and probe_sched_wakeup(). Plus we can obviously remove tracer_enabled,
ctx_trace, and sched_stopped as well.
Link: http://lkml.kernel.org/p/20140723193503.GA30217@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_{start,stop}_sched_switch_record() have no callers since
87d80de280 "tracing: Remove obsolete sched_switch tracer".
The last caller of tracing_sched_switch_assign_trace() was removed
by 30dbb20e68 "tracing: Remove boot tracer".
Link: http://lkml.kernel.org/p/20140723193501.GA30214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the introduction of the dynamic trampolines, it is useful that if
things go wrong that ftrace_bug() produces more information about what
the current state is. This can help debug issues that may arise.
Ftrace has lots of checks to make sure that the state of the system it
touchs is exactly what it expects it to be. When it detects an abnormality
it calls ftrace_bug() and disables itself to prevent any further damage.
It is crucial that ftrace_bug() produces sufficient information that
can be used to debug the situation.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the static ftrace_ops (like function tracer) enables tracing, and it
is the only callback that is referencing a function, a trampoline is
dynamically allocated to the function that calls the callback directly
instead of calling a loop function that iterates over all the registered
ftrace ops (if more than one ops is registered).
But when it comes to dynamically allocated ftrace_ops, where they may be
freed, on a CONFIG_PREEMPT kernel there's no way to know when it is safe
to free the trampoline. If a task was preempted while executing on the
trampoline, there's currently no way to know when it will be off that
trampoline.
But this is not true when it comes to !CONFIG_PREEMPT. The current method
of calling schedule_on_each_cpu() will force tasks off the trampoline,
becaues they can not schedule while on it (kernel preemption is not
configured). That means it is safe to free a dynamically allocated
ftrace ops trampoline when CONFIG_PREEMPT is not configured.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch introduces several new flags to collect kdb commands into
groups (later allowing them to be optionally disabled).
This follows similar prior art to enable/disable magic sysrq
commands.
The commands have been categorized as follows:
Always on: go (w/o args), env, set, help, ?, cpu (w/o args), sr,
dmesg, disable_nmi, defcmd, summary, grephelp
Mem read: md, mdr, mdp, mds, ef, bt (with args), per_cpu
Mem write: mm
Reg read: rd
Reg write: go (with args), rm
Inspect: bt (w/o args), btp, bta, btc, btt, ps, pid, lsmod
Flow ctrl: bp, bl, bph, bc, be, bd, ss
Signal: kill
Reboot: reboot
All: cpu, kgdb, (and all of the above), nmi_console
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Since we now treat KDB_REPEAT_* as flags, there is no need to
pass KDB_REPEAT_NONE. It's just the default behaviour when no
flags are specified.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
We're about to add more options for commands behaviour, so let's give
a more generic name to the low-level kdb command registration function.
There are just various renames, no functional changes.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
If the read loop in trace_buffers_splice_read() keeps failing due to
memory allocation failures without reading even a single page then this
function will keep busy looping.
Remove the risk for that by exiting the function if memory allocation
failures are seen.
Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
lockup:
# trace-cmd record -e raw_syscalls:* -F false
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
...
Call Trace:
[<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
[<ffffffff81092e25>] wait_on_pipe+0x35/0x40
[<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
[<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
[<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
[<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff8112415d>] do_splice_to+0x6d/0x90
[<ffffffff81126971>] SyS_splice+0x7c1/0x800
[<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
The problem is this: tracing_buffers_splice_read() calls
ring_buffer_wait() to wait for data in the ring buffers. The buffers
are not empty so ring_buffer_wait() returns immediately. But
tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
meaning it only wants to read a full page. When the full page is not
available, tracing_buffers_splice_read() tries to wait again with
ring_buffer_wait(), which again returns immediately, and so on.
Fix this by adding a "full" argument to ring_buffer_wait() which will
make ring_buffer_wait() wait until the writer has left the reader's
page, i.e. until full-page reads will succeed.
Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
Cc: stable@vger.kernel.org # 3.16+
Fixes: b1169cc69b ("tracing: Remove mock up poll wait function")
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
ftrace function hooks. Add to the output what function is being called
by the trampoline if the arch supports it.
Add support for this feature in x86_64.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current method of handling multiple function callbacks is to register
a list function callback that calls all the other callbacks based on
their hash tables and compare it to the function that the callback was
called on. But this is very inefficient.
For example, if you are tracing all functions in the kernel and then
add a kprobe to a function such that the kprobe uses ftrace, the
mcount trampoline will switch from calling the function trace callback
to calling the list callback that will iterate over all registered
ftrace_ops (in this case, the function tracer and the kprobes callback).
That means for every function being traced it checks the hash of the
ftrace_ops for function tracing and kprobes, even though the kprobes
is only set at a single function. The kprobes ftrace_ops is checked
for every function being traced!
Instead of calling the list function for functions that are only being
traced by a single callback, we can call a dynamically allocated
trampoline that calls the callback directly. The function graph tracer
already uses a direct call trampoline when it is being traced by itself
but it is not dynamically allocated. It's trampoline is static in the
kernel core. The infrastructure that called the function graph trampoline
can also be used to call a dynamically allocated one.
For now, only ftrace_ops that are not dynamically allocated can have
a trampoline. That is, users such as function tracer or stack tracer.
kprobes and perf allocate their ftrace_ops, and until there's a safe
way to free the trampoline, it can not be used. The dynamically allocated
ftrace_ops may, although, use the trampoline if the kernel is not
compiled with CONFIG_PREEMPT. But that will come later.
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When modifying code, ftrace has several checks to make sure things
are being done correctly. One of them is to make sure any code it
modifies is exactly what it expects it to be before it modifies it.
In order to do so with the new trampoline logic, it must be able
to find out what trampoline a function is hooked to in order to
see if the code that hooks to it is what's expected.
The logic to find the trampoline from a record (accounting descriptor
for a function that is hooked) needs to only look at the "old_hash"
of an ops that is being modified. The old_hash is the list of function
an ops is hooked to before its update. Since a record would only be
pointing to an ops that is being modified if it was already hooked
before.
Currently, it can pick a modified ops based on its new functions it
will be hooked to, and this picks the wrong trampoline and causes
the check to fail, disabling ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace: squash into ordering of ops for modification
The code that checks for trampolines when modifying function hooks
tests against a modified ops "old_hash". But the ops old_hash pointer
is not being updated before the changes are made, making it possible
to not find the right hash to the callback and possibly causing
ftrace to break in accounting and disable itself.
Have the ops set its old_hash before the modifying takes place.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
code screem nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Modules linked in:
CPU: 0 PID: 91 Comm: test-events Not tainted 3.17.0-rc7-00109-g2f85d18 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 ffff880010ec3c80 ffffffff8c696943 ffff880010ec3cb8
ffffffff8be7cae5 ffffffff8bead236 0000000000000001 ffff88001161fa01
0000000000000001 0000000000000000 ffff880010ec3d20 ffffffff8be7cb46
Call Trace:
[<ffffffff8c696943>] dump_stack+0x19/0x1b
[<ffffffff8be7cae5>] warn_slowpath_common+0x8f/0xa8
[<ffffffff8bead236>] ? __might_sleep+0x9a/0x378
[<ffffffff8be7cb46>] warn_slowpath_fmt+0x48/0x50
[<ffffffff8be0dd55>] ? sched_clock+0x9/0xd
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8bead236>] __might_sleep+0x9a/0x378
[<ffffffff8c6a0227>] down_read+0x26/0x98
[<ffffffff8be8f503>] exit_signals+0x27/0x1c2
[<ffffffff8be7fedd>] do_exit+0x193/0x10bd
[<ffffffff8bfd1969>] ? kfree+0x4a0/0x4d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8bea4b65>] kthread+0x156/0x156
[<ffffffff8c69c0f8>] ? wait_for_common+0x3e/0x224
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
[<ffffffff8c6a353a>] ret_from_fork+0x7a/0xb0
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
---[ end trace 14d02ef17adbc114 ]---
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a little
sloppy and not very robust. They work now because it runs at boot up
and the tests do not call anything that might trigger a spurious
wake up. But that doesn't mean those tests wont change in the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUNrcAAAoJEKQekfcNnQGu+oYH/3NaLEKOwQU+x0aL/rfSFB86
qtIq3X4iHGekGjrlN38N2Z36izI9AoYuGrWYReMFy1VcvnRxPAl1mc0y0dZfdW/C
KRLwKTAu0t78Ab8vzyXVDxS+Bs/zEi6V/8HykBFbCthiDz5IbTvxCoeS19O/X9CU
ptVKllUlywjKQD5UMiJwk7eOB5GspOeBgNu9MOh61ZfbYBVsl1hPqmD0gEaSH2Me
wLyDlIyc0P9dfeYeaqYblkiBaXLk2urZDU2Enffi1aueEwwWuN5x+DPGc6d6nGQW
fnworqoiYzz+maQoASwaLdCfJAP3cX5Ye7qWQk7QEtp4Ypdh5j7EacAf9pKEJg8=
=goKt
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Seems that Peter Zijlstra added a new check that is making old code
scream nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Call Trace:
__might_sleep+0x9a/0x378
down_read+0x26/0x98
exit_signals+0x27/0x1c2
do_exit+0x193/0x10bd
kthread+0x156/0x156
ret_from_fork+0x7a/0xb0
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a
little sloppy and not very robust. They work now because it runs at
boot up and the tests do not call anything that might trigger a
spurious wake up. But that doesn't mean those tests wont change in
the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks"
* tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Clean up scheduling in trace_wakeup_test_thread()
tracing: Robustify wait loop
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were
to trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a bit
of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function graph
tracer) hash tables. The hash tables tell ftrace what the tracer
wants to trace or doesn't want to trace. There's two of them: one
that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash table
tells us what not to trace, and if it is empty, all functions may be
traced, and if there's any listed, then those should not be traced
even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to
keep track of what has a trampoline and what does not, and allows
the ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUMwp6AAoJEKQekfcNnQGuIoAIAIsqvTYAnULyzKKCweEZYUfb
XJzz6cN5FPGSXkoeda1ZvnfOlHjFRrWNXzXMB0jZYR2hU++pe3xjtghaNzvbRcyV
wlwDUTsnz235OcOuFEspIwBamhtah96Coiwf/2z/2q6srXlHd/1TrqXB+Fpj1tEK
BkAViGDUEdq/eLZX7nGen36cTb5gpNqV9NjY1CVAK6bSkU/xXk/ArqFy1qy0MPnc
z/9bXdIf+Z6VnG/IzwRc2rwiMFuD1+lpjLuHEqagoHp1D4teCjWPSJl1EKCVAS40
GaCOTUZi92zIVgx8Bb28TglSla9MN65CO3E8dw6hlXUIsNz1p0eavpctnC6ac/Y=
=vDpP
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This set has a few minor updates, but the big change is the redesign
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were to
trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a
bit of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function
graph tracer) hash tables. The hash tables tell ftrace what the
tracer wants to trace or doesn't want to trace. There's two of them:
one that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash
table tells us what not to trace, and if it is empty, all functions
may be traced, and if there's any listed, then those should not be
traced even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to keep
track of what has a trampoline and what does not, and allows the
ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced"
* tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Only disable ftrace_enabled to test buffer in selftest
ftrace: Add sanity check when unregistering last ftrace_ops
kernel: trace_syscalls: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
tracing: generate RCU warnings even when tracepoints are disabled
ftrace: Replace tramp_hash with old_*_hash to save space
ftrace: Annotate the ops operation on update
ftrace: Grab any ops for a rec for enabled_functions output
ftrace: Remove freeing of old_hash from ftrace_hash_move()
ftrace: Set callback to ftrace_stub when no ops are registered
ftrace: Add helper function ftrace_ops_get_func()
ftrace: Add separate function for non recursive callbacks
Peter's new debugging tool triggers when tasks exit with !TASK_RUNNING.
The code in trace_wakeup_test_thread() also has a single schedule() call
that should be encompassed by a loop.
This cleans up the code a little to make it a bit more robust and
also makes the return exit properly with TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008135216.76142204@gandalf.local.home
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infreadead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The pending nested sleep debugging triggered on the potential stale
TASK_INTERRUPTIBLE in this code.
While there, fix the loop such that we won't revert to a while(1)
yield() 'spin' loop if we ever get a spurious wakeup.
And fix the actual issue by properly terminating the 'wait' loop by
setting TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008165110.GA14547@worktop.programming.kicks-ass.net
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 651e22f270 "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.
A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.
The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.
Commit 651e22f270 changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.
Fixes: 651e22f270 "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.
Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:
http://marc.info/?l=linux-kernel&m=127144305403241&w=2
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ftrace_enabled variable is set to zero in the self tests to keep
delayed functions from being traced and messing with the checks. This
only needs to be done when the checks are being performed, otherwise,
if ftrace_enabled is off when calls back to the utility that is being
tested, it can cause errors to happen and the tests can fail with
false positives.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the last ftrace_ops is unregistered, all the function records should
have a zeroed flags value. Make sure that is the case when the last ftrace_ops
is unregistered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The uses of "rcu_assign_pointer()" are NULLing out the pointers.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Link: http://lkml.kernel.org/p/20140822142822.GA32391@ada
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allowing function callbacks to declare their own trampolines requires
that each ftrace_ops that has a trampoline must have some sort of
accounting that keeps track of which ops has a trampoline attached
to a record.
The easy way to solve this was to add a "tramp_hash" that created a
hash entry for every function that a ops uses with a trampoline.
But since we can have literally tens of thousands of functions being
traced, that means we need tens of thousands of descriptors to map
the ops to the function in the hash. This is quite expensive and
can cause enabling and disabling the function graph tracer to take
some time to start and stop. It can take up to several seconds to
disable or enable all functions in the function graph tracer for this
reason.
The better approach albeit more complex, is to keep track of how ops
are being enabled and disabled, and use that along with the counting
of the number of ops attached to records, to determive what ops has
a trampoline attached to a record at enabling and disabling of
tracing.
To do this, the tramp_hash has been replaced with an old_filter_hash
and old_notrace_hash, which get the copy of the ops filter_hash and
notrace_hash respectively. The old hashes is kept until the ops has
been modified or removed and the old hashes are used with the logic
of the accounting to determine the ops that have the trampoline of
a record. The reason this has less of a footprint is due to the trick
that an "empty" hash in the filter_hash means "all functions" and
an empty hash in the notrace hash means "no functions" in the hash.
This is much more efficienct, doesn't have the delay, and takes up
much less memory, as we do not need to map all the functions but
just figure out which functions are mapped at the time it is
enabled or disabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add three new flags for ftrace_ops:
FTRACE_OPS_FL_ADDING
FTRACE_OPS_FL_REMOVING
FTRACE_OPS_FL_MODIFYING
These will be set for the ftrace_ops when they are first added
to the function tracing, being removed from function tracing
or just having their functions changed from function tracing,
respectively.
This will be needed to remove the tramp_hash, which can grow quite
big. The tramp_hash is used to note what functions a ftrace_ops
is using a trampoline for. Denoting which ftrace_ops is being
modified, will allow us to use the ftrace_ops hashes themselves,
which are much smaller as they have a global flag to denote if
a ftrace_ops is tracing all functions, as well as a notrace hash
if the ftrace_ops is tracing all but a few. The tramp_hash just
creates a hash item for every function, which can go into the 10s
of thousands if all functions are using the ftrace_ops trampoline.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When dumping the enabled_functions, use the first op that is
found with a trampoline to the record, as there should only be
one, as only one ops can be registered to a function that has
a trampoline.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_hash_move() currently frees the old hash that is passed to it
after replacing the pointer with the new hash. Instead of having the
function do that chore, have the caller perform the free.
This lets the ftrace_hash_move() be used a bit more freely, which
is needed for changing the way the trampoline logic is done.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The clean up that adds the helper function ftrace_ops_get_func()
caused the default function to not change when DYNAMIC_FTRACE was not
set and no ftrace_ops were registered. Although static tracing is
not very useful (not having DYNAMIC_FTRACE set), it is still supported
and we don't want to break it.
Clean up the if statement even more to specifically have the default
function call ftrace_stub when no ftrace_ops are registered. This
fixes the small bug for static tracing as well as makes the code a
bit more understandable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the helper function to what the mcount trampoline is to call
for a ftrace_ops function. This helper will be used by arch code
in the future to set up dynamic trampolines. But as this does the
same tests that are performed in choosing what function to call for
the default mcount trampoline, might as well use it to clean up
the existing code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of using the generic list function for callbacks that
are not recursive, call a new helper function from the mcount
trampoline called ftrace_ops_recur_func() that will do the recursion
checking for the callback.
This eliminates an indirection as well as will help in future code
that will use dynamically allocated trampolines.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Epoll on trace_pipe can sometimes hang in a weird case. If the ring buffer is
empty when we set waiters_pending but an event shows up exactly at that moment
we can miss being woken up by the ring buffers irq work. Since
ring_buffer_empty() is inherently racey we will sometimes think that the buffer
is not empty. So we don't get woken up and we don't think there are any events
even though there were some ready when we added the watch, which makes us hang.
This patch fixes this by making sure that we are actually on the wait list
before we set waiters_pending, and add a memory barrier to make sure
ring_buffer_empty() is going to be correct.
Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com
Cc: stable@vger.kernel.org # 3.10+
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In __ftrace_replace_code(), when converting the call to a nop in a function
it needs to compare against the "curr" (current) value of the ftrace ops, and
not the "new" one. It currently does not affect x86 which is the only arch
to do the trampolines with function graph tracer, but when other archs that do
depend on this code implement the function graph trampoline, it can crash.
Here's an example when ARM uses the trampolines (in the future):
------------[ cut here ]------------
WARNING: CPU: 0 PID: 9 at kernel/trace/ftrace.c:1716 ftrace_bug+0x17c/0x1f4()
Modules linked in: omap_rng rng_core ipv6
CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.16.0-test-10959-gf0094b28f303-dirty #52
[<c02188f4>] (unwind_backtrace) from [<c021343c>] (show_stack+0x20/0x24)
[<c021343c>] (show_stack) from [<c095a674>] (dump_stack+0x78/0x94)
[<c095a674>] (dump_stack) from [<c02532a0>] (warn_slowpath_common+0x7c/0x9c)
[<c02532a0>] (warn_slowpath_common) from [<c02532ec>] (warn_slowpath_null+0x2c/0x34)
[<c02532ec>] (warn_slowpath_null) from [<c02cbac4>] (ftrace_bug+0x17c/0x1f4)
[<c02cbac4>] (ftrace_bug) from [<c02cc44c>] (ftrace_replace_code+0x80/0x9c)
[<c02cc44c>] (ftrace_replace_code) from [<c02cc658>] (ftrace_modify_all_code+0xb8/0x164)
[<c02cc658>] (ftrace_modify_all_code) from [<c02cc718>] (__ftrace_modify_code+0x14/0x1c)
[<c02cc718>] (__ftrace_modify_code) from [<c02c7244>] (multi_cpu_stop+0xf4/0x134)
[<c02c7244>] (multi_cpu_stop) from [<c02c6e90>] (cpu_stopper_thread+0x54/0x130)
[<c02c6e90>] (cpu_stopper_thread) from [<c0271cd4>] (smpboot_thread_fn+0x1ac/0x1bc)
[<c0271cd4>] (smpboot_thread_fn) from [<c026ddf0>] (kthread+0xe0/0xfc)
[<c026ddf0>] (kthread) from [<c020f318>] (ret_from_fork+0x14/0x20)
---[ end trace dc9ce72c5b617d8f ]---
[ 65.047264] ftrace failed to modify [<c0208580>] asm_do_IRQ+0x10/0x1c
[ 65.054070] actual: 85:1b:00:eb
Fixes: 7413af1fb7 "ftrace: Make get_ftrace_addr() and get_ftrace_addr_old() global"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latest rewrite of ftrace removed the separate ftrace_ops of
the function tracer and the function graph tracer and had them
share the same ftrace_ops. This simplified the accounting by removing
the multiple layers of functions called, where the global_ops func
would call a special list that would iterate over the other ops that
were registered within it (like function and function graph), which
itself was registered to the ftrace ops list of all functions
currently active. If that sounds confusing, the code that implemented
it was also confusing and its removal is a good thing.
The problem with this change was that it assumed that the function
and function graph tracer can never be used at the same time.
This is mostly true, but there is an exception. That is when the
function profiler uses the function graph tracer to profile.
The function profiler can be activated the same time as the function
tracer, and this breaks the assumption and the result is that ftrace
will crash (it detects the error and shuts itself down, it does not
cause a kernel oops).
To solve this issue, a previous change allowed the hash tables
for the functions traced by a ftrace_ops to be a pointer and let
multiple ftrace_ops share the same hash. This allows the function
and function_graph tracer to have separate ftrace_ops, but still
share the hash, which is what is done.
Now the function and function graph tracers have separate ftrace_ops
again, and the function tracer can be run while the function_profile
is active.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that a ftrace_hash can be shared by multiple ftrace_ops, they can dec
the rec->flags by more than once (one per those that share the ftrace_hash).
This means that the tramp_hash may not have a hash item when it was added.
For example, if two ftrace_ops share a hash for a ftrace record, and the
first ops has a trampoline, when it adds itself it will set the rec->flags
TRAMP flag and increments its nr_trampolines counter. When the second ops
is added, it must clear that tramp flag but also decrement the other ops
that shares its hash. As the update to the function callbacks has not yet
been performed, the other ops will not have the tramp hash set yet and it
can not be used to know to decrement its nr_trampolines.
Luckily, the tramp_hash does not need to be used. As the ftrace_mutex is
held, a ops with a trampoline to a record during an update of another ops
that shares the record will have its func_hash pointing to it. Since a
trampoline can only be set for a record if only one ops is attached to it,
we can just check if the record has a trampoline (the FTRACE_FL_TRAMP flag
is set) and then find the ops that has this record in its hashes.
Also added some output to help debug when things go wrong.
Cc: stable@vger.kernel.org # 3.16+ (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When updating what an ftrace_ops traces, if it is registered (that is,
actively tracing), and that ftrace_ops uses the shared global_ops
local_hash, then we need to update all tracers that are active and
also share the global_ops' ftrace_hash_ops.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the top level debug file system function tracer shares its
ftrace_ops with the function graph tracer. This was thought to be fine
because the tracers are not used together, as one can only enable
function or function_graph tracer in the current_tracer file.
But that assumption proved to be incorrect. The function profiler
can use the function graph tracer when function tracing is enabled.
Since all function graph users uses the function tracing ftrace_ops
this causes a conflict and when a user enables both function profiling
as well as the function tracer it will crash ftrace and disable it.
The quick solution so far is to move them as separate ftrace_ops like
it was earlier. The problem though is to synchronize the functions that
are traced because both function and function_graph tracer are limited
by the selections made in the set_ftrace_filter and set_ftrace_notrace
files.
To handle this, a new structure is made called ftrace_ops_hash. This
structure will now hold the filter_hash and notrace_hash, and the
ftrace_ops will point to this structure. That will allow two ftrace_ops
to share the same hashes.
Since most ftrace_ops do not share the hashes, and to keep allocation
simple, the ftrace_ops structure will include both a pointer to the
ftrace_ops_hash called func_hash, as well as the structure itself,
called local_hash. When the ops are registered, the func_hash pointer
will be initialized to point to the local_hash within the ftrace_ops
structure. Some of the ftrace internal ftrace_ops will be initialized
statically. This will allow for the function and function_graph tracer
to have separate ops but still share the same hash tables that determine
what functions they trace.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
rarely ever hit, and requires the user to do something that users rarely
do. It took a few special test cases to even trigger this bug,
and one of them was just one test in the process of finishing up as another
one started.
Both bugs have to do with the ring buffer iterator rb_iter_peek(), but one
is more indirect than the other.
The fist bug fix is simply an increase in the safety net loop counter.
The counter makes sure that the rb_iter_peek() only iterates the number
of times we expect it can, and no more. Well, there was one way it could
iterate one more than we expected, and that caused the ring buffer
to shutdown with a nasty warning. The fix was simply to up that counter by
one.
The other bug has to be with rb_iter_reset() (called by rb_iter_peek()).
This happens when a user reads both the trace_pipe and trace files.
The trace_pipe is a consuming read and does not use the ring buffer
iterator, but the trace file is not a consuming read and does use the
ring buffer iterator. When the trace file is being read, if it detects
that a consuming read occurred, it resets the iterator and starts over.
But the reset code that does this (rb_iter_reset()), checks if the
reader_page is linked to the ring buffer or not, and will look into
the ring buffer itself if it is not. This is wrong, as it should always
try to read the reader page first. Not to mention, the code that looked
into the ring buffer did it wrong, and used the header_page "read" offset
to start reading on that page. That offset is bogus for pages in the
writable ring buffer, and was corrupting the iterator, and it would start
returning bogus events.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT44tRAAoJEKQekfcNnQGuMVIH/3evbjKT+w6Kon4S0LfLSejl
YDsXYkeO/lGiO3MnUveqq1jfw2+yHtyBVUipvfG0A61urMUhyUvjveph8Ec2cQ4Q
qHl0J28vDfF5tpKiYzygRN01wHD6GXYh+XZSChkA4ItzzD8K51lsZT1Yd+I2pTy2
DgH01EEEYiwYJcih+T4vlbKqYju6pwgxqKNCTv0RdVXUPya/tG9X2Nf8VGQTbmKS
FIO73qObYR+P9iXGIuPfyOxk2EvBiWS15WownZmfhZZxOIKx9IrDYwTsiV1+AJw+
sJFER1ulobYGDtGDa9yyrNJQr6wgbrfCDELnNKmdLUTlSwgZjLXpE2HEmlelY/s=
=5mQl
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace file read iterator fixes from Steven Rostedt:
"This contains a fix for two long standing bugs. Both of which are
rarely ever hit, and requires the user to do something that users
rarely do. It took a few special test cases to even trigger this bug,
and one of them was just one test in the process of finishing up as
another one started.
Both bugs have to do with the ring buffer iterator rb_iter_peek(), but
one is more indirect than the other.
The fist bug fix is simply an increase in the safety net loop counter.
The counter makes sure that the rb_iter_peek() only iterates the
number of times we expect it can, and no more. Well, there was one
way it could iterate one more than we expected, and that caused the
ring buffer to shutdown with a nasty warning. The fix was simply to
up that counter by one.
The other bug has to be with rb_iter_reset() (called by
rb_iter_peek()). This happens when a user reads both the trace_pipe
and trace files. The trace_pipe is a consuming read and does not use
the ring buffer iterator, but the trace file is not a consuming read
and does use the ring buffer iterator. When the trace file is being
read, if it detects that a consuming read occurred, it resets the
iterator and starts over. But the reset code that does this
(rb_iter_reset()), checks if the reader_page is linked to the ring
buffer or not, and will look into the ring buffer itself if it is not.
This is wrong, as it should always try to read the reader page first.
Not to mention, the code that looked into the ring buffer did it
wrong, and used the header_page "read" offset to start reading on that
page. That offset is bogus for pages in the writable ring buffer, and
was corrupting the iterator, and it would start returning bogus
events"
* tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Always reset iterator to reader page
ring-buffer: Up rb_iter_peek() loop count to 3
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).
When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.
Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.
My tests would sometimes trigger this bug on my i386 box:
WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[<c18796b3>] dump_stack+0x4b/0x75
[<c103a0e3>] warn_slowpath_common+0x7e/0x95
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c103a185>] warn_slowpath_fmt+0x33/0x35
[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
[<c10bed04>] trace_find_cmdline+0x40/0x64
[<c10c3c16>] trace_print_context+0x27/0xec
[<c10c4360>] ? trace_seq_printf+0x37/0x5b
[<c10c0b15>] print_trace_line+0x319/0x39b
[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
[<c10c13b1>] s_show+0x192/0x1ab
[<c10bfd9a>] ? s_next+0x5a/0x7c
[<c112e76e>] seq_read+0x267/0x34c
[<c1115a25>] vfs_read+0x8c/0xef
[<c112e507>] ? seq_lseek+0x154/0x154
[<c1115ba2>] SyS_read+0x54/0x7f
[<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####
Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.
After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.
After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().
As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.
I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.
There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.
Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.
Cc: stable@vger.kernel.org # 2.6.28+
Fixes: d769041f86 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:
WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
[<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
[<ffffffff810c74a3>] ? s_start+0xd7/0x17b
[<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
[<ffffffff8114cf94>] ? seq_read+0x148/0x361
[<ffffffff81132d98>] ? vfs_read+0x93/0xf1
[<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
[<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!
rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.
That is, we have this:
1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again
2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again
3. read the event.
But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.
Up the counter to 3.
Cc: stable@vger.kernel.org # 2.6.37+
Fixes: 69d1b839f7 "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co
- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)
- Better cache layout for the timekeeping internal data structures.
- Proper nanosecond based interfaces for in kernel users.
- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.
- Consolidation of the timekeeping interface zoo.
- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.
- Checkpoint/restart support for timerfd.
- A few NOHZ[_FULL] improvements in the [hr]timer code.
- Code move from kernel to kernel/time of all time* related code.
- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.
[ Ed. "Impressed"? I don't think that word means what you think it means ]
- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.
- The usual updates and fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...
Pull perf changes from Ingo Molnar:
"Kernel side changes:
- Consolidate the PMU interrupt-disabled code amongst architectures
(Vince Weaver)
- misc fixes
Tooling changes (new features, user visible changes):
- Add support for pagefault tracing in 'trace', please see multiple
examples in the changeset messages (Stanislav Fomichev).
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add header for columns in 'top' and 'report' TUI browsers (Jiri
Olsa)
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add IO mode into timechart command (Stanislav Fomichev)
- Fallback to syscalls:* when raw_syscalls:* is not available in the
perl and python perf scripts. (Daniel Bristot de Oliveira)
- Add --repeat global option to 'perf bench' to be used in benchmarks
such as the existing 'futex' one, that was modified to use it
instead of a local option. (Davidlohr Bueso)
- Fix fd -> pathname resolution in 'trace', be it using /proc or a
vfs_getname probe point. (Arnaldo Carvalho de Melo)
- Add suggestion of how to set perf_event_paranoid sysctl, to help
non-root users trying tools like 'trace' to get a working
environment. (Arnaldo Carvalho de Melo)
- Updates from trace-cmd for traceevent plugin_kvm plus args cleanup
(Steven Rostedt, Jan Kiszka)
- Support S/390 in 'perf kvm stat' (Alexander Yarygin)
Tooling infrastructure changes:
- Allow reserving a row for header purposes in the hists browser
(Arnaldo Carvalho de Melo)
- Various fixes and prep work related to supporting Intel PT (Adrian
Hunter)
- Introduce multiple debug variables control (Jiri Olsa)
- Add callchain and additional sample information for python scripts
(Joseph Schuchart)
- More prep work to support Intel PT: (Adrian Hunter)
- Polishing 'script' BTS output
- 'inject' can specify --kallsym
- VDSO is per machine, not a global var
- Expose data addr lookup functions previously private to 'script'
- Large mmap fixes in events processing
- Include standard stringify macros in power pc code (Sukadev
Bhattiprolu)
Tooling cleanups:
- Convert open coded equivalents to asprintf() (Andy Shevchenko)
- Remove needless reassignments in 'trace' (Arnaldo Carvalho de Melo)
- Cache the is_exit syscall test in 'trace) (Arnaldo Carvalho de
Melo)
- No need to reimplement err() in 'perf bench sched-messaging', drop
barf(). (Davidlohr Bueso).
- Remove ev_name argument from perf_evsel__hists_browse, can be
obtained from the other parameters. (Jiri Olsa)
Tooling fixes:
- Fix memory leak in the 'sched-messaging' perf bench test.
(Davidlohr Bueso)
- The -o and -n 'perf bench mem' options are mutually exclusive, emit
error when both are specified. (Davidlohr Bueso)
- Fix scrollbar refresh row index in the ui browser, problem exposed
now that headers will be added and will be allowed to be switched
on/off. (Jiri Olsa)
- Handle the num array type in python properly (Sebastian Andrzej
Siewior)
- Fix wrong condition for allocation failure (Jiri Olsa)
- Adjust callchain based on DWARF debug info on powerpc (Sukadev
Bhattiprolu)
- Fix a risk for doing free on uninitialized pointer in traceevent
lib (Rickard Strandqvist)
- Update attr test with PERF_FLAG_FD_CLOEXEC flag (Jiri Olsa)
- Enable close-on-exec flag on perf file descriptor (Yann Droneaud)
- Fix build on gcc 4.4.7 (Arnaldo Carvalho de Melo)
- Event ordering fixes (Jiri Olsa)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (123 commits)
Revert "perf tools: Fix jump label always changing during tracing"
perf tools: Fix perf usage string leftover
perf: Check permission only for parent tracepoint event
perf record: Store PERF_RECORD_FINISHED_ROUND only for nonempty rounds
perf record: Always force PERF_RECORD_FINISHED_ROUND event
perf inject: Add --kallsyms parameter
perf tools: Expose 'addr' functions so they can be reused
perf session: Fix accounting of ordered samples queue
perf powerpc: Include util/util.h and remove stringify macros
perf tools: Fix build on gcc 4.4.7
perf tools: Add thread parameter to vdso__dso_findnew()
perf tools: Add dso__type()
perf tools: Separate the VDSO map name from the VDSO dso name
perf tools: Add vdso__new()
perf machine: Fix the lifetime of the VDSO temporary file
perf tools: Group VDSO global variables into a structure
perf session: Add ability to skip 4GiB or more
perf session: Add ability to 'skip' a non-piped event stream
perf tools: Pass machine to vdso__dso_findnew()
perf tools: Add dso__data_size()
...
As he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than
my main development branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT3533AAoJEKQekfcNnQGuaNkIALM9uf6gvIXgvyAlhK1qfjJL
0pxAlgppukA7hZT9gcubGuStPxKnwtVSvd/GwGI/mc7Ls57vOvdA61TEHN48l2MW
SUev1Jeu1Nx3Wh4B2ivRq+dJadrwU9NwG507yvuXn2L8Gqzwal133dg3AHnsTYgz
yu7oe8baxMniFFYlAmSELFp2cA3kli/WrRuQB7weudbT12jZlpOMqz7OUN9o7dfo
E4XfwkkDUuo9EiKD1QwVMk2cIkCaHFI8lJDYJbrM3L5XJKfnKzZNRMWfwDG2L7hS
DGjMl6T0SjU+ZIoTidHdbyF/Q+I0o9kKTnUrErGu88+2Z1gfEKN5O8zovNMbPUM=
=dT/x
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing filter cleanups from Steven Rostedt:
"Oleg Nesterov did several clean ups with the tracing filter code. As
he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than my
main development branch.
This was based on work that was already pulled into 3.16, and is a
separate pull request to keep from having local merges in my pull
request"
* tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Kill "filter_string" arg of replace_preds()
tracing: Change apply_subsystem_event_filter() paths to check file->system == dir
tracing: Kill ftrace_event_call->files
tracing/uprobes: Kill the dead TRACE_EVENT_FL_USE_CALL_FILTER logic
tracing: Kill call_filter_disable()
tracing: Kill destroy_call_preds()
tracing: Kill destroy_preds() and destroy_file_preds()
to the ftrace function callback infrastructure. It's introducing a
way to allow different functions to call directly different trampolines
instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which always
had a different trampoline, but the function tracer trampoline was called
and did basically nothing, and then the function graph tracer trampoline
was called. The difference now, is that the function graph tracer
trampoline can be called directly if a function is only being traced by
the function graph trampoline. If function tracing is also happening on
the same function, the old way is still done.
The accounting for this takes up more memory when function graph tracing
is activated, as it needs to keep track of which functions it uses.
I have a new way that wont take as much memory, but it's not ready yet
for this merge window, and will have to wait for the next one.
Another big change was the removal of the ftrace_start/stop() calls that
were used by the suspend/resume code that stopped function tracing when
entering into suspend and resume paths. The stop of ftrace was done
because there was some function that would crash the system if one called
smp_processor_id()! The stop/start was a big hammer to solve the issue
at the time, which was when ftrace was first introduced into Linux.
Now ftrace has better infrastructure to debug such issues, and I found
the problem function and labeled it with "notrace" and function tracing
can now safely be activated all the way down into the guts of suspend
and resume.
Other changes include clean ups of uprobe code.
Clean up of the trace_seq() code.
And other various small fixes and clean ups to ftrace and tracing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT35zXAAoJEKQekfcNnQGuOz0H/38zqM0nLFhrgvz3EPk2UOjn
xqpX8qyb2V7TJZL+IqeXU2a5cQZl5ba0D4WtBGpxbTae3CJYiuQ87iKUNFoH0om5
FDpn80igb368k8V3qRdRsziKVCCf0XBd/NkHJXc0ZkfXGyzB2Ga4bBxALxp2gj9y
bnO+vKo6+tWYKG4hyQb4P3LRXUrK8/LWEsPr39cH2QH1Rdj69Lx9CgrCdUVJmwcb
Bj8hEiLXL/RYCFNn79A3wNTUvW0rG/AOIf4SLqXtasSRZ0ToaU0ZyDnrNv+0Ol47
rX8tSk+LfXchL9hpIvjCf1vlAYq3pO02favteR/jip3lx/dTjEDE4RJ9qtJzZ4Q=
=fwQY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a lot of work done. The main thing is the
changes to the ftrace function callback infrastructure. It's
introducing a way to allow different functions to call directly
different trampolines instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which
always had a different trampoline, but the function tracer trampoline
was called and did basically nothing, and then the function graph
tracer trampoline was called. The difference now, is that the
function graph tracer trampoline can be called directly if a function
is only being traced by the function graph trampoline. If function
tracing is also happening on the same function, the old way is still
done.
The accounting for this takes up more memory when function graph
tracing is activated, as it needs to keep track of which functions it
uses. I have a new way that wont take as much memory, but it's not
ready yet for this merge window, and will have to wait for the next
one.
Another big change was the removal of the ftrace_start/stop() calls
that were used by the suspend/resume code that stopped function
tracing when entering into suspend and resume paths. The stop of
ftrace was done because there was some function that would crash the
system if one called smp_processor_id()! The stop/start was a big
hammer to solve the issue at the time, which was when ftrace was first
introduced into Linux. Now ftrace has better infrastructure to debug
such issues, and I found the problem function and labeled it with
"notrace" and function tracing can now safely be activated all the way
down into the guts of suspend and resume
Other changes include clean ups of uprobe code, clean up of the
trace_seq() code, and other various small fixes and clean ups to
ftrace and tracing"
* tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
ftrace: Add warning if tramp hash does not match nr_trampolines
ftrace: Fix trampoline hash update check on rec->flags
ring-buffer: Use rb_page_size() instead of open coded head_page size
ftrace: Rename ftrace_ops field from trampolines to nr_trampolines
tracing: Convert local function_graph functions to static
ftrace: Do not copy old hash when resetting
tracing: let user specify tracing_thresh after selecting function_graph
ring-buffer: Always run per-cpu ring buffer resize with schedule_work_on()
tracing: Remove function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST
s390/ftrace: remove check of obsolete variable function_trace_stop
arm64, ftrace: Remove check of obsolete variable function_trace_stop
Blackfin: ftrace: Remove check of obsolete variable function_trace_stop
metag: ftrace: Remove check of obsolete variable function_trace_stop
microblaze: ftrace: Remove check of obsolete variable function_trace_stop
MIPS: ftrace: Remove check of obsolete variable function_trace_stop
parisc: ftrace: Remove check of obsolete variable function_trace_stop
sh: ftrace: Remove check of obsolete variable function_trace_stop
sparc64,ftrace: Remove check of obsolete variable function_trace_stop
tile: ftrace: Remove check of obsolete variable function_trace_stop
ftrace: x86: Remove check of obsolete variable function_trace_stop
...
There's no need to check cloned event's permission once the
parent was already checked.
Also the code is checking 'current' process permissions, which
is not owner process for cloned events, thus could end up with
wrong permission check result.
Reported-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Tested-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1405079782-8139-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After adding all the records to the tramp_hash, add a check that makes
sure that the number of records added matches the number of records
expected to match and do a WARN_ON and disable ftrace if they do
not match.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the loop of ftrace_save_ops_tramp_hash(), it adds all the recs
to the ops hash if the rec has only one callback attached and the
ops is connected to the rec. It gives a nasty warning and shuts down
ftrace if the rec doesn't have a trampoline set for it. But this
can happen with the following scenario:
# cd /sys/kernel/debug/tracing
# echo schedule do_IRQ > set_ftrace_filter
# mkdir instances/foo
# echo schedule > instances/foo/set_ftrace_filter
# echo function_graph > current_function
# echo function > instances/foo/current_function
# echo nop > instances/foo/current_function
The above would then trigger the following warning and disable
ftrace:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3145 at kernel/trace/ftrace.c:2212 ftrace_run_update_code+0xe4/0x15b()
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ip [...]
CPU: 1 PID: 3145 Comm: bash Not tainted 3.16.0-rc3-test+ #136
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81808a88 ffffffff81502130 0000000000000000
ffffffff81040ca1 ffff880077c08000 ffffffff810bd286 0000000000000001
ffffffff81a56830 ffff88007a041be0 ffff88007a872d60 00000000000001be
Call Trace:
[<ffffffff81502130>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bda1a>] ? ftrace_shutdown+0x11c/0x16b
[<ffffffff810bda87>] ? unregister_ftrace_function+0x1e/0x38
[<ffffffff810cc7e1>] ? function_trace_reset+0x1a/0x28
[<ffffffff810c924f>] ? tracing_set_tracer+0xc1/0x276
[<ffffffff810c9477>] ? tracing_set_trace_write+0x73/0x91
[<ffffffff81132383>] ? __sb_start_write+0x9a/0xcc
[<ffffffff8120478f>] ? security_file_permission+0x1b/0x31
[<ffffffff81130e49>] ? vfs_write+0xac/0x11c
[<ffffffff8113115d>] ? SyS_write+0x60/0x8e
[<ffffffff81508112>] ? system_call_fastpath+0x16/0x1b
---[ end trace 938c4415cbc7dc96 ]---
------------[ cut here ]------------
Link: http://lkml.kernel.org/r/20140723120805.GB21376@redhat.com
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a helper function to get a ring buffer page size (the number
of bytes of data recorded on the page), called rb_page_size().
Use that instead of open coding it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Expose the new NMI safe accessor to clock monotonic to the tracer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Having two fields within the same struct that is off by one character
can be confusing and error prone. Rename the counter "trampolines"
to "nr_trampolines" to explicitly show it is a counter and not to
be confused by the "trampoline" field.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "uptime" trace clock added in:
commit 8aacf017b0
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b0 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not waste time copying the old hash if the hash is going to be
reset. Just allocate a new hash and free the old one, as that is
the same result as copying te old one and then resetting it.
Link: http://lkml.kernel.org/p/1405384820-48837-1-git-send-email-wangnan0@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
[ SDR: Removed unused ftrace_filter_reset() function ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, tracing_thresh works only if we specify it before selecting
function_graph tracer. If we do the opposite, tracing_thresh will change
it's value, but it will not be applied.
To fix it, we add update_thresh callback which is called whenever
tracing_thresh is updated and for function_graph tracer we register
handler which reinitializes tracer depending on tracing_thresh.
Link: http://lkml.kernel.org/p/20140718111727.GA3206@stfomichev-desktop.yandex.net
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code for resizing the trace ring buffers has to run the per-cpu
resize on the CPU itself. The code was using preempt_off() and
running the code for the current CPU directly, otherwise calling
schedule_work_on().
At least on RT this could result in the following:
|BUG: sleeping function called from invalid context at kernel/rtmutex.c:673
|in_atomic(): 1, irqs_disabled(): 0, pid: 607, name: bash
|3 locks held by bash/607:
|CPU: 0 PID: 607 Comm: bash Not tainted 3.12.15-rt25+ #124
|(rt_spin_lock+0x28/0x68)
|(free_hot_cold_page+0x84/0x3b8)
|(free_buffer_page+0x14/0x20)
|(rb_update_pages+0x280/0x338)
|(ring_buffer_resize+0x32c/0x3dc)
|(free_snapshot+0x18/0x38)
|(tracing_set_tracer+0x27c/0x2ac)
probably via
|cd /sys/kernel/debug/tracing/
|echo 1 > events/enable ; sleep 2
|echo 1024 > buffer_size_kb
If we just always use schedule_work_on(), there's no need for the
preempt_off(). So do that.
Link: http://lkml.kernel.org/p/1405537633-31518-1-git-send-email-cminyard@mvista.com
Reported-by: Stanislav Meduna <stano@meduna.org>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>