mirror of https://gitee.com/openkylin/linux.git
doc: Update stallwarn.txt to make causes more prominent
This commit rearranges the Documentation/RCU/stallwarn.txt file to put the list of issues that can cause RCU CPU stall warnings near the beginning of the document. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit is contained in:
parent
b4553f0cfe
commit
8e2a439753
|
@ -1,9 +1,102 @@
|
||||||
Using RCU's CPU Stall Detector
|
Using RCU's CPU Stall Detector
|
||||||
|
|
||||||
The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
|
This document first discusses what sorts of issues RCU's CPU stall
|
||||||
detector, which detects conditions that unduly delay RCU grace periods.
|
detector can locate, and then discusses kernel parameters and Kconfig
|
||||||
This module parameter enables CPU stall detection by default, but
|
options that can be used to fine-tune the detector's operation. Finally,
|
||||||
may be overridden via boot-time parameter or at runtime via sysfs.
|
this document explains the stall detector's "splat" format.
|
||||||
|
|
||||||
|
|
||||||
|
What Causes RCU CPU Stall Warnings?
|
||||||
|
|
||||||
|
So your kernel printed an RCU CPU stall warning. The next question is
|
||||||
|
"What caused it?" The following problems can result in RCU CPU stall
|
||||||
|
warnings:
|
||||||
|
|
||||||
|
o A CPU looping in an RCU read-side critical section.
|
||||||
|
|
||||||
|
o A CPU looping with interrupts disabled.
|
||||||
|
|
||||||
|
o A CPU looping with preemption disabled. This condition can
|
||||||
|
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
|
||||||
|
stalls.
|
||||||
|
|
||||||
|
o A CPU looping with bottom halves disabled. This condition can
|
||||||
|
result in RCU-sched and RCU-bh stalls.
|
||||||
|
|
||||||
|
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
|
||||||
|
kernel without invoking schedule(). Note that cond_resched()
|
||||||
|
does not necessarily prevent RCU CPU stall warnings. Therefore,
|
||||||
|
if the looping in the kernel is really expected and desirable
|
||||||
|
behavior, you might need to replace some of the cond_resched()
|
||||||
|
calls with calls to cond_resched_rcu_qs().
|
||||||
|
|
||||||
|
o Booting Linux using a console connection that is too slow to
|
||||||
|
keep up with the boot-time console-message rate. For example,
|
||||||
|
a 115Kbaud serial console can be -way- too slow to keep up
|
||||||
|
with boot-time message rates, and will frequently result in
|
||||||
|
RCU CPU stall warning messages. Especially if you have added
|
||||||
|
debug printk()s.
|
||||||
|
|
||||||
|
o Anything that prevents RCU's grace-period kthreads from running.
|
||||||
|
This can result in the "All QSes seen" console-log message.
|
||||||
|
This message will include information on when the kthread last
|
||||||
|
ran and how often it should be expected to run.
|
||||||
|
|
||||||
|
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
||||||
|
happen to preempt a low-priority task in the middle of an RCU
|
||||||
|
read-side critical section. This is especially damaging if
|
||||||
|
that low-priority task is not permitted to run on any other CPU,
|
||||||
|
in which case the next RCU grace period can never complete, which
|
||||||
|
will eventually cause the system to run out of memory and hang.
|
||||||
|
While the system is in the process of running itself out of
|
||||||
|
memory, you might see stall-warning messages.
|
||||||
|
|
||||||
|
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
||||||
|
is running at a higher priority than the RCU softirq threads.
|
||||||
|
This will prevent RCU callbacks from ever being invoked,
|
||||||
|
and in a CONFIG_PREEMPT_RCU kernel will further prevent
|
||||||
|
RCU grace periods from ever completing. Either way, the
|
||||||
|
system will eventually run out of memory and hang. In the
|
||||||
|
CONFIG_PREEMPT_RCU case, you might see stall-warning
|
||||||
|
messages.
|
||||||
|
|
||||||
|
o A hardware or software issue shuts off the scheduler-clock
|
||||||
|
interrupt on a CPU that is not in dyntick-idle mode. This
|
||||||
|
problem really has happened, and seems to be most likely to
|
||||||
|
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
|
||||||
|
|
||||||
|
o A bug in the RCU implementation.
|
||||||
|
|
||||||
|
o A hardware failure. This is quite unlikely, but has occurred
|
||||||
|
at least once in real life. A CPU failed in a running system,
|
||||||
|
becoming unresponsive, but not causing an immediate crash.
|
||||||
|
This resulted in a series of RCU CPU stall warnings, eventually
|
||||||
|
leading the realization that the CPU had failed.
|
||||||
|
|
||||||
|
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
|
||||||
|
warning. Note that SRCU does -not- have CPU stall warnings. Please note
|
||||||
|
that RCU only detects CPU stalls when there is a grace period in progress.
|
||||||
|
No grace period, no CPU stall warnings.
|
||||||
|
|
||||||
|
To diagnose the cause of the stall, inspect the stack traces.
|
||||||
|
The offending function will usually be near the top of the stack.
|
||||||
|
If you have a series of stall warnings from a single extended stall,
|
||||||
|
comparing the stack traces can often help determine where the stall
|
||||||
|
is occurring, which will usually be in the function nearest the top of
|
||||||
|
that portion of the stack which remains the same from trace to trace.
|
||||||
|
If you can reliably trigger the stall, ftrace can be quite helpful.
|
||||||
|
|
||||||
|
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
|
||||||
|
and with RCU's event tracing. For information on RCU's event tracing,
|
||||||
|
see include/trace/events/rcu.h.
|
||||||
|
|
||||||
|
|
||||||
|
Fine-Tuning the RCU CPU Stall Detector
|
||||||
|
|
||||||
|
The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
|
||||||
|
CPU stall detector, which detects conditions that unduly delay RCU grace
|
||||||
|
periods. This module parameter enables CPU stall detection by default,
|
||||||
|
but may be overridden via boot-time parameter or at runtime via sysfs.
|
||||||
The stall detector's idea of what constitutes "unduly delayed" is
|
The stall detector's idea of what constitutes "unduly delayed" is
|
||||||
controlled by a set of kernel configuration variables and cpp macros:
|
controlled by a set of kernel configuration variables and cpp macros:
|
||||||
|
|
||||||
|
@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout
|
||||||
And continues with the output of sched_show_task() for each
|
And continues with the output of sched_show_task() for each
|
||||||
task stalling the current RCU-tasks grace period.
|
task stalling the current RCU-tasks grace period.
|
||||||
|
|
||||||
|
|
||||||
|
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||||
|
|
||||||
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
|
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
|
||||||
it will print a message similar to the following:
|
it will print a message similar to the following:
|
||||||
|
|
||||||
|
@ -178,89 +274,3 @@ grace period is in flight.
|
||||||
|
|
||||||
It is entirely possible to see stall warnings from normal and from
|
It is entirely possible to see stall warnings from normal and from
|
||||||
expedited grace periods at about the same time from the same run.
|
expedited grace periods at about the same time from the same run.
|
||||||
|
|
||||||
|
|
||||||
What Causes RCU CPU Stall Warnings?
|
|
||||||
|
|
||||||
So your kernel printed an RCU CPU stall warning. The next question is
|
|
||||||
"What caused it?" The following problems can result in RCU CPU stall
|
|
||||||
warnings:
|
|
||||||
|
|
||||||
o A CPU looping in an RCU read-side critical section.
|
|
||||||
|
|
||||||
o A CPU looping with interrupts disabled. This condition can
|
|
||||||
result in RCU-sched and RCU-bh stalls.
|
|
||||||
|
|
||||||
o A CPU looping with preemption disabled. This condition can
|
|
||||||
result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
|
|
||||||
stalls.
|
|
||||||
|
|
||||||
o A CPU looping with bottom halves disabled. This condition can
|
|
||||||
result in RCU-sched and RCU-bh stalls.
|
|
||||||
|
|
||||||
o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
|
|
||||||
kernel without invoking schedule(). Note that cond_resched()
|
|
||||||
does not necessarily prevent RCU CPU stall warnings. Therefore,
|
|
||||||
if the looping in the kernel is really expected and desirable
|
|
||||||
behavior, you might need to replace some of the cond_resched()
|
|
||||||
calls with calls to cond_resched_rcu_qs().
|
|
||||||
|
|
||||||
o Booting Linux using a console connection that is too slow to
|
|
||||||
keep up with the boot-time console-message rate. For example,
|
|
||||||
a 115Kbaud serial console can be -way- too slow to keep up
|
|
||||||
with boot-time message rates, and will frequently result in
|
|
||||||
RCU CPU stall warning messages. Especially if you have added
|
|
||||||
debug printk()s.
|
|
||||||
|
|
||||||
o Anything that prevents RCU's grace-period kthreads from running.
|
|
||||||
This can result in the "All QSes seen" console-log message.
|
|
||||||
This message will include information on when the kthread last
|
|
||||||
ran and how often it should be expected to run.
|
|
||||||
|
|
||||||
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
|
|
||||||
happen to preempt a low-priority task in the middle of an RCU
|
|
||||||
read-side critical section. This is especially damaging if
|
|
||||||
that low-priority task is not permitted to run on any other CPU,
|
|
||||||
in which case the next RCU grace period can never complete, which
|
|
||||||
will eventually cause the system to run out of memory and hang.
|
|
||||||
While the system is in the process of running itself out of
|
|
||||||
memory, you might see stall-warning messages.
|
|
||||||
|
|
||||||
o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
|
|
||||||
is running at a higher priority than the RCU softirq threads.
|
|
||||||
This will prevent RCU callbacks from ever being invoked,
|
|
||||||
and in a CONFIG_PREEMPT_RCU kernel will further prevent
|
|
||||||
RCU grace periods from ever completing. Either way, the
|
|
||||||
system will eventually run out of memory and hang. In the
|
|
||||||
CONFIG_PREEMPT_RCU case, you might see stall-warning
|
|
||||||
messages.
|
|
||||||
|
|
||||||
o A hardware or software issue shuts off the scheduler-clock
|
|
||||||
interrupt on a CPU that is not in dyntick-idle mode. This
|
|
||||||
problem really has happened, and seems to be most likely to
|
|
||||||
result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
|
|
||||||
|
|
||||||
o A bug in the RCU implementation.
|
|
||||||
|
|
||||||
o A hardware failure. This is quite unlikely, but has occurred
|
|
||||||
at least once in real life. A CPU failed in a running system,
|
|
||||||
becoming unresponsive, but not causing an immediate crash.
|
|
||||||
This resulted in a series of RCU CPU stall warnings, eventually
|
|
||||||
leading the realization that the CPU had failed.
|
|
||||||
|
|
||||||
The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
|
|
||||||
warning. Note that SRCU does -not- have CPU stall warnings. Please note
|
|
||||||
that RCU only detects CPU stalls when there is a grace period in progress.
|
|
||||||
No grace period, no CPU stall warnings.
|
|
||||||
|
|
||||||
To diagnose the cause of the stall, inspect the stack traces.
|
|
||||||
The offending function will usually be near the top of the stack.
|
|
||||||
If you have a series of stall warnings from a single extended stall,
|
|
||||||
comparing the stack traces can often help determine where the stall
|
|
||||||
is occurring, which will usually be in the function nearest the top of
|
|
||||||
that portion of the stack which remains the same from trace to trace.
|
|
||||||
If you can reliably trigger the stall, ftrace can be quite helpful.
|
|
||||||
|
|
||||||
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
|
|
||||||
and with RCU's event tracing. For information on RCU's event tracing,
|
|
||||||
see include/trace/events/rcu.h.
|
|
||||||
|
|
Loading…
Reference in New Issue