2742 lines
104 KiB
Plaintext
2742 lines
104 KiB
Plaintext
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<head><title>A Tour Through RCU's Requirements [LWN.net]</title>
|
|
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
|
|
|
|
<h1>A Tour Through RCU's Requirements</h1>
|
|
|
|
<p>Copyright IBM Corporation, 2015</p>
|
|
<p>Author: Paul E. McKenney</p>
|
|
<p><i>The initial version of this document appeared in the
|
|
<a href="https://lwn.net/">LWN</a> articles
|
|
<a href="https://lwn.net/Articles/652156/">here</a>,
|
|
<a href="https://lwn.net/Articles/652677/">here</a>, and
|
|
<a href="https://lwn.net/Articles/653326/">here</a>.</i></p>
|
|
|
|
<h2>Introduction</h2>
|
|
|
|
<p>
|
|
Read-copy update (RCU) is a synchronization mechanism that is often
|
|
used as a replacement for reader-writer locking.
|
|
RCU is unusual in that updaters do not block readers,
|
|
which means that RCU's read-side primitives can be exceedingly fast
|
|
and scalable.
|
|
In addition, updaters can make useful forward progress concurrently
|
|
with readers.
|
|
However, all this concurrency between RCU readers and updaters does raise
|
|
the question of exactly what RCU readers are doing, which in turn
|
|
raises the question of exactly what RCU's requirements are.
|
|
|
|
<p>
|
|
This document therefore summarizes RCU's requirements, and can be thought
|
|
of as an informal, high-level specification for RCU.
|
|
It is important to understand that RCU's specification is primarily
|
|
empirical in nature;
|
|
in fact, I learned about many of these requirements the hard way.
|
|
This situation might cause some consternation, however, not only
|
|
has this learning process been a lot of fun, but it has also been
|
|
a great privilege to work with so many people willing to apply
|
|
technologies in interesting new ways.
|
|
|
|
<p>
|
|
All that aside, here are the categories of currently known RCU requirements:
|
|
</p>
|
|
|
|
<ol>
|
|
<li> <a href="#Fundamental Requirements">
|
|
Fundamental Requirements</a>
|
|
<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a>
|
|
<li> <a href="#Parallelism Facts of Life">
|
|
Parallelism Facts of Life</a>
|
|
<li> <a href="#Quality-of-Implementation Requirements">
|
|
Quality-of-Implementation Requirements</a>
|
|
<li> <a href="#Linux Kernel Complications">
|
|
Linux Kernel Complications</a>
|
|
<li> <a href="#Software-Engineering Requirements">
|
|
Software-Engineering Requirements</a>
|
|
<li> <a href="#Other RCU Flavors">
|
|
Other RCU Flavors</a>
|
|
<li> <a href="#Possible Future Changes">
|
|
Possible Future Changes</a>
|
|
</ol>
|
|
|
|
<p>
|
|
This is followed by a <a href="#Summary">summary</a>,
|
|
which is in turn followed by the inevitable
|
|
<a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
|
|
|
|
<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2>
|
|
|
|
<p>
|
|
RCU's fundamental requirements are the closest thing RCU has to hard
|
|
mathematical requirements.
|
|
These are:
|
|
|
|
<ol>
|
|
<li> <a href="#Grace-Period Guarantee">
|
|
Grace-Period Guarantee</a>
|
|
<li> <a href="#Publish-Subscribe Guarantee">
|
|
Publish-Subscribe Guarantee</a>
|
|
<li> <a href="#Memory-Barrier Guarantees">
|
|
Memory-Barrier Guarantees</a>
|
|
<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally">
|
|
RCU Primitives Guaranteed to Execute Unconditionally</a>
|
|
<li> <a href="#Guaranteed Read-to-Write Upgrade">
|
|
Guaranteed Read-to-Write Upgrade</a>
|
|
</ol>
|
|
|
|
<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3>
|
|
|
|
<p>
|
|
RCU's grace-period guarantee is unusual in being premeditated:
|
|
Jack Slingwine and I had this guarantee firmly in mind when we started
|
|
work on RCU (then called “rclock”) in the early 1990s.
|
|
That said, the past two decades of experience with RCU have produced
|
|
a much more detailed understanding of this guarantee.
|
|
|
|
<p>
|
|
RCU's grace-period guarantee allows updaters to wait for the completion
|
|
of all pre-existing RCU read-side critical sections.
|
|
An RCU read-side critical section
|
|
begins with the marker <tt>rcu_read_lock()</tt> and ends with
|
|
the marker <tt>rcu_read_unlock()</tt>.
|
|
These markers may be nested, and RCU treats a nested set as one
|
|
big RCU read-side critical section.
|
|
Production-quality implementations of <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> are extremely lightweight, and in
|
|
fact have exactly zero overhead in Linux kernels built for production
|
|
use with <tt>CONFIG_PREEMPT=n</tt>.
|
|
|
|
<p>
|
|
This guarantee allows ordering to be enforced with extremely low
|
|
overhead to readers, for example:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int x, y;
|
|
2
|
|
3 void thread0(void)
|
|
4 {
|
|
5 rcu_read_lock();
|
|
6 r1 = READ_ONCE(x);
|
|
7 r2 = READ_ONCE(y);
|
|
8 rcu_read_unlock();
|
|
9 }
|
|
10
|
|
11 void thread1(void)
|
|
12 {
|
|
13 WRITE_ONCE(x, 1);
|
|
14 synchronize_rcu();
|
|
15 WRITE_ONCE(y, 1);
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Because the <tt>synchronize_rcu()</tt> on line 14 waits for
|
|
all pre-existing readers, any instance of <tt>thread0()</tt> that
|
|
loads a value of zero from <tt>x</tt> must complete before
|
|
<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must
|
|
also load a value of zero from <tt>y</tt>.
|
|
Similarly, any instance of <tt>thread0()</tt> that loads a value of
|
|
one from <tt>y</tt> must have started after the
|
|
<tt>synchronize_rcu()</tt> started, and must therefore also load
|
|
a value of one from <tt>x</tt>.
|
|
Therefore, the outcome:
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 0 && r2 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
cannot happen.
|
|
|
|
<p>@@QQ@@
|
|
Wait a minute!
|
|
You said that updaters can make useful forward progress concurrently
|
|
with readers, but pre-existing readers will block
|
|
<tt>synchronize_rcu()</tt>!!!
|
|
Just who are you trying to fool???
|
|
<p>@@QQA@@
|
|
First, if updaters do not wish to be blocked by readers, they can use
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will
|
|
be discussed later.
|
|
Second, even when using <tt>synchronize_rcu()</tt>, the other
|
|
update-side code does run concurrently with readers, whether pre-existing
|
|
or not.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
This scenario resembles one of the first uses of RCU in
|
|
<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>,
|
|
which managed a distributed lock manager's transition into
|
|
a state suitable for handling recovery from node failure,
|
|
more or less as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 #define STATE_NORMAL 0
|
|
2 #define STATE_WANT_RECOVERY 1
|
|
3 #define STATE_RECOVERING 2
|
|
4 #define STATE_WANT_NORMAL 3
|
|
5
|
|
6 int state = STATE_NORMAL;
|
|
7
|
|
8 void do_something_dlm(void)
|
|
9 {
|
|
10 int state_snap;
|
|
11
|
|
12 rcu_read_lock();
|
|
13 state_snap = READ_ONCE(state);
|
|
14 if (state_snap == STATE_NORMAL)
|
|
15 do_something();
|
|
16 else
|
|
17 do_something_carefully();
|
|
18 rcu_read_unlock();
|
|
19 }
|
|
20
|
|
21 void start_recovery(void)
|
|
22 {
|
|
23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
|
|
24 synchronize_rcu();
|
|
25 WRITE_ONCE(state, STATE_RECOVERING);
|
|
26 recovery();
|
|
27 WRITE_ONCE(state, STATE_WANT_NORMAL);
|
|
28 synchronize_rcu();
|
|
29 WRITE_ONCE(state, STATE_NORMAL);
|
|
30 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The RCU read-side critical section in <tt>do_something_dlm()</tt>
|
|
works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt>
|
|
to guarantee that <tt>do_something()</tt> never runs concurrently
|
|
with <tt>recovery()</tt>, but with little or no synchronization
|
|
overhead in <tt>do_something_dlm()</tt>.
|
|
|
|
<p>@@QQ@@
|
|
Why is the <tt>synchronize_rcu()</tt> on line 28 needed?
|
|
<p>@@QQA@@
|
|
Without that extra grace period, memory reordering could result in
|
|
<tt>do_something_dlm()</tt> executing <tt>do_something()</tt>
|
|
concurrently with the last bits of <tt>recovery()</tt>.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
In order to avoid fatal problems such as deadlocks,
|
|
an RCU read-side critical section must not contain calls to
|
|
<tt>synchronize_rcu()</tt>.
|
|
Similarly, an RCU read-side critical section must not
|
|
contain anything that waits, directly or indirectly, on completion of
|
|
an invocation of <tt>synchronize_rcu()</tt>.
|
|
|
|
<p>
|
|
Although RCU's grace-period guarantee is useful in and of itself, with
|
|
<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>,
|
|
it would be good to be able to use RCU to coordinate read-side
|
|
access to linked data structures.
|
|
For this, the grace-period guarantee is not sufficient, as can
|
|
be seen in function <tt>add_gp_buggy()</tt> below.
|
|
We will look at the reader's code later, but in the meantime, just think of
|
|
the reader as locklessly picking up the <tt>gp</tt> pointer,
|
|
and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the
|
|
<tt>->a</tt> and <tt>->b</tt> fields.
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp_buggy(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 p->a = a;
|
|
12 p->b = a;
|
|
13 gp = p; /* ORDERING BUG */
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The problem is that both the compiler and weakly ordered CPUs are within
|
|
their rights to reorder this code as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp_buggy_optimized(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
<b>11 gp = p; /* ORDERING BUG */
|
|
12 p->a = a;
|
|
13 p->b = a;</b>
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If an RCU reader fetches <tt>gp</tt> just after
|
|
<tt>add_gp_buggy_optimized</tt> executes line 11,
|
|
it will see garbage in the <tt>->a</tt> and <tt>->b</tt>
|
|
fields.
|
|
And this is but one of many ways in which compiler and hardware optimizations
|
|
could cause trouble.
|
|
Therefore, we clearly need some way to prevent the compiler and the CPU from
|
|
reordering in this manner, which brings us to the publish-subscribe
|
|
guarantee discussed in the next section.
|
|
|
|
<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3>
|
|
|
|
<p>
|
|
RCU's publish-subscribe guarantee allows data to be inserted
|
|
into a linked data structure without disrupting RCU readers.
|
|
The updater uses <tt>rcu_assign_pointer()</tt> to insert the
|
|
new data, and readers use <tt>rcu_dereference()</tt> to
|
|
access data, whether new or old.
|
|
The following shows an example of insertion:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool add_gp(int a, int b)
|
|
2 {
|
|
3 p = kmalloc(sizeof(*p), GFP_KERNEL);
|
|
4 if (!p)
|
|
5 return -ENOMEM;
|
|
6 spin_lock(&gp_lock);
|
|
7 if (rcu_access_pointer(gp)) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 p->a = a;
|
|
12 p->b = a;
|
|
13 rcu_assign_pointer(gp, p);
|
|
14 spin_unlock(&gp_lock);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually
|
|
equivalent to a simple assignment statement, but also guarantees
|
|
that its assignment will
|
|
happen after the two assignments in lines 11 and 12,
|
|
similar to the C11 <tt>memory_order_release</tt> store operation.
|
|
It also prevents any number of “interesting” compiler
|
|
optimizations, for example, the use of <tt>gp</tt> as a scratch
|
|
location immediately preceding the assignment.
|
|
|
|
<p>@@QQ@@
|
|
But <tt>rcu_assign_pointer()</tt> does nothing to prevent the
|
|
two assignments to <tt>p->a</tt> and <tt>p->b</tt>
|
|
from being reordered.
|
|
Can't that also cause problems?
|
|
<p>@@QQA@@
|
|
No, it cannot.
|
|
The readers cannot see either of these two fields until
|
|
the assignment to <tt>gp</tt>, by which time both fields are
|
|
fully initialized.
|
|
So reordering the assignments
|
|
to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly
|
|
cause any problems.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
It is tempting to assume that the reader need not do anything special
|
|
to control its accesses to the RCU-protected data,
|
|
as shown in <tt>do_something_gp_buggy()</tt> below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp_buggy(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 p = gp; /* OPTIMIZATIONS GALORE!!! */
|
|
5 if (p) {
|
|
6 do_something(p->a, p->b);
|
|
7 rcu_read_unlock();
|
|
8 return true;
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 return false;
|
|
12 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
However, this temptation must be resisted because there are a
|
|
surprisingly large number of ways that the compiler
|
|
(to say nothing of
|
|
<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>)
|
|
can trip this code up.
|
|
For but one example, if the compiler were short of registers, it
|
|
might choose to refetch from <tt>gp</tt> rather than keeping
|
|
a separate copy in <tt>p</tt> as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp_buggy_optimized(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
|
|
<b> 5 do_something(gp->a, gp->b);</b>
|
|
6 rcu_read_unlock();
|
|
7 return true;
|
|
8 }
|
|
9 rcu_read_unlock();
|
|
10 return false;
|
|
11 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If this function ran concurrently with a series of updates that
|
|
replaced the current structure with a new one,
|
|
the fetches of <tt>gp->a</tt>
|
|
and <tt>gp->b</tt> might well come from two different structures,
|
|
which could cause serious confusion.
|
|
To prevent this (and much else besides), <tt>do_something_gp()</tt> uses
|
|
<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool do_something_gp(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 p = rcu_dereference(gp);
|
|
5 if (p) {
|
|
6 do_something(p->a, p->b);
|
|
7 rcu_read_unlock();
|
|
8 return true;
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 return false;
|
|
12 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha)
|
|
memory barriers in the Linux kernel.
|
|
Should a
|
|
<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a>
|
|
ever appear, then <tt>rcu_dereference()</tt> could be implemented
|
|
as a <tt>memory_order_consume</tt> load.
|
|
Regardless of the exact implementation, a pointer fetched by
|
|
<tt>rcu_dereference()</tt> may not be used outside of the
|
|
outermost RCU read-side critical section containing that
|
|
<tt>rcu_dereference()</tt>, unless protection of
|
|
the corresponding data element has been passed from RCU to some
|
|
other synchronization mechanism, most commonly locking or
|
|
<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>.
|
|
|
|
<p>
|
|
In short, updaters use <tt>rcu_assign_pointer()</tt> and readers
|
|
use <tt>rcu_dereference()</tt>, and these two RCU API elements
|
|
work together to ensure that readers have a consistent view of
|
|
newly added data elements.
|
|
|
|
<p>
|
|
Of course, it is also necessary to remove elements from RCU-protected
|
|
data structures, for example, using the following process:
|
|
|
|
<ol>
|
|
<li> Remove the data element from the enclosing structure.
|
|
<li> Wait for all pre-existing RCU read-side critical sections
|
|
to complete (because only pre-existing readers can possibly have
|
|
a reference to the newly removed data element).
|
|
<li> At this point, only the updater has a reference to the
|
|
newly removed data element, so it can safely reclaim
|
|
the data element, for example, by passing it to <tt>kfree()</tt>.
|
|
</ol>
|
|
|
|
This process is implemented by <tt>remove_gp_synchronous()</tt>:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool remove_gp_synchronous(void)
|
|
2 {
|
|
3 struct foo *p;
|
|
4
|
|
5 spin_lock(&gp_lock);
|
|
6 p = rcu_access_pointer(gp);
|
|
7 if (!p) {
|
|
8 spin_unlock(&gp_lock);
|
|
9 return false;
|
|
10 }
|
|
11 rcu_assign_pointer(gp, NULL);
|
|
12 spin_unlock(&gp_lock);
|
|
13 synchronize_rcu();
|
|
14 kfree(p);
|
|
15 return true;
|
|
16 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
This function is straightforward, with line 13 waiting for a grace
|
|
period before line 14 frees the old data element.
|
|
This waiting ensures that readers will reach line 7 of
|
|
<tt>do_something_gp()</tt> before the data element referenced by
|
|
<tt>p</tt> is freed.
|
|
The <tt>rcu_access_pointer()</tt> on line 6 is similar to
|
|
<tt>rcu_dereference()</tt>, except that:
|
|
|
|
<ol>
|
|
<li> The value returned by <tt>rcu_access_pointer()</tt>
|
|
cannot be dereferenced.
|
|
If you want to access the value pointed to as well as
|
|
the pointer itself, use <tt>rcu_dereference()</tt>
|
|
instead of <tt>rcu_access_pointer()</tt>.
|
|
<li> The call to <tt>rcu_access_pointer()</tt> need not be
|
|
protected.
|
|
In contrast, <tt>rcu_dereference()</tt> must either be
|
|
within an RCU read-side critical section or in a code
|
|
segment where the pointer cannot change, for example, in
|
|
code protected by the corresponding update-side lock.
|
|
</ol>
|
|
|
|
<p>@@QQ@@
|
|
Without the <tt>rcu_dereference()</tt> or the
|
|
<tt>rcu_access_pointer()</tt>, what destructive optimizations
|
|
might the compiler make use of?
|
|
<p>@@QQA@@
|
|
Let's start with what happens to <tt>do_something_gp()</tt>
|
|
if it fails to use <tt>rcu_dereference()</tt>.
|
|
It could reuse a value formerly fetched from this same pointer.
|
|
It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time
|
|
manner, resulting in <i>load tearing</i>, in turn resulting a bytewise
|
|
mash-up of two distince pointer values.
|
|
It might even use value-speculation optimizations, where it makes a wrong
|
|
guess, but by the time it gets around to checking the value, an update
|
|
has changed the pointer to match the wrong guess.
|
|
Too bad about any dereferences that returned pre-initialization garbage
|
|
in the meantime!
|
|
|
|
<p>
|
|
For <tt>remove_gp_synchronous()</tt>, as long as all modifications
|
|
to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>,
|
|
the above optimizations are harmless.
|
|
However,
|
|
with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>,
|
|
<tt>sparse</tt> will complain if you
|
|
define <tt>gp</tt> with <tt>__rcu</tt> and then
|
|
access it without using
|
|
either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
In short, RCU's publish-subscribe guarantee is provided by the combination
|
|
of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>.
|
|
This guarantee allows data elements to be safely added to RCU-protected
|
|
linked data structures without disrupting RCU readers.
|
|
This guarantee can be used in combination with the grace-period
|
|
guarantee to also allow data elements to be removed from RCU-protected
|
|
linked data structures, again without disrupting RCU readers.
|
|
|
|
<p>
|
|
This guarantee was only partially premeditated.
|
|
DYNIX/ptx used an explicit memory barrier for publication, but had nothing
|
|
resembling <tt>rcu_dereference()</tt> for subscription, nor did it
|
|
have anything resembling the <tt>smp_read_barrier_depends()</tt>
|
|
that was later subsumed into <tt>rcu_dereference()</tt>.
|
|
The need for these operations made itself known quite suddenly at a
|
|
late-1990s meeting with the DEC Alpha architects, back in the days when
|
|
DEC was still a free-standing company.
|
|
It took the Alpha architects a good hour to convince me that any sort
|
|
of barrier would ever be needed, and it then took me a good <i>two</i> hours
|
|
to convince them that their documentation did not make this point clear.
|
|
More recent work with the C and C++ standards committees have provided
|
|
much education on tricks and traps from the compiler.
|
|
In short, compilers were much less tricky in the early 1990s, but in
|
|
2015, don't even think about omitting <tt>rcu_dereference()</tt>!
|
|
|
|
<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3>
|
|
|
|
<p>
|
|
The previous section's simple linked-data-structure scenario clearly
|
|
demonstrates the need for RCU's stringent memory-ordering guarantees on
|
|
systems with more than one CPU:
|
|
|
|
<ol>
|
|
<li> Each CPU that has an RCU read-side critical section that
|
|
begins before <tt>synchronize_rcu()</tt> starts is
|
|
guaranteed to execute a full memory barrier between the time
|
|
that the RCU read-side critical section ends and the time that
|
|
<tt>synchronize_rcu()</tt> returns.
|
|
Without this guarantee, a pre-existing RCU read-side critical section
|
|
might hold a reference to the newly removed <tt>struct foo</tt>
|
|
after the <tt>kfree()</tt> on line 14 of
|
|
<tt>remove_gp_synchronous()</tt>.
|
|
<li> Each CPU that has an RCU read-side critical section that ends
|
|
after <tt>synchronize_rcu()</tt> returns is guaranteed
|
|
to execute a full memory barrier between the time that
|
|
<tt>synchronize_rcu()</tt> begins and the time that the RCU
|
|
read-side critical section begins.
|
|
Without this guarantee, a later RCU read-side critical section
|
|
running after the <tt>kfree()</tt> on line 14 of
|
|
<tt>remove_gp_synchronous()</tt> might
|
|
later run <tt>do_something_gp()</tt> and find the
|
|
newly deleted <tt>struct foo</tt>.
|
|
<li> If the task invoking <tt>synchronize_rcu()</tt> remains
|
|
on a given CPU, then that CPU is guaranteed to execute a full
|
|
memory barrier sometime during the execution of
|
|
<tt>synchronize_rcu()</tt>.
|
|
This guarantee ensures that the <tt>kfree()</tt> on
|
|
line 14 of <tt>remove_gp_synchronous()</tt> really does
|
|
execute after the removal on line 11.
|
|
<li> If the task invoking <tt>synchronize_rcu()</tt> migrates
|
|
among a group of CPUs during that invocation, then each of the
|
|
CPUs in that group is guaranteed to execute a full memory barrier
|
|
sometime during the execution of <tt>synchronize_rcu()</tt>.
|
|
This guarantee also ensures that the <tt>kfree()</tt> on
|
|
line 14 of <tt>remove_gp_synchronous()</tt> really does
|
|
execute after the removal on
|
|
line 11, but also in the case where the thread executing the
|
|
<tt>synchronize_rcu()</tt> migrates in the meantime.
|
|
</ol>
|
|
|
|
<p>@@QQ@@
|
|
Given that multiple CPUs can start RCU read-side critical sections
|
|
at any time without any ordering whatsoever, how can RCU possibly tell whether
|
|
or not a given RCU read-side critical section starts before a
|
|
given instance of <tt>synchronize_rcu()</tt>?
|
|
<p>@@QQA@@
|
|
If RCU cannot tell whether or not a given
|
|
RCU read-side critical section starts before a
|
|
given instance of <tt>synchronize_rcu()</tt>,
|
|
then it must assume that the RCU read-side critical section
|
|
started first.
|
|
In other words, a given instance of <tt>synchronize_rcu()</tt>
|
|
can avoid waiting on a given RCU read-side critical section only
|
|
if it can prove that <tt>synchronize_rcu()</tt> started first.
|
|
<p>@@QQE@@
|
|
|
|
<p>@@QQ@@
|
|
The first and second guarantees require unbelievably strict ordering!
|
|
Are all these memory barriers <i> really</i> required?
|
|
<p>@@QQA@@
|
|
Yes, they really are required.
|
|
To see why the first guarantee is required, consider the following
|
|
sequence of events:
|
|
|
|
<ol>
|
|
<li> CPU 1: <tt>rcu_read_lock()</tt>
|
|
<li> CPU 1: <tt>q = rcu_dereference(gp);
|
|
/* Very likely to return p. */</tt>
|
|
<li> CPU 0: <tt>list_del_rcu(p);</tt>
|
|
<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
|
|
<li> CPU 1: <tt>do_something_with(q->a);
|
|
/* No smp_mb(), so might happen after kfree(). */</tt>
|
|
<li> CPU 1: <tt>rcu_read_unlock()</tt>
|
|
<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
|
|
<li> CPU 0: <tt>kfree(p);</tt>
|
|
</ol>
|
|
|
|
<p>
|
|
Therefore, there absolutely must be a full memory barrier between the
|
|
end of the RCU read-side critical section and the end of the
|
|
grace period.
|
|
|
|
<p>
|
|
The sequence of events demonstrating the necessity of the second rule
|
|
is roughly similar:
|
|
|
|
<ol>
|
|
<li> CPU 0: <tt>list_del_rcu(p);</tt>
|
|
<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
|
|
<li> CPU 1: <tt>rcu_read_lock()</tt>
|
|
<li> CPU 1: <tt>q = rcu_dereference(gp);
|
|
/* Might return p if no memory barrier. */</tt>
|
|
<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
|
|
<li> CPU 0: <tt>kfree(p);</tt>
|
|
<li> CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt>
|
|
<li> CPU 1: <tt>rcu_read_unlock()</tt>
|
|
</ol>
|
|
|
|
<p>
|
|
And similarly, without a memory barrier between the beginning of the
|
|
grace period and the beginning of the RCU read-side critical section,
|
|
CPU 1 might end up accessing the freelist.
|
|
|
|
<p>
|
|
The “as if” rule of course applies, so that any implementation
|
|
that acts as if the appropriate memory barriers were in place is a
|
|
correct implementation.
|
|
That said, it is much easier to fool yourself into believing that you have
|
|
adhered to the as-if rule than it is to actually adhere to it!
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
Note that these memory-barrier requirements do not replace the fundamental
|
|
RCU requirement that a grace period wait for all pre-existing readers.
|
|
On the contrary, the memory barriers called out in this section must operate in
|
|
such a way as to <i>enforce</i> this fundamental requirement.
|
|
Of course, different implementations enforce this requirement in different
|
|
ways, but enforce it they must.
|
|
|
|
<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3>
|
|
|
|
<p>
|
|
The common-case RCU primitives are unconditional.
|
|
They are invoked, they do their job, and they return, with no possibility
|
|
of error, and no need to retry.
|
|
This is a key RCU design philosophy.
|
|
|
|
<p>
|
|
However, this philosophy is pragmatic rather than pigheaded.
|
|
If someone comes up with a good justification for a particular conditional
|
|
RCU primitive, it might well be implemented and added.
|
|
After all, this guarantee was reverse-engineered, not premeditated.
|
|
The unconditional nature of the RCU primitives was initially an
|
|
accident of implementation, and later experience with synchronization
|
|
primitives with conditional primitives caused me to elevate this
|
|
accident to a guarantee.
|
|
Therefore, the justification for adding a conditional primitive to
|
|
RCU would need to be based on detailed and compelling use cases.
|
|
|
|
<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3>
|
|
|
|
<p>
|
|
As far as RCU is concerned, it is always possible to carry out an
|
|
update within an RCU read-side critical section.
|
|
For example, that RCU read-side critical section might search for
|
|
a given data element, and then might acquire the update-side
|
|
spinlock in order to update that element, all while remaining
|
|
in that RCU read-side critical section.
|
|
Of course, it is necessary to exit the RCU read-side critical section
|
|
before invoking <tt>synchronize_rcu()</tt>, however, this
|
|
inconvenience can be avoided through use of the
|
|
<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members
|
|
described later in this document.
|
|
|
|
<p>@@QQ@@
|
|
But how does the upgrade-to-write operation exclude other readers?
|
|
<p>@@QQA@@
|
|
It doesn't, just like normal RCU updates, which also do not exclude
|
|
RCU readers.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
This guarantee allows lookup code to be shared between read-side
|
|
and update-side code, and was premeditated, appearing in the earliest
|
|
DYNIX/ptx RCU documentation.
|
|
|
|
<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2>
|
|
|
|
<p>
|
|
RCU provides extremely lightweight readers, and its read-side guarantees,
|
|
though quite useful, are correspondingly lightweight.
|
|
It is therefore all too easy to assume that RCU is guaranteeing more
|
|
than it really is.
|
|
Of course, the list of things that RCU does not guarantee is infinitely
|
|
long, however, the following sections list a few non-guarantees that
|
|
have caused confusion.
|
|
Except where otherwise noted, these non-guarantees were premeditated.
|
|
|
|
<ol>
|
|
<li> <a href="#Readers Impose Minimal Ordering">
|
|
Readers Impose Minimal Ordering</a>
|
|
<li> <a href="#Readers Do Not Exclude Updaters">
|
|
Readers Do Not Exclude Updaters</a>
|
|
<li> <a href="#Updaters Only Wait For Old Readers">
|
|
Updaters Only Wait For Old Readers</a>
|
|
<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections">
|
|
Grace Periods Don't Partition Read-Side Critical Sections</a>
|
|
<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods">
|
|
Read-Side Critical Sections Don't Partition Grace Periods</a>
|
|
<li> <a href="#Disabling Preemption Does Not Block Grace Periods">
|
|
Disabling Preemption Does Not Block Grace Periods</a>
|
|
</ol>
|
|
|
|
<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
|
|
|
|
<p>
|
|
Reader-side markers such as <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees
|
|
except through their interaction with the grace-period APIs such as
|
|
<tt>synchronize_rcu()</tt>.
|
|
To see this, consider the following pair of threads:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(x, 1);
|
|
5 rcu_read_unlock();
|
|
6 rcu_read_lock();
|
|
7 WRITE_ONCE(y, 1);
|
|
8 rcu_read_unlock();
|
|
9 }
|
|
10
|
|
11 void thread1(void)
|
|
12 {
|
|
13 rcu_read_lock();
|
|
14 r1 = READ_ONCE(y);
|
|
15 rcu_read_unlock();
|
|
16 rcu_read_lock();
|
|
17 r2 = READ_ONCE(x);
|
|
18 rcu_read_unlock();
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
After <tt>thread0()</tt> and <tt>thread1()</tt> execute
|
|
concurrently, it is quite possible to have
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 0)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>),
|
|
which would not be possible if <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> had much in the way of ordering
|
|
properties.
|
|
But they do not, so the CPU is within its rights
|
|
to do significant reordering.
|
|
This is by design: Any significant ordering constraints would slow down
|
|
these fast-path APIs.
|
|
|
|
<p>@@QQ@@
|
|
Can't the compiler also reorder this code?
|
|
<p>@@QQA@@
|
|
No, the volatile casts in <tt>READ_ONCE()</tt> and
|
|
<tt>WRITE_ONCE()</tt> prevent the compiler from reordering in
|
|
this particular case.
|
|
<p>@@QQE@@
|
|
|
|
<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3>
|
|
|
|
<p>
|
|
Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt>
|
|
exclude updates.
|
|
All they do is to prevent grace periods from ending.
|
|
The following example illustrates this:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 r1 = READ_ONCE(y);
|
|
5 if (r1) {
|
|
6 do_something_with_nonzero_x();
|
|
7 r2 = READ_ONCE(x);
|
|
8 WARN_ON(!r2); /* BUG!!! */
|
|
9 }
|
|
10 rcu_read_unlock();
|
|
11 }
|
|
12
|
|
13 void thread1(void)
|
|
14 {
|
|
15 spin_lock(&my_lock);
|
|
16 WRITE_ONCE(x, 1);
|
|
17 WRITE_ONCE(y, 1);
|
|
18 spin_unlock(&my_lock);
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt>
|
|
excluded the <tt>thread1()</tt> function's update,
|
|
the <tt>WARN_ON()</tt> could never fire.
|
|
But the fact is that <tt>rcu_read_lock()</tt> does not exclude
|
|
much of anything aside from subsequent grace periods, of which
|
|
<tt>thread1()</tt> has none, so the
|
|
<tt>WARN_ON()</tt> can and does fire.
|
|
|
|
<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3>
|
|
|
|
<p>
|
|
It might be tempting to assume that after <tt>synchronize_rcu()</tt>
|
|
completes, there are no readers executing.
|
|
This temptation must be avoided because
|
|
new readers can start immediately after <tt>synchronize_rcu()</tt>
|
|
starts, and <tt>synchronize_rcu()</tt> is under no
|
|
obligation to wait for these new readers.
|
|
|
|
<p>@@QQ@@
|
|
Suppose that synchronize_rcu() did wait until all readers had completed.
|
|
Would the updater be able to rely on this?
|
|
<p>@@QQA@@
|
|
No.
|
|
Even if <tt>synchronize_rcu()</tt> were to wait until
|
|
all readers had completed, a new reader might start immediately after
|
|
<tt>synchronize_rcu()</tt> completed.
|
|
Therefore, the code following
|
|
<tt>synchronize_rcu()</tt> cannot rely on there being no readers
|
|
in any case.
|
|
<p>@@QQE@@
|
|
|
|
<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections">
|
|
Grace Periods Don't Partition Read-Side Critical Sections</a></h3>
|
|
|
|
<p>
|
|
It is tempting to assume that if any part of one RCU read-side critical
|
|
section precedes a given grace period, and if any part of another RCU
|
|
read-side critical section follows that same grace period, then all of
|
|
the first RCU read-side critical section must precede all of the second.
|
|
However, this just isn't the case: A single grace period does not
|
|
partition the set of RCU read-side critical sections.
|
|
An example of this situation can be illustrated as follows, where
|
|
<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 rcu_read_lock();
|
|
19 r2 = READ_ONCE(b);
|
|
20 r3 = READ_ONCE(c);
|
|
21 rcu_read_unlock();
|
|
22 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
It turns out that the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 0 && r3 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
is entirely possible.
|
|
The following figure show how this can happen, with each circled
|
|
<tt>QS</tt> indicating the point at which RCU recorded a
|
|
<i>quiescent state</i> for each thread, that is, a state in which
|
|
RCU knows that the thread cannot be in the midst of an RCU read-side
|
|
critical section that started before the current grace period:
|
|
|
|
<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p>
|
|
|
|
<p>
|
|
If it is necessary to partition RCU read-side critical sections in this
|
|
manner, it is necessary to use two grace periods, where the first
|
|
grace period is known to end before the second grace period starts:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 r2 = READ_ONCE(c);
|
|
19 synchronize_rcu();
|
|
20 WRITE_ONCE(d, 1);
|
|
21 }
|
|
22
|
|
23 void thread3(void)
|
|
24 {
|
|
25 rcu_read_lock();
|
|
26 r3 = READ_ONCE(b);
|
|
27 r4 = READ_ONCE(d);
|
|
28 rcu_read_unlock();
|
|
29 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Here, if <tt>(r1 == 1)</tt>, then
|
|
<tt>thread0()</tt>'s write to <tt>b</tt> must happen
|
|
before the end of <tt>thread1()</tt>'s grace period.
|
|
If in addition <tt>(r4 == 1)</tt>, then
|
|
<tt>thread3()</tt>'s read from <tt>b</tt> must happen
|
|
after the beginning of <tt>thread2()</tt>'s grace period.
|
|
If it is also the case that <tt>(r2 == 1)</tt>, then the
|
|
end of <tt>thread1()</tt>'s grace period must precede the
|
|
beginning of <tt>thread2()</tt>'s grace period.
|
|
This mean that the two RCU read-side critical sections cannot overlap,
|
|
guaranteeing that <tt>(r3 == 1)</tt>.
|
|
As a result, the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
cannot happen.
|
|
|
|
<p>
|
|
This non-requirement was also non-premeditated, but became apparent
|
|
when studying RCU's interaction with memory ordering.
|
|
|
|
<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods">
|
|
Read-Side Critical Sections Don't Partition Grace Periods</a></h3>
|
|
|
|
<p>
|
|
It is also tempting to assume that if an RCU read-side critical section
|
|
happens between a pair of grace periods, then those grace periods cannot
|
|
overlap.
|
|
However, this temptation leads nowhere good, as can be illustrated by
|
|
the following, with all variables initially zero:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 void thread0(void)
|
|
2 {
|
|
3 rcu_read_lock();
|
|
4 WRITE_ONCE(a, 1);
|
|
5 WRITE_ONCE(b, 1);
|
|
6 rcu_read_unlock();
|
|
7 }
|
|
8
|
|
9 void thread1(void)
|
|
10 {
|
|
11 r1 = READ_ONCE(a);
|
|
12 synchronize_rcu();
|
|
13 WRITE_ONCE(c, 1);
|
|
14 }
|
|
15
|
|
16 void thread2(void)
|
|
17 {
|
|
18 rcu_read_lock();
|
|
19 WRITE_ONCE(d, 1);
|
|
20 r2 = READ_ONCE(c);
|
|
21 rcu_read_unlock();
|
|
22 }
|
|
23
|
|
24 void thread3(void)
|
|
25 {
|
|
26 r3 = READ_ONCE(d);
|
|
27 synchronize_rcu();
|
|
28 WRITE_ONCE(e, 1);
|
|
29 }
|
|
30
|
|
31 void thread4(void)
|
|
32 {
|
|
33 rcu_read_lock();
|
|
34 r4 = READ_ONCE(b);
|
|
35 r5 = READ_ONCE(e);
|
|
36 rcu_read_unlock();
|
|
37 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
In this case, the outcome:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
|
|
</pre>
|
|
</blockquote>
|
|
|
|
is entirely possible, as illustrated below:
|
|
|
|
<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p>
|
|
|
|
<p>
|
|
Again, an RCU read-side critical section can overlap almost all of a
|
|
given grace period, just so long as it does not overlap the entire
|
|
grace period.
|
|
As a result, an RCU read-side critical section cannot partition a pair
|
|
of RCU grace periods.
|
|
|
|
<p>@@QQ@@
|
|
How long a sequence of grace periods, each separated by an RCU read-side
|
|
critical section, would be required to partition the RCU read-side
|
|
critical sections at the beginning and end of the chain?
|
|
<p>@@QQA@@
|
|
In theory, an infinite number.
|
|
In practice, an unknown number that is sensitive to both implementation
|
|
details and timing considerations.
|
|
Therefore, even in practice, RCU users must abide by the theoretical rather
|
|
than the practical answer.
|
|
<p>@@QQE@@
|
|
|
|
<h3><a name="Disabling Preemption Does Not Block Grace Periods">
|
|
Disabling Preemption Does Not Block Grace Periods</a></h3>
|
|
|
|
<p>
|
|
There was a time when disabling preemption on any given CPU would block
|
|
subsequent grace periods.
|
|
However, this was an accident of implementation and is not a requirement.
|
|
And in the current Linux-kernel implementation, disabling preemption
|
|
on a given CPU in fact does not block grace periods, as Oleg Nesterov
|
|
<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>.
|
|
|
|
<p>
|
|
If you need a preempt-disable region to block grace periods, you need to add
|
|
<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example
|
|
as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 preempt_disable();
|
|
2 rcu_read_lock();
|
|
3 do_something();
|
|
4 rcu_read_unlock();
|
|
5 preempt_enable();
|
|
6
|
|
7 /* Spinlocks implicitly disable preemption. */
|
|
8 spin_lock(&mylock);
|
|
9 rcu_read_lock();
|
|
10 do_something();
|
|
11 rcu_read_unlock();
|
|
12 spin_unlock(&mylock);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
In theory, you could enter the RCU read-side critical section first,
|
|
but it is more efficient to keep the entire RCU read-side critical
|
|
section contained in the preempt-disable region as shown above.
|
|
Of course, RCU read-side critical sections that extend outside of
|
|
preempt-disable regions will work correctly, but such critical sections
|
|
can be preempted, which forces <tt>rcu_read_unlock()</tt> to do
|
|
more work.
|
|
And no, this is <i>not</i> an invitation to enclose all of your RCU
|
|
read-side critical sections within preempt-disable regions, because
|
|
doing so would degrade real-time response.
|
|
|
|
<p>
|
|
This non-requirement appeared with preemptible RCU.
|
|
If you need a grace period that waits on non-preemptible code regions, use
|
|
<a href="#Sched Flavor">RCU-sched</a>.
|
|
|
|
<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
|
|
|
|
<p>
|
|
These parallelism facts of life are by no means specific to RCU, but
|
|
the RCU implementation must abide by them.
|
|
They therefore bear repeating:
|
|
|
|
<ol>
|
|
<li> Any CPU or task may be delayed at any time,
|
|
and any attempts to avoid these delays by disabling
|
|
preemption, interrupts, or whatever are completely futile.
|
|
This is most obvious in preemptible user-level
|
|
environments and in virtualized environments (where
|
|
a given guest OS's VCPUs can be preempted at any time by
|
|
the underlying hypervisor), but can also happen in bare-metal
|
|
environments due to ECC errors, NMIs, and other hardware
|
|
events.
|
|
Although a delay of more than about 20 seconds can result
|
|
in splats, the RCU implementation is obligated to use
|
|
algorithms that can tolerate extremely long delays, but where
|
|
“extremely long” is not long enough to allow
|
|
wrap-around when incrementing a 64-bit counter.
|
|
<li> Both the compiler and the CPU can reorder memory accesses.
|
|
Where it matters, RCU must use compiler directives and
|
|
memory-barrier instructions to preserve ordering.
|
|
<li> Conflicting writes to memory locations in any given cache line
|
|
will result in expensive cache misses.
|
|
Greater numbers of concurrent writes and more-frequent
|
|
concurrent writes will result in more dramatic slowdowns.
|
|
RCU is therefore obligated to use algorithms that have
|
|
sufficient locality to avoid significant performance and
|
|
scalability problems.
|
|
<li> As a rough rule of thumb, only one CPU's worth of processing
|
|
may be carried out under the protection of any given exclusive
|
|
lock.
|
|
RCU must therefore use scalable locking designs.
|
|
<li> Counters are finite, especially on 32-bit systems.
|
|
RCU's use of counters must therefore tolerate counter wrap,
|
|
or be designed such that counter wrap would take way more
|
|
time than a single system is likely to run.
|
|
An uptime of ten years is quite possible, a runtime
|
|
of a century much less so.
|
|
As an example of the latter, RCU's dyntick-idle nesting counter
|
|
allows 54 bits for interrupt nesting level (this counter
|
|
is 64 bits even on a 32-bit system).
|
|
Overflowing this counter requires 2<sup>54</sup>
|
|
half-interrupts on a given CPU without that CPU ever going idle.
|
|
If a half-interrupt happened every microsecond, it would take
|
|
570 years of runtime to overflow this counter, which is currently
|
|
believed to be an acceptably long time.
|
|
<li> Linux systems can have thousands of CPUs running a single
|
|
Linux kernel in a single shared-memory environment.
|
|
RCU must therefore pay close attention to high-end scalability.
|
|
</ol>
|
|
|
|
<p>
|
|
This last parallelism fact of life means that RCU must pay special
|
|
attention to the preceding facts of life.
|
|
The idea that Linux might scale to systems with thousands of CPUs would
|
|
have been met with some skepticism in the 1990s, but these requirements
|
|
would have otherwise have been unsurprising, even in the early 1990s.
|
|
|
|
<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2>
|
|
|
|
<p>
|
|
These sections list quality-of-implementation requirements.
|
|
Although an RCU implementation that ignores these requirements could
|
|
still be used, it would likely be subject to limitations that would
|
|
make it inappropriate for industrial-strength production use.
|
|
Classes of quality-of-implementation requirements are as follows:
|
|
|
|
<ol>
|
|
<li> <a href="#Specialization">Specialization</a>
|
|
<li> <a href="#Performance and Scalability">Performance and Scalability</a>
|
|
<li> <a href="#Composability">Composability</a>
|
|
<li> <a href="#Corner Cases">Corner Cases</a>
|
|
</ol>
|
|
|
|
<p>
|
|
These classes is covered in the following sections.
|
|
|
|
<h3><a name="Specialization">Specialization</a></h3>
|
|
|
|
<p>
|
|
RCU is and always has been intended primarily for read-mostly situations, as
|
|
illustrated by the following figure.
|
|
This means that RCU's read-side primitives are optimized, often at the
|
|
expense of its update-side primitives.
|
|
|
|
<p><img src="RCUApplicability.svg" alt="RCUApplicability.svg" width="70%"></p>
|
|
|
|
<p>
|
|
This focus on read-mostly situations means that RCU must interoperate
|
|
with other synchronization primitives.
|
|
For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt>
|
|
examples discussed earlier use RCU to protect readers and locking to
|
|
coordinate updaters.
|
|
However, the need extends much farther, requiring that a variety of
|
|
synchronization primitives be legal within RCU read-side critical sections,
|
|
including spinlocks, sequence locks, atomic operations, reference
|
|
counters, and memory barriers.
|
|
|
|
<p>@@QQ@@
|
|
What about sleeping locks?
|
|
<p>@@QQA@@
|
|
These are forbidden within Linux-kernel RCU read-side critical sections
|
|
because it is not legal to place a quiescent state (in this case,
|
|
voluntary context switch) within an RCU read-side critical section.
|
|
However, sleeping locks may be used within userspace RCU read-side critical
|
|
sections, and also within Linux-kernel sleepable RCU
|
|
<a href="#Sleepable RCU">(SRCU)</a>
|
|
read-side critical sections.
|
|
In addition, the -rt patchset turns spinlocks into a sleeping locks so
|
|
that the corresponding critical sections can be preempted, which
|
|
also means that these sleeplockified spinlocks (but not other sleeping locks!)
|
|
may be acquire within -rt-Linux-kernel RCU read-side critical sections.
|
|
|
|
<p>
|
|
Note that it <i>is</i> legal for a normal RCU read-side critical section
|
|
to conditionally acquire a sleeping locks (as in <tt>mutex_trylock()</tt>),
|
|
but only as long as it does not loop indefinitely attempting to
|
|
conditionally acquire that sleeping locks.
|
|
The key point is that things like <tt>mutex_trylock()</tt>
|
|
either return with the mutex held, or return an error indication if
|
|
the mutex was not immediately available.
|
|
Either way, <tt>mutex_trylock()</tt> returns immediately without sleeping.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
It often comes as a surprise that many algorithms do not require a
|
|
consistent view of data, but many can function in that mode,
|
|
with network routing being the poster child.
|
|
Internet routing algorithms take significant time to propagate
|
|
updates, so that by the time an update arrives at a given system,
|
|
that system has been sending network traffic the wrong way for
|
|
a considerable length of time.
|
|
Having a few threads continue to send traffic the wrong way for a
|
|
few more milliseconds is clearly not a problem: In the worst case,
|
|
TCP retransmissions will eventually get the data where it needs to go.
|
|
In general, when tracking the state of the universe outside of the
|
|
computer, some level of inconsistency must be tolerated due to
|
|
speed-of-light delays if nothing else.
|
|
|
|
<p>
|
|
Furthermore, uncertainty about external state is inherent in many cases.
|
|
For example, a pair of veternarians might use heartbeat to determine
|
|
whether or not a given cat was alive.
|
|
But how long should they wait after the last heartbeat to decide that
|
|
the cat is in fact dead?
|
|
Waiting less than 400 milliseconds makes no sense because this would
|
|
mean that a relaxed cat would be considered to cycle between death
|
|
and life more than 100 times per minute.
|
|
Moreover, just as with human beings, a cat's heart might stop for
|
|
some period of time, so the exact wait period is a judgment call.
|
|
One of our pair of veternarians might wait 30 seconds before pronouncing
|
|
the cat dead, while the other might insist on waiting a full minute.
|
|
The two veternarians would then disagree on the state of the cat during
|
|
the final 30 seconds of the minute following the last heartbeat, as
|
|
fancifully illustrated below:
|
|
|
|
<p><img src="2013-08-is-it-dead.png" alt="2013-08-is-it-dead.png" width="431"></p>
|
|
|
|
<p>
|
|
Interestingly enough, this same situation applies to hardware.
|
|
When push comes to shove, how do we tell whether or not some
|
|
external server has failed?
|
|
We send messages to it periodically, and declare it failed if we
|
|
don't receive a response within a given period of time.
|
|
Policy decisions can usually tolerate short
|
|
periods of inconsistency.
|
|
The policy was decided some time ago, and is only now being put into
|
|
effect, so a few milliseconds of delay is normally inconsequential.
|
|
|
|
<p>
|
|
However, there are algorithms that absolutely must see consistent data.
|
|
For example, the translation between a user-level SystemV semaphore
|
|
ID to the corresponding in-kernel data structure is protected by RCU,
|
|
but it is absolutely forbidden to update a semaphore that has just been
|
|
removed.
|
|
In the Linux kernel, this need for consistency is accommodated by acquiring
|
|
spinlocks located in the in-kernel data structure from within
|
|
the RCU read-side critical section, and this is indicated by the
|
|
green box in the figure above.
|
|
Many other techniques may be used, and are in fact used within the
|
|
Linux kernel.
|
|
|
|
<p>
|
|
In short, RCU is not required to maintain consistency, and other
|
|
mechanisms may be used in concert with RCU when consistency is required.
|
|
RCU's specialization allows it to do its job extremely well, and its
|
|
ability to interoperate with other synchronization mechanisms allows
|
|
the right mix of synchronization tools to be used for a given job.
|
|
|
|
<h3><a name="Performance and Scalability">Performance and Scalability</a></h3>
|
|
|
|
<p>
|
|
Energy efficiency is a critical component of performance today,
|
|
and Linux-kernel RCU implementations must therefore avoid unnecessarily
|
|
awakening idle CPUs.
|
|
I cannot claim that this requirement was premeditated.
|
|
In fact, I learned of it during a telephone conversation in which I
|
|
was given “frank and open” feedback on the importance
|
|
of energy efficiency in battery-powered systems and on specific
|
|
energy-efficiency shortcomings of the Linux-kernel RCU implementation.
|
|
In my experience, the battery-powered embedded community will consider
|
|
any unnecessary wakeups to be extremely unfriendly acts.
|
|
So much so that mere Linux-kernel-mailing-list posts are
|
|
insufficient to vent their ire.
|
|
|
|
<p>
|
|
Memory consumption is not particularly important for in most
|
|
situations, and has become decreasingly
|
|
so as memory sizes have expanded and memory
|
|
costs have plummeted.
|
|
However, as I learned from Matt Mackall's
|
|
<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a>
|
|
efforts, memory footprint is critically important on single-CPU systems with
|
|
non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus
|
|
<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a>
|
|
was born.
|
|
Josh Triplett has since taken over the small-memory banner with his
|
|
<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a>
|
|
project, which resulted in
|
|
<a href="#Sleepable RCU">SRCU</a>
|
|
becoming optional for those kernels not needing it.
|
|
|
|
<p>
|
|
The remaining performance requirements are, for the most part,
|
|
unsurprising.
|
|
For example, in keeping with RCU's read-side specialization,
|
|
<tt>rcu_dereference()</tt> should have negligible overhead (for
|
|
example, suppression of a few minor compiler optimizations).
|
|
Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> should have exactly zero overhead.
|
|
|
|
<p>
|
|
In preemptible environments, in the case where the RCU read-side
|
|
critical section was not preempted (as will be the case for the
|
|
highest-priority real-time process), <tt>rcu_read_lock()</tt> and
|
|
<tt>rcu_read_unlock()</tt> should have minimal overhead.
|
|
In particular, they should not contain atomic read-modify-write
|
|
operations, memory-barrier instructions, preemption disabling,
|
|
interrupt disabling, or backwards branches.
|
|
However, in the case where the RCU read-side critical section was preempted,
|
|
<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts.
|
|
This is why it is better to nest an RCU read-side critical section
|
|
within a preempt-disable region than vice versa, at least in cases
|
|
where that critical section is short enough to avoid unduly degrading
|
|
real-time latencies.
|
|
|
|
<p>
|
|
The <tt>synchronize_rcu()</tt> grace-period-wait primitive is
|
|
optimized for throughput.
|
|
It may therefore incur several milliseconds of latency in addition to
|
|
the duration of the longest RCU read-side critical section.
|
|
On the other hand, multiple concurrent invocations of
|
|
<tt>synchronize_rcu()</tt> are required to use batching optimizations
|
|
so that they can be satisfied by a single underlying grace-period-wait
|
|
operation.
|
|
For example, in the Linux kernel, it is not unusual for a single
|
|
grace-period-wait operation to serve more than
|
|
<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a>
|
|
of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation
|
|
overhead down to nearly zero.
|
|
However, the grace-period optimization is also required to avoid
|
|
measurable degradation of real-time scheduling and interrupt latencies.
|
|
|
|
<p>
|
|
In some cases, the multi-millisecond <tt>synchronize_rcu()</tt>
|
|
latencies are unacceptable.
|
|
In these cases, <tt>synchronize_rcu_expedited()</tt> may be used
|
|
instead, reducing the grace-period latency down to a few tens of
|
|
microseconds on small systems, at least in cases where the RCU read-side
|
|
critical sections are short.
|
|
There are currently no special latency requirements for
|
|
<tt>synchronize_rcu_expedited()</tt> on large systems, but,
|
|
consistent with the empirical nature of the RCU specification,
|
|
that is subject to change.
|
|
However, there most definitely are scalability requirements:
|
|
A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096
|
|
CPUs should at least make reasonable forward progress.
|
|
In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
|
|
is permitted to impose modest degradation of real-time latency
|
|
on non-idle online CPUs.
|
|
That said, it will likely be necessary to take further steps to reduce this
|
|
degradation, hopefully to roughly that of a scheduling-clock interrupt.
|
|
|
|
<p>
|
|
There are a number of situations where even
|
|
<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period
|
|
latency is unacceptable.
|
|
In these situations, the asynchronous <tt>call_rcu()</tt> can be
|
|
used in place of <tt>synchronize_rcu()</tt> as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 struct foo {
|
|
2 int a;
|
|
3 int b;
|
|
4 struct rcu_head rh;
|
|
5 };
|
|
6
|
|
7 static void remove_gp_cb(struct rcu_head *rhp)
|
|
8 {
|
|
9 struct foo *p = container_of(rhp, struct foo, rh);
|
|
10
|
|
11 kfree(p);
|
|
12 }
|
|
13
|
|
14 bool remove_gp_asynchronous(void)
|
|
15 {
|
|
16 struct foo *p;
|
|
17
|
|
18 spin_lock(&gp_lock);
|
|
19 p = rcu_dereference(gp);
|
|
20 if (!p) {
|
|
21 spin_unlock(&gp_lock);
|
|
22 return false;
|
|
23 }
|
|
24 rcu_assign_pointer(gp, NULL);
|
|
25 call_rcu(&p->rh, remove_gp_cb);
|
|
26 spin_unlock(&gp_lock);
|
|
27 return true;
|
|
28 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
A definition of <tt>struct foo</tt> is finally needed, and appears
|
|
on lines 1-5.
|
|
The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt>
|
|
on line 25, and will be invoked after the end of a subsequent
|
|
grace period.
|
|
This gets the same effect as <tt>remove_gp_synchronous()</tt>,
|
|
but without forcing the updater to wait for a grace period to elapse.
|
|
The <tt>call_rcu()</tt> function may be used in a number of
|
|
situations where neither <tt>synchronize_rcu()</tt> nor
|
|
<tt>synchronize_rcu_expedited()</tt> would be legal,
|
|
including within preempt-disable code, <tt>local_bh_disable()</tt> code,
|
|
interrupt-disable code, and interrupt handlers.
|
|
However, even <tt>call_rcu()</tt> is illegal within NMI handlers.
|
|
The callback function (<tt>remove_gp_cb()</tt> in this case) will be
|
|
executed within softirq (software interrupt) environment within the
|
|
Linux kernel,
|
|
either within a real softirq handler or under the protection
|
|
of <tt>local_bh_disable()</tt>.
|
|
In both the Linux kernel and in userspace, it is bad practice to
|
|
write an RCU callback function that takes too long.
|
|
Long-running operations should be relegated to separate threads or
|
|
(in the Linux kernel) workqueues.
|
|
|
|
<p>@@QQ@@
|
|
Why does line 19 use <tt>rcu_access_pointer()</tt>?
|
|
After all, <tt>call_rcu()</tt> on line 25 stores into the
|
|
structure, which would interact badly with concurrent insertions.
|
|
Doesn't this mean that <tt>rcu_dereference()</tt> is required?
|
|
<p>@@QQA@@
|
|
Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes
|
|
any changes, including any insertions that <tt>rcu_dereference()</tt>
|
|
would protect against.
|
|
Therefore, any insertions will be delayed until after <tt>->gp_lock</tt>
|
|
is released on line 25, which in turn means that
|
|
<tt>rcu_access_pointer()</tt> suffices.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
However, all that <tt>remove_gp_cb()</tt> is doing is
|
|
invoking <tt>kfree()</tt> on the data element.
|
|
This is a common idiom, and is supported by <tt>kfree_rcu()</tt>,
|
|
which allows “fire and forget” operation as shown below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 struct foo {
|
|
2 int a;
|
|
3 int b;
|
|
4 struct rcu_head rh;
|
|
5 };
|
|
6
|
|
7 bool remove_gp_faf(void)
|
|
8 {
|
|
9 struct foo *p;
|
|
10
|
|
11 spin_lock(&gp_lock);
|
|
12 p = rcu_dereference(gp);
|
|
13 if (!p) {
|
|
14 spin_unlock(&gp_lock);
|
|
15 return false;
|
|
16 }
|
|
17 rcu_assign_pointer(gp, NULL);
|
|
18 kfree_rcu(p, rh);
|
|
19 spin_unlock(&gp_lock);
|
|
20 return true;
|
|
21 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
Note that <tt>remove_gp_faf()</tt> simply invokes
|
|
<tt>kfree_rcu()</tt> and proceeds, without any need to pay any
|
|
further attention to the subsequent grace period and <tt>kfree()</tt>.
|
|
It is permissible to invoke <tt>kfree_rcu()</tt> from the same
|
|
environments as for <tt>call_rcu()</tt>.
|
|
Interestingly enough, DYNIX/ptx had the equivalents of
|
|
<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not
|
|
<tt>synchronize_rcu()</tt>.
|
|
This was due to the fact that RCU was not heavily used within DYNIX/ptx,
|
|
so the very few places that needed something like
|
|
<tt>synchronize_rcu()</tt> simply open-coded it.
|
|
|
|
<p>@@QQ@@
|
|
Earlier it was claimed that <tt>call_rcu()</tt> and
|
|
<tt>kfree_rcu()</tt> allowed updaters to avoid being blocked
|
|
by readers.
|
|
But how can that be correct, given that the invocation of the callback
|
|
and the freeing of the memory (respectively) must still wait for
|
|
a grace period to elapse?
|
|
<p>@@QQA@@
|
|
We could define things this way, but keep in mind that this sort of
|
|
definition would say that updates in garbage-collected languages
|
|
cannot complete until the next time the garbage collector runs,
|
|
which does not seem at all reasonable.
|
|
The key point is that in most cases, an updater using either
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the
|
|
next update as soon as it has invoked <tt>call_rcu()</tt> or
|
|
<tt>kfree_rcu()</tt>, without having to wait for a subsequent
|
|
grace period.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
But what if the updater must wait for the completion of code to be
|
|
executed after the end of the grace period, but has other tasks
|
|
that can be carried out in the meantime?
|
|
The polling-style <tt>get_state_synchronize_rcu()</tt> and
|
|
<tt>cond_synchronize_rcu()</tt> functions may be used for this
|
|
purpose, as shown below:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 bool remove_gp_poll(void)
|
|
2 {
|
|
3 struct foo *p;
|
|
4 unsigned long s;
|
|
5
|
|
6 spin_lock(&gp_lock);
|
|
7 p = rcu_access_pointer(gp);
|
|
8 if (!p) {
|
|
9 spin_unlock(&gp_lock);
|
|
10 return false;
|
|
11 }
|
|
12 rcu_assign_pointer(gp, NULL);
|
|
13 spin_unlock(&gp_lock);
|
|
14 s = get_state_synchronize_rcu();
|
|
15 do_something_while_waiting();
|
|
16 cond_synchronize_rcu(s);
|
|
17 kfree(p);
|
|
18 return true;
|
|
19 }
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a
|
|
“cookie” from RCU,
|
|
then line 15 carries out other tasks,
|
|
and finally, line 16 returns immediately if a grace period has
|
|
elapsed in the meantime, but otherwise waits as required.
|
|
The need for <tt>get_state_synchronize_rcu</tt> and
|
|
<tt>cond_synchronize_rcu()</tt> has appeared quite recently,
|
|
so it is too early to tell whether they will stand the test of time.
|
|
|
|
<p>
|
|
RCU thus provides a range of tools to allow updaters to strike the
|
|
required tradeoff between latency, flexibility and CPU overhead.
|
|
|
|
<h3><a name="Composability">Composability</a></h3>
|
|
|
|
<p>
|
|
Composability has received much attention in recent years, perhaps in part
|
|
due to the collision of multicore hardware with object-oriented techniques
|
|
designed in single-threaded environments for single-threaded use.
|
|
And in theory, RCU read-side critical sections may be composed, and in
|
|
fact may be nested arbitrarily deeply.
|
|
In practice, as with all real-world implementations of composable
|
|
constructs, there are limitations.
|
|
|
|
<p>
|
|
Implementations of RCU for which <tt>rcu_read_lock()</tt>
|
|
and <tt>rcu_read_unlock()</tt> generate no code, such as
|
|
Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be
|
|
nested arbitrarily deeply.
|
|
After all, there is no overhead.
|
|
Except that if all these instances of <tt>rcu_read_lock()</tt>
|
|
and <tt>rcu_read_unlock()</tt> are visible to the compiler,
|
|
compilation will eventually fail due to exhausting memory,
|
|
mass storage, or user patience, whichever comes first.
|
|
If the nesting is not visible to the compiler, as is the case with
|
|
mutually recursive functions each in its own translation unit,
|
|
stack overflow will result.
|
|
If the nesting takes the form of loops, either the control variable
|
|
will overflow or (in the Linux kernel) you will get an RCU CPU stall warning.
|
|
Nevertheless, this class of RCU implementations is one
|
|
of the most composable constructs in existence.
|
|
|
|
<p>
|
|
RCU implementations that explicitly track nesting depth
|
|
are limited by the nesting-depth counter.
|
|
For example, the Linux kernel's preemptible RCU limits nesting to
|
|
<tt>INT_MAX</tt>.
|
|
This should suffice for almost all practical purposes.
|
|
That said, a consecutive pair of RCU read-side critical sections
|
|
between which there is an operation that waits for a grace period
|
|
cannot be enclosed in another RCU read-side critical section.
|
|
This is because it is not legal to wait for a grace period within
|
|
an RCU read-side critical section: To do so would result either
|
|
in deadlock or
|
|
in RCU implicitly splitting the enclosing RCU read-side critical
|
|
section, neither of which is conducive to a long-lived and prosperous
|
|
kernel.
|
|
|
|
<p>
|
|
It is worth noting that RCU is not alone in limiting composability.
|
|
For example, many transactional-memory implementations prohibit
|
|
composing a pair of transactions separated by an irrevocable
|
|
operation (for example, a network receive operation).
|
|
For another example, lock-based critical sections can be composed
|
|
surprisingly freely, but only if deadlock is avoided.
|
|
|
|
<p>
|
|
In short, although RCU read-side critical sections are highly composable,
|
|
care is required in some situations, just as is the case for any other
|
|
composable synchronization mechanism.
|
|
|
|
<h3><a name="Corner Cases">Corner Cases</a></h3>
|
|
|
|
<p>
|
|
A given RCU workload might have an endless and intense stream of
|
|
RCU read-side critical sections, perhaps even so intense that there
|
|
was never a point in time during which there was not at least one
|
|
RCU read-side critical section in flight.
|
|
RCU cannot allow this situation to block grace periods: As long as
|
|
all the RCU read-side critical sections are finite, grace periods
|
|
must also be finite.
|
|
|
|
<p>
|
|
That said, preemptible RCU implementations could potentially result
|
|
in RCU read-side critical sections being preempted for long durations,
|
|
which has the effect of creating a long-duration RCU read-side
|
|
critical section.
|
|
This situation can arise only in heavily loaded systems, but systems using
|
|
real-time priorities are of course more vulnerable.
|
|
Therefore, RCU priority boosting is provided to help deal with this
|
|
case.
|
|
That said, the exact requirements on RCU priority boosting will likely
|
|
evolve as more experience accumulates.
|
|
|
|
<p>
|
|
Other workloads might have very high update rates.
|
|
Although one can argue that such workloads should instead use
|
|
something other than RCU, the fact remains that RCU must
|
|
handle such workloads gracefully.
|
|
This requirement is another factor driving batching of grace periods,
|
|
but it is also the driving force behind the checks for large numbers
|
|
of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
|
|
Finally, high update rates should not delay RCU read-side critical
|
|
sections, although some read-side delays can occur when using
|
|
<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
|
|
of <tt>try_stop_cpus()</tt>.
|
|
(In the future, <tt>synchronize_rcu_expedited()</tt> will be
|
|
converted to use lighter-weight inter-processor interrupts (IPIs),
|
|
but this will still disturb readers, though to a much smaller degree.)
|
|
|
|
<p>
|
|
Although all three of these corner cases were understood in the early
|
|
1990s, a simple user-level test consisting of <tt>close(open(path))</tt>
|
|
in a tight loop
|
|
in the early 2000s suddenly provided a much deeper appreciation of the
|
|
high-update-rate corner case.
|
|
This test also motivated addition of some RCU code to react to high update
|
|
rates, for example, if a given CPU finds itself with more than 10,000
|
|
RCU callbacks queued, it will cause RCU to take evasive action by
|
|
more aggressively starting grace periods and more aggressively forcing
|
|
completion of grace-period processing.
|
|
This evasive action causes the grace period to complete more quickly,
|
|
but at the cost of restricting RCU's batching optimizations, thus
|
|
increasing the CPU overhead incurred by that grace period.
|
|
|
|
<h2><a name="Software-Engineering Requirements">
|
|
Software-Engineering Requirements</a></h2>
|
|
|
|
<p>
|
|
Between Murphy's Law and “To err is human”, it is necessary to
|
|
guard against mishaps and misuse:
|
|
|
|
<ol>
|
|
<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt>
|
|
everywhere that it is needed, so kernels built with
|
|
<tt>CONFIG_PROVE_RCU=y</tt> will spat if
|
|
<tt>rcu_dereference()</tt> is used outside of an
|
|
RCU read-side critical section.
|
|
Update-side code can use <tt>rcu_dereference_protected()</tt>,
|
|
which takes a
|
|
<a href="https://lwn.net/Articles/371986/">lockdep expression</a>
|
|
to indicate what is providing the protection.
|
|
If the indicated protection is not provided, a lockdep splat
|
|
is emitted.
|
|
|
|
<p>
|
|
Code shared between readers and updaters can use
|
|
<tt>rcu_dereference_check()</tt>, which also takes a
|
|
lockdep expression, and emits a lockdep splat if neither
|
|
<tt>rcu_read_lock()</tt> nor the indicated protection
|
|
is in place.
|
|
In addition, <tt>rcu_dereference_raw()</tt> is used in those
|
|
(hopefully rare) cases where the required protection cannot
|
|
be easily described.
|
|
Finally, <tt>rcu_read_lock_held()</tt> is provided to
|
|
allow a function to verify that it has been invoked within
|
|
an RCU read-side critical section.
|
|
I was made aware of this set of requirements shortly after Thomas
|
|
Gleixner audited a number of RCU uses.
|
|
<li> A given function might wish to check for RCU-related preconditions
|
|
upon entry, before using any other RCU API.
|
|
The <tt>rcu_lockdep_assert()</tt> does this job,
|
|
asserting the expression in kernels having lockdep enabled
|
|
and doing nothing otherwise.
|
|
<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt>
|
|
and <tt>rcu_dereference()</tt>, perhaps (incorrectly)
|
|
substituting a simple assignment.
|
|
To catch this sort of error, a given RCU-protected pointer may be
|
|
tagged with <tt>__rcu</tt>, after which running sparse
|
|
with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain
|
|
about simple-assignment accesses to that pointer.
|
|
Arnd Bergmann made me aware of this requirement, and also
|
|
supplied the needed
|
|
<a href="https://lwn.net/Articles/376011/">patch series</a>.
|
|
<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt>
|
|
will splat if a data element is passed to <tt>call_rcu()</tt>
|
|
twice in a row, without a grace period in between.
|
|
(This error is similar to a double free.)
|
|
The corresponding <tt>rcu_head</tt> structures that are
|
|
dynamically allocated are automatically tracked, but
|
|
<tt>rcu_head</tt> structures allocated on the stack
|
|
must be initialized with <tt>init_rcu_head_on_stack()</tt>
|
|
and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>.
|
|
Similarly, statically allocated non-stack <tt>rcu_head</tt>
|
|
structures must be initialized with <tt>init_rcu_head()</tt>
|
|
and cleaned up with <tt>destroy_rcu_head()</tt>.
|
|
Mathieu Desnoyers made me aware of this requirement, and also
|
|
supplied the needed
|
|
<a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>.
|
|
<li> An infinite loop in an RCU read-side critical section will
|
|
eventually trigger an RCU CPU stall warning splat, with
|
|
the duration of “eventually” being controlled by the
|
|
<tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or,
|
|
alternatively, by the
|
|
<tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs
|
|
parameter.
|
|
However, RCU is not obligated to produce this splat
|
|
unless there is a grace period waiting on that particular
|
|
RCU read-side critical section.
|
|
<p>
|
|
Some extreme workloads might intentionally delay
|
|
RCU grace periods, and systems running those workloads can
|
|
be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt>
|
|
to suppress the splats.
|
|
This kernel parameter may also be set via <tt>sysfs</tt>.
|
|
Furthermore, RCU CPU stall warnings are counter-productive
|
|
during sysrq dumps and during panics.
|
|
RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and
|
|
<tt>rcu_sysrq_end()</tt> API members to be called before
|
|
and after long sysrq dumps.
|
|
RCU also supplies the <tt>rcu_panic()</tt> notifier that is
|
|
automatically invoked at the beginning of a panic to suppress
|
|
further RCU CPU stall warnings.
|
|
|
|
<p>
|
|
This requirement made itself known in the early 1990s, pretty
|
|
much the first time that it was necessary to debug a CPU stall.
|
|
That said, the initial implementation in DYNIX/ptx was quite
|
|
generic in comparison with that of Linux.
|
|
<li> Although it would be very good to detect pointers leaking out
|
|
of RCU read-side critical sections, there is currently no
|
|
good way of doing this.
|
|
One complication is the need to distinguish between pointers
|
|
leaking and pointers that have been handed off from RCU to
|
|
some other synchronization mechanism, for example, reference
|
|
counting.
|
|
<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related
|
|
information is provided via both debugfs and event tracing.
|
|
<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and
|
|
<tt>rcu_dereference()</tt> to create typical linked
|
|
data structures can be surprisingly error-prone.
|
|
Therefore, RCU-protected
|
|
<a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a>
|
|
and, more recently, RCU-protected
|
|
<a href="https://lwn.net/Articles/612100/">hash tables</a>
|
|
are available.
|
|
Many other special-purpose RCU-protected data structures are
|
|
available in the Linux kernel and the userspace RCU library.
|
|
<li> Some linked structures are created at compile time, but still
|
|
require <tt>__rcu</tt> checking.
|
|
The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this
|
|
purpose.
|
|
<li> It is not necessary to use <tt>rcu_assign_pointer()</tt>
|
|
when creating linked structures that are to be published via
|
|
a single external pointer.
|
|
The <tt>RCU_INIT_POINTER()</tt> macro is provided for
|
|
this task and also for assigning <tt>NULL</tt> pointers
|
|
at runtime.
|
|
</ol>
|
|
|
|
<p>
|
|
This not a hard-and-fast list: RCU's diagnostic capabilities will
|
|
continue to be guided by the number and type of usage bugs found
|
|
in real-world RCU usage.
|
|
|
|
<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2>
|
|
|
|
<p>
|
|
The Linux kernel provides an interesting environment for all kinds of
|
|
software, including RCU.
|
|
Some of the relevant points of interest are as follows:
|
|
|
|
<ol>
|
|
<li> <a href="#Configuration">Configuration</a>.
|
|
<li> <a href="#Firmware Interface">Firmware Interface</a>.
|
|
<li> <a href="#Early Boot">Early Boot</a>.
|
|
<li> <a href="#Interrupts and NMIs">
|
|
Interrupts and non-maskable interrupts (NMIs)</a>.
|
|
<li> <a href="#Loadable Modules">Loadable Modules</a>.
|
|
<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
|
|
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
|
|
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
|
|
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
|
|
<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
|
|
<li> <a href="#Performance, Scalability, Response Time, and Reliability">
|
|
Performance, Scalability, Response Time, and Reliability</a>.
|
|
</ol>
|
|
|
|
<p>
|
|
This list is probably incomplete, but it does give a feel for the
|
|
most notable Linux-kernel complications.
|
|
Each of the following sections covers one of the above topics.
|
|
|
|
<h3><a name="Configuration">Configuration</a></h3>
|
|
|
|
<p>
|
|
RCU's goal is automatic configuration, so that almost nobody
|
|
needs to worry about RCU's <tt>Kconfig</tt> options.
|
|
And for almost all users, RCU does in fact work well
|
|
“out of the box.”
|
|
|
|
<p>
|
|
However, there are specialized use cases that are handled by
|
|
kernel boot parameters and <tt>Kconfig</tt> options.
|
|
Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users
|
|
about new <tt>Kconfig</tt> options, which requires almost all of them
|
|
be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option.
|
|
|
|
<p>
|
|
This all should be quite obvious, but the fact remains that
|
|
Linus Torvalds recently had to
|
|
<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a>
|
|
me of this requirement.
|
|
|
|
<h3><a name="Firmware Interface">Firmware Interface</a></h3>
|
|
|
|
<p>
|
|
In many cases, kernel obtains information about the system from the
|
|
firmware, and sometimes things are lost in translation.
|
|
Or the translation is accurate, but the original message is bogus.
|
|
|
|
<p>
|
|
For example, some systems' firmware overreports the number of CPUs,
|
|
sometimes by a large factor.
|
|
If RCU naively believed the firmware, as it used to do,
|
|
it would create too many per-CPU kthreads.
|
|
Although the resulting system will still run correctly, the extra
|
|
kthreads needlessly consume memory and can cause confusion
|
|
when they show up in <tt>ps</tt> listings.
|
|
|
|
<p>
|
|
RCU must therefore wait for a given CPU to actually come online before
|
|
it can allow itself to believe that the CPU actually exists.
|
|
The resulting “ghost CPUs” (which are never going to
|
|
come online) cause a number of
|
|
<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>.
|
|
|
|
<h3><a name="Early Boot">Early Boot</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel's boot sequence is an interesting process,
|
|
and RCU is used early, even before <tt>rcu_init()</tt>
|
|
is invoked.
|
|
In fact, a number of RCU's primitives can be used as soon as the
|
|
initial task's <tt>task_struct</tt> is available and the
|
|
boot CPU's per-CPU variables are set up.
|
|
The read-side primitives (<tt>rcu_read_lock()</tt>,
|
|
<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>,
|
|
and <tt>rcu_access_pointer()</tt>) will operate normally very early on,
|
|
as will <tt>rcu_assign_pointer()</tt>.
|
|
|
|
<p>
|
|
Although <tt>call_rcu()</tt> may be invoked at any
|
|
time during boot, callbacks are not guaranteed to be invoked until after
|
|
the scheduler is fully up and running.
|
|
This delay in callback invocation is due to the fact that RCU does not
|
|
invoke callbacks until it is fully initialized, and this full initialization
|
|
cannot occur until after the scheduler has initialized itself to the
|
|
point where RCU can spawn and run its kthreads.
|
|
In theory, it would be possible to invoke callbacks earlier,
|
|
however, this is not a panacea because there would be severe restrictions
|
|
on what operations those callbacks could invoke.
|
|
|
|
<p>
|
|
Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
|
|
<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
|
|
(<a href="#Bottom-Half Flavor">discussed below</a>),
|
|
and
|
|
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
|
|
will all operate normally
|
|
during very early boot, the reason being that there is only one CPU
|
|
and preemption is disabled.
|
|
This means that the call <tt>synchronize_rcu()</tt> (or friends)
|
|
itself is a quiescent
|
|
state and thus a grace period, so the early-boot implementation can
|
|
be a no-op.
|
|
|
|
<p>
|
|
Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
|
|
continue to operate normally through the remainder of boot, courtesy
|
|
of the fact that preemption is disabled across their RCU read-side
|
|
critical sections and also courtesy of the fact that there is still
|
|
only one CPU.
|
|
However, once the scheduler starts initializing, preemption is enabled.
|
|
There is still only a single CPU, but the fact that preemption is enabled
|
|
means that the no-op implementation of <tt>synchronize_rcu()</tt> no
|
|
longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
|
|
Therefore, as soon as the scheduler starts initializing, the early-boot
|
|
fastpath is disabled.
|
|
This means that <tt>synchronize_rcu()</tt> switches to its runtime
|
|
mode of operation where it posts callbacks, which in turn means that
|
|
any call to <tt>synchronize_rcu()</tt> will block until the corresponding
|
|
callback is invoked.
|
|
Unfortunately, the callback cannot be invoked until RCU's runtime
|
|
grace-period machinery is up and running, which cannot happen until
|
|
the scheduler has initialized itself sufficiently to allow RCU's
|
|
kthreads to be spawned.
|
|
Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
|
|
initialization can result in deadlock.
|
|
|
|
<p>@@QQ@@
|
|
So what happens with <tt>synchronize_rcu()</tt> during
|
|
scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
|
|
kernels?
|
|
<p>@@QQA@@
|
|
In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
|
|
maps directly to <tt>synchronize_sched()</tt>.
|
|
Therefore, <tt>synchronize_rcu()</tt> works normally throughout
|
|
boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
|
|
However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
|
|
so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
|
|
during scheduler initialization.
|
|
<p>@@QQE@@
|
|
|
|
<p>
|
|
I learned of these boot-time requirements as a result of a series of
|
|
system hangs.
|
|
|
|
<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel has interrupts, and RCU read-side critical sections are
|
|
legal within interrupt handlers and within interrupt-disabled regions
|
|
of code, as are invocations of <tt>call_rcu()</tt>.
|
|
|
|
<p>
|
|
Some Linux-kernel architectures can enter an interrupt handler from
|
|
non-idle process context, and then just never leave it, instead stealthily
|
|
transitioning back to process context.
|
|
This trick is sometimes used to invoke system calls from inside the kernel.
|
|
These “half-interrupts” mean that RCU has to be very careful
|
|
about how it counts interrupt nesting levels.
|
|
I learned of this requirement the hard way during a rewrite
|
|
of RCU's dyntick-idle code.
|
|
|
|
<p>
|
|
The Linux kernel has non-maskable interrupts (NMIs), and
|
|
RCU read-side critical sections are legal within NMI handlers.
|
|
Thankfully, RCU update-side primitives, including
|
|
<tt>call_rcu()</tt>, are prohibited within NMI handlers.
|
|
|
|
<p>
|
|
The name notwithstanding, some Linux-kernel architectures
|
|
can have nested NMIs, which RCU must handle correctly.
|
|
Andy Lutomirski
|
|
<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
|
|
with this requirement;
|
|
he also kindly surprised me with
|
|
<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
|
|
that meets this requirement.
|
|
|
|
<h3><a name="Loadable Modules">Loadable Modules</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel has loadable modules, and these modules can
|
|
also be unloaded.
|
|
After a given module has been unloaded, any attempt to call
|
|
one of its functions results in a segmentation fault.
|
|
The module-unload functions must therefore cancel any
|
|
delayed calls to loadable-module functions, for example,
|
|
any outstanding <tt>mod_timer()</tt> must be dealt with
|
|
via <tt>del_timer_sync()</tt> or similar.
|
|
|
|
<p>
|
|
Unfortunately, there is no way to cancel an RCU callback;
|
|
once you invoke <tt>call_rcu()</tt>, the callback function is
|
|
going to eventually be invoked, unless the system goes down first.
|
|
Because it is normally considered socially irresponsible to crash the system
|
|
in response to a module unload request, we need some other way
|
|
to deal with in-flight RCU callbacks.
|
|
|
|
<p>
|
|
RCU therefore provides
|
|
<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>,
|
|
which waits until all in-flight RCU callbacks have been invoked.
|
|
If a module uses <tt>call_rcu()</tt>, its exit function should therefore
|
|
prevent any future invocation of <tt>call_rcu()</tt>, then invoke
|
|
<tt>rcu_barrier()</tt>.
|
|
In theory, the underlying module-unload code could invoke
|
|
<tt>rcu_barrier()</tt> unconditionally, but in practice this would
|
|
incur unacceptable latencies.
|
|
|
|
<p>
|
|
Nikita Danilov noted this requirement for an analogous filesystem-unmount
|
|
situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
|
|
The need for <tt>rcu_barrier()</tt> for module unloading became
|
|
apparent later.
|
|
|
|
<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
|
|
|
|
<p>
|
|
The Linux kernel supports CPU hotplug, which means that CPUs
|
|
can come and go.
|
|
It is of course illegal to use any RCU API member from an offline CPU.
|
|
This requirement was present from day one in DYNIX/ptx, but
|
|
on the other hand, the Linux kernel's CPU-hotplug implementation
|
|
is “interesting.”
|
|
|
|
<p>
|
|
The Linux-kernel CPU-hotplug implementation has notifiers that
|
|
are used to allow the various kernel subsystems (including RCU)
|
|
to respond appropriately to a given CPU-hotplug operation.
|
|
Most RCU operations may be invoked from CPU-hotplug notifiers,
|
|
including even normal synchronous grace-period operations
|
|
such as <tt>synchronize_rcu()</tt>.
|
|
However, expedited grace-period operations such as
|
|
<tt>synchronize_rcu_expedited()</tt> are not supported,
|
|
due to the fact that current implementations block CPU-hotplug
|
|
operations, which could result in deadlock.
|
|
|
|
<p>
|
|
In addition, all-callback-wait operations such as
|
|
<tt>rcu_barrier()</tt> are also not supported, due to the
|
|
fact that there are phases of CPU-hotplug operations where
|
|
the outgoing CPU's callbacks will not be invoked until after
|
|
the CPU-hotplug operation ends, which could also result in deadlock.
|
|
|
|
<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
|
|
|
|
<p>
|
|
RCU depends on the scheduler, and the scheduler uses RCU to
|
|
protect some of its data structures.
|
|
This means the scheduler is forbidden from acquiring
|
|
the runqueue locks and the priority-inheritance locks
|
|
in the middle of an outermost RCU read-side critical section unless either
|
|
(1) it releases them before exiting that same
|
|
RCU read-side critical section, or
|
|
(2) interrupts are disabled across
|
|
that entire RCU read-side critical section.
|
|
This same prohibition also applies (recursively!) to any lock that is acquired
|
|
while holding any lock to which this prohibition applies.
|
|
Adhering to this rule prevents preemptible RCU from invoking
|
|
<tt>rcu_read_unlock_special()</tt> while either runqueue or
|
|
priority-inheritance locks are held, thus avoiding deadlock.
|
|
|
|
<p>
|
|
Prior to v4.4, it was only necessary to disable preemption across
|
|
RCU read-side critical sections that acquired scheduler locks.
|
|
In v4.4, expedited grace periods started using IPIs, and these
|
|
IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath.
|
|
Therefore, this expedited-grace-period change required disabling of
|
|
interrupts, not just preemption.
|
|
|
|
<p>
|
|
For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt>
|
|
implementation must be written carefully to avoid similar deadlocks.
|
|
In particular, <tt>rcu_read_unlock()</tt> must tolerate an
|
|
interrupt where the interrupt handler invokes both
|
|
<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
|
|
This possibility requires <tt>rcu_read_unlock()</tt> to use
|
|
negative nesting levels to avoid destructive recursion via
|
|
interrupt handler's use of RCU.
|
|
|
|
<p>
|
|
This pair of mutual scheduler-RCU requirements came as a
|
|
<a href="https://lwn.net/Articles/453002/">complete surprise</a>.
|
|
|
|
<p>
|
|
As noted above, RCU makes use of kthreads, and it is necessary to
|
|
avoid excessive CPU-time accumulation by these kthreads.
|
|
This requirement was no surprise, but RCU's violation of it
|
|
when running context-switch-heavy workloads when built with
|
|
<tt>CONFIG_NO_HZ_FULL=y</tt>
|
|
<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
|
|
RCU has made good progress towards meeting this requirement, even
|
|
for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
|
|
but there is room for further improvement.
|
|
|
|
<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
|
|
|
|
<p>
|
|
It is possible to use tracing on RCU code, but tracing itself
|
|
uses RCU.
|
|
For this reason, <tt>rcu_dereference_raw_notrace()</tt>
|
|
is provided for use by tracing, which avoids the destructive
|
|
recursion that could otherwise ensue.
|
|
This API is also used by virtualization in some architectures,
|
|
where RCU readers execute in environments in which tracing
|
|
cannot be used.
|
|
The tracing folks both located the requirement and provided the
|
|
needed fix, so this surprise requirement was relatively painless.
|
|
|
|
<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>
|
|
|
|
<p>
|
|
Interrupting idle CPUs is considered socially unacceptable,
|
|
especially by people with battery-powered embedded systems.
|
|
RCU therefore conserves energy by detecting which CPUs are
|
|
idle, including tracking CPUs that have been interrupted from idle.
|
|
This is a large part of the energy-efficiency requirement,
|
|
so I learned of this via an irate phone call.
|
|
|
|
<p>
|
|
Because RCU avoids interrupting idle CPUs, it is illegal to
|
|
execute an RCU read-side critical section on an idle CPU.
|
|
(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat
|
|
if you try it.)
|
|
The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt>
|
|
event tracing is provided to work around this restriction.
|
|
In addition, <tt>rcu_is_watching()</tt> may be used to
|
|
test whether or not it is currently legal to run RCU read-side
|
|
critical sections on this CPU.
|
|
I learned of the need for diagnostics on the one hand
|
|
and <tt>RCU_NONIDLE()</tt> on the other while inspecting
|
|
idle-loop code.
|
|
Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
|
|
which is used quite heavily in the idle loop.
|
|
|
|
<p>
|
|
It is similarly socially unacceptable to interrupt an
|
|
<tt>nohz_full</tt> CPU running in userspace.
|
|
RCU must therefore track <tt>nohz_full</tt> userspace
|
|
execution.
|
|
And in
|
|
<a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a>
|
|
kernels, RCU must separately track idle CPUs on the one hand and
|
|
CPUs that are either idle or executing in userspace on the other.
|
|
In both cases, RCU must be able to sample state at two points in
|
|
time, and be able to determine whether or not some other CPU spent
|
|
any time idle and/or executing in userspace.
|
|
|
|
<p>
|
|
These energy-efficiency requirements have proven quite difficult to
|
|
understand and to meet, for example, there have been more than five
|
|
clean-sheet rewrites of RCU's energy-efficiency code, the last of
|
|
which was finally able to demonstrate
|
|
<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>.
|
|
As noted earlier,
|
|
I learned of many of these requirements via angry phone calls:
|
|
Flaming me on the Linux-kernel mailing list was apparently not
|
|
sufficient to fully vent their ire at RCU's energy-efficiency bugs!
|
|
|
|
<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
|
|
|
|
<p>
|
|
Although small-memory non-realtime systems can simply use Tiny RCU,
|
|
code size is only one aspect of memory efficiency.
|
|
Another aspect is the size of the <tt>rcu_head</tt> structure
|
|
used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>.
|
|
Although this structure contains nothing more than a pair of pointers,
|
|
it does appear in many RCU-protected data structures, including
|
|
some that are size critical.
|
|
The <tt>page</tt> structure is a case in point, as evidenced by
|
|
the many occurrences of the <tt>union</tt> keyword within that structure.
|
|
|
|
<p>
|
|
This need for memory efficiency is one reason that RCU uses hand-crafted
|
|
singly linked lists to track the <tt>rcu_head</tt> structures that
|
|
are waiting for a grace period to elapse.
|
|
It is also the reason why <tt>rcu_head</tt> structures do not contain
|
|
debug information, such as fields tracking the file and line of the
|
|
<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them.
|
|
Although this information might appear in debug-only kernel builds at some
|
|
point, in the meantime, the <tt>->func</tt> field will often provide
|
|
the needed debug information.
|
|
|
|
<p>
|
|
However, in some cases, the need for memory efficiency leads to even
|
|
more extreme measures.
|
|
Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field
|
|
shares storage with a great many other structures that are used at
|
|
various points in the corresponding page's lifetime.
|
|
In order to correctly resolve certain
|
|
<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>,
|
|
the Linux kernel's memory-management subsystem needs a particular bit
|
|
to remain zero during all phases of grace-period processing,
|
|
and that bit happens to map to the bottom bit of the
|
|
<tt>rcu_head</tt> structure's <tt>->next</tt> field.
|
|
RCU makes this guarantee as long as <tt>call_rcu()</tt>
|
|
is used to post the callback, as opposed to <tt>kfree_rcu()</tt>
|
|
or some future “lazy”
|
|
variant of <tt>call_rcu()</tt> that might one day be created for
|
|
energy-efficiency purposes.
|
|
|
|
<h3><a name="Performance, Scalability, Response Time, and Reliability">
|
|
Performance, Scalability, Response Time, and Reliability</a></h3>
|
|
|
|
<p>
|
|
Expanding on the
|
|
<a href="#Performance and Scalability">earlier discussion</a>,
|
|
RCU is used heavily by hot code paths in performance-critical
|
|
portions of the Linux kernel's networking, security, virtualization,
|
|
and scheduling code paths.
|
|
RCU must therefore use efficient implementations, especially in its
|
|
read-side primitives.
|
|
To that end, it would be good if preemptible RCU's implementation
|
|
of <tt>rcu_read_lock()</tt> could be inlined, however, doing
|
|
this requires resolving <tt>#include</tt> issues with the
|
|
<tt>task_struct</tt> structure.
|
|
|
|
<p>
|
|
The Linux kernel supports hardware configurations with up to
|
|
4096 CPUs, which means that RCU must be extremely scalable.
|
|
Algorithms that involve frequent acquisitions of global locks or
|
|
frequent atomic operations on global variables simply cannot be
|
|
tolerated within the RCU implementation.
|
|
RCU therefore makes heavy use of a combining tree based on the
|
|
<tt>rcu_node</tt> structure.
|
|
RCU is required to tolerate all CPUs continuously invoking any
|
|
combination of RCU's runtime primitives with minimal per-operation
|
|
overhead.
|
|
In fact, in many cases, increasing load must <i>decrease</i> the
|
|
per-operation overhead, witness the batching optimizations for
|
|
<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>,
|
|
<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>.
|
|
As a general rule, RCU must cheerfully accept whatever the
|
|
rest of the Linux kernel decides to throw at it.
|
|
|
|
<p>
|
|
The Linux kernel is used for real-time workloads, especially
|
|
in conjunction with the
|
|
<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>.
|
|
The real-time-latency response requirements are such that the
|
|
traditional approach of disabling preemption across RCU
|
|
read-side critical sections is inappropriate.
|
|
Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore
|
|
use an RCU implementation that allows RCU read-side critical
|
|
sections to be preempted.
|
|
This requirement made its presence known after users made it
|
|
clear that an earlier
|
|
<a href="https://lwn.net/Articles/107930/">real-time patch</a>
|
|
did not meet their needs, in conjunction with some
|
|
<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a>
|
|
encountered by a very early version of the -rt patchset.
|
|
|
|
<p>
|
|
In addition, RCU must make do with a sub-100-microsecond real-time latency
|
|
budget.
|
|
In fact, on smaller systems with the -rt patchset, the Linux kernel
|
|
provides sub-20-microsecond real-time latencies for the whole kernel,
|
|
including RCU.
|
|
RCU's scalability and latency must therefore be sufficient for
|
|
these sorts of configurations.
|
|
To my surprise, the sub-100-microsecond real-time latency budget
|
|
<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf">
|
|
applies to even the largest systems [PDF]</a>,
|
|
up to and including systems with 4096 CPUs.
|
|
This real-time requirement motivated the grace-period kthread, which
|
|
also simplified handling of a number of race conditions.
|
|
|
|
<p>
|
|
Finally, RCU's status as a synchronization primitive means that
|
|
any RCU failure can result in arbitrary memory corruption that can be
|
|
extremely difficult to debug.
|
|
This means that RCU must be extremely reliable, which in
|
|
practice also means that RCU must have an aggressive stress-test
|
|
suite.
|
|
This stress-test suite is called <tt>rcutorture</tt>.
|
|
|
|
<p>
|
|
Although the need for <tt>rcutorture</tt> was no surprise,
|
|
the current immense popularity of the Linux kernel is posing
|
|
interesting—and perhaps unprecedented—validation
|
|
challenges.
|
|
To see this, keep in mind that there are well over one billion
|
|
instances of the Linux kernel running today, given Android
|
|
smartphones, Linux-powered televisions, and servers.
|
|
This number can be expected to increase sharply with the advent of
|
|
the celebrated Internet of Things.
|
|
|
|
<p>
|
|
Suppose that RCU contains a race condition that manifests on average
|
|
once per million years of runtime.
|
|
This bug will be occurring about three times per <i>day</i> across
|
|
the installed base.
|
|
RCU could simply hide behind hardware error rates, given that no one
|
|
should really expect their smartphone to last for a million years.
|
|
However, anyone taking too much comfort from this thought should
|
|
consider the fact that in most jurisdictions, a successful multi-year
|
|
test of a given mechanism, which might include a Linux kernel,
|
|
suffices for a number of types of safety-critical certifications.
|
|
In fact, rumor has it that the Linux kernel is already being used
|
|
in production for safety-critical applications.
|
|
I don't know about you, but I would feel quite bad if a bug in RCU
|
|
killed someone.
|
|
Which might explain my recent focus on validation and verification.
|
|
|
|
<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2>
|
|
|
|
<p>
|
|
One of the more surprising things about RCU is that there are now
|
|
no fewer than five <i>flavors</i>, or API families.
|
|
In addition, the primary flavor that has been the sole focus up to
|
|
this point has two different implementations, non-preemptible and
|
|
preemptible.
|
|
The other four flavors are listed below, with requirements for each
|
|
described in a separate section.
|
|
|
|
<ol>
|
|
<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a>
|
|
<li> <a href="#Sched Flavor">Sched Flavor</a>
|
|
<li> <a href="#Sleepable RCU">Sleepable RCU</a>
|
|
<li> <a href="#Tasks RCU">Tasks RCU</a>
|
|
</ol>
|
|
|
|
<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3>
|
|
|
|
<p>
|
|
The softirq-disable (AKA “bottom-half”,
|
|
hence the “_bh” abbreviations)
|
|
flavor of RCU, or <i>RCU-bh</i>, was developed by
|
|
Dipankar Sarma to provide a flavor of RCU that could withstand the
|
|
network-based denial-of-service attacks researched by Robert
|
|
Olsson.
|
|
These attacks placed so much networking load on the system
|
|
that some of the CPUs never exited softirq execution,
|
|
which in turn prevented those CPUs from ever executing a context switch,
|
|
which, in the RCU implementation of that time, prevented grace periods
|
|
from ever ending.
|
|
The result was an out-of-memory condition and a system hang.
|
|
|
|
<p>
|
|
The solution was the creation of RCU-bh, which does
|
|
<tt>local_bh_disable()</tt>
|
|
across its read-side critical sections, and which uses the transition
|
|
from one type of softirq processing to another as a quiescent state
|
|
in addition to context switch, idle, user mode, and offline.
|
|
This means that RCU-bh grace periods can complete even when some of
|
|
the CPUs execute in softirq indefinitely, thus allowing algorithms
|
|
based on RCU-bh to withstand network-based denial-of-service attacks.
|
|
|
|
<p>
|
|
Because
|
|
<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt>
|
|
disable and re-enable softirq handlers, any attempt to start a softirq
|
|
handlers during the
|
|
RCU-bh read-side critical section will be deferred.
|
|
In this case, <tt>rcu_read_unlock_bh()</tt>
|
|
will invoke softirq processing, which can take considerable time.
|
|
One can of course argue that this softirq overhead should be associated
|
|
with the code following the RCU-bh read-side critical section rather
|
|
than <tt>rcu_read_unlock_bh()</tt>, but the fact
|
|
is that most profiling tools cannot be expected to make this sort
|
|
of fine distinction.
|
|
For example, suppose that a three-millisecond-long RCU-bh read-side
|
|
critical section executes during a time of heavy networking load.
|
|
There will very likely be an attempt to invoke at least one softirq
|
|
handler during that three milliseconds, but any such invocation will
|
|
be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>.
|
|
This can of course make it appear at first glance as if
|
|
<tt>rcu_read_unlock_bh()</tt> was executing very slowly.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a>
|
|
includes
|
|
<tt>rcu_read_lock_bh()</tt>,
|
|
<tt>rcu_read_unlock_bh()</tt>,
|
|
<tt>rcu_dereference_bh()</tt>,
|
|
<tt>rcu_dereference_bh_check()</tt>,
|
|
<tt>synchronize_rcu_bh()</tt>,
|
|
<tt>synchronize_rcu_bh_expedited()</tt>,
|
|
<tt>call_rcu_bh()</tt>,
|
|
<tt>rcu_barrier_bh()</tt>, and
|
|
<tt>rcu_read_lock_bh_held()</tt>.
|
|
|
|
<h3><a name="Sched Flavor">Sched Flavor</a></h3>
|
|
|
|
<p>
|
|
Before preemptible RCU, waiting for an RCU grace period had the
|
|
side effect of also waiting for all pre-existing interrupt
|
|
and NMI handlers.
|
|
However, there are legitimate preemptible-RCU implementations that
|
|
do not have this property, given that any point in the code outside
|
|
of an RCU read-side critical section can be a quiescent state.
|
|
Therefore, <i>RCU-sched</i> was created, which follows “classic”
|
|
RCU in that an RCU-sched grace period waits for for pre-existing
|
|
interrupt and NMI handlers.
|
|
In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched
|
|
APIs have identical implementations, while kernels built with
|
|
<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each.
|
|
|
|
<p>
|
|
Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels,
|
|
<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
|
|
disable and re-enable preemption, respectively.
|
|
This means that if there was a preemption attempt during the
|
|
RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt>
|
|
will enter the scheduler, with all the latency and overhead entailed.
|
|
Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look
|
|
as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly.
|
|
However, the highest-priority task won't be preempted, so that task
|
|
will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a>
|
|
includes
|
|
<tt>rcu_read_lock_sched()</tt>,
|
|
<tt>rcu_read_unlock_sched()</tt>,
|
|
<tt>rcu_read_lock_sched_notrace()</tt>,
|
|
<tt>rcu_read_unlock_sched_notrace()</tt>,
|
|
<tt>rcu_dereference_sched()</tt>,
|
|
<tt>rcu_dereference_sched_check()</tt>,
|
|
<tt>synchronize_sched()</tt>,
|
|
<tt>synchronize_rcu_sched_expedited()</tt>,
|
|
<tt>call_rcu_sched()</tt>,
|
|
<tt>rcu_barrier_sched()</tt>, and
|
|
<tt>rcu_read_lock_sched_held()</tt>.
|
|
However, anything that disables preemption also marks an RCU-sched
|
|
read-side critical section, including
|
|
<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>,
|
|
<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>,
|
|
and so on.
|
|
|
|
<h3><a name="Sleepable RCU">Sleepable RCU</a></h3>
|
|
|
|
<p>
|
|
For well over a decade, someone saying “I need to block within
|
|
an RCU read-side critical section” was a reliable indication
|
|
that this someone did not understand RCU.
|
|
After all, if you are always blocking in an RCU read-side critical
|
|
section, you can probably afford to use a higher-overhead synchronization
|
|
mechanism.
|
|
However, that changed with the advent of the Linux kernel's notifiers,
|
|
whose RCU read-side critical
|
|
sections almost never sleep, but sometimes need to.
|
|
This resulted in the introduction of
|
|
<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>,
|
|
or <i>SRCU</i>.
|
|
|
|
<p>
|
|
SRCU allows different domains to be defined, with each such domain
|
|
defined by an instance of an <tt>srcu_struct</tt> structure.
|
|
A pointer to this structure must be passed in to each SRCU function,
|
|
for example, <tt>synchronize_srcu(&ss)</tt>, where
|
|
<tt>ss</tt> is the <tt>srcu_struct</tt> structure.
|
|
The key benefit of these domains is that a slow SRCU reader in one
|
|
domain does not delay an SRCU grace period in some other domain.
|
|
That said, one consequence of these domains is that read-side code
|
|
must pass a “cookie” from <tt>srcu_read_lock()</tt>
|
|
to <tt>srcu_read_unlock()</tt>, for example, as follows:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int idx;
|
|
2
|
|
3 idx = srcu_read_lock(&ss);
|
|
4 do_something();
|
|
5 srcu_read_unlock(&ss, idx);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
As noted above, it is legal to block within SRCU read-side critical sections,
|
|
however, with great power comes great responsibility.
|
|
If you block forever in one of a given domain's SRCU read-side critical
|
|
sections, then that domain's grace periods will also be blocked forever.
|
|
Of course, one good way to block forever is to deadlock, which can
|
|
happen if any operation in a given domain's SRCU read-side critical
|
|
section can block waiting, either directly or indirectly, for that domain's
|
|
grace period to elapse.
|
|
For example, this results in a self-deadlock:
|
|
|
|
<blockquote>
|
|
<pre>
|
|
1 int idx;
|
|
2
|
|
3 idx = srcu_read_lock(&ss);
|
|
4 do_something();
|
|
5 synchronize_srcu(&ss);
|
|
6 srcu_read_unlock(&ss, idx);
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<p>
|
|
However, if line 5 acquired a mutex that was held across
|
|
a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>,
|
|
deadlock would still be possible.
|
|
Furthermore, if line 5 acquired a mutex that was held across
|
|
a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>,
|
|
and if an <tt>ss1</tt>-domain SRCU read-side critical section
|
|
acquired another mutex that was held across as <tt>ss</tt>-domain
|
|
<tt>synchronize_srcu()</tt>,
|
|
deadlock would again be possible.
|
|
Such a deadlock cycle could extend across an arbitrarily large number
|
|
of different SRCU domains.
|
|
Again, with great power comes great responsibility.
|
|
|
|
<p>
|
|
Unlike the other RCU flavors, SRCU read-side critical sections can
|
|
run on idle and even offline CPUs.
|
|
This ability requires that <tt>srcu_read_lock()</tt> and
|
|
<tt>srcu_read_unlock()</tt> contain memory barriers, which means
|
|
that SRCU readers will run a bit slower than would RCU readers.
|
|
It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
|
|
API, which, in combination with <tt>srcu_read_unlock()</tt>,
|
|
guarantees a full memory barrier.
|
|
|
|
<p>
|
|
The
|
|
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
|
|
includes
|
|
<tt>srcu_read_lock()</tt>,
|
|
<tt>srcu_read_unlock()</tt>,
|
|
<tt>srcu_dereference()</tt>,
|
|
<tt>srcu_dereference_check()</tt>,
|
|
<tt>synchronize_srcu()</tt>,
|
|
<tt>synchronize_srcu_expedited()</tt>,
|
|
<tt>call_srcu()</tt>,
|
|
<tt>srcu_barrier()</tt>, and
|
|
<tt>srcu_read_lock_held()</tt>.
|
|
It also includes
|
|
<tt>DEFINE_SRCU()</tt>,
|
|
<tt>DEFINE_STATIC_SRCU()</tt>, and
|
|
<tt>init_srcu_struct()</tt>
|
|
APIs for defining and initializing <tt>srcu_struct</tt> structures.
|
|
|
|
<h3><a name="Tasks RCU">Tasks RCU</a></h3>
|
|
|
|
<p>
|
|
Some forms of tracing use “tramopolines” to handle the
|
|
binary rewriting required to install different types of probes.
|
|
It would be good to be able to free old trampolines, which sounds
|
|
like a job for some form of RCU.
|
|
However, because it is necessary to be able to install a trace
|
|
anywhere in the code, it is not possible to use read-side markers
|
|
such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
|
|
In addition, it does not work to have these markers in the trampoline
|
|
itself, because there would need to be instructions following
|
|
<tt>rcu_read_unlock()</tt>.
|
|
Although <tt>synchronize_rcu()</tt> would guarantee that execution
|
|
reached the <tt>rcu_read_unlock()</tt>, it would not be able to
|
|
guarantee that execution had completely left the trampoline.
|
|
|
|
<p>
|
|
The solution, in the form of
|
|
<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>,
|
|
is to have implicit
|
|
read-side critical sections that are delimited by voluntary context
|
|
switches, that is, calls to <tt>schedule()</tt>,
|
|
<tt>cond_resched_rcu_qs()</tt>, and
|
|
<tt>synchronize_rcu_tasks()</tt>.
|
|
In addition, transitions to and from userspace execution also delimit
|
|
tasks-RCU read-side critical sections.
|
|
|
|
<p>
|
|
The tasks-RCU API is quite compact, consisting only of
|
|
<tt>call_rcu_tasks()</tt>,
|
|
<tt>synchronize_rcu_tasks()</tt>, and
|
|
<tt>rcu_barrier_tasks()</tt>.
|
|
|
|
<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>
|
|
|
|
<p>
|
|
One of the tricks that RCU uses to attain update-side scalability is
|
|
to increase grace-period latency with increasing numbers of CPUs.
|
|
If this becomes a serious problem, it will be necessary to rework the
|
|
grace-period state machine so as to avoid the need for the additional
|
|
latency.
|
|
|
|
<p>
|
|
Expedited grace periods scan the CPUs, so their latency and overhead
|
|
increases with increasing numbers of CPUs.
|
|
If this becomes a serious problem on large systems, it will be necessary
|
|
to do some redesign to avoid this scalability problem.
|
|
|
|
<p>
|
|
RCU disables CPU hotplug in a few places, perhaps most notably in the
|
|
expedited grace-period and <tt>rcu_barrier()</tt> operations.
|
|
If there is a strong reason to use expedited grace periods in CPU-hotplug
|
|
notifiers, it will be necessary to avoid disabling CPU hotplug.
|
|
This would introduce some complexity, so there had better be a <i>very</i>
|
|
good reason.
|
|
|
|
<p>
|
|
The tradeoff between grace-period latency on the one hand and interruptions
|
|
of other CPUs on the other hand may need to be re-examined.
|
|
The desire is of course for zero grace-period latency as well as zero
|
|
interprocessor interrupts undertaken during an expedited grace period
|
|
operation.
|
|
While this ideal is unlikely to be achievable, it is quite possible that
|
|
further improvements can be made.
|
|
|
|
<p>
|
|
The multiprocessor implementations of RCU use a combining tree that
|
|
groups CPUs so as to reduce lock contention and increase cache locality.
|
|
However, this combining tree does not spread its memory across NUMA
|
|
nodes nor does it align the CPU groups with hardware features such
|
|
as sockets or cores.
|
|
Such spreading and alignment is currently believed to be unnecessary
|
|
because the hotpath read-side primitives do not access the combining
|
|
tree, nor does <tt>call_rcu()</tt> in the common case.
|
|
If you believe that your architecture needs such spreading and alignment,
|
|
then your architecture should also benefit from the
|
|
<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set
|
|
to the number of CPUs in a socket, NUMA node, or whatever.
|
|
If the number of CPUs is too large, use a fraction of the number of
|
|
CPUs.
|
|
If the number of CPUs is a large prime number, well, that certainly
|
|
is an “interesting” architectural choice!
|
|
More flexible arrangements might be considered, but only if
|
|
<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only
|
|
if the inadequacy has been demonstrated by a carefully run and
|
|
realistic system-level workload.
|
|
|
|
<p>
|
|
Please note that arrangements that require RCU to remap CPU numbers will
|
|
require extremely good demonstration of need and full exploration of
|
|
alternatives.
|
|
|
|
<p>
|
|
There is an embarrassingly large number of flavors of RCU, and this
|
|
number has been increasing over time.
|
|
Perhaps it will be possible to combine some at some future date.
|
|
|
|
<p>
|
|
RCU's various kthreads are reasonably recent additions.
|
|
It is quite likely that adjustments will be required to more gracefully
|
|
handle extreme loads.
|
|
It might also be necessary to be able to relate CPU utilization by
|
|
RCU's kthreads and softirq handlers to the code that instigated this
|
|
CPU utilization.
|
|
For example, RCU callback overhead might be charged back to the
|
|
originating <tt>call_rcu()</tt> instance, though probably not
|
|
in production kernels.
|
|
|
|
<h2><a name="Summary">Summary</a></h2>
|
|
|
|
<p>
|
|
This document has presented more than two decade's worth of RCU
|
|
requirements.
|
|
Given that the requirements keep changing, this will not be the last
|
|
word on this subject, but at least it serves to get an important
|
|
subset of the requirements set forth.
|
|
|
|
<h2><a name="Acknowledgments">Acknowledgments</a></h2>
|
|
|
|
I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar,
|
|
Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and
|
|
Andy Lutomirski for their help in rendering
|
|
this article human readable, and to Michelle Rankin for her support
|
|
of this effort.
|
|
Other contributions are acknowledged in the Linux kernel's git archive.
|
|
The cartoon is copyright (c) 2013 by Melissa Broussard,
|
|
and is provided
|
|
under the terms of the Creative Commons Attribution-Share Alike 3.0
|
|
United States license.
|
|
|
|
<p>@@QQAL@@
|
|
|
|
</body></html>
|