diff --git a/.gitignore b/.gitignore
index c2ed4ecb0acd2..0c39aa20b6ba8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -33,6 +33,7 @@
*.lzo
*.patch
*.gcno
+*.ll
modules.builtin
Module.symvers
*.dwo
diff --git a/.mailmap b/.mailmap
index d2aeb146efed7..5273cfd70ad62 100644
--- a/.mailmap
+++ b/.mailmap
@@ -146,6 +146,8 @@ Santosh Shilimkar
+The segments are as follows:
+
+
+The ->head pointer references the first callback or
+is NULL if the list contains no callbacks (which is
+not the same as being empty).
+Each element of the ->tails[] array references the
+->next pointer of the last callback in the corresponding
+segment of the list, or the list's ->head pointer if
+that segment and all previous segments are empty.
+If the corresponding segment is empty but some previous segment is
+not empty, then the array element is identical to its predecessor.
+Older callbacks are closer to the head of the list, and new callbacks
+are added at the tail.
+This relationship between the ->head pointer, the
+->tails[] array, and the callbacks is shown in this
+diagram:
+
+ In this figure, the ->head pointer references the
+first
+RCU callback in the list.
+The ->tails[RCU_DONE_TAIL] array element references
+the ->head pointer itself, indicating that none
+of the callbacks is ready to invoke.
+The ->tails[RCU_WAIT_TAIL] array element references callback
+CB 2's ->next pointer, which indicates that
+CB 1 and CB 2 are both waiting on the current grace period,
+give or take possible disagreements about exactly which grace period
+is the current one.
+The ->tails[RCU_NEXT_READY_TAIL] array element
+references the same RCU callback that ->tails[RCU_WAIT_TAIL]
+does, which indicates that there are no callbacks waiting on the next
+RCU grace period.
+The ->tails[RCU_NEXT_TAIL] array element references
+CB 4's ->next pointer, indicating that all the
+remaining RCU callbacks have not yet been assigned to an RCU grace
+period.
+Note that the ->tails[RCU_NEXT_TAIL] array element
+always references the last RCU callback's ->next pointer
+unless the callback list is empty, in which case it references
+the ->head pointer.
+
+
+There is one additional important special case for the
+->tails[RCU_NEXT_TAIL] array element: It can be NULL
+when this list is disabled.
+Lists are disabled when the corresponding CPU is offline or when
+the corresponding CPU's callbacks are offloaded to a kthread,
+both of which are described elsewhere.
+
+ CPUs advance their callbacks from the
+RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the
+RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments
+as grace periods advance.
+
+ The ->gp_seq[] array records grace-period
+numbers corresponding to the list segments.
+This is what allows different CPUs to have different ideas as to
+which is the current grace period while still avoiding premature
+invocation of their callbacks.
+In particular, this allows CPUs that go idle for extended periods
+to determine which of their callbacks are ready to be invoked after
+reawakening.
+
+ The ->len counter contains the number of
+callbacks in ->head, and the
+->len_lazy contains the number of those callbacks that
+are known to only free memory, and whose invocation can therefore
+be safely deferred.
+
+ Important note: It is the ->len field that
+determines whether or not there are callbacks associated with
+this rcu_segcblist structure, not the ->head
+pointer.
+The reason for this is that all the ready-to-invoke callbacks
+(that is, those in the RCU_DONE_TAIL segment) are extracted
+all at once at callback-invocation time.
+If callback invocation must be postponed, for example, because a
+high-priority process just woke up on this CPU, then the remaining
+callbacks are placed back on the RCU_DONE_TAIL segment.
+Either way, the ->len and ->len_lazy counts
+are adjusted after the corresponding callbacks have been invoked, and so
+again it is the ->len count that accurately reflects whether
+or not there are callbacks associated with this rcu_segcblist
+structure.
+Of course, off-CPU sampling of the ->len count requires
+the use of appropriate synchronization, for example, memory barriers.
+This synchronization can be a bit subtle, particularly in the case
+of rcu_barrier().
+
The ->nxtlist pointer and the
-->nxttail[] array form a four-segment list with
-older callbacks near the head and newer ones near the tail.
-Each segment contains callbacks with the corresponding relationship
-to the current grace period.
-The pointer out of the end of each of the four segments is referenced
-by the element of the ->nxttail[] array indexed by
-RCU_DONE_TAIL (for callbacks handled by a prior grace period),
-RCU_WAIT_TAIL (for callbacks waiting on the current grace period),
-RCU_NEXT_READY_TAIL (for callbacks that will wait on the next
-grace period), and
-RCU_NEXT_TAIL (for callbacks that are not yet associated
-with a specific grace period)
-respectively, as shown in the following figure.
-
- In this figure, the ->nxtlist pointer references the
-first
-RCU callback in the list.
-The ->nxttail[RCU_DONE_TAIL] array element references
-the ->nxtlist pointer itself, indicating that none
-of the callbacks is ready to invoke.
-The ->nxttail[RCU_WAIT_TAIL] array element references callback
-CB 2's ->next pointer, which indicates that
-CB 1 and CB 2 are both waiting on the current grace period.
-The ->nxttail[RCU_NEXT_READY_TAIL] array element
-references the same RCU callback that ->nxttail[RCU_WAIT_TAIL]
-does, which indicates that there are no callbacks waiting on the next
-RCU grace period.
-The ->nxttail[RCU_NEXT_TAIL] array element references
-CB 4's ->next pointer, indicating that all the
-remaining RCU callbacks have not yet been assigned to an RCU grace
-period.
-Note that the ->nxttail[RCU_NEXT_TAIL] array element
-always references the last RCU callback's ->next pointer
-unless the callback list is empty, in which case it references
-the ->nxtlist pointer.
-
- CPUs advance their callbacks from the
-RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the
-RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments
-as grace periods advance.
+ The ->cblist structure is the segmented callback list
+described earlier.
The CPU advances the callbacks in its rcu_data structure
whenever it notices that another RCU grace period has completed.
The CPU detects the completion of an RCU grace period by noticing
@@ -1049,16 +1135,7 @@ Introduction
The rcu_state Structure
Sizing the rcu_node Array
Finally, lines 64-66 produce an error if the maximum number of
CPUs is too large for the specified fanout.
+
+The rcu_segcblist Structure
+
+The rcu_segcblist structure maintains a segmented list of
+callbacks as follows:
+
+
+ 1 #define RCU_DONE_TAIL 0
+ 2 #define RCU_WAIT_TAIL 1
+ 3 #define RCU_NEXT_READY_TAIL 2
+ 4 #define RCU_NEXT_TAIL 3
+ 5 #define RCU_CBLIST_NSEGS 4
+ 6
+ 7 struct rcu_segcblist {
+ 8 struct rcu_head *head;
+ 9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
+10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
+11 long len;
+12 long len_lazy;
+13 };
+
+
+
+
+
+
+
+
The rcu_data Structure
@@ -983,62 +1113,18 @@ RCU Callback Handling
as follows:
- 1 struct rcu_head *nxtlist;
- 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
- 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
- 4 long qlen_lazy;
- 5 long qlen;
- 6 long qlen_last_fqs_check;
+ 1 struct rcu_segcblist cblist;
+ 2 long qlen_last_fqs_check;
+ 3 unsigned long n_cbs_invoked;
+ 4 unsigned long n_nocbs_invoked;
+ 5 unsigned long n_cbs_orphaned;
+ 6 unsigned long n_cbs_adopted;
7 unsigned long n_force_qs_snap;
- 8 unsigned long n_cbs_invoked;
- 9 unsigned long n_cbs_orphaned;
-10 unsigned long n_cbs_adopted;
-11 long blimit;
+ 8 long blimit;
-
-
-
RCU Callback Handling
->completed field is updated at the end of each
grace period.
-
The ->nxtcompleted[] array records grace-period -numbers corresponding to the list segments. -This allows CPUs that go idle for extended periods to determine -which of their callbacks are ready to be invoked after reawakening. - -
The ->qlen counter contains the number of -callbacks in ->nxtlist, and the -->qlen_lazy contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. +
The ->qlen_last_fqs_check and ->n_force_qs_snap coordinate the forcing of quiescent states from call_rcu() and friends when callback @@ -1069,6 +1146,10 @@
Finally, the ->blimit counter is the maximum number of RCU callbacks that may be invoked at a given time. @@ -1104,6 +1185,9 @@
The ->dynticks_nesting field counts the @@ -1117,11 +1201,32 @@
Finally, the ->dynticks field counts the corresponding +
The ->dynticks field counts the corresponding CPU's transitions to and from dyntick-idle mode, so that this counter has an even value when the CPU is in dyntick-idle mode and an odd value otherwise. +
The ->rcu_need_heavy_qs field is used +to record the fact that the RCU core code would really like to +see a quiescent state from the corresponding CPU, so much so that +it is willing to call for heavy-weight dyntick-counter operations. +This flag is checked by RCU's context-switch and cond_resched() +code, which provide a momentary idle sojourn in response. + +
The ->rcu_qs_ctr field is used to record +quiescent states from cond_resched(). +Because cond_resched() can execute quite frequently, this +must be quite lightweight, as in a non-atomic increment of this +per-CPU field. + +
Finally, the ->rcu_urgent_qs field is used to record +the fact that the RCU core code would really like to see a quiescent +state from the corresponding CPU, with the various other fields indicating +just how badly RCU wants this quiescent state. +This flag is checked by RCU's context-switch and cond_resched() +code, which, if nothing else, non-atomically increment ->rcu_qs_ctr +in response. +
Quick Quiz: |
---|
Quick Quiz: |
---|
- So what happens with synchronize_rcu() during - scheduler initialization for CONFIG_PREEMPT=n - kernels? + How can RCU possibly handle grace periods before all of its + kthreads have been spawned??? |
Answer: |
- In CONFIG_PREEMPT=n kernel, synchronize_rcu()
- maps directly to synchronize_sched().
- Therefore, synchronize_rcu() works normally throughout
- boot in CONFIG_PREEMPT=n kernels.
- However, your code must also work in CONFIG_PREEMPT=y kernels,
- so it is still necessary to avoid invoking synchronize_rcu()
- during scheduler initialization.
+ Very carefully!
+
+
+ + During the “dead zone” between the time that the + scheduler spawns the first task and the time that all of RCU's + kthreads have been spawned, all synchronous grace periods are + handled by the expedited grace-period mechanism. + At runtime, this expedited mechanism relies on workqueues, but + during the dead zone the requesting task itself drives the + desired expedited grace period. + Because dead-zone execution takes place within task context, + everything works. + Once the dead zone ends, expedited grace periods go back to + using workqueues, as is required to avoid problems that would + otherwise occur when a user task received a POSIX signal while + driving an expedited grace period. + + + + And yes, this does mean that it is unhelpful to send POSIX + signals to random tasks between the time that the scheduler + spawns its first kthread and the time that RCU's kthreads + have all been spawned. + If there ever turns out to be a good reason for sending POSIX + signals during that time, appropriate adjustments will be made. + (If it turns out that POSIX signals are sent during this time for + no good reason, other adjustments will be made, appropriate + or otherwise.) |
+Important note: The rcu_barrier() function is not, +repeat, not, obligated to wait for a grace period. +It is instead only required to wait for RCU callbacks that have +already been posted. +Therefore, if there are no RCU callbacks posted anywhere in the system, +rcu_barrier() is within its rights to return immediately. +Even if there are callbacks posted, rcu_barrier() does not +necessarily need to wait for a grace period. + +
Quick Quiz: |
---|
+ Wait a minute! + Each RCU callbacks must wait for a grace period to complete, + and rcu_barrier() must wait for each pre-existing + callback to be invoked. + Doesn't rcu_barrier() therefore need to wait for + a full grace period if there is even one callback posted anywhere + in the system? + |
Answer: |
+ Absolutely not!!!
+
+
+ + Yes, each RCU callbacks must wait for a grace period to complete, + but it might well be partly (or even completely) finished waiting + by the time rcu_barrier() is invoked. + In that case, rcu_barrier() need only wait for the + remaining portion of the grace period to elapse. + So even if there are quite a few callbacks posted, + rcu_barrier() might well return quite quickly. + + + + So if you need to wait for a grace period as well as for all + pre-existing callbacks, you will need to invoke both + synchronize_rcu() and rcu_barrier(). + If latency is a concern, you can always use workqueues + to invoke them concurrently. + |
The Linux kernel supports CPU hotplug, which means that CPUs can come and go. -It is of course illegal to use any RCU API member from an offline CPU. +It is of course illegal to use any RCU API member from an offline CPU, +with the exception of SRCU read-side +critical sections. This requirement was present from day one in DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug implementation is “interesting.” @@ -2310,19 +2375,18 @@
-In addition, all-callback-wait operations such as +However, all-callback-wait operations such as rcu_barrier() are also not supported, due to the fact that there are phases of CPU-hotplug operations where the outgoing CPU's callbacks will not be invoked until after the CPU-hotplug operation ends, which could also result in deadlock. +Furthermore, rcu_barrier() blocks CPU-hotplug operations +during its execution, which results in another type of deadlock +when invoked from a CPU-hotplug notifier.
+Also unlike other RCU flavors, SRCU's callbacks-wait function +srcu_barrier() may be invoked from CPU-hotplug notifiers, +though this is not necessarily a good idea. +The reason that this is possible is that SRCU is insensitive +to whether or not a CPU is online, which means that srcu_barrier() +need not exclude CPU-hotplug operations. + +
+As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating +a locking bottleneck present in prior kernel versions. +Although this will allow users to put much heavier stress on +call_srcu(), it is important to note that SRCU does not +yet take any special steps to deal with callback flooding. +So if you are posting (say) 10,000 SRCU callbacks per second per CPU, +you are probably totally OK, but if you intend to post (say) 1,000,000 +SRCU callbacks per second per CPU, please run some tests first. +SRCU just might need a few adjustment to deal with that sort of load. +Of course, your mileage may vary based on the speed of your CPUs and +the size of your memory. +
The SRCU API @@ -3021,8 +3106,8 @@
RCU disables CPU hotplug in a few places, perhaps most notably in the -expedited grace-period and rcu_barrier() operations. -If there is a strong reason to use expedited grace periods in CPU-hotplug +rcu_barrier() operations. +If there is a strong reason to use rcu_barrier() in CPU-hotplug notifiers, it will be necessary to avoid disabling CPU hotplug. This would introduce some complexity, so there had better be a very good reason. @@ -3096,9 +3181,5 @@