The rcu_request_urgent_qs_task() function is used only within RCU,
so there is no point in exporting it to the rest of the kernel from
nclude/linux/rcutiny.h and include/linux/rcutree.h. This commit therefore
moves this function to kernel/rcu/rcu.h.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The various functions similar to rcu_batches_started(), the
function show_rcu_gp_kthreads(), the various functions similar to
rcu_force_quiescent_state(), and the variables rcutorture_testseq and
rcutorture_vernum are used only within RCU. There is therefore no point
in exporting them to the kernel at large from include/linux/rcutiny.h
and include/linux/rcutree.h. This commit therefore moves all of these
to kernel/rcu/rcu.h.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The include/linux/rcupdate.h file contains a number of definitions that
are used only to communicate between rcutorture, rcuperf, and the RCU code
itself. There is no point in having these definitions exposed globally
throughout the kernel, so this commit moves them to kernel/rcu/rcu.h.
This change has the added benefit of shrinking rcupdate.h.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, a call to schedule() acts as a Tasks RCU quiescent state
only if a context switch actually takes place. However, just the
call to schedule() guarantees that the calling task has moved off of
whatever tracing trampoline that it might have been one previously.
This commit therefore plumbs schedule()'s "preempt" parameter into
rcu_note_context_switch(), which then records the Tasks RCU quiescent
state, but only if this call to schedule() was -not- due to a preemption.
To avoid adding overhead to the common-case context-switch path,
this commit hides the rcu_note_context_switch() check under an existing
non-common-case check.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Straight forward conversion to the state machine. Though the question arises
whether this needs really all these state transitions to work.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.982013161@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit provides rcu_exp_batches_completed() and
rcu_exp_batches_completed_sched() functions to allow torture-test modules
to check how many expedited grace period batches have completed.
These are analogous to the existing rcu_batches_completed(),
rcu_batches_completed_bh(), and rcu_batches_completed_sched() functions.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces a local_irq_save()/local_irq_restore() pair with
a lockdep assertion that interrupts are already disabled. This should
remove the corresponding overhead from the interrupt entry/exit fastpaths.
This change was inspired by the fact that Iftekhar Ahmed's mutation
testing showed that removing rcu_irq_enter()'s call to local_ird_restore()
had no effect, which might indicate that interrupts were always enabled
anyway.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal. Especially given that there are regions of the scheduler that
already have interrupts disabled.
This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
by Peter Zijlstra. ]
As we now have rcu_callback_t typedefs as the type of rcu callbacks, we
should use it in call_rcu*() and friends as the type of parameters. This
could save us a few lines of code and make it clear which function
requires an rcu callbacks rather than other callbacks as its argument.
Besides, this can also help cscope to generate a better database for
code reading.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The get_state_synchronize_rcu() and cond_synchronize_rcu() functions
allow polling for grace-period completion, with an actual wait for a
grace period occurring only when cond_synchronize_rcu() is called too
soon after the corresponding get_state_synchronize_rcu(). However,
these functions work only for vanilla RCU. This commit adds the
get_state_synchronize_sched() and cond_synchronize_sched(), which provide
the same capability for RCU-sched.
Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:
- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.
- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter
- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf
- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure
- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...
The Tiny RCU counterparts to rcu_idle_enter(), rcu_idle_exit(),
rcu_irq_enter(), and rcu_irq_exit() are empty functions, but each has
EXPORT_SYMBOL_GPL(), which needlessly consumes extra memory, especially
in kernels built with module support. This commit therefore moves these
functions to static inlines in rcutiny.h, removing the need for exports.
This won't affect the size of the tiniest kernels, which are likely
built without module support, but might help semi-tiny kernels that
might include module support.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit converts several CONFIG_RCU_NOCB_CPU_ALL #ifdefs to
instead use IS_ENABLED(). This change should help avoid hiding
code from compiler diagnostics.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.
Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.
Makes the code more readable and gets rid of the jiffies magic in the
nohz code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Although cond_resched_rcu_qs() only applies to TASKS_RCU, it is used
in places where it would be useful for it to apply to the normal RCU
flavors, rcu_preempt, rcu_sched, and rcu_bh. This is especially the
case for workloads that aggressively overload the system, particularly
those that generate large numbers of RCU updates on systems running
NO_HZ_FULL CPUs. This commit therefore communicates quiescent states
from cond_resched_rcu_qs() to the normal RCU flavors.
Note that it is unfortunately necessary to leave the old ->passed_quiesce
mechanism in place to allow quiescent states that apply to only one
flavor to be recorded. (Yes, we could decrement ->rcu_qs_ctr_snap in
that case, but that is not so good for debugging of RCU internals.)
In addition, if one of the RCU flavor's grace period has stalled, this
will invoke rcu_momentary_dyntick_idle(), resulting in a heavy-weight
quiescent state visible from other CPUs.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Merge commit from Sasha Levin fixing a bug where __this_cpu()
was used in preemptible code. ]
Currently, rcutorture's Reader Batch checks measure from the end of
the previous grace period to the end of the current one. This commit
tightens up these checks by measuring from the start and end of the same
grace period. This involves adding rcu_batches_started() and friends
corresponding to the existing rcu_batches_completed() and friends.
We leave SRCU alone for the moment, as it does not yet have a way of
tracking both ends of its grace periods.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Long ago, the various ->completed fields were of type long, but now are
unsigned long due to signed-integer-overflow concerns. However, the
various _batches_completed() functions remained of type long, even though
their only purpose in life is to return the corresponding ->completed
field. This patch cleans this up by changing these functions' return
types to unsigned long.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it. This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr(). Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
The "cpu" argument to rcu_note_context_switch() is always the current
CPU, so drop it. This in turn allows the "cpu" argument to
rcu_preempt_note_context_switch() to be removed, which allows the sole
use of "cpu" in both functions to be replaced with a this_cpu_ptr().
Again, the anticipated cross-CPU uses of these functions has been
replaced by NO_HZ_FULL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
This commit allows rcutorture to print additional state for the
RCU grace-period kthreads in cases where RCU seems reluctant to
start a new grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The following pattern is currently not well supported by RCU:
1. Make data element inaccessible to RCU readers.
2. Do work that probably lasts for more than one grace period.
3. Do something to make sure RCU readers in flight before #1 above
have completed.
Here are some things that could currently be done:
a. Do a synchronize_rcu() unconditionally at either #1 or #3 above.
This works, but imposes needless work and latency.
b. Post an RCU callback at #1 above that does a wakeup, then
wait for the wakeup at #3. This works well, but likely results
in an extra unneeded grace period. Open-coding this is also
a bit more semi-tricky code than would be good.
This commit therefore adds get_state_synchronize_rcu() and
cond_synchronize_rcu() APIs. Call get_state_synchronize_rcu() at #1
above and pass its return value to cond_synchronize_rcu() at #3 above.
This results in a call to synchronize_rcu() if no grace period has
elapsed between #1 and #3, but requires only a load, comparison, and
memory barrier if a full grace period did elapse.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
If CONFIG_RCU_NOCB_CPU_ALL=y, then rcu_needs_cpu() will always
return false, however, the current version nevertheless checks
for RCU callbacks. This commit therefore creates a static inline
implementation of rcu_needs_cpu() that unconditionally returns false
when CONFIG_RCU_NOCB_CPU_ALL=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
All of the RCU source files have the usual GPL header, which contains a
long-obsolete postal address for FSF. To avoid the need to track the
FSF office's movements, this commit substitutes the URL where GPL may
be found.
Reported-by: Greg KH <gregkh@linuxfoundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Function prototypes don't need to have the "extern" keyword since this
is the default behavior. Its explicit use is redundant. This commit
therefore removes them.
Signed-off-by: Teodora Baluta <teobaluta@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The old rcu_is_cpu_idle() function is just __rcu_is_watching() with
preemption disabled. This commit therefore renames rcu_is_cpu_idle()
to rcu_is_watching.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
There is currently no way for kernel code to determine whether it
is safe to enter an RCU read-side critical section, in other words,
whether or not RCU is paying attention to the currently running CPU.
Given the large and increasing quantity of code shared by the idle loop
and non-idle code, the this shortcoming is becoming increasingly painful.
This commit therefore adds __rcu_is_watching(), which returns true if
it is safe to enter an RCU read-side critical section on the currently
running CPU. This function is quite fast, using only a __this_cpu_read().
However, the caller must disable preemption.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Now that TINY_PREEMPT_RCU is no more, exit_rcu() is always an empty
function. But if TINY_RCU is going to have an empty function, it should
be in include/linux/rcutiny.h, where it does not bloat the kernel.
This commit therefore moves exit_rcu() out of kernel/rcupdate.c to
kernel/rcutree_plugin.h, and places a static inline empty function in
include/linux/rcutiny.h in order to shrink TINY_RCU a bit.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
TINY_PREEMPT_RCU could use a kthread to handle RCU callback invocation,
which required an API to abstract kthread vs. softirq invocation.
Now that TINY_PREEMPT_RCU is no longer with us, this commit retires
this API in favor of direct use of the relevant softirq primitives.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
barrier: Reduce the amount of disturbance by rcu_barrier() to the rest of
the system. This branch also includes improvements to
RCU_FAST_NO_HZ, which are included here due to conflicts.
fixes: Miscellaneous fixes.
inline: Remaining changes from an abortive attempt to inline
preemptible RCU's __rcu_read_lock(). These are (1) making
exit_rcu() avoid unnecessary work and (2) avoiding having
preemptible RCU record a blocked thread when the scheduler
declines to do a context switch.
srcu: Lai Jiangshan's algorithmic implementation of SRCU, including
call_srcu().
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing). The exit() path invokes exit_rcu() to
do this cleanup.
However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space. Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.
While we are at it, consolidate the two identical exit_rcu() functions
into a single function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
kernel/rcupdate.c
The rcu_blocking_is_gp() function tests to see if there is only one
online CPU, and if so, synchronize_sched() and friends become no-ops.
However, for larger systems, num_online_cpus() scans a large vector,
and might be preempted while doing so. While preempted, any number
of CPUs might come online and go offline, potentially resulting in
num_online_cpus() returning 1 when there never had only been one
CPU online. This could result in a too-short RCU grace period, which
could in turn result in total failure, except that the only way that
the grace period is too short is if there is an RCU read-side critical
section spanning it. For RCU-sched and RCU-bh (which are the only
cases using rcu_blocking_is_gp()), RCU read-side critical sections
have either preemption or bh disabled, which prevents CPUs from going
offline. This in turn prevents actual failures from occurring.
This commit therefore adds a large block comment to rcu_blocking_is_gp()
documenting why it is safe. This commit also moves rcu_blocking_is_gp()
into kernel/rcutree.c, which should help prevent unwary developers from
mistaking it for a generally useful function.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The expedited RCU primitives can be quite useful, but they have some
high costs as well. This commit updates and creates docbook comments
calling out the costs, and updates the RCU documentation as well.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
enter dyntick-idle mode even if it still has RCU callbacks queued.
RCU avoids system hangs in this case by scheduling a timer for several
jiffies in the future. However, if all of the callbacks on that CPU
are from kfree_rcu(), there is no reason to wake the CPU up, as it is
not a problem to defer freeing of memory.
This commit therefore tracks the number of callbacks on a given CPU
that are from kfree_rcu(), and avoids scheduling the timer if all of
a given CPU's callbacks are from kfree_rcu().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although TREE_PREEMPT_RCU indirectly uses might_sleep() to detect illegal
use of synchronize_sched() and synchronize_rcu_bh() from within an RCU
read-side critical section, this might_sleep() check is bypassed when
there is only a single CPU (for example, when running an SMP kernel on
a single-CPU system). This patch therefore adds a might_sleep() call
to the rcu_blocking_is_gp() check that is unconditionally invoked from
both synchronize_sched() and synchronize_rcu_bh().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Pull the code that waits for an RCU grace period into a single function,
which is then called by synchronize_rcu() and friends in the case of
TREE_RCU and TREE_PREEMPT_RCU, and from rcu_barrier() and friends in
the case of TINY_RCU and TINY_PREEMPT_RCU.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Provide rcu_virt_note_context_switch() for vitalization use to note
quiescent state during guest entry.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is not possible to accurately correlate rcutorture output with that
of debugfs. This patch therefore adds a debugfs file that prints out
the rcutorture version number, permitting easy correlation.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The first version of synchronize_sched_expedited() used the migration
code in the scheduler, and was therefore implemented in kernel/sched.c.
However, the more recent version of this code no longer uses the
migration code, so this commit moves it to the main RCU source files.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If RCU priority boosting is to be meaningful, callback invocation must
be boosted in addition to preempted RCU readers. Otherwise, in presence
of CPU real-time threads, the grace period ends, but the callbacks don't
get invoked. If the callbacks don't get invoked, the associated memory
doesn't get freed, so the system is still subject to OOM.
But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit
moves the callback invocations to a kthread, which can be boosted easily.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The CONFIG_PREEMPT_RCU kernel configuration parameter was recently
re-introduced, but as an indication of the type of RCU (preemptible
vs. non-preemptible) instead of as selecting a given implementation.
This commit uses CONFIG_PREEMPT_RCU to combine duplicate code
from include/linux/rcutiny.h and include/linux/rcutree.h into
include/linux/rcupdate.h. This commit also combines a few other pieces
of duplicate code that have accumulated.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Combine the duplicate definitions of ULONG_CMP_GE(), ULONG_CMP_LT(),
and rcu_preempt_depth() into include/linux/rcupdate.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When using a kernel debugger, a long sojourn in the debugger can get
you lots of RCU CPU stall warnings once you resume. This might not be
helpful, especially if you are using the system console. This patch
therefore allows RCU CPU stall warnings to be suppressed, but only for
the duration of the current set of grace periods.
This differs from Jason's original patch in that it adds support for
tiny RCU and preemptible RCU, and uses a slightly different method for
suppressing the RCU CPU stall warning messages.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Jason Wessel <jason.wessel@windriver.com>
Implement a small-memory-footprint uniprocessor-only implementation of
preemptible RCU. This implementation uses but a single blocked-tasks
list rather than the combinatorial number used per leaf rcu_node by
TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
processing. This version also takes advantage of uniprocessor execution
to accelerate grace periods in the case where there are no readers.
The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.
This implementation is a step towards having RCU implementation driven
off of the SMP and PREEMPT kernel configuration variables, which can
happen once this implementation has accumulated sufficient experience.
Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
suggested by Steve Rostedt in order to avoid the compiler-reordering
issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).
As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time
workloads, CONFIG_TINY_RCU is even better.
CONFIG_TREE_PREEMPT_RCU
text data bss dec filename
13 0 0 13 kernel/rcupdate.o
6170 825 28 7023 kernel/rcutree.o
----
7026 Total
CONFIG_TINY_PREEMPT_RCU
text data bss dec filename
13 0 0 13 kernel/rcupdate.o
2081 81 8 2170 kernel/rcutiny.o
----
2183 Total
CONFIG_TINY_RCU (non-preemptible)
text data bss dec filename
13 0 0 13 kernel/rcupdate.o
719 25 0 744 kernel/rcutiny.o
---
757 Total
Requested-by: Loïc Minier <loic.minier@canonical.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
sched, wait: Use wrapper functions
sched: Remove a stale comment
ondemand: Make the iowait-is-busy time a sysfs tunable
ondemand: Solve a big performance issue by counting IOWAIT time as busy
sched: Intoduce get_cpu_iowait_time_us()
sched: Eliminate the ts->idle_lastupdate field
sched: Fold updating of the last_update_time_info into update_ts_time_stats()
sched: Update the idle statistics in get_cpu_idle_time_us()
sched: Introduce a function to update the idle statistics
sched: Add a comment to get_cpu_idle_time_us()
cpu_stop: add dummy implementation for UP
sched: Remove rq argument to the tracepoints
rcu: need barrier() in UP synchronize_sched_expedited()
sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
sched: kill paranoia check in synchronize_sched_expedited()
sched: replace migration_thread with cpu_stop
stop_machine: reimplement using cpu_stop
cpu_stop: implement stop_cpu[s]()
sched: Fix select_idle_sibling() logic in select_task_rq_fair()
...
TINY_RCU does not need rcu_scheduler_active unless CONFIG_DEBUG_LOCK_ALLOC.
So conditionally compile rcu_scheduler_active in order to slim down
rcutiny a bit more. Also gets rid of an EXPORT_SYMBOL_GPL, which is
responsible for most of the slimming.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The addition of preemptible RCU to treercu resulted in a bit of
confusion and inefficiency surrounding the handling of context switches
for RCU-sched and for RCU-preempt. For RCU-sched, a context switch
is a quiescent state, pure and simple, just like it always has been.
For RCU-preempt, a context switch is in no way a quiescent state, but
special handling is required when a task blocks in an RCU read-side
critical section.
However, the callout from the scheduler and the outer loop in ksoftirqd
still calls something named rcu_sched_qs(), whose name is no longer
accurate. Furthermore, when rcu_check_callbacks() notes an RCU-sched
quiescent state, it ends up unnecessarily (though harmlessly, aside
from the performance hit) enqueuing the current task if it happens to
be running in an RCU-preempt read-side critical section. This not only
increases the maximum latency of scheduler_tick(), it also needlessly
increases the overhead of the next outermost rcu_read_unlock() invocation.
This patch addresses this situation by separating the notion of RCU's
context-switch handling from that of RCU-sched's quiescent states.
The context-switch handling is covered by rcu_note_context_switch() in
general and by rcu_preempt_note_context_switch() for preemptible RCU.
This permits rcu_sched_qs() to handle quiescent states and only quiescent
states. It also reduces the maximum latency of scheduler_tick(), though
probably by much less than a microsecond. Finally, it means that tasks
within preemptible-RCU read-side critical sections avoid incurring the
overhead of queuing unless there really is a context switch.
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Because synchronize_rcu_bh() is identical to synchronize_sched(),
make the former a static inline invoking the latter, saving the
overhead of an EXPORT_SYMBOL_GPL() and the duplicate code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched. All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.
This patch kills migration_thread and replaces all three uses with
cpu_stop. The three different roles of migration_thread() are
splitted into three separate cpu_stop callbacks -
migration_cpu_stop(), active_load_balance_cpu_stop() and
synchronize_sched_expedited_cpu_stop() - and each use case now simply
asks cpu_stop to execute the callback as necessary.
synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpu_stop.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.
synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched(). If called with cpu hotplug blocked,
cpu_stop already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break. However, this patch preserves
the paranoid check using a cpumask to record on which cpus the stopper
ran so that it can serve as a bisection point if something actually
goes wrong theree.
Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.
This patch also renames cpu_stop threads to from "stopper/%d" to
"migration/%d". The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.
With this patch applied, stop_machine() and sched now share the same
resources. stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
In practice, it is harmless to voluntarily sleep in a
rcu_read_lock() section if we are running under preempt rcu, but
it is illegal if we build a kernel running non-preemptable rcu.
Currently, might_sleep() doesn't notice sleepable operations
under rcu_read_lock() sections if we are running under
preemptable rcu because preempt_count() is left untouched after
rcu_read_lock() in this case. But we want developers who test
their changes under such config to notice the "sleeping while
atomic" issues.
So we add rcu_read_lock_nesting to prempt_count() in
might_sleep() checks.
[ v2: Handle rcu-tiny ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These issues identified during an old-fashioned face-to-face code
review extending over many hours.
o Add comments for tricky parts of code, and correct comments
that have passed their sell-by date.
o Get rid of the vestiges of rcu_init_sched(), which is no
longer needed now that PREEMPT_RCU is gone.
o Move the #include of rcutree_plugin.h to the end of
rcutree.c, which means that, rather than having a random
collection of forward declarations, the new set of forward
declarations document the set of plugins. The new home for
this #include also allows __rcu_init_preempt() to move into
rcutree_plugin.h.
o Fix rcu_preempt_check_callbacks() to be static.
Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12537246443924-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra <peterz@infradead.org>
These issues identified during an old-fashioned face-to-face code
review extended over many hours.
o Bury various forms of the "rsp->completed == rsp->gpnum"
comparison into an rcu_gp_in_progress() function, which has
the beneficial side-effect of forcing consistent use of
ACCESS_ONCE().
o Replace hand-coded arithmetic with DIV_ROUND_UP().
o Bury several "!list_empty(&rnp->blocked_tasks[rnp->gpnum & 0x01])"
instances into an rcu_preempted_readers() function, as this
expression indicates that there are no readers blocked
within RCU read-side critical sections blocking the current
grace period. (Though there might well be similar readers
blocking the next grace period.)
o Remove a dangling rcu_restart_cpu() declaration that has
been dangling for almost 20 minor releases of the kernel.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12537246442687-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Create a kernel/rcutree_plugin.h file that contains definitions
for preemptable RCU (or, under the #else branch of the #ifdef,
empty definitions for the classic non-preemptable semantics).
These definitions fit into plugins defined in kernel/rcutree.c
for this purpose.
This variant of preemptable RCU uses a new algorithm whose
read-side expense is roughly that of classic hierarchical RCU
under CONFIG_PREEMPT. This new algorithm's update-side expense
is similar to that of classic hierarchical RCU, and, in absence
of read-side preemption or blocking, is exactly that of classic
hierarchical RCU. Perhaps more important, this new algorithm
has a much simpler implementation, saving well over 1,000 lines
of code compared to mainline's implementation of preemptable
RCU, which will hopefully be retired in favor of this new
algorithm.
The simplifications are obtained by maintaining per-task
nesting state for running tasks, and using a simple
lock-protected algorithm to handle accounting when tasks block
within RCU read-side critical sections, making use of lessons
learned while creating numerous user-level RCU implementations
over the past 18 months.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: akpm@linux-foundation.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: josht@linux.vnet.ibm.com
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
LKML-Reference: <12509746134003-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This adds the synchronize_sched_expedited() primitive that
implements the "big hammer" expedited RCU grace periods.
This primitive is placed in kernel/sched.c rather than
kernel/rcupdate.c due to its need to interact closely with the
migration_thread() kthread.
The idea is to wake up this kthread with req->task set to NULL,
in response to which the kthread reports the quiescent state
resulting from the kthread having been scheduled.
Because this patch needs to fallback to the slow versions of
the primitives in response to some races with CPU onlining and
offlining, a new synchronize_rcu_bh() primitive is added as
well.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Cc: davem@davemloft.net
Cc: dada1@cosmosbay.com
Cc: zbr@ioremap.net
Cc: jeff.chua.linux@gmail.com
Cc: paulus@samba.org
Cc: laijs@cn.fujitsu.com
Cc: jengelh@medozas.de
Cc: r000n@r000n.net
Cc: benh@kernel.crashing.org
Cc: mathieu.desnoyers@polymtl.ca
LKML-Reference: <12459460982947-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes a hierarchical-RCU performance bug located by Anton
Blanchard. The problem stems from a misguided attempt to provide a
work-around for jiffies-counter failure. This work-around uses a per-CPU
n_rcu_pending counter, which is incremented on each call to rcu_pending(),
which in turn is called from each scheduling-clock interrupt. Each CPU
then treats this counter as a surrogate for the jiffies counter, so
that if the jiffies counter fails to advance, the per-CPU n_rcu_pending
counter will cause RCU to invoke force_quiescent_state(), which in turn
will (among other things) send resched IPIs to CPUs that have thus far
failed to pass through an RCU quiescent state.
Unfortunately, each CPU resets only its own counter after sending a
batch of IPIs. This means that the other CPUs will also (needlessly)
send -another- round of IPIs, for a full N-squared set of IPIs in the
worst case every three scheduler-clock ticks until the grace period
finally ends. It is not reasonable for a given CPU to reset each and
every n_rcu_pending for all the other CPUs, so this patch instead simply
disables the jiffies-counter "training wheels", thus eliminating the
excessive IPIs.
Note that the jiffies-counter IPIs do not have this problem due to
the fact that the jiffies counter is global, so that the CPU sending
the IPIs can easily reset things, thus preventing the other CPUs from
sending redundant IPIs.
Note also that the n_rcu_pending counter remains, as it will continue to
be used for tracing. It may also see use to update the jiffies counter,
should an appropriate kick-the-jiffies-counter API appear.
Located-by: Anton Blanchard <anton@au1.ibm.com>
Tested-by: Anton Blanchard <anton@au1.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: anton@samba.org
Cc: akpm@linux-foundation.org
Cc: dipankar@in.ibm.com
Cc: manfred@colorfullife.com
Cc: cl@linux-foundation.org
Cc: josht@linux.vnet.ibm.com
Cc: schamp@sgi.com
Cc: niv@us.ibm.com
Cc: dvhltc@us.ibm.com
Cc: ego@in.ibm.com
Cc: laijs@cn.fujitsu.com
Cc: rostedt@goodmis.org
Cc: peterz@infradead.org
Cc: penberg@cs.helsinki.fi
Cc: andi@firstfloor.org
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Reference: <12396834793575-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
linux/percpu.h includes linux/slab.h, which generates circular inclusion
dependencies when trying to switch kmemtrace to use tracepoints instead
of markers.
This patch allows tracing within slab headers' inline functions.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix for all non-x86 architectures
We want to remove percpu.h from rcuclassic.h/rcutree.h (for upcoming
kmemtrace changes) but that would break the DECLARE_PER_CPU based
declarations in these files.
Move the quiescent counter management functions to their respective
RCU implementation .c files - they were slightly above the inlining
limit anyway.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: paulmck@linux.vnet.ibm.com
LKML-Reference: <1237898630.25315.83.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton. And cleans up the variable-name
and function-name language. ;-)
The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state. This in turn causes
RCU to prematurely end grace periods during this time.
This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition. RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().
Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1. This allows boot-time code that
calls synchronize_rcu() to proceed normally. Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence. Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.
In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.
Changes since v4:
o Changes the name of the introduced function and variable to
be less emotional. ;-)
Changes since v3:
o WARN_ON(nr_context_switches() > 0) to verify that RCU
switches out of boot-time mode before the first context
switch, as suggested by Nick Piggin.
Changes since v2:
o Created rcu_blocking_is_gp() internal-to-RCU API that
determines whether a call to synchronize_rcu() is itself
a grace period.
o The definition of rcu_blocking_is_gp() for rcuclassic and
rcutree checks to see if but a single CPU is online.
o The definition of rcu_blocking_is_gp() for rcupreempt
checks to see both if but a single CPU is online and if
the system is still in early boot.
This allows rcupreempt to again work correctly if running
on a single CPU after booting is complete.
o Added check to rcupreempt's synchronize_sched() for there
being but one online CPU.
Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes a long-standing performance bug in classic RCU that
results in massive internal-to-RCU lock contention on systems with
more than a few hundred CPUs. Although this patch creates a separate
flavor of RCU for ease of review and patch maintenance, it is intended
to replace classic RCU.
This patch still handles stress better than does mainline, so I am still
calling it ready for inclusion. This patch is against the -tip tree.
Nevertheless, experience on an actual 1000+ CPU machine would still be
most welcome.
Most of the changes noted below were found while creating an rcutiny
(which should permit ejecting the current rcuclassic) and while doing
detailed line-by-line documentation.
Updates from v9 (http://lkml.org/lkml/2008/12/2/334):
o Fixes from remainder of line-by-line code walkthrough,
including comment spelling, initialization, undesirable
narrowing due to type conversion, removing redundant memory
barriers, removing redundant local-variable initialization,
and removing redundant local variables.
I do not believe that any of these fixes address the CPU-hotplug
issues that Andi Kleen was seeing, but please do give it a whirl
in case the machine is smarter than I am.
A writeup from the walkthrough may be found at the following
URL, in case you are suffering from terminal insomnia or
masochism:
http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf
o Made rcutree tracing use seq_file, as suggested some time
ago by Lai Jiangshan.
o Added a .csv variant of the rcudata debugfs trace file, to allow
people having thousands of CPUs to drop the data into
a spreadsheet. Tested with oocalc and gnumeric. Updated
documentation to suit.
Updates from v8 (http://lkml.org/lkml/2008/11/15/139):
o Fix a theoretical race between grace-period initialization and
force_quiescent_state() that could occur if more than three
jiffies were required to carry out the grace-period
initialization. Which it might, if you had enough CPUs.
o Apply Ingo's printk-standardization patch.
o Substitute local variables for repeated accesses to global
variables.
o Fix comment misspellings and redundant (but harmless) increments
of ->n_rcu_pending (this latter after having explicitly added it).
o Apply checkpatch fixes.
Updates from v7 (http://lkml.org/lkml/2008/10/10/291):
o Fixed a number of problems noted by Gautham Shenoy, including
the cpu-stall-detection bug that he was having difficulty
convincing me was real. ;-)
o Changed cpu-stall detection to wait for ten seconds rather than
three in order to reduce false positive, as suggested by Ingo
Molnar.
o Produced a design document (http://lwn.net/Articles/305782/).
The act of writing this document uncovered a number of both
theoretical and "here and now" bugs as noted below.
o Fix dynticks_nesting accounting confusion, simplify WARN_ON()
condition, fix kerneldoc comments, and add memory barriers
in dynticks interface functions.
o Add more data to tracing.
o Remove unused "rcu_barrier" field from rcu_data structure.
o Count calls to rcu_pending() from scheduling-clock interrupt
to use as a surrogate timebase should jiffies stop counting.
o Fix a theoretical race between force_quiescent_state() and
grace-period initialization. Yes, initialization does have to
go on for some jiffies for this race to occur, but given enough
CPUs...
Updates from v6 (http://lkml.org/lkml/2008/9/23/448):
o Fix a number of checkpatch.pl complaints.
o Apply review comments from Ingo Molnar and Lai Jiangshan
on the stall-detection code.
o Fix several bugs in !CONFIG_SMP builds.
o Fix a misspelled config-parameter name so that RCU now announces
at boot time if stall detection is configured.
o Run tests on numerous combinations of configurations parameters,
which after the fixes above, now build and run correctly.
Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):
o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
changeset some time ago, and finally got around to retesting
this option).
o Fix some tracing bugs in rcupreempt that caused incorrect
totals to be printed.
o I now test with a more brutal random-selection online/offline
script (attached). Probably more brutal than it needs to be
on the people reading it as well, but so it goes.
o A number of optimizations and usability improvements:
o Make rcu_pending() ignore the grace-period timeout when
there is no grace period in progress.
o Make force_quiescent_state() avoid going for a global
lock in the case where there is no grace period in
progress.
o Rearrange struct fields to improve struct layout.
o Make call_rcu() initiate a grace period if RCU was
idle, rather than waiting for the next scheduling
clock interrupt.
o Invoke rcu_irq_enter() and rcu_irq_exit() only when
idle, as suggested by Andi Kleen. I still don't
completely trust this change, and might back it out.
o Make CONFIG_RCU_TRACE be the single config variable
manipulated for all forms of RCU, instead of the prior
confusion.
o Document tracing files and formats for both rcupreempt
and rcutree.
Updates from v4 for those missing v5 given its bad subject line:
o Separated dynticks interface so that NMIs and irqs call separate
functions, greatly simplifying it. In particular, this code
no longer requires a proof of correctness. ;-)
o Separated dynticks state out into its own per-CPU structure,
avoiding the duplicated accounting.
o The case where a dynticks-idle CPU runs an irq handler that
invokes call_rcu() is now correctly handled, forcing that CPU
out of dynticks-idle mode.
o Review comments have been applied (thank you all!!!).
For but one example, fixed the dynticks-ordering issue that
Manfred pointed out, saving me much debugging. ;-)
o Adjusted rcuclassic and rcupreempt to handle dynticks changes.
Attached is an updated patch to Classic RCU that applies a hierarchy,
greatly reducing the contention on the top-level lock for large machines.
This passes 10-hour concurrent rcutorture and online-offline testing on
128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
bugs in presence of dynticks (exciting working on a system where
"sleep 1" hangs until interrupted...), which were fixed in the
2.6.27 kernel. It is getting more reliable than mainline by some
measures, so the next version will be against -tip for inclusion.
See also Manfred Spraul's recent patches (or his earlier work from
2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
We will converge onto a common patch in the fullness of time, but are
currently exploring different regions of the design space. That said,
I have already gratefully stolen quite a few of Manfred's ideas.
This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on
64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
there is no hierarchy. By default, the RCU initialization code will
adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
this balancing, allowing the hierarchy to be exactly aligned to the
underlying hardware. Up to two levels of hierarchy are permitted
(in addition to the root node), allowing up to 16,384 CPUs on 32-bit
systems and up to 262,144 CPUs on 64-bit systems. I just know that I
am going to regret saying this, but this seems more than sufficient
for the foreseeable future. (Some architectures might wish to set
CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
If this becomes a real problem, additional levels can be added, but I
doubt that it will make a significant difference on real hardware.)
In the common case, a given CPU will manipulate its private rcu_data
structure and the rcu_node structure that it shares with its immediate
neighbors. This can reduce both lock and memory contention by multiple
orders of magnitude, which should eliminate the need for the strange
manipulations that are reported to be required when running Linux on
very large systems.
Some shortcomings:
o More bugs will probably surface as a result of an ongoing
line-by-line code inspection.
Patches will be provided as required.
o There are probably hangs, rcutorture failures, &c. Seems
quite stable on a 128-CPU machine, but that is kind of small
compared to 4096 CPUs. However, seems to do better than
mainline.
Patches will be provided as required.
o The memory footprint of this version is several KB larger
than rcuclassic.
A separate UP-only rcutiny patch will be provided, which will
reduce the memory footprint significantly, even compared
to the old rcuclassic. One such patch passes light testing,
and has a memory footprint smaller even than rcuclassic.
Initial reaction from various embedded guys was "it is not
worth it", so am putting it aside.
Credits:
o Manfred Spraul for ideas, review comments, and bugs spotted,
as well as some good friendly competition. ;-)
o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
for reviews and comments.
o Thomas Gleixner for much-needed help with some timer issues
(see patches below).
o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
alive despite my heavy abuse^Wtesting.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>