worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.
Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().
This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.
The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueue execution locality improvement patchset is about to
applied which will cause merge conflicts with changes in for-6.5-fixes.
Let's avoid future merge conflict by pulling in for-6.5-fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Use LIST_HEAD() to initialize cull_list instead of open-coding it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.
The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.
This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.
Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Motivation of doing this is to better improve boot times for devices when
we want to prevent our workqueue works from running on some specific CPUs,
e,g, some CPUs are busy with interrupts.
Signed-off-by: tiozhang <tiozhang@didiglobal.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Based on commit c4f135d643 ("workqueue: Wrap flush_workqueue() using
a macro"), all in-tree users stopped flushing system-wide workqueues.
Therefore, start emitting runtime message so that all out-of-tree users
will understand that they need to update their code.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
* Concurrency-managed per-cpu work items that hog CPUs and delay the
execution of other work items are now automatically detected and excluded
from concurrency management. Reporting on such work items can also be
enabled through a config option.
* Added tools/workqueue/wq_monitor.py which improves visibility into
workqueue usages and behaviors.
* Includes Arnd's minimal fix for gcc-13 enum warning on 32bit compiles.
This conflicts with afa4bb778e ("workqueue: clean up WORK_* constant
types, clarify masking") in master. Can be resolved by picking the master
version.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJoGvw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGZu0AP9IGK2opAzO9i3i1/Ys81b3sHi9PwrYWH3g252T
Oe3O6QD/Wh0wYBVl0o7IdW6BGdd5iNwIEs420G53UmmPrATqsgQ=
=TffY
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
- Concurrency-managed per-cpu work items that hog CPUs and delay the
execution of other work items are now automatically detected and
excluded from concurrency management. Reporting on such work items
can also be enabled through a config option.
- Added tools/workqueue/wq_monitor.py which improves visibility into
workqueue usages and behaviors.
- Arnd's minimal fix for gcc-13 enum warning on 32bit compiles,
superseded by commit afa4bb778e in mainline.
* tag 'wq-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us is 0
workqueue: Fix WARN_ON_ONCE() triggers in worker_enter_idle()
workqueue: fix enum type for gcc-13
workqueue: Track and monitor per-workqueue CPU time usage
workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism
workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE
workqueue: Improve locking rule description for worker fields
workqueue: Move worker_set/clr_flags() upwards
workqueue: Re-order struct worker fields
workqueue: Add pwq->stats[] and a monitoring script
Further upgrade queue_work_on() comment
Dave Airlie reports that gcc-13.1.1 has started complaining about some
of the workqueue code in 32-bit arm builds:
kernel/workqueue.c: In function ‘get_work_pwq’:
kernel/workqueue.c:713:24: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
713 | return (void *)(data & WORK_STRUCT_WQ_DATA_MASK);
| ^
[ ... a couple of other cases ... ]
and while it's not immediately clear exactly why gcc started complaining
about it now, I suspect it's some C23-induced enum type handlign fixup in
gcc-13 is the cause.
Whatever the reason for starting to complain, the code and data types
are indeed disgusting enough that the complaint is warranted.
The wq code ends up creating various "helper constants" (like that
WORK_STRUCT_WQ_DATA_MASK) using an enum type, which is all kinds of
confused. The mask needs to be 'unsigned long', not some unspecified
enum type.
To make matters worse, the actual "mask and cast to a pointer" is
repeated a couple of times, and the cast isn't even always done to the
right pointer, but - as the error case above - to a 'void *' with then
the compiler finishing the job.
That's now how we roll in the kernel.
So create the masks using the proper types rather than some ambiguous
enumeration, and use a nice helper that actually does the type
conversion in one well-defined place.
Incidentally, this magically makes clang generate better code. That,
admittedly, is really just a sign of clang having been seriously
confused before, and cleaning up the typing unconfuses the compiler too.
Reported-by: Dave Airlie <airlied@gmail.com>
Link: https://lore.kernel.org/lkml/CAPM=9twNnV4zMCvrPkw3H-ajZOH-01JVh_kDrxdPYQErz8ZTdA@mail.gmail.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If workqueue.cpu_intensive_thresh_us is set to 0, the detection mechanism
for CPU-hogging per-cpu work item will keep triggering spuriously:
workqueue: process_srcu hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: gc_worker hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: wait_rcu_exp_gp hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
workqueue: kfree_rcu_monitor hogged CPU for >0us 8 times, consider switching to WQ_UNBOUND
workqueue: reg_todo hogged CPU for >0us 4 times, consider switching to WQ_UNBOUND
This commit therefore disables the CPU-hog detection mechanism when
workqueue.cpu_intensive_thresh_us is set to 0.
tj: Patch description updated and the condition check on
cpu_intensive_thresh_us separated into a separate if statement for
readability.
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that wq_worker_tick() is there, we can easily track the rough CPU time
consumption of each workqueue by charging the whole tick whenever a tick
hits an active workqueue. While not super accurate, it provides reasonable
visibility into the workqueues that consume a lot of CPU cycles.
wq_monitor.py is updated to report the per-workqueue CPU times.
v2: wq_monitor.py was using "cputime" as the key when outputting in json
format. Use "cpu_time" instead for consistency with other fields.
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue now automatically marks per-cpu work items that hog CPU for too
long as CPU_INTENSIVE, which excludes them from concurrency management and
prevents stalling other concurrency-managed work items. If a work function
keeps running over the thershold, it likely needs to be switched to use an
unbound workqueue.
This patch adds a debug mechanism which tracks the work functions which
trigger the automatic CPU_INTENSIVE mechanism and report them using
pr_warn() with exponential backoff.
v3: Documentation update.
v2: Drop bouncing to kthread_worker for printing messages. It was to avoid
introducing circular locking dependency through printk but not effective
as it still had pool lock -> wci_lock -> printk -> pool lock loop. Let's
just print directly using printk_deferred().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
If a per-cpu work item hogs the CPU, it can prevent other work items from
starting through concurrency management. A per-cpu workqueue which intends
to host such CPU-hogging work items can choose to not participate in
concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be
error-prone and difficult to debug when missed.
This patch adds an automatic CPU usage based detection. If a
concurrency-managed work item consumes more CPU time than the threshold
(10ms by default) continuously without intervening sleeps, wq_worker_tick()
which is called from scheduler_tick() will detect the condition and
automatically mark it CPU_INTENSIVE.
The mechanism isn't foolproof:
* Detection depends on tick hitting the work item. Getting preempted at the
right timings may allow a violating work item to evade detection at least
temporarily.
* nohz_full CPUs may not be running ticks and thus can fail detection.
* Even when detection is working, the 10ms detection delays can add up if
many CPU-hogging work items are queued at the same time.
However, in vast majority of cases, this should be able to detect violations
reliably and provide reasonable protection with a small increase in code
complexity.
If some work items trigger this condition repeatedly, the bigger problem
likely is the CPU being saturated with such per-cpu work items and the
solution would be making them UNBOUND. The next patch will add a debug
mechanism to help spot such cases.
v4: Documentation for workqueue.cpu_intensive_thresh_us added to
kernel-parameters.txt.
v3: Switch to use wq_worker_tick() instead of hooking into preemptions as
suggested by Peter.
v2: Lai pointed out that wq_worker_stopping() also needs to be called from
preemption and rtlock paths and an earlier patch was updated
accordingly. This patch adds a comment describing the risk of infinte
recursions and how they're avoided.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
* Some worker fields are modified only by the worker itself while holding
pool->lock thus making them safe to read from self, IRQ context if the CPU
is running the worker or while holding pool->lock. Add 'K' locking rule
for them.
* worker->sleeping is currently marked "None" which isn't very descriptive.
It's used only by the worker itself. Add 'S' locking rule for it.
A future patch will depend on the 'K' rule to access worker->current_* from
the scheduler ticks.
Signed-off-by: Tejun Heo <tj@kernel.org>
They are going to be used in wq_worker_stopping(). Move them upwards.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Currently, the only way to peer into workqueue operations is through
tracing. While possible, it isn't easy or convenient to monitor
per-workqueue behaviors over time this way. Let's add pwq->stats[] that
track relevant events and a drgn monitoring script -
tools/workqueue/wq_monitor.py.
It's arguable whether this needs to be configurable. However, it currently
only has several counters and the runtime overhead shouldn't be noticeable
given that they're on pwq's which are per-cpu on per-cpu workqueues and
per-numa-node on unbound ones. Let's keep it simple for the time being.
v2: Patch reordered to earlier with fewer fields. Field will be added back
gradually. Help message improved.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
The current queue_work_on() docbook comment says that the caller must
ensure that the specified CPU can't go away, and further says that the
penalty for failing to nail down the specified CPU is that the workqueue
handler might find itself executing on some other CPU. This is true
as far as it goes, but fails to note what happens if the specified CPU
never was online. Therefore, further expand this comment to say that
specifying a CPU that was never online will result in a splat.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Mostly changes from Petr to improve warning and error reporting. Workqueue
now reports more of the relevant failures with better context which should
help debugging.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZErbOg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfipAP9BXrUBI99uZL7XjSEe6rLWa0STODyWh67ikR+0
nUO22QEA+ttt5ecLZyopYB18Le06VBBW7J5cs0Cg83bCBTGb4wE=
=gHbc
-----END PGP SIGNATURE-----
Merge tag 'wq-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"Mostly changes from Petr to improve warning and error reporting.
Workqueue now reports more of the relevant failures with better
context which should help debugging"
* tag 'wq-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Introduce show_freezable_workqueues
workqueue: Print backtraces from CPUs with hung CPU bound workqueues
workqueue: Warn when a rescuer could not be created
workqueue: Interrupted create_worker() is not a repeated event
workqueue: Warn when a new worker could not be created
workqueue: Fix hung time report of worker pools
workqueue: Simplify a pr_warn() call in wq_select_unbound_cpu()
MAINTAINERS: Add workqueue_internal.h to the WORKQUEUE entry
Currently show_all_workqueue is called if freeze fails at the time of
freeze the workqueues, which shows the status of all workqueues and of
all worker pools. In this cases we may only need to dump state of only
workqueues that are freezable and busy.
This patch defines show_freezable_workqueues, which uses
show_one_workqueue, a granular function that shows the state of individual
workqueues, so that dump only the state of freezable workqueues
at that time.
tj: Minor message adjustment.
Signed-off-by: Jungseung Lee <js07.lee@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.
Worker pools for unbound workqueues always wake up an idle worker and
try to process the work immediately. The last idle worker has to create
new worker first. The stall might happen only when a new worker could
not be created in which case an error should get printed. Another problem
might be too high load. In this case, workers are victims of a global
system problem.
Worker pools for CPU bound workqueues are designed for lightweight
work items that do not need much CPU time. They are proceed one by
one on a single worker. New worker is used only when a work is sleeping.
It creates one additional scenario. The stall might happen when
the CPU-bound workqueue is used for CPU-intensive work.
More precisely, the stall is detected when a CPU-bound worker is in
the TASK_RUNNING state for too long. In this case, it might be useful
to see the backtrace from the problematic worker.
The information how long a worker is in the running state is not available.
But the CPU-bound worker pools do not have many workers in the running
state by definition. And only few pools are typically blocked.
It should be acceptable to print backtraces from all workers in
TASK_RUNNING state in the stalled worker pools. The number of false
positives should be very low.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
It typically happens during the system boot.
systemd switches the root filesystem from initrd to the booted system
during boot. It kills processes that block the switch for too long.
One of the process might be modprobe that tries to create a workqueue.
These problems are hard to reproduce. Also alloc_workqueue() does not
pass the error code. Make the debugging easier by printing an error,
similar to create_worker().
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kthread_create_on_node() might get interrupted(). It is rare but realistic.
For example, when an unbound workqueue is allocated in module_init()
callback. It is done in the context of the "modprobe" process. And,
for example, systemd might kill pending processes when switching root
from initrd to the booted system.
The interrupt is a one-off event and the race might be hard to reproduce.
It is always worth printing.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.
The progress is guaranteed by using idle workers or creating new workers
for pending work items.
There are several reasons why a new worker could not be created:
+ there is not enough memory
+ there is no free pool ID (IDR API)
+ the system reached PID limit
+ the process creating the new worker was interrupted
+ the last idle worker (manager) has not been scheduled for a long
time. It was not able to even start creating the kthread.
None of these failures is reported at the moment. The only clue is that
show_one_worker_pool() prints that there is a manager. It is the last
idle worker that is responsible for creating a new one. But it is not
clear if create_worker() is failing and why.
Make the debugging easier by printing errors in create_worker().
The error code is important, especially from kthread_create_on_node().
It helps to distinguish the various reasons. For example, reaching
memory limit (-ENOMEM), other system limits (-EAGAIN), or process
interrupted (-EINTR).
Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
for each stuck worker pool.
Ratelimited printk() might be better. It would help to know if the problem
remains. It would be more clear if the create_worker() errors and workqueue
stalls are related. Also old messages might get lost when the internal log
buffer is full. The problem is that printk() might touch the watchdog.
For example, see touch_nmi_watchdog() in serial8250_console_write().
It would require synchronization of the begin and length of the ratelimit
interval with the workqueue watchdog. Otherwise, the error messages
might break the watchdog. This does not look worth the complexity.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.
Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.
show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.
The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().
The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use pr_warn_once() to achieve the same thing. It's simpler.
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230313182918.1312597-8-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
!CONFIG_SMP builds complain about rebind_worker() being unused. Its only
user, rebind_workers() is indeed only defined for CONFIG_SMP, so just fold
the two lines back up there.
Link: http://lore.kernel.org/r/20230113143102.2e94d74f@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It has been reported that isolated CPUs can suffer from interference due to
per-CPU kworkers waking up just to die.
A surge of workqueue activity during initial setup of a latency-sensitive
application (refresh_vm_stats() being one of the culprits) can cause extra
per-CPU kworkers to be spawned. Then, said latency-sensitive task can be
running merrily on an isolated CPU only to be interrupted sometime later by
a kworker marked for death (cf. IDLE_WORKER_TIMEOUT, 5 minutes after last
kworker activity).
Prevent this by affining kworkers to the wq_unbound_cpumask (which doesn't
contain isolated CPUs, cf. HK_TYPE_WQ) before waking them up after marking
them with WORKER_DIE.
Changing the affinity does require a sleepable context, leverage the newly
introduced pool->idle_cull_work to get that.
Remove dying workers from pool->workers and keep track of them in a
separate list. This intentionally prevents for_each_loop_worker() from
iterating over workers that are marked for death.
Rename destroy_worker() to set_working_dying() to better reflect its
effects and relationship with wake_dying_workers().
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
put_unbound_pool() currently passes wq_manager_inactive() as exit condition
to rcuwait_wait_event(), which grabs pool->lock to check for
pool->flags & POOL_MANAGER_ACTIVE
A later patch will require destroy_worker() to be invoked with
wq_pool_attach_mutex held, which needs to be acquired before
pool->lock. A mutex cannot be acquired within rcuwait_wait_event(), as
it could clobber the task state set by rcuwait_wait_event()
Instead, restructure the waiting logic to acquire any necessary lock
outside of rcuwait_wait_event().
Since further work cannot be inserted into unbound pwqs that have reached
->refcnt==0, this is bound to make forward progress as eventually the
worklist will be drained and need_more_worker(pool) will remain false,
preventing any worker from stealing the manager position from us.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A later patch will require a sleepable context in the idle worker timeout
function. Converting worker_pool.idle_timer to a delayed_work gives us just
that, however this would imply turning all idle_timer expiries into
scheduler events (waking up a worker to handle the dwork).
Instead, implement a "custom dwork" where the timer callback does some
extra checks before queuing the associated work.
No change in functionality intended.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Later patches will reuse this code, move it into reusable functions.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When unbind_workers() reads wq_unbound_cpumask to set the affinity of
freshly-unbound kworkers, it only holds wq_pool_attach_mutex. This isn't
sufficient as wq_unbound_cpumask is only protected by wq_pool_mutex.
Make wq_unbound_cpumask protected with wq_pool_attach_mutex and also
remove the need of temporary saved_cpumask.
Fixes: 10a5a651e3 ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Reported-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The show_pwq() function dumps out a pool_workqueue structure's activity,
including the pending work-queue handlers:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: test_work_func, test_work_func, test_work_func1, test_work_func1, test_work_func1, test_work_func1, test_work_func1
When large systems are facing certain types of hang conditions, it is not
unusual for this "pending" list to contain runs of hundreds of identical
function names. This "wall of text" is difficult to read, and worse yet,
it can be interleaved with other output such as stack traces.
Therefore, make show_pwq() use run-length encoding so that the above
printout instead looks like this:
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x1 nice=0 active=10/256 refcnt=11
in-flight: 7:test_work_func, 64:test_work_func, 249:test_work_func
pending: 2*test_work_func, 5*test_work_func1
When no comma would be printed, including the WORK_STRUCT_LINKED case,
a new run is started unconditionally.
This output is more readable, places less stress on the hardware,
firmware, and software on the console-log path, and reduces interference
with other output.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently if the user queues a new work item unintentionally
into a wq after the destroy_workqueue(wq), the work still can
be queued and scheduled without any noticeable kernel message
before the end of a RCU grace period.
As a debug-aid facility, this commit adds a new flag
__WQ_DESTROYING to spot that issue by triggering a kernel WARN
message.
Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Earlier commits in this series allow battery-powered systems to build
their kernels with the default-disabled CONFIG_RCU_LAZY=y Kconfig option.
This Kconfig option causes call_rcu() to delay its callbacks in order
to batch them. This means that a given RCU grace period covers more
callbacks, thus reducing the number of grace periods, in turn reducing
the amount of energy consumed, which increases battery lifetime which
can be a very good thing. This is not a subtle effect: In some important
use cases, the battery lifetime is increased by more than 10%.
This CONFIG_RCU_LAZY=y option is available only for CPUs that offload
callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot
parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y.
Delaying callbacks is normally not a problem because most callbacks do
nothing but free memory. If the system is short on memory, a shrinker
will kick all currently queued lazy callbacks out of their laziness,
thus freeing their memory in short order. Similarly, the rcu_barrier()
function, which blocks until all currently queued callbacks are invoked,
will also kick lazy callbacks, thus enabling rcu_barrier() to complete
in a timely manner.
However, there are some cases where laziness is not a good option.
For example, synchronize_rcu() invokes call_rcu(), and blocks until
the newly queued callback is invoked. It would not be a good for
synchronize_rcu() to block for ten seconds, even on an idle system.
Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of
call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a
given CPU kicks any lazy callbacks that might be already queued on that
CPU. After all, if there is going to be a grace period, all callbacks
might as well get full benefit from it.
Yes, this could be done the other way around by creating a
call_rcu_lazy(), but earlier experience with this approach and
feedback at the 2022 Linux Plumbers Conference shifted the approach
to call_rcu() being lazy with call_rcu_hurry() for the few places
where laziness is inappropriate.
And another call_rcu() instance that cannot be lazy is the one
in queue_rcu_work(), given that callers to queue_rcu_work() are
not necessarily OK with long delays.
Therefore, make queue_rcu_work() use call_rcu_hurry() in order to revert
to the old behavior.
[ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ]
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This replaces the prior support for Clang's standard Control Flow
Integrity (CFI) instrumentation, which has required a lot of special
conditions (e.g. LTO) and work-arounds. The current implementation
("Kernel CFI") is specific to C, directly designed for the Linux kernel,
and takes advantage of architectural features like x86's IBT. This
series retains arm64 support and adds x86 support. Additional "generic"
architectural support is expected soon:
https://github.com/samitolvanen/llvm-project/commits/kcfi_generic
- treewide: Remove old CFI support details
- arm64: Replace Clang CFI support with Clang KCFI support
- x86: Introduce Clang KCFI support
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM4aAUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkgWD/4mUgb7xewNIG/+fuipGd620Iao
K0T8q4BNxLNRltOxNc3Q0WMDCggX0qJGCeds7EdFQJQOGxWcbifM8MAS4idAGM0G
fc3Gxl1imC/oF6goCAbQgndA6jYFIWXGsv8LsRjAXRidWLFr3GFAqVqYJyokSySr
8zMQsEDuF4I1gQnOhEWdtPZbV3MQ4ZjfFzpv+33agbq6Gb72vKvDh3G6g2VXlxjt
1qnMtS+eEpbBU65cJkOi4MSLgymWbnIAeTMb0dbsV4kJ08YoTl8uz1B+weeH6GgT
WP73ZJ4nqh1kkkT9EqS9oKozNB9fObhvCokEuAjuQ7i1eCEZsbShvRc0iL7OKTGG
UfuTJa5qQ4h7Z0JS35FCSJETa+fcG0lTyEd133nLXLMZP9K2antf+A6O//fd0J1V
Jg4VN7DQmZ+UNGOzRkL6dTtQUy4PkxhniIloaClfSYXxhNirA+v//sHTnTK3z2Bl
6qceYqmFmns2Laual7+lvnZgt6egMBcmAL/MOdbU74+KIR9Xw76wxQjifktHX+WF
FEUQkUJDB5XcUyKlbvHoqobRMxvEZ8RIlC5DIkgFiPRE3TI0MqfzNSFnQ/6+lFNg
Y0AS9HYJmcj8sVzAJ7ji24WPFCXzsbFn6baJa9usDNbWyQZokYeiv7ZPNPHPDVrv
YEBP6aYko0lVSUS9qw==
=Li4D
-----END PGP SIGNATURE-----
Merge tag 'kcfi-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kcfi updates from Kees Cook:
"This replaces the prior support for Clang's standard Control Flow
Integrity (CFI) instrumentation, which has required a lot of special
conditions (e.g. LTO) and work-arounds.
The new implementation ("Kernel CFI") is specific to C, directly
designed for the Linux kernel, and takes advantage of architectural
features like x86's IBT. This series retains arm64 support and adds
x86 support.
GCC support is expected in the future[1], and additional "generic"
architectural support is expected soon[2].
Summary:
- treewide: Remove old CFI support details
- arm64: Replace Clang CFI support with Clang KCFI support
- x86: Introduce Clang KCFI support"
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107048 [1]
Link: https://github.com/samitolvanen/llvm-project/commits/kcfi_generic [2]
* tag 'kcfi-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
x86: Add support for CONFIG_CFI_CLANG
x86/purgatory: Disable CFI
x86: Add types to indirectly called assembly functions
x86/tools/relocs: Ignore __kcfi_typeid_ relocations
kallsyms: Drop CONFIG_CFI_CLANG workarounds
objtool: Disable CFI warnings
objtool: Preserve special st_shndx indexes in elf_update_symbol
treewide: Drop __cficanonical
treewide: Drop WARN_ON_FUNCTION_MISMATCH
treewide: Drop function_nocfi
init: Drop __nocfi from __init
arm64: Drop unneeded __nocfi attributes
arm64: Add CFI error handling
arm64: Add types to indirect called assembly functions
psci: Fix the function type for psci_initcall_t
lkdtm: Emit an indirect call for CFI tests
cfi: Add type helper macros
cfi: Switch to -fsanitize=kcfi
cfi: Drop __CFI_ADDRESSABLE
cfi: Remove CONFIG_CFI_CLANG_SHADOW
...
CONFIG_CFI_CLANG no longer breaks cross-module function address
equality, which makes WARN_ON_FUNCTION_MISMATCH unnecessary. Remove
the definition and switch back to WARN_ON_ONCE.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220908215504.3686827-15-samitolvanen@google.com
Like Hillf Danton mentioned
syzbot should have been able to catch cancel_work_sync() in work context
by checking lockdep_map in __flush_work() for both flush and cancel.
in [1], being unable to report an obvious deadlock scenario shown below is
broken. From locking dependency perspective, sync version of cancel request
should behave as if flush request, for it waits for completion of work if
that work has already started execution.
----------
#include <linux/module.h>
#include <linux/sched.h>
static DEFINE_MUTEX(mutex);
static void work_fn(struct work_struct *work)
{
schedule_timeout_uninterruptible(HZ / 5);
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
static DECLARE_WORK(work, work_fn);
static int __init test_init(void)
{
schedule_work(&work);
schedule_timeout_uninterruptible(HZ / 10);
mutex_lock(&mutex);
cancel_work_sync(&work);
mutex_unlock(&mutex);
return -EINVAL;
}
module_init(test_init);
MODULE_LICENSE("GPL");
----------
The check this patch restores was added by commit 0976dfc1d0
("workqueue: Catch more locking problems with flush_work()").
Then, lockdep's crossrelease feature was added by commit b09be676e0
("locking/lockdep: Implement the 'crossrelease' feature"). As a result,
this check was once removed by commit fd1a5b04df ("workqueue: Remove
now redundant lock acquisitions wrt. workqueue flushes").
But lockdep's crossrelease feature was removed by commit e966eaeeb6
("locking/lockdep: Remove the cross-release locking checks"). At this
point, this check should have been restored.
Then, commit d6e89786be ("workqueue: skip lockdep wq dependency in
cancel_work_sync()") introduced a boolean flag in order to distinguish
flush_work() and cancel_work_sync(), for checking "struct workqueue_struct"
dependency when called from cancel_work_sync() was causing false positives.
Then, commit 87915adc3f ("workqueue: re-add lockdep dependencies for
flushing") tried to restore "struct work_struct" dependency check, but by
error checked this boolean flag. Like an example shown above indicates,
"struct work_struct" dependency needs to be checked for both flush_work()
and cancel_work_sync().
Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1]
Reported-by: Hillf Danton <hdanton@sina.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes: 87915adc3f ("workqueue: re-add lockdep dependencies for flushing")
Cc: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
New driver:
- logicvc
vfio:
- use aperture API
core:
- of: Add data-lane helpers and convert drivers
- connector: Remove deprecated ida_simple_get()
media:
- Add various RGB666 and RGB888 format constants
panel:
- Add HannStar HSD101PWW
- Add ETML0700Y5DHA
dma-buf:
- add sync-file API
- set dma mask for udmabuf devices
fbcon:
- Improve scrolling performance
- Sanitize input
fbdev:
- device unregistering fixes
- vesa: Support COMPILE_TEST
- Disable firmware-device registration when first native driver loads
aperture:
- fix segfault during hot-unplug
- export for use with other subsystems
client:
- use driver validated modes
dp:
- aux: make probing more reliable
- mst: Read extended DPCD capabilities during system resume
- Support waiting for HDP signal
- Port-validation fixes
edid:
- CEA data-block iterators
- struct drm_edid introduction
- implement HF-EEODB extension
gem:
- don't use fb format non-existing planes
probe-helper:
- use 640x480 as displayport fallback
scheduler:
- don't kill jobs in interrupt context
bridge:
- Add support for i.MX8qxp and i.MX8qm
- lots of fixes/cleanups
- Add TI-DLPC3433
- fy07024di26a30d: Optional GPIO reset
- ldb: Add reg and reg-name properties to bindings, Kconfig fixes
- lt9611: Fix display sensing;
- tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling
- tc358775: Fix clock settings
- ti-sn65dsi83: Allow GPIO to sleep
- adv7511: I2C fixes
- anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback
- fsl-ldb: Drop DE flip
- ti-sn65dsi86: Convert to atomic modesetting
amdgpu:
- use atomic fence helpers in DM
- fix VRAM address calculations
- export CRTC bpc via debugfs
- Initial devcoredump support
- Enable high priority gfx queue on asics which support it
- Adjust GART size on newer APUs for S/G display
- Soft reset for GFX 11 / SDMA 6
- Add gfxoff status query for vangogh
- Fix timestamps for cursor only commits
- Adjust GART size on newer APUs for S/G display
- fix buddy memory corruption
amdkfd:
- MMU notifier fixes
- P2P DMA support using dma-buf
- Add available memory IOCTL
- HMM profiler support
- Simplify GPUVM validation
- Unified memory for CWSR save/restore area
i915:
- General driver clean-up
- DG2 enabling (still under force probe)
- DG2 small BAR memory support
- HuC loading support
- DG2 workarounds
- DG2/ATS-M device IDs added
- Ponte Vecchio prep work and new blitter engines
- add Meteorlake support
- Fix sparse warnings
- DMC MMIO range checks
- Audio related fixes
- Runtime PM fixes
- PSR fixes
- Media freq factor and per-gt enhancements
- DSI fixes for ICL+
- Disable DMC flip queue handlers
- ADL_P voltage swing updates
- Use more the VBT for panel information
- Fix on Type-C ports with TBT mode
- Improve fastset and allow seamless M/N changes
- Accept more fixed modes with VRR/DMRRS panels
- Disable connector polling for a headless SKU
- ADL-S display PLL w/a
- Enable THP on Icelake and beyond
- Fix i915_gem_object_ggtt_pin_ww regression on old platforms
- Expose per tile media freq factor in sysfs
- Fix dma_resv fence handling in multi-batch execbuf
- Improve on suspend / resume time with VT-d enabled
- export CRTC bpc settings via debugfs
msm:
- gpu: a619 support
- gpu: Fix for unclocked GMU register access
- gpu: Devcore dump enhancements
- client utilization via fdinfo support
- fix fence rollover issue
- gem: Lockdep false-positive warning fix
- gem: Switch to pfn mappings
- WB support on sc7180
- dp: dropped custom bulk clock implementation
- fix link retraining on resolution change
- hdmi: dropped obsolete GPIO support
tegra:
- context isolation for host1x engines
- tegra234 soc support
mediatek:
- add vdosys0/1 for mt8195
- add MT8195 dp_intf driver
exynos:
- Fix resume function issue of exynos decon driver by calling
clk_disable_unprepare() properly if clk_prepare_enable() failed.
nouveau:
- set of misc fixes/cleanups
- display cleanups
gma500:
- Cleanup connector I2C handling
hyperv:
- Unify VRAM allocation of Gen1 and Gen2
meson:
- Support YUV422 output; Refcount fixes
mgag200:
- Support damage clipping
- Support gamma handling
- Protect concurrent HW access
- Fixes to connector
- Store model-specific limits in device-info structure
- fix PCI register init
panfrost:
- Valhall support
r128:
- Fix bit-shift overflow
rockchip:
- Locking fixes in error path
ssd130x:
- Fix built-in linkage
udl:
- Always advertize VGA connector
ast:
- Support multiple outputs
- fix black screen on resume
sun4i:
- HDMI PHY cleanups
vc4:
- Add support for BCM2711
vkms:
- Allocate output buffer with vmalloc()
mcde:
- Fix ref-count leak
mxsfb/lcdif:
- Support i.MX8MP LCD controller
stm/ltdc:
- Support dynamic Z order
- Support mirroring
ingenic:
- Fix display at maximum resolution
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmLp/7YACgkQDHTzWXnE
hr7NjhAAnefa+72EG42OAqajbwTQMENOtFfqyL3k6ueK2ciYbsj/wklw/xc4Ok3o
DM5kG54t+nA9L1M7UyE7eaO36/XcuvS8Ea0uKKkamWt+3Ux4g1Vo1J37nP5sK5jI
GT/wceKA5sk3nuYly2lBby6mVTGuhAX+3edTAFeOwmd0WvQzzpy4vV+nCAgfshUs
ql4gfQPdQdP+wiovUzCIEu6exCSCAI/Oc944fd3AJi5bZbOPFXRS4rMMOLSrdoXV
9P44EZExPbYrDuVUCx/UaZtN8D9myyyBfZe62CtdgNyTYUHXnHCBYue+7D/s5O+y
GaLWcP128MsqZNmJNhmcWFIlgqowO24YkKUH68JH0UtBLSWich8rfdEsrxIidYED
0ma1jodRapjyZOjrHEJ3N5deKpoflMmqvCMpvIk1Ev6pT8KX9a6u34kLgsOVCV41
2bDEYD+DbRW2FexGR79yB2huXHGSnguco6069ca1oy9RF4q8cX6Pb1w2u42oS7zX
lIgLIashilVR2AYg/qi6IPHavmOQ9ItSXPC+4YasYiMGp/mwePqpmL63b/wkhg0D
nXn6/F8Bm6wle2FFbkLGwo1fF1Hn7RzTHSlqRWDKSEaMLhCus6M09VsobFCB19i0
lO4FNVTL8ZtryR94bgVmgi616w9hOhDhM9A+C0kJ9KBkDnDYUJU=
=HQ9U
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- New driver for logicvc - which is a display IP core.
- EDID parser rework to add new extensions
- fbcon scrolling improvements
- i915 has some more DG2 work but not enabled by default, but should
have enough features for userspace to work now.
Otherwise it's lots of work all over the place. Detailed summary:
New driver:
- logicvc
vfio:
- use aperture API
core:
- of: Add data-lane helpers and convert drivers
- connector: Remove deprecated ida_simple_get()
media:
- Add various RGB666 and RGB888 format constants
panel:
- Add HannStar HSD101PWW
- Add ETML0700Y5DHA
dma-buf:
- add sync-file API
- set dma mask for udmabuf devices
fbcon:
- Improve scrolling performance
- Sanitize input
fbdev:
- device unregistering fixes
- vesa: Support COMPILE_TEST
- Disable firmware-device registration when first native driver loads
aperture:
- fix segfault during hot-unplug
- export for use with other subsystems
client:
- use driver validated modes
dp:
- aux: make probing more reliable
- mst: Read extended DPCD capabilities during system resume
- Support waiting for HDP signal
- Port-validation fixes
edid:
- CEA data-block iterators
- struct drm_edid introduction
- implement HF-EEODB extension
gem:
- don't use fb format non-existing planes
probe-helper:
- use 640x480 as displayport fallback
scheduler:
- don't kill jobs in interrupt context
bridge:
- Add support for i.MX8qxp and i.MX8qm
- lots of fixes/cleanups
- Add TI-DLPC3433
- fy07024di26a30d: Optional GPIO reset
- ldb: Add reg and reg-name properties to bindings, Kconfig fixes
- lt9611: Fix display sensing;
- tc358767: DSI/DPI refactoring and DSI-to-eDP support, DSI lane handling
- tc358775: Fix clock settings
- ti-sn65dsi83: Allow GPIO to sleep
- adv7511: I2C fixes
- anx7625: Fix error handling; DPI fixes; Implement HDP timeout via callback
- fsl-ldb: Drop DE flip
- ti-sn65dsi86: Convert to atomic modesetting
amdgpu:
- use atomic fence helpers in DM
- fix VRAM address calculations
- export CRTC bpc via debugfs
- Initial devcoredump support
- Enable high priority gfx queue on asics which support it
- Adjust GART size on newer APUs for S/G display
- Soft reset for GFX 11 / SDMA 6
- Add gfxoff status query for vangogh
- Fix timestamps for cursor only commits
- Adjust GART size on newer APUs for S/G display
- fix buddy memory corruption
amdkfd:
- MMU notifier fixes
- P2P DMA support using dma-buf
- Add available memory IOCTL
- HMM profiler support
- Simplify GPUVM validation
- Unified memory for CWSR save/restore area
i915:
- General driver clean-up
- DG2 enabling (still under force probe)
- DG2 small BAR memory support
- HuC loading support
- DG2 workarounds
- DG2/ATS-M device IDs added
- Ponte Vecchio prep work and new blitter engines
- add Meteorlake support
- Fix sparse warnings
- DMC MMIO range checks
- Audio related fixes
- Runtime PM fixes
- PSR fixes
- Media freq factor and per-gt enhancements
- DSI fixes for ICL+
- Disable DMC flip queue handlers
- ADL_P voltage swing updates
- Use more the VBT for panel information
- Fix on Type-C ports with TBT mode
- Improve fastset and allow seamless M/N changes
- Accept more fixed modes with VRR/DMRRS panels
- Disable connector polling for a headless SKU
- ADL-S display PLL w/a
- Enable THP on Icelake and beyond
- Fix i915_gem_object_ggtt_pin_ww regression on old platforms
- Expose per tile media freq factor in sysfs
- Fix dma_resv fence handling in multi-batch execbuf
- Improve on suspend / resume time with VT-d enabled
- export CRTC bpc settings via debugfs
msm:
- gpu: a619 support
- gpu: Fix for unclocked GMU register access
- gpu: Devcore dump enhancements
- client utilization via fdinfo support
- fix fence rollover issue
- gem: Lockdep false-positive warning fix
- gem: Switch to pfn mappings
- WB support on sc7180
- dp: dropped custom bulk clock implementation
- fix link retraining on resolution change
- hdmi: dropped obsolete GPIO support
tegra:
- context isolation for host1x engines
- tegra234 soc support
mediatek:
- add vdosys0/1 for mt8195
- add MT8195 dp_intf driver
exynos:
- Fix resume function issue of exynos decon driver by calling
clk_disable_unprepare() properly if clk_prepare_enable() failed.
nouveau:
- set of misc fixes/cleanups
- display cleanups
gma500:
- Cleanup connector I2C handling
hyperv:
- Unify VRAM allocation of Gen1 and Gen2
meson:
- Support YUV422 output; Refcount fixes
mgag200:
- Support damage clipping
- Support gamma handling
- Protect concurrent HW access
- Fixes to connector
- Store model-specific limits in device-info structure
- fix PCI register init
panfrost:
- Valhall support
r128:
- Fix bit-shift overflow
rockchip:
- Locking fixes in error path
ssd130x:
- Fix built-in linkage
udl:
- Always advertize VGA connector
ast:
- Support multiple outputs
- fix black screen on resume
sun4i:
- HDMI PHY cleanups
vc4:
- Add support for BCM2711
vkms:
- Allocate output buffer with vmalloc()
mcde:
- Fix ref-count leak
mxsfb/lcdif:
- Support i.MX8MP LCD controller
stm/ltdc:
- Support dynamic Z order
- Support mirroring
ingenic:
- Fix display at maximum resolution"
* tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm: (1480 commits)
drm/amd/display: Fix a compilation failure on PowerPC caused by FPU code
drm/amdgpu: enable support for psp 13.0.4 block
drm/amdgpu: add files for PSP 13.0.4
drm/amdgpu: add header files for MP 13.0.4
drm/amdgpu: correct RLC_RLCS_BOOTLOAD_STATUS offset and index
drm/amdgpu: send msg to IMU for the front-door loading
drm/amdkfd: use time_is_before_jiffies(a + b) to replace "jiffies - a > b"
drm/amdgpu: fix hive reference leak when reflecting psp topology info
drm/amd/pm: enable GFX ULV feature support for SMU13.0.0
drm/amd/pm: update driver if header for SMU 13.0.0
drm/amdgpu: move mes self test after drm sched re-started
drm/amdgpu: drop non-necessary call trace dump
drm/amdgpu: enable VCN cg and JPEG cg/pg
drm/amdgpu: vcn_4_0_2 video codec query
drm/amdgpu: add VCN_4_0_2 firmware support
drm/amdgpu: add VCN function in NBIO v7.7
drm/amdgpu: fix a vcn4 boot poll bug in emulation mode
drm/amd/amdgpu: add memory training support for PSP_V13
drm/amdkfd: remove an unnecessary amdgpu_bo_ref
drm/amd/pm: Add get_gfx_off_status interface for yellow carp
...
Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
fails and trigger the false warning.
Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.
It is very easy to trigger the warning:
Set wq_unbound_cpumask to a small set of CPUs.
Offline all the CPUs of wq_unbound_cpumask.
Offline an extra CPU and trigger the warning.
Fixes: 10a5a651e3 ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 6417250d3f.
amdpgu need this function in order to prematurly stop pending
reset works when another reset work already in progress.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Lai Jiangshan<jiangshanlai@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Since flush operation synchronously waits for completion, flushing
system-wide WQs (e.g. system_wq) might introduce possibility of deadlock
due to unexpected locking dependency. Tejun Heo commented at [1] that it
makes no sense at all to call flush_workqueue() on the shared WQs as the
caller has no idea what it's gonna end up waiting for.
Although there is flush_scheduled_work() which flushes system_wq WQ with
"Think twice before calling this function! It's very easy to get into
trouble if you don't take great care." warning message, syzbot found a
circular locking dependency caused by flushing system_wq WQ [2].
Therefore, let's change the direction to that developers had better use
their local WQs if flush_scheduled_work()/flush_workqueue(system_*_wq) is
inevitable.
Steps for converting system-wide WQs into local WQs are explained at [3],
and a conversion to stop flushing system-wide WQs is in progress. Now we
want some mechanism for preventing developers who are not aware of this
conversion from again start flushing system-wide WQs.
Since I found that WARN_ON() is complete but awkward approach for teaching
developers about this problem, let's use __compiletime_warning() for
incomplete but handy approach. For completeness, we will also insert
WARN_ON() into __flush_workqueue() after all in-tree users stopped calling
flush_scheduled_work().
Link: https://lore.kernel.org/all/YgnQGZWT%2Fn3VAITX@slm.duckdns.org/ [1]
Link: https://syzkaller.appspot.com/bug?extid=bde0f89deacca7c765b8 [2]
Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp [3]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a CPU is going offline, all workers on the CPU's pool will have their
cpus_allowed cleared to cpu_possible_mask and can run on any CPUs including
the isolated ones. Instead, set cpus_allowed to wq_unbound_cpumask so that
the can avoid isolated CPUs.
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull workqueue updates from Tejun Heo:
"Nothing major. Just follow-up cleanups from Lai after the earlier
synchronization simplification"
* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Convert the type of pool->nr_running to int
workqueue: Use wake_up_worker() in wq_worker_sleeping() instead of open code
workqueue: Change the comments of the synchronization about the idle_list
workqueue: Remove the mb() pair between wq_worker_sleeping() and insert_work()
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
To prepare for supporting each feature of the housekeeping cpumask
toward cpuset, prepare each of the HK_FLAG_* entries to move to their
own cpumask with enforcing to fetch them individually. The new
constraint is that multiple HK_FLAG_* entries can't be mixed together
anymore in a single call to housekeeping cpumask().
This will later allow, for example, to runtime modify the cpulist passed
through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot
parameters.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220207155910.527133-3-frederic@kernel.org
It is only modified in associated CPU, so it doesn't need to be atomic.
tj: Comment updated.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The wakeup code in wq_worker_sleeping() is the same as wake_up_worker().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The access to idle_list in wq_worker_sleeping() is changed to be
protected by pool->lock, so the comments above idle_list can be changed
to "L:" which is the meaning of "access with pool->lock held".
And the outdated comments in wq_worker_sleeping() is removed since
the function is not called with rq lock held any more, idle_list is
dereferenced with pool lock now.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In wq_worker_sleeping(), the access to worklist is protected by the
pool->lock, so the memory barrier is unneeded.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
for-5.16-fixes contains two subtle race conditions which were introduced by
scheduler side code cleanups. The branch didn't get pushed out, so merge
into for-5.17.
nr_running is never modified remotely after the schedule callback in
wakeup path is removed.
Rather nr_running is often accessed with other fields in the pool
together, so the cacheline_aligned for nr_running isn't needed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In unbind_workers(), there are two pool->lock held sections separated
by the code of zapping nr_running. wake_up_worker() needs to be in
pool->lock held section and after zapping nr_running. And zapping
nr_running had to be after schedule() when the local wake up
functionality was in use. Now, the call to schedule() has been removed
along with the local wake up functionality, so the code can be merged
into the same pool->lock held section.
The diffstat shows that it is other code moved down because the diff
tools can not know the meaning of merging lock sections by swapping
two code blocks.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and moved the schedule callback from the wakeup code to at end of
schedule() in the worker's process context.
It means that the callback wq_worker_running() is guaranteed that
it sees the %WORKER_UNBOUND flag after scheduled since unbind_workers()
is running on the same CPU that all the pool's workers bound to.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Long time before, workers are not ALL bound after CPU_ONLINE, they can
still be running in other CPUs before self rebinding.
But the commit a9ab775bca ("workqueue: directly restore CPU affinity
of workers from CPU_ONLINE") makes rebind_workers() bind them all.
So all workers are on the CPU before the CPU is down.
And the comment in unbind_workers() refers to the workers "which are
still executing works from before the last CPU down" is outdated.
Just removed it.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The commit 6d25be5782 ("sched/core, workqueues: Distangle worker
accounting from rq lock") changed the schedule callbacks for workqueue
and removed the local-wake-up functionality.
Now the wakingup of workers is done by normal fashion and workers not
yet migrated to the specific CPU in concurrency managed pool can also
be woken up by workers that already bound to the specific cpu now.
So this advanced kicking of the idle workers to migrate them to the
associated CPU is unneeded now.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At CPU-hotplug time, unbind_workers() may preempt a worker while it is
going to sleep. In that case the following scenario can happen:
unbind_workers() wq_worker_sleeping()
-------------- -------------------
if (worker->flags & WORKER_NOT_RUNNING)
return;
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_dec_and_test(&pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of -1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == -1", all sorts of hazards can
happen, starting with queued works being ignored because no workers are
awaken at insert_work() time.
Fix this with checking again the worker flags while pool->lock is locked.
Fixes: b945efcdd0 ("sched: Remove pointless preemption disable in sched_submit_work()")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
At CPU-hotplug time, unbind_worker() may preempt a worker while it is
waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running()
-------------- -------------------
if (!(worker->flags & WORKER_NOT_RUNNING))
//PREEMPTED by unbind_workers
worker->flags |= WORKER_UNBOUND;
[...]
atomic_set(&pool->nr_running, 0);
//resume to worker
atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to
remain 0 until the pool ever gets rebound in case cpu_up() is called on
the target CPU in the future. But here the race leaves pool->nr_running
with a value of 1, triggering the following warning when the worker goes
idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0
Modules linked in:
CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
Workqueue: 0x0 (rcu_par_gp)
RIP: 0010:worker_enter_idle+0x95/0xc0
Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0
RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086
RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140
RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080
R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20
R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140
FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
worker_thread+0x89/0x3d0
? process_one_work+0x400/0x400
kthread+0x162/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x22/0x30
</TASK>
Also due to this incorrect "nr_running == 1", further queued work may
end up not being served, because no worker is awaken at work insert time.
This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in
wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with
unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update
due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The current queue_work_on() docbook comment says that the caller must
ensure that the specified CPU can't go away, but does not spell out the
consequences, which turn out to be quite mild. Therefore expand this
comment to explicitly say that the penalty for failing to nail down the
specified CPU is that the workqueue handler might find itself executing
on some other CPU.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
| kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
| it tries to allocate memory attempting to acquire spinlock in page
| allocation code while holding workqueue pool raw_spinlock.
|
| There are several instances of this problem when block layer tries
| to __queue_work(). Call trace from one of these instances is below:
|
| kblockd_mod_delayed_work_on()
| mod_delayed_work_on()
| __queue_delayed_work()
| __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
| insert_work()
| kasan_record_aux_stack()
| kasan_save_stack()
| stack_depot_save()
| alloc_pages()
| __alloc_pages()
| get_page_from_freelist()
| rm_queue()
| rm_queue_pcplist()
| local_lock_irqsave(&pagesets.lock, flags);
| [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and commmit ab00db216c).
Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes
extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently show_workqueue_state shows the state of all workqueues and of
all worker pools. In certain cases we may need to dump state of only a
specific workqueue or worker pool. For example in destroy_workqueue we
only need to show state of the workqueue which is getting destroyed.
So rename show_workqueue_state to show_all_workqueues(to signify it
dumps state of all busy workqueues) and divide it into more granular
functions (show_one_workqueue and show_one_worker_pool), that would show
states of individual workqueues and worker pools and can be used in
cases such as the one mentioned above.
Also, as mentioned earlier, make destroy_workqueue dump data pertaining
to only the workqueue that is being destroyed and make user(s) of
earlier interface(show_workqueue_state), use new interface
(show_all_workqueues).
Signed-off-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some unfriendly component, such as dpdk, write the same mask to
unbound kworker cpumask again and again. Every time it write to
this interface some work is queue to cpu, even though the mask
is same with the original mask.
So, fix it by return success and do nothing if the cpumask is
equal with the old one.
Signed-off-by: Mengen Sun <mengensun@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Console drivers often queue work while holding locks also taken in their
console write paths, something which can lead to deadlocks on SMP when
dumping workqueue state (e.g. sysrq-t or on suspend failures).
For serial console drivers this could look like:
CPU0 CPU1
---- ----
show_workqueue_state();
lock(&pool->lock); <IRQ>
lock(&port->lock);
schedule_work();
lock(&pool->lock);
printk();
lock(console_owner);
lock(&port->lock);
where workqueues are, for example, used to push data to the line
discipline, process break signals and handle modem-status changes. Line
disciplines and serdev drivers can also queue work on write-wakeup
notifications, etc.
Reworking every console driver to avoid queuing work while holding locks
also taken in their write paths would complicate drivers and is neither
desirable or feasible.
Instead use the deferred-printk mechanism to avoid printing while
holding pool locks when dumping workqueue state.
Note that there are a few WARN_ON() assertions in the workqueue code
which could potentially also trigger a deadlock. Hopefully the ongoing
printk rework will provide a general solution for this eventually.
This was originally reported after a lockdep splat when executing
sysrq-t with the imx serial driver.
Fixes: 3494fc3084 ("workqueue: dump workqueues on sysrq-t")
Cc: stable@vger.kernel.org # 4.0
Reported-by: Fabio Estevam <festevam@denx.de>
Tested-by: Fabio Estevam <festevam@denx.de>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
There was no strong reason to or not to flush barrier work items in
flush_workqueue(). And we have to make barrier work items not participate
in nr_active so we had been using WORK_NO_COLOR for them which also makes
them can't be flushed by flush_workqueue().
And the users of flush_workqueue() often do not intend to wait barrier work
items issued by flush_work(). That made the choice sound perfect.
But barrier work items have reference to internal structure (pool_workqueue)
and the worker thread[s] is/are still busy for the workqueue user when the
barrrier work items are not done. So it is reasonable to make flush_workqueue()
also watch for flush_work() to make it more robust.
And a problem[1] reported by Li Zhe shows that we need such robustness.
The warning logs are listed below:
WARNING: CPU: 0 PID: 19336 at kernel/workqueue.c:4430 destroy_workqueue+0x11a/0x2f0
*****
destroy_workqueue: test_workqueue9 has the following busy pwq
pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=0/1 refcnt=2
in-flight: 5658:wq_barrier_func
Showing busy workqueues and worker pools:
*****
It shows that even after drain_workqueue() returns, the barrier work item
is still in flight and the pwq (and a worker) is still busy on it.
The problem is caused by flush_workqueue() not watching flush_work():
Thread A Worker
/* normal work item with linked */
process_scheduled_works()
destroy_workqueue() process_one_work()
drain_workqueue() /* run normal work item */
/-- pwq_dec_nr_in_flight()
flush_workqueue() <---/
/* the last normal work item is done */
sanity_check process_one_work()
/-- raw_spin_unlock_irq(&pool->lock)
raw_spin_lock_irq(&pool->lock) <-/ /* maybe preempt */
*WARNING* wq_barrier_func()
/* maybe preempt by cond_resched() */
Thread A can get the pool lock after the Worker unlocks the pool lock before
running wq_barrier_func(). And if there is any preemption happen around
wq_barrier_func(), destroy_workqueue()'s sanity check is more likely to
get the lock and catch it. (Note: preemption is not necessary to cause the bug,
the unlocking is enough to possibly trigger the WARNING.)
A simple solution might be just executing all linked barrier work items
once without releasing pool lock after the head work item's
pwq_dec_nr_in_flight(). But this solution has two problems:
1) the head work item might also be barrier work item when the user-queued
work item is cancelled. For example:
thread 1: thread 2:
queue_work(wq, &my_work)
flush_work(&my_work)
cancel_work_sync(&my_work);
/* Neiter my_work nor the barrier work is scheduled. */
destroy_workqueue(wq);
/* This is an easier way to catch the WARNING. */
2) there might be too much linked barrier work items and running them
all once without releasing pool lock just causes trouble.
The only solution is to make flush_workqueue() aslo watch barrier work
items. So we have to assign a color to these barrier work items which
is the color of the head (user-queued) work item.
Assigning a color doesn't cause any problem in ative management, because
the prvious patch made barrier work items not participate in nr_active
via WORK_STRUCT_INACTIVE rather than reliance on the (old) WORK_NO_COLOR.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
Reported-by: Li Zhe <lizhe.67@bytedance.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, WORK_NO_COLOR has two meanings:
Not participate in flushing
Not participate in nr_active
And only non-barrier work items are marked with WORK_STRUCT_INACTIVE
when they are in inactive_works list. The barrier work items are not
marked INACTIVE even linked in inactive_works list since these tail
items are always moved together with the head work item.
These definitions are simple, clean and practical. (Except a small
blemish that only the first meaning of WORK_NO_COLOR is documented in
include/linux/workqueue.h while both meanings are in workqueue.c)
But dual-purpose WORK_NO_COLOR used for barrier work items has proven to
be problematical[1]. Only the second purpose is obligatory. So we plan
to make barrier work items participate in flushing but keep them still
not participating in nr_active.
So the plan is to mark barrier work items inactive without using
WORK_NO_COLOR in this patch so that we can assign a flushing color to
them in next patch.
The reasonable way is to add or reuse a bit in work data of the work
item. But adding a bit will double the size of pool_workqueue.
Currently, WORK_STRUCT_INACTIVE is only used in try_to_grab_pending()
for user-queued work items and try_to_grab_pending() can't work for
barrier work items. So we extend WORK_STRUCT_INACTIVE to also mark
barrier work items no matter which list they are in because we don't
need to determind which list a barrier work item is in.
So the meaning of WORK_STRUCT_INACTIVE becomes just "the work items don't
participate in nr_active" (no matter whether it is a barrier work item or
a user-queued work item). And WORK_STRUCT_INACTIVE for user-queued work
items means they are in inactive_works list.
This patch does it by setting WORK_STRUCT_INACTIVE for barrier work items
in insert_wq_barrier() and checking WORK_STRUCT_INACTIVE first in
pwq_dec_nr_in_flight(). And the meaning of WORK_NO_COLOR is reduced to
only "not participating in flushing".
There is no functionality change intended in this patch. Because
WORK_NO_COLOR+WORK_STRUCT_INACTIVE represents the previous WORK_NO_COLOR
in meaning and try_to_grab_pending() doesn't use for barrier work items
and avoids being confused by this extended WORK_STRUCT_INACTIVE.
A bunch of comment for nr_active & WORK_STRUCT_INACTIVE is also added for
documenting how WORK_STRUCT_INACTIVE works in nr_active management.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a local var @work_flags to calculate work_flags step by step, so that
we don't need to squeeze several flags in only the last line of code.
Parepare for next patch to add a bit to barrier work item's flag. Not
squshing this to next patch makes it clear that what it will have changed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make pwq_dec_nr_in_flight() use work_data rather just work_color.
Prepare for later patch to get WORK_STRUCT_INACTIVE bit from work_data
in pwq_dec_nr_in_flight().
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are two kinds of "delayed" work items in workqueue subsystem.
One is for timer-delayed work items which are visible to workqueue users.
The other kind is for work items delayed by active management which can
not be directly visible to workqueue users. We mixed the word "delayed"
for both kinds and caused somewhat ambiguity.
This patch renames the later one (delayed by active management) to
"inactive", because it is used for workqueue active management and
most of its related symbols are named with "active" or "activate".
All "delayed" and "DELAYED" are carefully checked and renamed one by
one to avoid accidentally changing the name of the other kind for
timer-delayed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace ida_simple_get() with ida_alloc() and ida_simple_remove() with
ida_free(), the latter is more concise and intuitive.
In addition, if ida_alloc() fails, NULL is returned directly. This
eliminates unnecessary initialization of two local variables and an 'if'
judgment.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the
previously allocated memories are not released. Doing this before
allocating memory eliminates memory leaks.
tj: Note that the condition only occurs when the arch code is pretty broken
and the WARN_ON might as well be BUG_ON().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then
once this VCPU resumes it will see the new jiffies value, while it
may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this
VCPU and updates all the watchdogs via pvclock_touch_watchdogs().
There is a small chance of misreported WQ stalls in the meantime,
because new jiffies is time_after() old 'ts + thresh'.
wq_watchdog_timer_fn()
{
for_each_pool(pool, pi) {
if (time_after(jiffies, ts + thresh)) {
pr_emerg("BUG: workqueue lockup - pool");
}
}
}
Save jiffies at the beginning of this function and use that value
for stall detection. If VM gets suspended then we continue using
"old" jiffies value and old WQ touch timestamps. If IRQ at some
point restarts the stall detection cycle (pvclock_touch_watchdogs())
then old jiffies will always be before new 'ts + thresh'.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Clean up list_sort prototypes (Sami Tolvanen)
- Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt
wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf
ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS
bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM
HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2
AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr
zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL
oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9
Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT
Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45
aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v
zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU=
=wU6U
-----END PGP SIGNATURE-----
Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull CFI on arm64 support from Kees Cook:
"This builds on last cycle's LTO work, and allows the arm64 kernels to
be built with Clang's Control Flow Integrity feature. This feature has
happily lived in Android kernels for almost 3 years[1], so I'm excited
to have it ready for upstream.
The wide diffstat is mainly due to the treewide fixing of mismatched
list_sort prototypes. Other things in core kernel are to address
various CFI corner cases. The largest code portion is the CFI runtime
implementation itself (which will be shared by all architectures
implementing support for CFI). The arm64 pieces are Acked by arm64
maintainers rather than coming through the arm64 tree since carrying
this tree over there was going to be awkward.
CFI support for x86 is still under development, but is pretty close.
There are a handful of corner cases on x86 that need some improvements
to Clang and objtool, but otherwise works well.
Summary:
- Clean up list_sort prototypes (Sami Tolvanen)
- Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)"
* tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
arm64: allow CONFIG_CFI_CLANG to be selected
KVM: arm64: Disable CFI for nVHE
arm64: ftrace: use function_nocfi for ftrace_call
arm64: add __nocfi to __apply_alternatives
arm64: add __nocfi to functions that jump to a physical address
arm64: use function_nocfi with __pa_symbol
arm64: implement function_nocfi
psci: use function_nocfi for cpu_resume
lkdtm: use function_nocfi
treewide: Change list_sort to use const pointers
bpf: disable CFI in dispatcher functions
kallsyms: strip ThinLTO hashes from static functions
kthread: use WARN_ON_FUNCTION_MISMATCH
workqueue: use WARN_ON_FUNCTION_MISMATCH
module: ensure __cfi_check alignment
mm: add generic function_nocfi macro
cfi: add __cficanonical
add support for Clang CFI
With CONFIG_CFI_CLANG, a callback function passed to
__queue_delayed_work from a module points to a jump table entry
defined in the module instead of the one used in the core kernel,
which breaks function address equality in this check:
WARN_ON_ONCE(timer->function != delayed_work_timer_fn);
Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning
when CFI and modules are both enabled.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-6-samitolvanen@google.com
84;0;0c84;0;0c
There are two workqueue-specific watchdog timestamps:
+ @wq_watchdog_touched_cpu (per-CPU) updated by
touch_softlockup_watchdog()
+ @wq_watchdog_touched (global) updated by
touch_all_softlockup_watchdogs()
watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.
Solution:
touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
timestamp.
The global timestamp can no longer be used for bound workqueues because
it is now updated from all CPUs. Instead, bound workqueues have to check
only @wq_watchdog_touched_cpu and these timestamps have to be updated for
all CPUs in touch_all_softlockup_watchdogs().
Beware:
The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.
It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.
V3:
- Modify the commit message clearly according to Petr's suggestion.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The debug_work_activate() is called on the premise that
the work can be inserted, because if wq be in WQ_DRAINING
status, insert work may be failed.
Fixes: e41e704bc4 ("workqueue: improve destroy_workqueue() debuggability")
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull qorkqueue updates from Tejun Heo:
"Tracepoint and comment updates only"
* 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Use %s instead of function name
workqueue: tracing the name of the workqueue instead of it's address
workqueue: fix annotation for WQ_SYSFS
It is better to replace the function name with %s, in case the function
name changes.
Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.
Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.
Therefore set the affinity after getting that magic flag.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU.
Workqueues have unfortunate semantics in that per-cpu workers are not
default flushed and parked during hotplug, however a subset does
manual flush on hotplug and hard relies on them for correctness.
Therefore play silly games..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org
The scheduler won't break affinity for us any more, and we should
"emulate" the same behavior when the scheduler breaks affinity for
us. The behavior is "changing the cpumask to cpu_possible_mask".
And there might be some other CPUs online later while the worker is
still running with the pending work items. The worker should be allowed
to use the later online CPUs as before and process the work items ASAP.
If we use cpu_active_mask here, we can't achieve this goal but
using cpu_possible_mask can.
Fixes: 06249738a4 ("workqueue: Manually break affinity on hotplug")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210111152638.2417-4-jiangshanlai@gmail.com
Pull workqueue update from Tejun Heo:
"The same as the cgroup tree - one commit which was scheduled for the
5.11 merge window.
All the commit does is avoding spurious worker wakeups from workqueue
allocation / config change path to help cpuisol use cases"
* 'for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Kick a worker based on the actual activation of delayed works
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
Patch series "kasan: add workqueue stack for generic KASAN", v5.
Syzbot reports many UAF issues for workqueue, see [1].
In some of these access/allocation happened in process_one_work(), we
see the free stack is useless in KASAN report, it doesn't help
programmers to solve UAF for workqueue issue.
This patchset improves KASAN reports by making them to have workqueue
queueing stack. It is useful for programmers to solve use-after-free or
double-free memory issue.
Generic KASAN also records the last two workqueue stacks and prints them
in KASAN report. It is only suitable for generic KASAN.
[1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
[2] https://bugzilla.kernel.org/show_bug.cgi?id=198437
This patch (of 4):
When analyzing use-after-free or double-free issue, recording the
enqueuing work stacks is helpful to preserve usage history which
potentially gives a hint about the affected code.
For workqueue it has turned out to be useful to record the enqueuing work
call stacks. Because user can see KASAN report to determine whether it is
root cause. They don't need to enable debugobjects, but they have a
chance to find out the root cause.
Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.com
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In realtime scenario, We do not want to have interference on the
isolated cpu cores. but when invoking alloc_workqueue() for percpu wq
on the housekeeping cpu, it kick a kworker on the isolated cpu.
alloc_workqueue
pwq_adjust_max_active
wake_up_worker
The comment in pwq_adjust_max_active() said:
"Need to kick a worker after thawed or an unbound wq's
max_active is bumped"
So it is unnecessary to kick a kworker for percpu's wq when invoking
alloc_workqueue(). this patch only kick a worker based on the actual
activation of delayed works.
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Don't rely on the scheduler to force break affinity for us -- it will
stop doing that for per-cpu-kthreads.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.464718669@infradead.org
As warned by Sphinx:
./Documentation/core-api/workqueue:400: ./kernel/workqueue.c:1218: WARNING: Unexpected indentation.
the return code table is currently not recognized, as it lacks
markups.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
This should make it harder for the kernel to corrupt the debug object
descriptor, used to call functions to fixup state and track debug objects,
by moving the structure to read-only memory.
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200815004027.2046113-3-swboyd@chromium.org
Any runtime WARN_ON() has to be fixed, and BUILD_BUG_ON() can
help you nitice it earlier.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is no point to unlock() and then lock() the same mutex
back to back.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration. Unfortunately, it checks only whether the pool needs
help from rescuers, but it doesn't check whether the pwq has work items
in the pool (the real reason that this rescuer can help for the pool).
The patch adds the check and void unneeded requeuing.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueue code has it's internal spinlocks (pool::lock), which
are acquired on most workqueue operations. These spinlocks are
converted to 'sleeping' spinlocks on a RT-kernel.
Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.
The pool::lock hold times are bound and the code sections are
relatively short, which allows to convert pool::lock and as a
consequence wq_mayday_lock to raw spinlocks which are truly spinning
locks even on a PREEMPT_RT kernel.
With the previous conversion of the manager waitqueue to a simple
waitqueue workqueues are now fully RT compliant.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The workqueue code has it's internal spinlock (pool::lock) and also
implicit spinlock usage in the wq_manager waitqueue. These spinlocks
are converted to 'sleeping' spinlocks on a RT-kernel.
Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.
pool::lock can be converted to a raw spinlock as the lock held times
are short. But the workqueue manager waitqueue is handled inside of
pool::lock held regions which again violates the lock nesting rules
of raw and regular spinlocks.
The manager waitqueue has no special requirements like custom wakeup
callbacks or mass wakeups. While it does not use exclusive wait mode
explicitly there is no strict requirement to queue the waiters in a
particular order as there is only one waiter at a time.
This allows to replace the waitqueue with rcuwait which solves the
locking problem because rcuwait relies on existing locking.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The data structure member "wq->rescuer" was reset to a null pointer
in one if branch. It was passed to a call of the function "kfree"
in the callback function "rcu_free_wq" (which was eventually executed).
The function "kfree" does not perform more meaningful data processing
for a passed null pointer (besides immediately returning from such a call).
Thus delete this function call which became unnecessary with the referenced
software update.
Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")
Suggested-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We need to preserve error code before freeing "rescuer".
Fixes: f187b6974f ("workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace inline function PTR_ERR_OR_ZERO with IS_ERR and PTR_ERR to
remove redundant parameter definitions and checks.
Reduce code size.
Before:
text data bss dec hex filename
47510 5979 840 54329 d439 kernel/workqueue.o
After:
text data bss dec hex filename
47474 5979 840 54293 d415 kernel/workqueue.o
Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The kernel test robot triggered a warning with the following race:
task-ctx A interrupt-ctx B
worker
-> process_one_work()
-> work_item()
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> ->sleeping = 1
atomic_dec_and_test(nr_running)
__schedule(); *interrupt*
async_page_fault()
-> local_irq_enable();
-> schedule();
-> sched_submit_work()
-> wq_worker_sleeping()
-> if (WARN_ON(->sleeping)) return
-> __schedule()
-> sched_update_worker()
-> wq_worker_running()
-> atomic_inc(nr_running);
-> ->sleeping = 0;
-> sched_update_worker()
-> wq_worker_running()
if (!->sleeping) return
In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().
Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.
Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
Pull workqueue updates from Tejun Heo:
"Nothing too interesting. Just two trivial patches"
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Mark up unlocked access to wq->first_flusher
workqueue: Make workqueue_init*() return void
[ 7329.671518] BUG: KCSAN: data-race in flush_workqueue / flush_workqueue
[ 7329.671549]
[ 7329.671572] write to 0xffff8881f65fb250 of 8 bytes by task 37173 on cpu 2:
[ 7329.671607] flush_workqueue+0x3bc/0x9b0 (kernel/workqueue.c:2844)
[ 7329.672527]
[ 7329.672540] read to 0xffff8881f65fb250 of 8 bytes by task 37175 on cpu 0:
[ 7329.672571] flush_workqueue+0x28d/0x9b0 (kernel/workqueue.c:2835)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.
Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.
Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement. So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.
Fixes: ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton <hdanton@sina.com>
[dj: massage changelog]
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
The return values of workqueue_init() and workqueue_early_int() are
always 0, and there is no usage of their return value. So just make
them return void.
Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.
You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.
Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Update the comment to use PREEMPTION because it is true for both
preemption models.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20191015191821.11479-35-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- Dynamic tick (nohz) updates, perhaps most notably changes to force
the tick on when needed due to lengthy in-kernel execution on CPUs
on which RCU is waiting.
- Linux-kernel memory consistency model updates.
- Replace rcu_swap_protected() with rcu_prepace_pointer().
- Torture-test updates.
- Documentation updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
rcu: Suppress levelspread uninitialized messages
rcu: Fix uninitialized variable in nocb_gp_wait()
rcu: Update descriptions for rcu_future_grace_period tracepoint
rcu: Update descriptions for rcu_nocb_wake tracepoint
rcu: Remove obsolete descriptions for rcu_barrier tracepoint
rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
workqueue: Convert for_each_wq to use built-in list check
rcu: Several rcu_segcblist functions can be static
rcu: Remove unused function hlist_bl_del_init_rcu()
Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
...
An additional check has been recently added to ensure that a RCU related lock
is held while the RCU list is iterated.
The `pwqs' are sometimes iterated without a RCU lock but with the &wq->mutex
acquired leading to a warning.
Teach list_for_each_entry_rcu() that the RCU usage is okay if &wq->mutex
is acquired during the list traversal.
Fixes: 28875945ba ("rcu: Add support for consolidated-RCU reader checking")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Because list_for_each_entry_rcu() can now check for holding a
lock as well as for being in an RCU read-side critical section,
this commit replaces the workqueue_sysfs_unregister() function's
use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with
list_for_each_entry_rcu() augmented with a lockdep_is_held() optional
argument.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration. Unfortunately, it doesn't check whether the pwq is
already on the mayday list and unconditionally gets the ref and moves
it onto the list. This doesn't corrupt the list but creates an
additional reference to the pwq. It got queued twice but will only be
removed once.
This leak later can trigger pwq refcnt warning on workqueue
destruction and prevent freeing of the workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # v3.19+
destroy_workqueue() warnings still, at a lower frequency, trigger
spuriously. The problem seems to be in-flight operations which
haven't reached put_pwq() yet.
* Make sanity check grab all the related locks so that it's
synchronized against operations which puts pwq at the end.
* Always print out the offending pwq.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
* Now that wq->rescuer may be cleared while rescuer is still there,
switch show_pwq() debug printout to test worker->rescue_wq to
identify rescuers intead of testing wq->rescuer.
* Update comment on ->rescuer locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle. If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling. It unfortunately
has a couple issues.
* Mayday list queueing increments pwq's refcnts which gets detected as
busy and fails the sanity checks. However, because mayday list
queueing is asynchronous, this condition can happen without any
actual work items left in the workqueue.
* Sanity check failure leaves the sysfs interface behind too which can
lead to init failure of newer instances of the workqueue.
This patch fixes the above two by
* If a workqueue has a rescuer, disable and kill the rescuer before
sanity checks. Disabling and killing is guaranteed to flush the
existing mayday list.
* Remove sysfs interface before sanity checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.
Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
padata will use these these interfaces in a later patch, so unconfine them.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
All callers use GFP_KERNEL. No point in having that argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
None of those functions have any users outside of workqueue.c. Confine
them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull workqueue updates from Tejun Heo:
"Only three commits, of which two are trivial.
The non-trivial chagne is Thomas's patch to switch workqueue from
sched RCU to regular one. The use of sched RCU is mostly historic and
doesn't really buy us anything noticeable"
* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Use normal rcu
kernel/workqueue: Document wq_worker_last_func() argument
kernel/workqueue: Use __printf markup to silence compiler in function 'alloc_workqueue'
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
=WZfC
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Allow state reset of printk_once() calls.
- Prevent crashes when dereferencing invalid pointers in vsprintf().
Only the first byte is checked for simplicity.
- Make vsprintf warnings consistent and inlined.
- Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
modifiers.
- Some clean up of vsprintf and test_printf code.
* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
lib/vsprintf: Make function pointer_string static
vsprintf: Limit the length of inlined error messages
vsprintf: Avoid confusion between invalid address and value
vsprintf: Prevent crash when dereferencing invalid pointers
vsprintf: Consolidate handling of unknown pointer specifiers
vsprintf: Factor out %pO handler as kobject_string()
vsprintf: Factor out %pV handler as va_format()
vsprintf: Factor out %p[iI] handler as ip_addr_string()
vsprintf: Do not check address of well-known strings
vsprintf: Consistent %pK handling for kptr_restrict == 0
vsprintf: Shuffle restricted_pointer()
printk: Tie printk_once / printk_deferred_once into .data.once for reset
treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
lib/test_printf: Switch to bitmap_zalloc()
The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.
Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().
[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need for sched_rcu. The undocumented reason why sched_rcu
is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by
the fact that sched_rcu reader side critical sections are also protected
by preempt or irq disabled regions.
Replace rcu_read_lock_sched with rcu_read_lock and acquire the RCU lock
where it is not yet explicit acquired. Replace local_irq_disable() with
rcu_read_lock(). Update asserts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bigeasy: mangle changelog a little]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The recent change to prevent use after free and a memory leak introduced an
unconditional call to wq_unregister_lockdep() in the error handling
path. If the lockdep key had not been registered yet, then the lockdep core
emits a warning.
Only call wq_unregister_lockdep() if wq_register_lockdep() has been
called first.
Fixes: 009bb421b6 ("workqueue, lockdep: Fix an alloc_workqueue() error path")
Reported-by: syzbot+be0c198232f86389c3dd@syzkaller.appspotmail.com
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20190311230255.176081-1-bvanassche@acm.org
This patch avoids that the following warning is reported when building
with W=1:
kernel/workqueue.c:938: warning: Function parameter or member 'task' not described in 'wq_worker_last_func'
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Silence warnings (triggered at W=1) by adding relevant __printf attributes.
kernel/workqueue.c:4249:2: warning: function 'alloc_workqueue' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull locking fixes from Thomas Gleixner:
"A few fixes for lockdep:
- initialize lockdep internal RCU head after initializing RCU
- prevent use after free in a alloc_workqueue() error handling path
- plug a memory leak in the workqueue core which fails to free a
dynamically allocated lock name.
- make Clang happy"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
workqueue, lockdep: Fix a memory leak in wq->lock_name
workqueue, lockdep: Fix an alloc_workqueue() error path
locking/lockdep: Only call init_rcu_head() after RCU has been initialized
locking/lockdep: Avoid a Clang warning
Merge more updates from Andrew Morton:
- some of the rest of MM
- various misc things
- dynamic-debug updates
- checkpatch
- some epoll speedups
- autofs
- rapidio
- lib/, lib/lzo/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...
This function can only be called safely from very specific scheduler
contexts. Document those.
Link: http://lkml.kernel.org/r/20190206150528.31198-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull workqueue updates from Tejun Heo:
"All trivial. Two comment updates and one more initialization sanity
check in flush_work()"
* 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix spelling in source code comments
workqueue: fix typo in comment
workqueue: Try to catch flush_work() without INIT_WORK().
Here is the big driver core patchset for 5.1-rc1
More patches than "normal" here this merge window, due to some work in
the driver core by Alexander Duyck to rework the async probe
functionality to work better for a number of devices, and independant
work from Rafael for the device link functionality to make it work
"correctly".
Also in here is:
- lots of BUS_ATTR() removals, the macro is about to go away
- firmware test fixups
- ihex fixups and simplification
- component additions (also includes i915 patches)
- lots of minor coding style fixups and cleanups.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXH+euQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynyTgCfbV8CLums843sBnT8NnWrTMTdTCcAn1K4re0m
ep8g+6oRLxJy414hogxQ
=bLs2
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big driver core patchset for 5.1-rc1
More patches than "normal" here this merge window, due to some work in
the driver core by Alexander Duyck to rework the async probe
functionality to work better for a number of devices, and independant
work from Rafael for the device link functionality to make it work
"correctly".
Also in here is:
- lots of BUS_ATTR() removals, the macro is about to go away
- firmware test fixups
- ihex fixups and simplification
- component additions (also includes i915 patches)
- lots of minor coding style fixups and cleanups.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (65 commits)
driver core: platform: remove misleading err_alloc label
platform: set of_node in platform_device_register_full()
firmware: hardcode the debug message for -ENOENT
driver core: Add missing description of new struct device_link field
driver core: Fix PM-runtime for links added during consumer probe
drivers/component: kerneldoc polish
async: Add cmdline option to specify drivers to be async probed
driver core: Fix possible supplier PM-usage counter imbalance
PM-runtime: Fix __pm_runtime_set_status() race with runtime resume
driver: platform: Support parsing GpioInt 0 in platform_get_irq()
selftests: firmware: fix verify_reqs() return value
Revert "selftests: firmware: remove use of non-standard diff -Z option"
Revert "selftests: firmware: add CONFIG_FW_LOADER_USER_HELPER_FALLBACK to config"
device: Fix comment for driver_data in struct device
kernfs: Allocating memory for kernfs_iattrs with kmem_cache.
sysfs: remove unused include of kernfs-internal.h
driver core: Postpone DMA tear-down until after devres release
driver core: Document limitation related to DL_FLAG_RPM_ACTIVE
PM-runtime: Take suppliers into account in __pm_runtime_set_status()
device.h: Add __cold to dev_<level> logging functions
...
Change "execuing" into "executing" and "guarnateed" into "guaranteed".
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The following commit:
87915adc3f ("workqueue: re-add lockdep dependencies for flushing")
improved deadlock checking in the workqueue implementation. Unfortunately
that patch also introduced a few false positive lockdep complaints.
This patch suppresses these false positives by allocating the workqueue mutex
lockdep key dynamically.
An example of a false positive lockdep complaint suppressed by this patch
can be found below. The root cause of the lockdep complaint shown below
is that the direct I/O code can call alloc_workqueue() from inside a work
item created by another alloc_workqueue() call and that both workqueues
share the same lockdep key. This patch avoids that that lockdep complaint
is triggered by allocating the work queue lockdep keys dynamically.
In other words, this patch guarantees that a unique lockdep key is
associated with each work queue mutex.
======================================================
WARNING: possible circular locking dependency detected
4.19.0-dbg+ #1 Not tainted
fio/4129 is trying to acquire lock:
00000000a01cfe1a ((wq_completion)"dio/%s"sb->s_id){+.+.}, at: flush_workqueue+0xd0/0x970
but task is already holding lock:
00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&sb->s_type->i_mutex_key#14){+.+.}:
down_write+0x3d/0x80
__generic_file_fsync+0x77/0xf0
ext4_sync_file+0x3c9/0x780
vfs_fsync_range+0x66/0x100
dio_complete+0x2f5/0x360
dio_aio_complete_work+0x1c/0x20
process_one_work+0x481/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #1 ((work_completion)(&dio->complete_work)){+.+.}:
process_one_work+0x447/0x9f0
worker_thread+0x63/0x5a0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
-> #0 ((wq_completion)"dio/%s"sb->s_id){+.+.}:
lock_acquire+0xc5/0x200
flush_workqueue+0xf3/0x970
drain_workqueue+0xec/0x220
destroy_workqueue+0x23/0x350
sb_init_dio_done_wq+0x6a/0x80
do_blockdev_direct_IO+0x1f33/0x4be0
__blockdev_direct_IO+0x79/0x86
ext4_direct_IO+0x5df/0xbb0
generic_file_direct_write+0x119/0x220
__generic_file_write_iter+0x131/0x2d0
ext4_file_write_iter+0x3fa/0x710
aio_write+0x235/0x330
io_submit_one+0x510/0xeb0
__x64_sys_io_submit+0x122/0x340
do_syscall_64+0x71/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Chain exists of:
(wq_completion)"dio/%s"sb->s_id --> (work_completion)(&dio->complete_work) --> &sb->s_type->i_mutex_key#14
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sb->s_type->i_mutex_key#14);
lock((work_completion)(&dio->complete_work));
lock(&sb->s_type->i_mutex_key#14);
lock((wq_completion)"dio/%s"sb->s_id);
*** DEADLOCK ***
1 lock held by fio/4129:
#0: 00000000a0acecf9 (&sb->s_type->i_mutex_key#14){+.+.}, at: ext4_file_write_iter+0x154/0x710
stack backtrace:
CPU: 3 PID: 4129 Comm: fio Not tainted 4.19.0-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x86/0xc5
print_circular_bug.isra.32+0x20a/0x218
__lock_acquire+0x1c68/0x1cf0
lock_acquire+0xc5/0x200
flush_workqueue+0xf3/0x970
drain_workqueue+0xec/0x220
destroy_workqueue+0x23/0x350
sb_init_dio_done_wq+0x6a/0x80
do_blockdev_direct_IO+0x1f33/0x4be0
__blockdev_direct_IO+0x79/0x86
ext4_direct_IO+0x5df/0xbb0
generic_file_direct_write+0x119/0x220
__generic_file_write_iter+0x131/0x2d0
ext4_file_write_iter+0x3fa/0x710
aio_write+0x235/0x330
io_submit_one+0x510/0xeb0
__x64_sys_io_submit+0x122/0x340
do_syscall_64+0x71/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190214230058.196511-20-bvanassche@acm.org
[ Reworked the changelog a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a new function, queue_work_node, which is meant to schedule work on
a "random" CPU of the requested NUMA node. The main motivation for this is
to help assist asynchronous init to better improve boot times for devices
that are local to a specific node.
For now we just default to the first CPU that is in the intersection of the
cpumask of the node and the online cpumask. The only exception is if the
CPU is local to the node we will just use the current CPU. This should work
for our purposes as we are currently only using this for unbound work so
the CPU will be translated to a node anyway instead of being directly used.
As we are only using the first CPU to represent the NUMA node for now I am
limiting the scope of the function so that it can only be used with unbound
workqueues.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzbot found a flush_work() caller who forgot to call INIT_WORK()
because that work_struct was allocated by kzalloc() [1]. But the message
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
by lock_map_acquire() is failing to tell that INIT_WORK() is missing.
Since flush_work() without INIT_WORK() is a bug, and INIT_WORK() should
set ->func field to non-zero, let's warn if ->func field is zero.
[1] https://syzkaller.appspot.com/bug?id=a5954455fcfa51c29ca2ab55b203076337e1c770
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched(). This commit therefore makes that change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Some architectures need to use stop_machine() to patch functions for
ftrace, and the assumption is that the stopped CPUs do not make function
calls to traceable functions when they are in the stopped state.
Commit ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after
MULTI_STOP_PREPARE") added calls to the watchdog touch functions from
the stopped CPUs and those functions lack notrace annotations. This
leads to crashes when enabling/disabling ftrace on ARM kernels built
with the Thumb-2 instruction set.
Fix it by adding the necessary notrace annotations.
Fixes: ce4f06dcbb ("stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: oleg@redhat.com
Cc: tj@kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180821152507.18313-1-vincent.whitchurch@axis.com
Pull workqueue updates from Tejun Heo:
"Over the lockdep cross-release churn, workqueue lost some of the
existing annotations. Johannes Berg restored it and also improved
them"
* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: re-add lockdep dependencies for flushing
workqueue: skip lockdep wq dependency in cancel_work_sync()
In flush_work(), we need to create a lockdep dependency so that
the following scenario is appropriately tagged as a problem:
work_function()
{
mutex_lock(&mutex);
...
}
other_function()
{
mutex_lock(&mutex);
flush_work(&work); // or cancel_work_sync(&work);
}
This is a problem since the work might be running and be blocked
on trying to acquire the mutex.
Similarly, in flush_workqueue().
These were removed after cross-release partially caught these
problems, but now cross-release was reverted anyway. IMHO the
removal was erroneous anyway though, since lockdep should be
able to catch potential problems, not just actual ones, and
cross-release would only have caught the problem when actually
invoking wait_for_completion().
Fixes: fd1a5b04df ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In cancel_work_sync(), we can only have one of two cases, even
with an ordered workqueue:
* the work isn't running, just cancelled before it started
* the work is running, but then nothing else can be on the
workqueue before it
Thus, we need to skip the lockdep workqueue dependency handling,
otherwise we get false positive reports from lockdep saying that
we have a potential deadlock when the workqueue also has other
work items with locking, e.g.
work1_function() { mutex_lock(&mutex); ... }
work2_function() { /* nothing */ }
other_function() {
queue_work(ordered_wq, &work1);
queue_work(ordered_wq, &work2);
mutex_lock(&mutex);
cancel_work_sync(&work2);
}
As described above, this isn't a problem, but lockdep will
currently flag it as if cancel_work_sync() was flush_work(),
which *is* a problem.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
xfcp, hisi_sas, cxlflash, qla2xxx. In the absence of Nic, we're also
taking target updates which are mostly minor except for the tcmu
refactor. The only real core change to worry about is the removal of
high page bouncing (in sas, storvsc and iscsi). This has been well
tested and no problems have shown up so far.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWx1pbCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishUucAP42pccS
ziKyiOizuxv9fZ4Q+nXd1A9zhI5tqqpkHjcQegEA40qiZSi3EKGKR8W0UpX7Ntmo
tqrZJGojx9lnrAM2RbQ=
=NMXg
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
xfcp, hisi_sas, cxlflash, qla2xxx.
In the absence of Nic, we're also taking target updates which are
mostly minor except for the tcmu refactor.
The only real core change to worry about is the removal of high page
bouncing (in sas, storvsc and iscsi). This has been well tested and no
problems have shown up so far"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (268 commits)
scsi: lpfc: update driver version to 12.0.0.4
scsi: lpfc: Fix port initialization failure.
scsi: lpfc: Fix 16gb hbas failing cq create.
scsi: lpfc: Fix crash in blk_mq layer when executing modprobe -r lpfc
scsi: lpfc: correct oversubscription of nvme io requests for an adapter
scsi: lpfc: Fix MDS diagnostics failure (Rx < Tx)
scsi: hisi_sas: Mark PHY as in reset for nexus reset
scsi: hisi_sas: Fix return value when get_free_slot() failed
scsi: hisi_sas: Terminate STP reject quickly for v2 hw
scsi: hisi_sas: Add v2 hw force PHY function for internal ATA command
scsi: hisi_sas: Include TMF elements in struct hisi_sas_slot
scsi: hisi_sas: Try wait commands before before controller reset
scsi: hisi_sas: Init disks after controller reset
scsi: hisi_sas: Create a scsi_host_template per HW module
scsi: hisi_sas: Reset disks when discovered
scsi: hisi_sas: Add LED feature for v3 hw
scsi: hisi_sas: Change common allocation mode of device id
scsi: hisi_sas: change slot index allocation mode
scsi: hisi_sas: Introduce hisi_sas_phy_set_linkrate()
scsi: hisi_sas: fix a typo in hisi_sas_task_prep()
...
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
KCDVLYPxwQQqK1Mqig==
=/3L8
-----END PGP SIGNATURE-----
Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull overflow updates from Kees Cook:
"This adds the new overflow checking helpers and adds them to the
2-factor argument allocators. And this adds the saturating size
helpers and does a treewide replacement for the struct_size() usage.
Additionally this adds the overflow testing modules to make sure
everything works.
I'm still working on the treewide replacements for allocators with
"simple" multiplied arguments:
*alloc(a * b, ...) -> *alloc_array(a, b, ...)
and
*zalloc(a * b, ...) -> *calloc(a, b, ...)
as well as the more complex cases, but that's separable from this
portion of the series. I expect to have the rest sent before -rc1
closes; there are a lot of messy cases to clean up.
Summary:
- Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)"
* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use struct_size() for devm_kmalloc() and friends
treewide: Use struct_size() for vmalloc()-family
treewide: Use struct_size() for kmalloc()-family
device: Use overflow helpers for devm_kmalloc()
mm: Use overflow helpers in kvmalloc()
mm: Use overflow helpers in kmalloc_array*()
test_overflow: Add memory allocation overflow tests
overflow.h: Add allocation size calculation helpers
test_overflow: Report test failures
test_overflow: macrofy some more, do more tests for free
lib: add runtime test of check_*_overflow functions
compiler.h: enable builtin overflow checkers and add fallback code
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:
// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@
- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
Signed-off-by: Kees Cook <keescook@chromium.org>
In commit 7ee681b252 ("workqueue: Convert to state machine callbacks"),
three new function definitions were added: ‘workqueue_prepare_cpu’,
‘workqueue_online_cpu’ and ‘workqueue_offline_cpu’.
Move these function definitions within a CONFIG_SMP block since they are
not used outside of it. This will match function declarations in header
<include/linux/workqueue.h>, and silence the following gcc warning (W=1):
kernel/workqueue.c:4743:5: warning: no previous prototype for ‘workqueue_prepare_cpu’ [-Wmissing-prototypes]
kernel/workqueue.c:4756:5: warning: no previous prototype for ‘workqueue_online_cpu’ [-Wmissing-prototypes]
kernel/workqueue.c:4783:5: warning: no previous prototype for ‘workqueue_offline_cpu’ [-Wmissing-prototypes]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The worker struct could already be freed when wq_worker_comm() tries
to access it for reporting. This patch protects PF_WQ_WORKER
modifications with wq_pool_attach_mutex and makes wq_worker_comm()
test the flag before dereferencing worker from kthread_data(), which
ensures that it only dereferences when the worker struct is valid.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes: 6b59808bfe ("workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}")
There can be a lot of workqueue workers and they all show up with the
cryptic kworker/* names making it difficult to understand which is
doing what and how they came to be.
# ps -ef | grep kworker
root 4 2 0 Feb25 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Feb25 ? 00:00:00 [kworker/u112:0]
root 19 2 0 Feb25 ? 00:00:00 [kworker/1:0H]
root 25 2 0 Feb25 ? 00:00:00 [kworker/2:0H]
root 31 2 0 Feb25 ? 00:00:00 [kworker/3:0H]
...
This patch makes workqueue workers report the latest workqueue it was
executing for through /proc/PID/{comm,stat,status}. The extra
information is appended to the kthread name with intervening '+' if
currently executing, otherwise '-'.
# cat /proc/25/comm
kworker/2:0-events_power_efficient
# cat /proc/25/stat
25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0...
# grep Name /proc/25/status
Name: kworker/2:0-events_power_efficient
Unfortunately, ps(1) truncates comm to 15 characters,
# ps 25
PID TTY STAT TIME COMMAND
25 ? I 0:00 [kworker/2:0-eve]
making it a lot less useful; however, this should be an easy fix from
ps(1) side.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Craig Small <csmall@enc.com.au>
Work functions can use set_worker_desc() to improve the visibility of
what the worker task is doing. Currently, the desc field is unset at
the beginning of each execution and there is a separate field to track
the field is set during the current execution.
Instead of leaving empty till desc is set, worker->desc can be used to
remember the last workqueue the worker worked on by default and users
that use set_worker_desc() can override it to something more
informative as necessary.
This simplifies desc handling and helps tracking the last workqueue
that the worker exected on to improve visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
For historical reasons, the worker attach/detach functions don't
currently manage worker->pool and the callers are manually and
inconsistently updating it.
This patch moves worker->pool updates into the worker attach/detach
functions. This makes worker->pool consistent and clearly defines how
worker->pool updates are synchronized.
This will help later workqueue visibility improvements by allowing
safe access to workqueue information from worker->task.
Signed-off-by: Tejun Heo <tj@kernel.org>
To improve workqueue visibility, we want to be able to access
workqueue information from worker tasks. The per-pool attach mutex
makes that difficult because there's no way of stabilizing task ->
worker pool association without knowing the pool first.
Worker attach/detach is a slow path and there's no need for different
pools to be able to perform them concurrently. This patch replaces
the per-pool attach_mutex with global wq_pool_attach_mutex to prepare
for visibility improvement changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
As a prerequisite, complement commit 3d1cb2059d ("workqueue: include
workqueue info when printing debug dump of a worker task") to be usable with
kernel modules by exporting the symbol set_worker_desc(). Current built-in
user was introduced with commit ef3b101925 ("writeback: set worker desc to
identify writeback workers in task dumps").
Can help distinguishing work items which do not have adapter scope.
Description is printed out with task dump for debugging on WARN, BUG, panic,
or magic-sysrq [show-task-states(t)].
Example:
$ echo 0 >| /sys/bus/ccw/drivers/zfcp/0.0.1880/0x50050763031bd327/failed &
$ echo 't' >| /proc/sysrq-trigger
$ dmesg
sysrq: SysRq : Show State
task PC stack pid father
...
zfcp_q_0.0.1880 S14640 2165 2 0x02000000
Call Trace:
([<00000000009df464>] __schedule+0xbf4/0xc78)
[<00000000009df57c>] schedule+0x94/0xc0
[<0000000000168654>] rescuer_thread+0x33c/0x3a0
[<000000000016f8be>] kthread+0x166/0x178
[<00000000009e71f2>] kernel_thread_starter+0x6/0xc
[<00000000009e71ec>] kernel_thread_starter+0x0/0xc
no locks held by zfcp_q_0.0.1880/2165.
...
kworker/u512:2 D11280 2193 2 0x02000000
Workqueue: zfcp_q_0.0.1880 zfcp_scsi_rport_work [zfcp] (zrpd-50050763031bd327)
^^^^^^^^^^^^^^^^^^^^^
Call Trace:
([<00000000009df464>] __schedule+0xbf4/0xc78)
[<00000000009df57c>] schedule+0x94/0xc0
[<00000000009e50c0>] schedule_timeout+0x488/0x4d0
[<00000000001e425c>] msleep+0x5c/0x78 >>test code only<<
[<000003ff8008a21e>] zfcp_scsi_rport_work+0xbe/0x100 [zfcp]
[<0000000000167154>] process_one_work+0x3b4/0x718
[<000000000016771c>] worker_thread+0x264/0x408
[<000000000016f8be>] kthread+0x166/0x178
[<00000000009e71f2>] kernel_thread_starter+0x6/0xc
[<00000000009e71ec>] kernel_thread_starter+0x0/0xc
2 locks held by kworker/u512:2/2193:
#0: (name){++++.+}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718
#1: ((&(&port->rport_work)->work)){+.+.+.}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718
...
=============================================
Showing busy workqueues and worker pools:
workqueue zfcp_q_0.0.1880: flags=0x2000a
pwq 512: cpus=0-255 flags=0x4 nice=0 active=1/1
in-flight: 2193:zfcp_scsi_rport_work [zfcp]
pool 512: cpus=0-255 flags=0x4 nice=0 hung=0s workers=4 idle: 5 2354 2311
Work items with adapter scope are already identified by the workqueue name
"zfcp_q_<devbusid>" and the work item function name.
Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull workqueue updates from Tejun Heo:
"rcu_work addition and a couple trivial changes"
* 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove the comment about the old manager_arb mutex
workqueue: fix the comments of nr_idle
fs/aio: Use rcu_work instead of explicit rcu and work item
cgroup: Use rcu_work instead of explicit rcu and work item
RCU, workqueue: Implement rcu_work
Since the worker rebinding behavior was refactored, there is
no idle worker off the idle_list now. The comment is outdated
and can be just removed.
It also groups nr_workers and nr_idle together.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are cases where RCU callback needs to be bounced to a sleepable
context. This is currently done by the RCU callback queueing a work
item, which can be cumbersome to write and confusing to read.
This patch introduces rcu_work, a workqueue work variant which gets
executed after a RCU grace period, and converts the open coded
bouncing in fs/aio and kernel/cgroup.
v3: Dropped queue_rcu_work_on(). Documented rcu grace period behavior
after queue_rcu_work().
v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
completion of previously queued rcu callback as per Paul.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Found this by accident.
There are no usages of bare cancel_work() in current kernel source.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Never directly free @dev after calling device_register(), even
if it returned an error! Always use put_device() to give up the
reference initialized in this function instead.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
The is also meson and edid fixes, plus a fix for a doc warning.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJajY3SAAoJEEN0HIUfOBk0/5IP/jTa0VKe7UurEzj9Vzgt4USu
tVre4MGN42peY2PbVSsBmvHAOeyII7la1/NkiFi8wZKQ2MXw43NenKOcRLDW0r9b
6U8Tlq3sU//NdUDAiLLx9hKb+i31ag+wodvULt0PKtEWDsxWDSRZUo792as2YUkC
VxHuIQywNABohn2Ya8Og1dON25GD7zRzNzH7O+g+fds/Qvav0504u2v10jBKJC0D
IB2oc3ZtJR8n0dFpzhnEB7YkxyvkrsWZQ1LtutGFgrr54F0KVHvAm/VMZ5qzyCRi
kvJN81OFo0xpdE7ZMSQ5YAvcPsEC5ifSNaaxpawsM904H7fS6FNhHMg7cGGi1f7R
B8YbLrdy+mBnQPNNbPcDPQA+YN/tRv4rRmmdLdkDbdY1GM/JJ4C7PTuLL6mX1iWU
DuHiaFS0KZGoS0XCVbvhLkPt5fsmvp+QxBpeNAtxgOdn2pRquDmGZ1jTVEG2mw5U
rqoPURa3urqdSwj8ba0jbJo6WBAmb1uWeyJ7xpyUVhR9SR30+URYVWwJEPDOgTnQ
PaEzjobntgDLaq5NbhpEvmYmylv1SPkucGtkCtwPxIrrh5Z84pZTJ1th2ogfn3Ti
VL25dTlzFpsjEMgC72wCi0eiP7qLVTX9vHYZBzkeIjIWDH0rCnCFxvjwmD/aVUbz
Ex1/fGNEVkFupcYLu7m4
=555h
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Fixes for 4.16. I contains fixes for deadlock on runtime suspend on few
drivers, a memory leak on non-blocking commits, a crash on color-eviction.
The is also meson and edid fixes, plus a fix for a doc warning.
* tag 'drm-misc-fixes-2018-02-21' of git://anongit.freedesktop.org/drm/drm-misc:
drm/tve200: fix kernel-doc documentation comment include
drm/meson: fix vsync buffer update
drm: Handle unexpected holes in color-eviction
drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA
drm/amdgpu: Fix deadlock on runtime suspend
drm/radeon: Fix deadlock on runtime suspend
drm/nouveau: Fix deadlock on runtime suspend
drm: Allow determining if current task is output poll worker
workqueue: Allow retrieval of current task's work struct
drm/atomic: Fix memleak on ERESTARTSYS during non-blocking commits
As we prepare for offloading the residual 1hz scheduler ticks to
workqueue, let's affine those to housekeepers so that they don't
interrupt the CPUs that don't want to be disturbed.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a helper to retrieve the current task's work struct if it is
a workqueue worker.
This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for a specific worker to
finish and that worker in turn calls a function which waits for runtime
suspend to finish. That function is invoked from multiple call sites
and waiting for runtime suspend to finish is the correct thing to do
except if it's executing in the context of the worker.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging tree
to the "real" part of the kernel, and the most and fsl-mc codebases are
almost ready to move out, that will probably happen for 4.17-rc1 if all
goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLxoA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk4vgCgjeMlwhtar65DIticIRj626EFxiQAnjGmH8Kd
d9Xz2Piq8X47uSsC/6AE
=xxMT
-----END PGP SIGNATURE-----
Merge tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO updates from Greg KH:
"Here is the big Staging and IIO driver patches for 4.16-rc1.
There is the normal amount of new IIO drivers added, like all
releases.
The networking IPX and the ncpfs filesystem are moved into the staging
tree, as they are on their way out of the kernel due to lack of use
anymore.
The visorbus subsystem finall has started moving out of the staging
tree to the "real" part of the kernel, and the most and fsl-mc
codebases are almost ready to move out, that will probably happen for
4.17-rc1 if all goes well.
Other than that, there is a bunch of license header cleanups in the
tree, along with the normal amount of coding style churn that we all
know and love for this codebase. I also got frustrated at the
Meltdown/Spectre mess and took it out on the dgnc tty driver, deleting
huge chunks of it that were never even being used.
Full details of everything is in the shortlog.
All of these patches have been in linux-next for a while with no
reported issues"
* tag 'staging-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (627 commits)
staging: rtlwifi: remove redundant initialization of 'cfg_cmd'
staging: rtl8723bs: remove a couple of redundant initializations
staging: comedi: reformat lines to 80 chars or less
staging: lustre: separate a connection destroy from free struct kib_conn
Staging: rtl8723bs: Use !x instead of NULL comparison
Staging: rtl8723bs: Remove dead code
Staging: rtl8723bs: Change names to conform to the kernel code
staging: ccree: Fix missing blank line after declaration
staging: rtl8188eu: remove redundant initialization of 'pwrcfgcmd'
staging: rtlwifi: remove unused RTLHALMAC_ST and RTLPHYDM_ST
staging: fbtft: remove unused FB_TFT_SSD1325 kconfig
staging: comedi: dt2811: remove redundant initialization of 'ns'
staging: wilc1000: fix alignments to match open parenthesis
staging: wilc1000: removed unnecessary defined enums typedef
staging: wilc1000: remove unnecessary use of parentheses
staging: rtl8192u: remove redundant initialization of 'timeout'
staging: sm750fb: fix CamelCase for dispSet var
staging: lustre: lnet/selftest: fix compile error on UP build
staging: rtl8723bs: hal_com_phycfg: Remove unneeded semicolons
staging: rts5208: Fix "seg_no" calculation in reset_ms_card()
...
Pull workqueue updates from Tejun Heo:
"Workqueue has an early init trick where workqueues can be created and
work items queued on them before the workqueue subsystem is online.
This helps simplifying early init and operation of low level
subsystems which use workqueues for managerial things which aren't
depended upon early during boot.
Out of laziness, the early init didn't cover workqueues with
WQ_MEM_RECLAIM, which is inconsistent and confusing because adding the
flag simply makes the system fail to boot. Cover WQ_MEM_RECLAIM too.
This was originally brought up for RCU but RCU didn't actually need
this. I still think it's a good idea to cover it"
* 'for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: allow WQ_MEM_RECLAIM on early init workqueues
workqueue: separate out init_rescuer()
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle were:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and in
kernel/torture.c). Also a couple of fixes to avoid sending IPIs to
offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends() and
read_barrier_depends().
- Torture-test updates.
- Miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
torture: Save a line in stutter_wait(): while -> for
torture: Eliminate torture_runnable and perf_runnable
torture: Make stutter less vulnerable to compilers and races
locking/locktorture: Fix num reader/writer corner cases
locking/locktorture: Fix rwsem reader_delay
torture: Place all torture-test modules in one MAINTAINERS group
rcutorture/kvm-build.sh: Skip build directory check
rcutorture: Simplify functions.sh include path
rcutorture: Simplify logging
rcutorture/kvm-recheck-*: Improve result directory readability check
rcutorture/kvm.sh: Support execution from any directory
rcutorture/kvm.sh: Use consistent help text for --qemu-args
rcutorture/kvm.sh: Remove unused variable, `alldone`
rcutorture: Remove unused script, config2frag.sh
rcutorture/configinit: Fix build directory error message
rcutorture: Preempt RCU-preempt readers more vigorously
torture: Reduce #ifdefs for preempt_schedule()
rcu: Remove have_rcu_nocb_mask from tree_plugin.h
rcu: Add comment giving debug strategy for double call_rcu()
tracing, rcu: Hide trace event rcu_nocb_wake when not used
...
Instead of the cfs workitem library, use workqueues.
As lnet wants to provide a cpu mask of allowed cpus, it
needs to be a WQ_UNBOUND work queue so that tasks can
run on cpus other than where they were submitted.
This patch also exported apply_workqueue_attrs() which is
a documented part of the workqueue API, that isn't currently
exported. lustre needs it to allow workqueue thread to be limited
to a subset of CPUs.
Acked-by: Tejun Heo <tj@kernel.org> (for export of apply_workqueue_attrs)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show_workqueue_state() can print out a lot of messages while being in
atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
device is slow it may end up triggering NMI hard lockup watchdog.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
Workqueues can be created early during boot before workqueue subsystem
in fully online - work items are queued waiting for later full
initialization. However, early init wasn't supported for
WQ_MEM_RECLAIM workqueues causing unnecessary annoyances for a subset
of users. Expand early init support to include WQ_MEM_RECLAIM
workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Separate out init_rescuer() from __alloc_workqueue_key() to prepare
for early init support for WQ_MEM_RECLAIM. This patch doesn't
introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Pull RCU updates from Paul E. McKenney:
- Updates to use cond_resched() instead of cond_resched_rcu_qs()
where feasible (currently everywhere except in kernel/rcu and
in kernel/torture.c). Also a couple of fixes to avoid sending
IPIs to offline CPUs.
- Updates to simplify RCU's dyntick-idle handling.
- Updates to remove almost all uses of smp_read_barrier_depends()
and read_barrier_depends().
- Miscellaneous fixes.
- Torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The filw was converted from print_symbol() to %pf some time
ago (044c782ce3 "workqueue: fix checkpatch issues").
kallsyms does not seem to be needed anymore.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the cpu/hotplug refactoring, DOWN_FAILED is never called without
preceding DOWN_PREPARE making the workaround unnecessary. Remove it.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the recent cpu/hotplug refactoring, workqueue_offline_cpu() is
guaranteed to run on the local cpu which is going offline.
This also fixes the following deadlock by removing work item
scheduling and flushing from CPU hotplug path.
http://lkml.kernel.org/r/1504764252-29091-1-git-send-email-prsood@codeaurora.org
tj: Description update.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs(). This
commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Initialize wq_unbound_cpumask to exclude cpus that were isolated by
the cmdline's isolcpus parameter.
Signed-off-by: Tal Shorer <tal.shorer@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With all callbacks converted, and the timer callback prototype
switched over, the TIMER_FUNC_TYPE cast is no longer needed,
so remove it. Conversion was done with the following scripts:
perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
$(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)
perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
$(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)
The now unused macros are also dropped from include/linux/timer.h.
Signed-off-by: Kees Cook <keescook@chromium.org>
Pull workqueue updates from Tejun Heo:
"There was a commit to make unbound kworkers respect cpu isolation but
it conflicted with the restructuring of cpu isolation and got
reverted, so the only thing left is the trivial comment fix.
Will retry the cpu isolation change after this merge window"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix comment for unbound workqueue's attrbutes
Revert "workqueue: respect isolated cpus when queueing an unbound work"
workqueue: respect isolated cpus when queueing an unbound work
Pull timer updates from Thomas Gleixner:
"Yet another big pile of changes:
- More year 2038 work from Arnd slowly reaching the point where we
need to think about the syscalls themself.
- A new timer function which allows to conditionally (re)arm a timer
only when it's either not running or the new expiry time is sooner
than the armed expiry time. This allows to use a single timer for
multiple timeout requirements w/o caring about the first expiry
time at the call site.
- A new NMI safe accessor to clock real time for the printk timestamp
work. Can be used by tracing, perf as well if required.
- A large number of timer setup conversions from Kees which got
collected here because either maintainers requested so or they
simply got ignored. As Kees pointed out already there are a few
trivial merge conflicts and some redundant commits which was
unavoidable due to the size of this conversion effort.
- Avoid a redundant iteration in the timer wheel softirq processing.
- Provide a mechanism to treat RTC implementations depending on their
hardware properties, i.e. don't inflict the write at the 0.5
seconds boundary which originates from the PC CMOS RTC to all RTCs.
No functional change as drivers need to be updated separately.
- The usual small updates to core code clocksource drivers. Nothing
really exciting"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (111 commits)
timers: Add a function to start/reduce a timer
pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()
timer: Prepare to change all DEFINE_TIMER() callbacks
netfilter: ipvs: Convert timers to use timer_setup()
scsi: qla2xxx: Convert timers to use timer_setup()
block/aoe: discover_timer: Convert timers to use timer_setup()
ide: Convert timers to use timer_setup()
drbd: Convert timers to use timer_setup()
mailbox: Convert timers to use timer_setup()
crypto: Convert timers to use timer_setup()
drivers/pcmcia: omap1: Fix error in automated timer conversion
ARM: footbridge: Fix typo in timer conversion
drivers/sgi-xp: Convert timers to use timer_setup()
drivers/pcmcia: Convert timers to use timer_setup()
drivers/memstick: Convert timers to use timer_setup()
drivers/macintosh: Convert timers to use timer_setup()
hwrng/xgene-rng: Convert timers to use timer_setup()
auxdisplay: Convert timers to use timer_setup()
sparc/led: Convert timers to use timer_setup()
mips: ip22/32: Convert timers to use timer_setup()
...
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1509980490-4285-4-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit b5149873a0.
It conflicts with the following isolcpus change from the sched branch.
edb9382175 ("sched/isolation: Move isolcpus= handling to the housekeeping code")
Let's revert for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the workqueue code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-12-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initialize wq_unbound_cpumask to exclude cpus that were isolated by
the cmdline's isolcpus parameter.
Signed-off-by: Tal Shorer <tal.shorer@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly. (The prior workqueue
patch missed a few timers.)
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Link: https://lkml.kernel.org/r/20171016225825.GA99101@beast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:
[ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
[ 1270.473240] -----------------------------------------------------
[ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
[ 1270.474994]
[ 1270.474994] and this task is already holding:
[ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
[ 1270.476046] which would create a new lock dependency:
[ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
[ 1270.476949]
[ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 1270.477553] (&pool->lock/1){-.-.}
...
[ 1270.488900] to a HARDIRQ-irq-unsafe lock:
[ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
...
[ 1270.494735] Possible interrupt unsafe locking scenario:
[ 1270.494735]
[ 1270.495250] CPU0 CPU1
[ 1270.495600] ---- ----
[ 1270.495947] lock(&(&lock->wait_lock)->rlock);
[ 1270.496295] local_irq_disable();
[ 1270.496753] lock(&pool->lock/1);
[ 1270.497205] lock(&(&lock->wait_lock)->rlock);
[ 1270.497744] <Interrupt>
[ 1270.497948] lock(&pool->lock/1);
, which will cause a irq inversion deadlock if the above lock scenario
happens.
The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.
Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f ("locking/mutex: Fix
lockdep_assert_held() fail").
Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag. The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.
This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.
v2: Drop unnecessary flag clearing before pool destruction as
suggested by Boqun.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: stable@vger.kernel.org