linux-yocto/kernel
Andrea Righi 8ae0972677 Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
commit 0b47b6c3543efd65f2e620e359b05f4938314fbd upstream.

scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.

In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.

However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.

As an example, let's consider the following scenario:

 $ schedtool -a 0,1, -e yes > /dev/null
 $ sudo schedtool -F -p 99 -a 0, -e \
   stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[                        0.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
 1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
  975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[||||||||||||||||||||||100.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
  577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
  555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.

This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.

Fixes: 97e13ecb02 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25 11:16:46 +02:00
..
bpf bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init() 2025-09-19 16:37:29 +02:00
cgroup cgroup: split cgroup_destroy_wq into 3 workqueues 2025-09-25 11:16:41 +02:00
configs Kbuild updates for v6.16 2025-06-07 10:05:35 -07:00
debug TTY/Serial driver updates for 6.15-rc1 2025-04-02 18:17:33 -07:00
dma dma-debug: don't enforce dma mapping check on noncoherent allocations 2025-09-19 16:37:26 +02:00
entry entry: Inline syscall_exit_to_user_mode() 2025-04-29 08:27:10 +02:00
events perf: Fix the POLL_HUP delivery breakage 2025-09-19 16:37:26 +02:00
futex futex: Use user_write_access_begin/_end() in futex_put_value() 2025-08-20 18:41:35 +02:00
gcov kbuild: require gcc-8 and binutils-2.30 2025-04-30 21:53:35 +02:00
irq genirq/irq_sim: Initialize work context pointers properly 2025-06-13 15:36:35 +02:00
kcsan kcsan: test: Initialize dummy variable 2025-08-15 16:38:50 +02:00
livepatch sched,livepatch: Untangle cond_resched() and live-patching 2025-05-14 13:16:24 +02:00
locking Generic: 2025-06-02 12:24:58 -07:00
module module: Prevent silent truncation of module name in delete_module(2) 2025-08-20 18:41:28 +02:00
power PM: hibernate: Restrict GFP mask in hibernation_snapshot() 2025-09-19 16:37:30 +02:00
printk printk: nbcon: Allow reacquire during panic 2025-08-20 18:41:30 +02:00
rcu rcu: Fix racy re-initialization of irq_work causing hangs 2025-08-20 18:41:43 +02:00
sched Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" 2025-09-25 11:16:46 +02:00
time hrtimers: Unconditionally update target CPU base after offline timer migration 2025-09-19 16:37:34 +02:00
trace tracing: Silence warning when chunk allocation fails in trace_pid_write 2025-09-19 16:37:28 +02:00
.gitignore kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-08-20 18:41:31 +02:00
acct.c acct: block access to kernel internal filesystems 2025-02-12 12:24:16 +01:00
async.c
audit_fsnotify.c
audit_tree.c replace collect_mounts()/drop_collected_mounts() with a safer variant 2025-06-23 14:01:49 -04:00
audit_watch.c fs: add kern_path_locked_negative() 2025-04-15 11:32:34 +02:00
audit.c audit: record AUDIT_ANOM_* events regardless of presence of rules 2025-04-11 14:14:41 -04:00
audit.h audit,module: restore audit logging in load failure case 2025-08-15 16:38:20 +02:00
auditfilter.c audit: fix out-of-bounds read in audit_compare_dname_path() 2025-09-09 19:02:34 +02:00
auditsc.c audit,module: restore audit logging in load failure case 2025-08-15 16:38:20 +02:00
backtracetest.c
bounds.c
capability.c capability: Remove unused has_capability 2025-03-07 22:03:09 -06:00
cfi.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
compat.c
configs.c
context_tracking.c context_tracking: Make RCU watch ct_kernel_exit_state() warning 2025-03-04 18:44:29 -08:00
cpu_pm.c
cpu.c perf: Remove too early and redundant CPU hotplug handling 2025-05-08 21:50:19 +02:00
crash_core.c crash: Use note name macros 2025-02-10 16:56:58 -08:00
crash_dump_dm_crypt.c crash_dump: retrieve dm crypt keys in kdump kernel 2025-05-21 10:48:21 -07:00
crash_reserve.c crash: fix spelling mistake "crahskernel" -> "crashkernel" 2025-05-11 17:54:10 -07:00
cred.c
delayacct.c delayacct: remove redundant code and adjust indentation 2025-05-27 19:40:33 -07:00
dma.c
elfcorehdr.c
exec_domain.c
exit.c - Avoid a crash on a heterogeneous machine where not all cores support the 2025-06-22 10:11:45 -07:00
exit.h
extable.c
fail_function.c
fork.c - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
freezer.c sched,freezer: Remove unnecessary warning in __thaw_task 2025-07-17 07:56:50 -10:00
gen_kheaders.sh kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-08-20 18:41:31 +02:00
groups.c
hung_task.c hung_task: show the blocker task if the task is hung on semaphore 2025-05-11 17:54:08 -07:00
iomem.c mm/memremap: Pass down MEMREMAP_* flags to arch_memremap_wb() 2025-02-21 15:05:38 +01:00
irq_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
jump_label.c jump_label: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
kallsyms_internal.h
kallsyms_selftest.c
kallsyms_selftest.h
kallsyms.c kallsyms: Remove KALLSYMS_ABSOLUTE_PERCPU 2025-02-18 10:16:04 +01:00
kcmp.c kcmp: improve performance adding an unlikely hint to task comparisons 2025-02-21 10:25:33 +01:00
Kconfig.freezer
Kconfig.hz kernel: Fix "select" wording on HZ_250 description 2025-02-21 09:20:30 +01:00
Kconfig.kexec kho: mm: don't allow deferred struct page with KHO 2025-08-28 16:34:34 +02:00
Kconfig.locks
Kconfig.preempt
kcov.c
kexec_core.c kexec_core: Fix error code path in the KEXEC_JUMP flow 2025-08-15 16:38:34 +02:00
kexec_elf.c kexec: initialize ELF lowest address to ULONG_MAX 2025-03-16 22:30:47 -07:00
kexec_file.c - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
kexec_handover.c kho: warn if KHO is disabled due to an error 2025-08-28 16:34:35 +02:00
kexec_internal.h kexec: add KHO support to kexec file loads 2025-05-12 23:50:40 -07:00
kexec.c
kheaders.c
kprobes.c kprobes: Use RCU in all users of __module_text_address(). 2025-03-10 11:54:46 +01:00
ksyms_common.c
ksysfs.c kernel/ksysfs.c: simplify bin_attribute definition 2025-01-07 16:59:15 +01:00
kthread.c ipvs: Fix estimator kthreads preferred affinity 2025-08-20 18:40:52 +02:00
latencytop.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
Makefile kheaders: rebuild kheaders_data.tar.xz when a file is modified within a minute 2025-08-20 18:41:31 +02:00
module_signature.c
notifier.c
nsproxy.c kernel/nsproxy: remove unnecessary guards 2025-05-09 13:13:54 +02:00
padata.c padata: Fix pd UAF once and for all 2025-08-15 16:38:58 +02:00
panic.c - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
params.c module: ensure that kobject_put() is safe for module type kobjects 2025-05-07 20:24:59 +02:00
pid_namespace.c pid: Do not set pid_max in new pid namespaces 2025-03-06 10:18:36 +01:00
pid_sysctl.h treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
pid.c pidfs: detect refcount bugs 2025-05-06 13:59:00 +02:00
profile.c
ptrace.c ptrace: introduce PTRACE_SET_SYSCALL_INFO request 2025-05-11 17:48:15 -07:00
range.c
reboot.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
regset.c
relay.c relay: remove unused relay_late_setup_files 2025-05-11 17:54:09 -07:00
resource_kunit.c
resource.c resource: fix false warning in __request_region() 2025-07-24 17:57:59 -07:00
rseq.c rseq: Fix segfault on registration when rseq_cs is non-zero 2025-03-06 22:26:49 +01:00
scftorture.c
scs.c
seccomp.c seccomp: avoid the lock trip seccomp_filter_release in common case 2025-02-24 11:17:10 -08:00
signal.c signal: Fix memory leak for PIDFD_SELF* sentinels 2025-08-28 16:34:38 +02:00
smp.c CSD-lock pull request for v6.14 2025-01-28 11:34:03 -08:00
smpboot.c
smpboot.h
softirq.c lockdep: Fix wait context check on softirq for PREEMPT_RT 2025-03-25 10:46:44 +01:00
stackleak.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
stacktrace.c
static_call_inline.c Modules changes for 6.15-rc1 2025-03-30 15:44:36 -07:00
static_call.c
stop_machine.c sched/core: Fix migrate_swap() vs. hotplug 2025-07-01 15:02:03 +02:00
sys_ni.c
sys.c futex: Add basic infrastructure for local task local hash 2025-05-03 12:02:07 +02:00
sysctl-test.c sysctl: move u8 register test to lib/test_sysctl.c 2025-04-14 14:13:41 +02:00
sysctl.c sparc: mv sparc sysctls into their own file under arch/sparc/kernel 2025-04-09 13:32:16 +02:00
task_work.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
taskstats.c
torture.c torture: Add get_torture_init_jiffies() for test-start time 2025-02-05 07:14:24 -08:00
tracepoint.c tracepoint: Print the function symbol when tracepoint_debug is set 2025-03-21 15:30:10 -04:00
tsacct.c
ucount.c ucount: fix atomic_long_inc_below() argument type 2025-08-15 16:39:17 +02:00
uid16.c
uid16.h
umh.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
up.c
user_namespace.c uidgid: add map_id_range_up() 2025-02-12 12:12:27 +01:00
user-return-notifier.c
user.c
usermode_driver.c
utsname_sysctl.c treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
utsname.c
vhost_task.c vhost_task: fix vhost_task_create() documentation 2025-04-18 10:08:11 -04:00
vmcore_info.c crash: export PAGE_UNACCEPTED_MAPCOUNT_VALUE to vmcoreinfo 2025-05-11 17:54:04 -07:00
watch_queue.c vfs-6.15-rc1.pipe 2025-03-24 09:52:37 -07:00
watchdog_buddy.c
watchdog_perf.c - The 7 patch series "powerpc/crash: use generic crashkernel 2025-04-01 10:06:52 -07:00
watchdog.c kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count 2025-05-21 10:48:22 -07:00
workqueue_internal.h
workqueue.c workqueue: Initialize wq_isolated_cpumask in workqueue_init_early() 2025-06-17 08:58:29 -10:00