Scheduler updates for v6.18:

Core scheduler changes:
 
  - Make migrate_{en,dis}able() inline, to improve performance
    (Menglong Dong)
 
  - Move STDL_INIT() functions out-of-line (Peter Zijlstra)
 
  - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)
 
 Fair scheduling:
 
  - Defer throttling when tasks exit to user-space, to reduce the
    chance & impact of throttle-preemption with held locks and
    other resources. (Aaron Lu, Valentin Schneider)
 
  - Get rid of sched_domains_curr_level hack for tl->cpumask(),
    as the warning was getting triggered on certain topologies.
    (Peter Zijlstra)
 
 Misc cleanups & fixes:
 
  - Header cleanups (Menglong Dong)
 
  - Fix race in push_dl_task() (Harshit Agarwal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmjWn0IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i9ng/+KJYEsNilPA6nGzZLurU3HXm6S5NrNBPc
 xJU43eo3QdFUymKqYaCoXCwaMah3m8fFW+DC8ZoEvWC5wHnlIw6+u/ZEtsWQ4R8N
 8+sjQddQeb9Gez+nkC7pXX8h1nqlhHyGWU88+9hOyEEMk21KgH7tJ5gOuGQdPhiA
 v24D8dkS1sT6CAeqZAycYIkos+kfNNV+wAEQCXAxfvSSslpzJVZcj9kdQInxF4O4
 9+WrvFZFVYJWdJgmgfbpBMw7N8a4Gpv+Lr6U0ZVXHNeLjNdn60I+5Me+9BW9GRcO
 pPL50RfGgGgVIqyEHvBhhz8p0tlKUJo9NkRJ9hmNB5SKT3P442SU789ppTYgI1il
 5P4g3TiuomqPVuo99/mYHYOpT6NkYC1fkwMP9Hfk4NRUX0EA6lKXelVnMXaC3Z7R
 U+0xVYtK4BBLRFBo4HIEM1HChSIcOnutuufFX4lPPt+ewnAmJwSt619w+lBF0v+n
 cpiLIlQ74LrYlrULw3bznnwzh5dV8XHm4CT3XAShm6qB97/SaLr0rI5W7njdSvte
 ym9z4yFWidawowvLWjFX1CykSnX/T5MV+uzuIH64MEIA39B4VKJWCAC3l3AMJn/5
 Ckhf93B5uG0q+wUtE66x/JrN7wtbz9PMpwh01fXNWs/eWc1VKkVrvq+PmHODngmz
 drSUoI1P570=
 =8ZTZ
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core scheduler changes:

   - Make migrate_{en,dis}able() inline, to improve performance
     (Menglong Dong)

   - Move STDL_INIT() functions out-of-line (Peter Zijlstra)

   - Unify the SCHED_{SMT,CLUSTER,MC} Kconfig (Peter Zijlstra)

  Fair scheduling:

   - Defer throttling to when tasks exit to user-space, to reduce the
     chance & impact of throttle-preemption with held locks and other
     resources (Aaron Lu, Valentin Schneider)

   - Get rid of sched_domains_curr_level hack for tl->cpumask(), as the
     warning was getting triggered on certain topologies (Peter
     Zijlstra)

  Misc cleanups & fixes:

   - Header cleanups (Menglong Dong)

   - Fix race in push_dl_task() (Harshit Agarwal)"

* tag 'sched-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix some typos in include/linux/preempt.h
  sched: Make migrate_{en,dis}able() inline
  rcu: Replace preempt.h with sched.h in include/linux/rcupdate.h
  arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c
  sched/fair: Do not balance task to a throttled cfs_rq
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: Get rid of throttled_lb_pair()
  sched/fair: Task based throttle time accounting
  sched/fair: Switch to task based throttle model
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Add related data structure for task based throttle
  sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig
  sched: Move STDL_INIT() functions out-of-line
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
  sched/deadline: Fix race in push_dl_task()
This commit is contained in:
Linus Torvalds 2025-09-30 10:35:11 -07:00
commit 6c7340a7a8
49 changed files with 696 additions and 506 deletions

13
Kbuild
View File

@ -34,13 +34,24 @@ arch/$(SRCARCH)/kernel/asm-offsets.s: $(timeconst-file) $(bounds-file)
$(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s FORCE $(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s FORCE
$(call filechk,offsets,__ASM_OFFSETS_H__) $(call filechk,offsets,__ASM_OFFSETS_H__)
# Generate rq-offsets.h
rq-offsets-file := include/generated/rq-offsets.h
targets += kernel/sched/rq-offsets.s
kernel/sched/rq-offsets.s: $(offsets-file)
$(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
$(call filechk,offsets,__RQ_OFFSETS_H__)
# Check for missing system calls # Check for missing system calls
quiet_cmd_syscalls = CALL $< quiet_cmd_syscalls = CALL $<
cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags) $(missing_syscalls_flags) cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags) $(missing_syscalls_flags)
PHONY += missing-syscalls PHONY += missing-syscalls
missing-syscalls: scripts/checksyscalls.sh $(offsets-file) missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
$(call cmd,syscalls) $(call cmd,syscalls)
# Check the manual modification of atomic headers # Check the manual modification of atomic headers

View File

@ -41,6 +41,44 @@ config HOTPLUG_SMT
config SMT_NUM_THREADS_DYNAMIC config SMT_NUM_THREADS_DYNAMIC
bool bool
config ARCH_SUPPORTS_SCHED_SMT
bool
config ARCH_SUPPORTS_SCHED_CLUSTER
bool
config ARCH_SUPPORTS_SCHED_MC
bool
config SCHED_SMT
bool "SMT (Hyperthreading) scheduler support"
depends on ARCH_SUPPORTS_SCHED_SMT
default y
help
Improves the CPU scheduler's decision making when dealing with
MultiThreading at a cost of slightly increased overhead in some
places. If unsure say N here.
config SCHED_CLUSTER
bool "Cluster scheduler support"
depends on ARCH_SUPPORTS_SCHED_CLUSTER
default y
help
Cluster scheduler support improves the CPU scheduler's decision
making when dealing with machines that have clusters of CPUs.
Cluster usually means a couple of CPUs which are placed closely
by sharing mid-level caches, last-level cache tags or internal
busses.
config SCHED_MC
bool "Multi-Core Cache (MC) scheduler support"
depends on ARCH_SUPPORTS_SCHED_MC
default y
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
# Selected by HOTPLUG_CORE_SYNC_DEAD or HOTPLUG_CORE_SYNC_FULL # Selected by HOTPLUG_CORE_SYNC_DEAD or HOTPLUG_CORE_SYNC_FULL
config HOTPLUG_CORE_SYNC config HOTPLUG_CORE_SYNC
bool bool

View File

@ -4,6 +4,7 @@
* This code generates raw asm output which is post-processed to extract * This code generates raw asm output which is post-processed to extract
* and format the required data. * and format the required data.
*/ */
#define COMPILE_OFFSETS
#include <linux/types.h> #include <linux/types.h>
#include <linux/stddef.h> #include <linux/stddef.h>

View File

@ -2,6 +2,7 @@
/* /*
* Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com) * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
*/ */
#define COMPILE_OFFSETS
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/mm.h> #include <linux/mm.h>

View File

@ -941,28 +941,14 @@ config IRQSTACKS
config ARM_CPU_TOPOLOGY config ARM_CPU_TOPOLOGY
bool "Support cpu topology definition" bool "Support cpu topology definition"
depends on SMP && CPU_V7 depends on SMP && CPU_V7
select ARCH_SUPPORTS_SCHED_MC
select ARCH_SUPPORTS_SCHED_SMT
default y default y
help help
Support ARM cpu topology definition. The MPIDR register defines Support ARM cpu topology definition. The MPIDR register defines
affinity between processors which is then used to describe the cpu affinity between processors which is then used to describe the cpu
topology of an ARM System. topology of an ARM System.
config SCHED_MC
bool "Multi-core scheduler support"
depends on ARM_CPU_TOPOLOGY
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config SCHED_SMT
bool "SMT scheduler support"
depends on ARM_CPU_TOPOLOGY
help
Improves the CPU scheduler's decision making when dealing with
MultiThreading at a cost of slightly increased overhead in some
places. If unsure say N here.
config HAVE_ARM_SCU config HAVE_ARM_SCU
bool bool
help help

View File

@ -7,6 +7,8 @@
* This code generates raw asm output which is post-processed to extract * This code generates raw asm output which is post-processed to extract
* and format the required data. * and format the required data.
*/ */
#define COMPILE_OFFSETS
#include <linux/compiler.h> #include <linux/compiler.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/mm.h> #include <linux/mm.h>

View File

@ -108,6 +108,9 @@ config ARM64
select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SCHED_SMT
select ARCH_SUPPORTS_SCHED_CLUSTER
select ARCH_SUPPORTS_SCHED_MC
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_BPF_JIT
@ -1507,29 +1510,6 @@ config CPU_LITTLE_ENDIAN
endchoice endchoice
config SCHED_MC
bool "Multi-core scheduler support"
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config SCHED_CLUSTER
bool "Cluster scheduler support"
help
Cluster scheduler support improves the CPU scheduler's decision
making when dealing with machines that have clusters of CPUs.
Cluster usually means a couple of CPUs which are placed closely
by sharing mid-level caches, last-level cache tags or internal
busses.
config SCHED_SMT
bool "SMT scheduler support"
help
Improves the CPU scheduler's decision making when dealing with
MultiThreading at a cost of slightly increased overhead in some
places. If unsure say N here.
config NR_CPUS config NR_CPUS
int "Maximum number of CPUs (2-4096)" int "Maximum number of CPUs (2-4096)"
range 2 4096 range 2 4096

View File

@ -6,6 +6,7 @@
* 2001-2002 Keith Owens * 2001-2002 Keith Owens
* Copyright (C) 2012 ARM Ltd. * Copyright (C) 2012 ARM Ltd.
*/ */
#define COMPILE_OFFSETS
#include <linux/arm_sdei.h> #include <linux/arm_sdei.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0 // SPDX-License-Identifier: GPL-2.0
// Copyright (C) 2018 Hangzhou C-SKY Microsystems co.,ltd. // Copyright (C) 2018 Hangzhou C-SKY Microsystems co.,ltd.
#define COMPILE_OFFSETS
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/kernel_stat.h> #include <linux/kernel_stat.h>

View File

@ -8,6 +8,7 @@
* *
* Copyright (c) 2010-2012, The Linux Foundation. All rights reserved. * Copyright (c) 2010-2012, The Linux Foundation. All rights reserved.
*/ */
#define COMPILE_OFFSETS
#include <linux/compat.h> #include <linux/compat.h>
#include <linux/types.h> #include <linux/types.h>

View File

@ -70,6 +70,8 @@ config LOONGARCH
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_MEMTEST select ARCH_USE_MEMTEST
@ -452,23 +454,6 @@ config EFI_STUB
This kernel feature allows the kernel to be loaded directly by This kernel feature allows the kernel to be loaded directly by
EFI firmware without the use of a bootloader. EFI firmware without the use of a bootloader.
config SCHED_SMT
bool "SMT scheduler support"
depends on SMP
default y
help
Improves scheduler's performance when there are multiple
threads in one physical core.
config SCHED_MC
bool "Multi-core scheduler support"
depends on SMP
default y
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places.
config SMP config SMP
bool "Multi-Processing support" bool "Multi-Processing support"
help help

View File

@ -4,6 +4,8 @@
* *
* Copyright (C) 2020-2022 Loongson Technology Corporation Limited * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
*/ */
#define COMPILE_OFFSETS
#include <linux/types.h> #include <linux/types.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/mm.h> #include <linux/mm.h>

View File

@ -9,6 +9,7 @@
* #defines from the assembly-language output. * #defines from the assembly-language output.
*/ */
#define COMPILE_OFFSETS
#define ASM_OFFSETS_C #define ASM_OFFSETS_C
#include <linux/stddef.h> #include <linux/stddef.h>

View File

@ -7,6 +7,7 @@
* License. See the file "COPYING" in the main directory of this archive * License. See the file "COPYING" in the main directory of this archive
* for more details. * for more details.
*/ */
#define COMPILE_OFFSETS
#include <linux/init.h> #include <linux/init.h>
#include <linux/stddef.h> #include <linux/stddef.h>

View File

@ -2223,7 +2223,7 @@ config MIPS_MT_SMP
select SMP select SMP
select SMP_UP select SMP_UP
select SYS_SUPPORTS_SMP select SYS_SUPPORTS_SMP
select SYS_SUPPORTS_SCHED_SMT select ARCH_SUPPORTS_SCHED_SMT
select MIPS_PERF_SHARED_TC_COUNTERS select MIPS_PERF_SHARED_TC_COUNTERS
help help
This is a kernel model which is known as SMVP. This is supported This is a kernel model which is known as SMVP. This is supported
@ -2235,18 +2235,6 @@ config MIPS_MT_SMP
config MIPS_MT config MIPS_MT
bool bool
config SCHED_SMT
bool "SMT (multithreading) scheduler support"
depends on SYS_SUPPORTS_SCHED_SMT
default n
help
SMT scheduler support improves the CPU scheduler's decision making
when dealing with MIPS MT enabled cores at a cost of slightly
increased overhead in some places. If unsure say N here.
config SYS_SUPPORTS_SCHED_SMT
bool
config SYS_SUPPORTS_MULTITHREADING config SYS_SUPPORTS_MULTITHREADING
bool bool
@ -2318,7 +2306,7 @@ config MIPS_CPS
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
select SYNC_R4K if (CEVT_R4K || CSRC_R4K) select SYNC_R4K if (CEVT_R4K || CSRC_R4K)
select SYS_SUPPORTS_HOTPLUG_CPU select SYS_SUPPORTS_HOTPLUG_CPU
select SYS_SUPPORTS_SCHED_SMT if CPU_MIPSR6 select ARCH_SUPPORTS_SCHED_SMT if CPU_MIPSR6
select SYS_SUPPORTS_SMP select SYS_SUPPORTS_SMP
select WEAK_ORDERING select WEAK_ORDERING
select GENERIC_IRQ_MIGRATION if HOTPLUG_CPU select GENERIC_IRQ_MIGRATION if HOTPLUG_CPU

View File

@ -9,6 +9,8 @@
* Kevin Kissell, kevink@mips.com and Carsten Langgaard, carstenl@mips.com * Kevin Kissell, kevink@mips.com and Carsten Langgaard, carstenl@mips.com
* Copyright (C) 2000 MIPS Technologies, Inc. * Copyright (C) 2000 MIPS Technologies, Inc.
*/ */
#define COMPILE_OFFSETS
#include <linux/compat.h> #include <linux/compat.h>
#include <linux/types.h> #include <linux/types.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -2,6 +2,7 @@
/* /*
* Copyright (C) 2011 Tobias Klauser <tklauser@distanz.ch> * Copyright (C) 2011 Tobias Klauser <tklauser@distanz.ch>
*/ */
#define COMPILE_OFFSETS
#include <linux/stddef.h> #include <linux/stddef.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -18,6 +18,7 @@
* compile this file to assembler, and then extract the * compile this file to assembler, and then extract the
* #defines from the assembly-language output. * #defines from the assembly-language output.
*/ */
#define COMPILE_OFFSETS
#include <linux/signal.h> #include <linux/signal.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -44,6 +44,7 @@ config PARISC
select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_HAVE_NMI_SAFE_CMPXCHG
select GENERIC_SMP_IDLE_THREAD select GENERIC_SMP_IDLE_THREAD
select GENERIC_ARCH_TOPOLOGY if SMP select GENERIC_ARCH_TOPOLOGY if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP && PA8X00
select GENERIC_CPU_DEVICES if !SMP select GENERIC_CPU_DEVICES if !SMP
select GENERIC_LIB_DEVMEM_IS_ALLOWED select GENERIC_LIB_DEVMEM_IS_ALLOWED
select SYSCTL_ARCH_UNALIGN_ALLOW select SYSCTL_ARCH_UNALIGN_ALLOW
@ -319,14 +320,6 @@ config SMP
If you don't know what to do here, say N. If you don't know what to do here, say N.
config SCHED_MC
bool "Multi-core scheduler support"
depends on GENERIC_ARCH_TOPOLOGY && PA8X00
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config IRQSTACKS config IRQSTACKS
bool "Use separate kernel stacks when processing interrupts" bool "Use separate kernel stacks when processing interrupts"
default y default y

View File

@ -13,6 +13,7 @@
* Copyright (C) 2002 Randolph Chung <tausq with parisc-linux.org> * Copyright (C) 2002 Randolph Chung <tausq with parisc-linux.org>
* Copyright (C) 2003 James Bottomley <jejb at parisc-linux.org> * Copyright (C) 2003 James Bottomley <jejb at parisc-linux.org>
*/ */
#define COMPILE_OFFSETS
#include <linux/types.h> #include <linux/types.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -170,6 +170,9 @@ config PPC
select ARCH_STACKWALK select ARCH_STACKWALK
select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC_BOOK3S || PPC_8xx select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC_BOOK3S || PPC_8xx
select ARCH_SUPPORTS_SCHED_MC if SMP
select ARCH_SUPPORTS_SCHED_SMT if PPC64 && SMP
select SCHED_MC if ARCH_SUPPORTS_SCHED_MC
select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_MEMTEST select ARCH_USE_MEMTEST
@ -965,14 +968,6 @@ config PPC_PROT_SAO_LPAR
config PPC_COPRO_BASE config PPC_COPRO_BASE
bool bool
config SCHED_SMT
bool "SMT (Hyperthreading) scheduler support"
depends on PPC64 && SMP
help
SMT scheduler support improves the CPU scheduler's decision making
when dealing with POWER5 cpus at a cost of slightly increased
overhead in some places. If unsure say N here.
config PPC_DENORMALISATION config PPC_DENORMALISATION
bool "PowerPC denormalisation exception handling" bool "PowerPC denormalisation exception handling"
depends on PPC_BOOK3S_64 depends on PPC_BOOK3S_64

View File

@ -131,6 +131,8 @@ static inline int cpu_to_coregroup_id(int cpu)
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
#include <asm/cputable.h> #include <asm/cputable.h>
struct cpumask *cpu_coregroup_mask(int cpu);
#ifdef CONFIG_PPC64 #ifdef CONFIG_PPC64
#include <asm/smp.h> #include <asm/smp.h>

View File

@ -8,6 +8,7 @@
* compile this file to assembler, and then extract the * compile this file to assembler, and then extract the
* #defines from the assembly-language output. * #defines from the assembly-language output.
*/ */
#define COMPILE_OFFSETS
#include <linux/compat.h> #include <linux/compat.h>
#include <linux/signal.h> #include <linux/signal.h>

View File

@ -1028,19 +1028,19 @@ static int powerpc_shared_proc_flags(void)
* We can't just pass cpu_l2_cache_mask() directly because * We can't just pass cpu_l2_cache_mask() directly because
* returns a non-const pointer and the compiler barfs on that. * returns a non-const pointer and the compiler barfs on that.
*/ */
static const struct cpumask *shared_cache_mask(int cpu) static const struct cpumask *tl_cache_mask(struct sched_domain_topology_level *tl, int cpu)
{ {
return per_cpu(cpu_l2_cache_map, cpu); return per_cpu(cpu_l2_cache_map, cpu);
} }
#ifdef CONFIG_SCHED_SMT #ifdef CONFIG_SCHED_SMT
static const struct cpumask *smallcore_smt_mask(int cpu) static const struct cpumask *tl_smallcore_smt_mask(struct sched_domain_topology_level *tl, int cpu)
{ {
return cpu_smallcore_mask(cpu); return cpu_smallcore_mask(cpu);
} }
#endif #endif
static struct cpumask *cpu_coregroup_mask(int cpu) struct cpumask *cpu_coregroup_mask(int cpu)
{ {
return per_cpu(cpu_coregroup_map, cpu); return per_cpu(cpu_coregroup_map, cpu);
} }
@ -1054,11 +1054,6 @@ static bool has_coregroup_support(void)
return coregroup_enabled; return coregroup_enabled;
} }
static const struct cpumask *cpu_mc_mask(int cpu)
{
return cpu_coregroup_mask(cpu);
}
static int __init init_big_cores(void) static int __init init_big_cores(void)
{ {
int cpu; int cpu;
@ -1448,7 +1443,7 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
return false; return false;
} }
cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu)); cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
/* Update l2-cache mask with all the CPUs that are part of submask */ /* Update l2-cache mask with all the CPUs that are part of submask */
or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
@ -1538,7 +1533,7 @@ static void update_coregroup_mask(int cpu, cpumask_var_t *mask)
return; return;
} }
cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu)); cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
/* Update coregroup mask with all the CPUs that are part of submask */ /* Update coregroup mask with all the CPUs that are part of submask */
or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask); or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
@ -1601,7 +1596,7 @@ static void add_cpu_to_masks(int cpu)
/* If chip_id is -1; limit the cpu_core_mask to within PKG */ /* If chip_id is -1; limit the cpu_core_mask to within PKG */
if (chip_id == -1) if (chip_id == -1)
cpumask_and(mask, mask, cpu_cpu_mask(cpu)); cpumask_and(mask, mask, cpu_node_mask(cpu));
for_each_cpu(i, mask) { for_each_cpu(i, mask) {
if (chip_id == cpu_to_chip_id(i)) { if (chip_id == cpu_to_chip_id(i)) {
@ -1701,22 +1696,22 @@ static void __init build_sched_topology(void)
if (has_big_cores) { if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n"); pr_info("Big cores detected but using small core scheduling\n");
powerpc_topology[i++] = powerpc_topology[i++] =
SDTL_INIT(smallcore_smt_mask, powerpc_smt_flags, SMT); SDTL_INIT(tl_smallcore_smt_mask, powerpc_smt_flags, SMT);
} else { } else {
powerpc_topology[i++] = SDTL_INIT(cpu_smt_mask, powerpc_smt_flags, SMT); powerpc_topology[i++] = SDTL_INIT(tl_smt_mask, powerpc_smt_flags, SMT);
} }
#endif #endif
if (shared_caches) { if (shared_caches) {
powerpc_topology[i++] = powerpc_topology[i++] =
SDTL_INIT(shared_cache_mask, powerpc_shared_cache_flags, CACHE); SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
} }
if (has_coregroup_support()) { if (has_coregroup_support()) {
powerpc_topology[i++] = powerpc_topology[i++] =
SDTL_INIT(cpu_mc_mask, powerpc_shared_proc_flags, MC); SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
} }
powerpc_topology[i++] = SDTL_INIT(cpu_cpu_mask, powerpc_shared_proc_flags, PKG); powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);
/* There must be one trailing NULL entry left. */ /* There must be one trailing NULL entry left. */
BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1); BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);

View File

@ -74,6 +74,7 @@ config RISCV
select ARCH_SUPPORTS_PER_VMA_LOCK if MMU select ARCH_SUPPORTS_PER_VMA_LOCK if MMU
select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_SCHED_MC if SMP
select ARCH_USE_CMPXCHG_LOCKREF if 64BIT select ARCH_USE_CMPXCHG_LOCKREF if 64BIT
select ARCH_USE_MEMTEST select ARCH_USE_MEMTEST
select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_RWLOCKS
@ -455,14 +456,6 @@ config SMP
If you don't know what to do here, say N. If you don't know what to do here, say N.
config SCHED_MC
bool "Multi-core scheduler support"
depends on SMP
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config NR_CPUS config NR_CPUS
int "Maximum number of CPUs (2-512)" int "Maximum number of CPUs (2-512)"
depends on SMP depends on SMP

View File

@ -3,6 +3,7 @@
* Copyright (C) 2012 Regents of the University of California * Copyright (C) 2012 Regents of the University of California
* Copyright (C) 2017 SiFive * Copyright (C) 2017 SiFive
*/ */
#define COMPILE_OFFSETS
#include <linux/kbuild.h> #include <linux/kbuild.h>
#include <linux/mm.h> #include <linux/mm.h>

View File

@ -554,15 +554,11 @@ config NODES_SHIFT
depends on NUMA depends on NUMA
default "1" default "1"
config SCHED_SMT
def_bool n
config SCHED_MC
def_bool n
config SCHED_TOPOLOGY config SCHED_TOPOLOGY
def_bool y def_bool y
prompt "Topology scheduler support" prompt "Topology scheduler support"
select ARCH_SUPPORTS_SCHED_SMT
select ARCH_SUPPORTS_SCHED_MC
select SCHED_SMT select SCHED_SMT
select SCHED_MC select SCHED_MC
help help

View File

@ -4,6 +4,7 @@
* This code generates raw asm output which is post-processed to extract * This code generates raw asm output which is post-processed to extract
* and format the required data. * and format the required data.
*/ */
#define COMPILE_OFFSETS
#include <linux/kbuild.h> #include <linux/kbuild.h>
#include <linux/sched.h> #include <linux/sched.h>

View File

@ -509,33 +509,27 @@ int topology_cpu_init(struct cpu *cpu)
return rc; return rc;
} }
static const struct cpumask *cpu_thread_mask(int cpu)
{
return &cpu_topology[cpu].thread_mask;
}
const struct cpumask *cpu_coregroup_mask(int cpu) const struct cpumask *cpu_coregroup_mask(int cpu)
{ {
return &cpu_topology[cpu].core_mask; return &cpu_topology[cpu].core_mask;
} }
static const struct cpumask *cpu_book_mask(int cpu) static const struct cpumask *tl_book_mask(struct sched_domain_topology_level *tl, int cpu)
{ {
return &cpu_topology[cpu].book_mask; return &cpu_topology[cpu].book_mask;
} }
static const struct cpumask *cpu_drawer_mask(int cpu) static const struct cpumask *tl_drawer_mask(struct sched_domain_topology_level *tl, int cpu)
{ {
return &cpu_topology[cpu].drawer_mask; return &cpu_topology[cpu].drawer_mask;
} }
static struct sched_domain_topology_level s390_topology[] = { static struct sched_domain_topology_level s390_topology[] = {
SDTL_INIT(cpu_thread_mask, cpu_smt_flags, SMT), SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
SDTL_INIT(cpu_coregroup_mask, cpu_core_flags, MC), SDTL_INIT(tl_mc_mask, cpu_core_flags, MC),
SDTL_INIT(cpu_book_mask, NULL, BOOK), SDTL_INIT(tl_book_mask, NULL, BOOK),
SDTL_INIT(cpu_drawer_mask, NULL, DRAWER), SDTL_INIT(tl_drawer_mask, NULL, DRAWER),
SDTL_INIT(cpu_cpu_mask, NULL, PKG), SDTL_INIT(tl_pkg_mask, NULL, PKG),
{ NULL, }, { NULL, },
}; };

View File

@ -8,6 +8,7 @@
* compile this file to assembler, and then extract the * compile this file to assembler, and then extract the
* #defines from the assembly-language output. * #defines from the assembly-language output.
*/ */
#define COMPILE_OFFSETS
#include <linux/stddef.h> #include <linux/stddef.h>
#include <linux/types.h> #include <linux/types.h>

View File

@ -110,6 +110,8 @@ config SPARC64
select HAVE_SETUP_PER_CPU_AREA select HAVE_SETUP_PER_CPU_AREA
select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_EMBED_FIRST_CHUNK
select NEED_PER_CPU_PAGE_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK
select ARCH_SUPPORTS_SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
config ARCH_PROC_KCORE_TEXT config ARCH_PROC_KCORE_TEXT
def_bool y def_bool y
@ -288,24 +290,6 @@ if SPARC64 || COMPILE_TEST
source "kernel/power/Kconfig" source "kernel/power/Kconfig"
endif endif
config SCHED_SMT
bool "SMT (Hyperthreading) scheduler support"
depends on SPARC64 && SMP
default y
help
SMT scheduler support improves the CPU scheduler's decision making
when dealing with SPARC cpus at a cost of slightly increased overhead
in some places. If unsure say N here.
config SCHED_MC
bool "Multi-core scheduler support"
depends on SPARC64 && SMP
default y
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config CMDLINE_BOOL config CMDLINE_BOOL
bool "Default bootloader kernel arguments" bool "Default bootloader kernel arguments"
depends on SPARC64 depends on SPARC64

View File

@ -10,6 +10,7 @@
* *
* On sparc, thread_info data is static and TI_XXX offsets are computed by hand. * On sparc, thread_info data is static and TI_XXX offsets are computed by hand.
*/ */
#define COMPILE_OFFSETS
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/mm_types.h> #include <linux/mm_types.h>

View File

@ -1 +1,3 @@
#define COMPILE_OFFSETS
#include <sysdep/kernel-offsets.h> #include <sysdep/kernel-offsets.h>

View File

@ -330,6 +330,10 @@ config X86
imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
select ARCH_SUPPORTS_PT_RECLAIM if X86_64 select ARCH_SUPPORTS_PT_RECLAIM if X86_64
select ARCH_SUPPORTS_SCHED_SMT if SMP
select SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_CLUSTER if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
config INSTRUCTION_DECODER config INSTRUCTION_DECODER
def_bool y def_bool y
@ -1031,29 +1035,6 @@ config NR_CPUS
This is purely to save memory: each supported CPU adds about 8KB This is purely to save memory: each supported CPU adds about 8KB
to the kernel image. to the kernel image.
config SCHED_CLUSTER
bool "Cluster scheduler support"
depends on SMP
default y
help
Cluster scheduler support improves the CPU scheduler's decision
making when dealing with machines that have clusters of CPUs.
Cluster usually means a couple of CPUs which are placed closely
by sharing mid-level caches, last-level cache tags or internal
busses.
config SCHED_SMT
def_bool y if SMP
config SCHED_MC
def_bool y
prompt "Multi-core scheduler support"
depends on SMP
help
Multi-core scheduler support improves the CPU scheduler's decision
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config SCHED_MC_PRIO config SCHED_MC_PRIO
bool "CPU core priorities scheduler support" bool "CPU core priorities scheduler support"
depends on SCHED_MC depends on SCHED_MC

View File

@ -479,14 +479,14 @@ static int x86_cluster_flags(void)
static bool x86_has_numa_in_package; static bool x86_has_numa_in_package;
static struct sched_domain_topology_level x86_topology[] = { static struct sched_domain_topology_level x86_topology[] = {
SDTL_INIT(cpu_smt_mask, cpu_smt_flags, SMT), SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
#ifdef CONFIG_SCHED_CLUSTER #ifdef CONFIG_SCHED_CLUSTER
SDTL_INIT(cpu_clustergroup_mask, x86_cluster_flags, CLS), SDTL_INIT(tl_cls_mask, x86_cluster_flags, CLS),
#endif #endif
#ifdef CONFIG_SCHED_MC #ifdef CONFIG_SCHED_MC
SDTL_INIT(cpu_coregroup_mask, x86_core_flags, MC), SDTL_INIT(tl_mc_mask, x86_core_flags, MC),
#endif #endif
SDTL_INIT(cpu_cpu_mask, x86_sched_itmt_flags, PKG), SDTL_INIT(tl_pkg_mask, x86_sched_itmt_flags, PKG),
{ NULL }, { NULL },
}; };

View File

@ -11,6 +11,7 @@
* *
* Chris Zankel <chris@zankel.net> * Chris Zankel <chris@zankel.net>
*/ */
#define COMPILE_OFFSETS
#include <asm/processor.h> #include <asm/processor.h>
#include <asm/coprocessor.h> #include <asm/coprocessor.h>

View File

@ -372,7 +372,7 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
/* /*
* Migrate-Disable and why it is undesired. * Migrate-Disable and why it is undesired.
* *
* When a preempted task becomes elegible to run under the ideal model (IOW it * When a preempted task becomes eligible to run under the ideal model (IOW it
* becomes one of the M highest priority tasks), it might still have to wait * becomes one of the M highest priority tasks), it might still have to wait
* for the preemptee's migrate_disable() section to complete. Thereby suffering * for the preemptee's migrate_disable() section to complete. Thereby suffering
* a reduction in bandwidth in the exact duration of the migrate_disable() * a reduction in bandwidth in the exact duration of the migrate_disable()
@ -387,7 +387,7 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
* - a lower priority tasks; which under preempt_disable() could've instantly * - a lower priority tasks; which under preempt_disable() could've instantly
* migrated away when another CPU becomes available, is now constrained * migrated away when another CPU becomes available, is now constrained
* by the ability to push the higher priority task away, which might itself be * by the ability to push the higher priority task away, which might itself be
* in a migrate_disable() section, reducing it's available bandwidth. * in a migrate_disable() section, reducing its available bandwidth.
* *
* IOW it trades latency / moves the interference term, but it stays in the * IOW it trades latency / moves the interference term, but it stays in the
* system, and as long as it remains unbounded, the system is not fully * system, and as long as it remains unbounded, the system is not fully
@ -399,7 +399,7 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
* PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a * PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a
* number of primitives into becoming preemptible, they would also allow * number of primitives into becoming preemptible, they would also allow
* migration. This turns out to break a bunch of per-cpu usage. To this end, * migration. This turns out to break a bunch of per-cpu usage. To this end,
* all these primitives employ migirate_disable() to restore this implicit * all these primitives employ migrate_disable() to restore this implicit
* assumption. * assumption.
* *
* This is a 'temporary' work-around at best. The correct solution is getting * This is a 'temporary' work-around at best. The correct solution is getting
@ -407,7 +407,7 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
* per-cpu locking or short preempt-disable regions. * per-cpu locking or short preempt-disable regions.
* *
* The end goal must be to get rid of migrate_disable(), alternatively we need * The end goal must be to get rid of migrate_disable(), alternatively we need
* a schedulability theory that does not depend on abritrary migration. * a schedulability theory that does not depend on arbitrary migration.
* *
* *
* Notes on the implementation. * Notes on the implementation.
@ -424,8 +424,6 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier,
* work-conserving schedulers. * work-conserving schedulers.
* *
*/ */
extern void migrate_disable(void);
extern void migrate_enable(void);
/** /**
* preempt_disable_nested - Disable preemption inside a normally preempt disabled section * preempt_disable_nested - Disable preemption inside a normally preempt disabled section
@ -471,7 +469,6 @@ static __always_inline void preempt_enable_nested(void)
DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable()) DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace()) DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
#ifdef CONFIG_PREEMPT_DYNAMIC #ifdef CONFIG_PREEMPT_DYNAMIC

View File

@ -24,7 +24,7 @@
#include <linux/compiler.h> #include <linux/compiler.h>
#include <linux/atomic.h> #include <linux/atomic.h>
#include <linux/irqflags.h> #include <linux/irqflags.h>
#include <linux/preempt.h> #include <linux/sched.h>
#include <linux/bottom_half.h> #include <linux/bottom_half.h>
#include <linux/lockdep.h> #include <linux/lockdep.h>
#include <linux/cleanup.h> #include <linux/cleanup.h>

View File

@ -49,6 +49,9 @@
#include <linux/tracepoint-defs.h> #include <linux/tracepoint-defs.h>
#include <linux/unwind_deferred_types.h> #include <linux/unwind_deferred_types.h>
#include <asm/kmap_size.h> #include <asm/kmap_size.h>
#ifndef COMPILE_OFFSETS
#include <generated/rq-offsets.h>
#endif
/* task_struct member predeclarations (sorted alphabetically): */ /* task_struct member predeclarations (sorted alphabetically): */
struct audit_context; struct audit_context;
@ -881,6 +884,11 @@ struct task_struct {
#ifdef CONFIG_CGROUP_SCHED #ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group; struct task_group *sched_task_group;
#ifdef CONFIG_CFS_BANDWIDTH
struct callback_head sched_throttle_work;
struct list_head throttle_node;
bool throttled;
#endif
#endif #endif
@ -2310,4 +2318,114 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_restore(_tag, _old) do {} while (0) #define alloc_tag_restore(_tag, _old) do {} while (0)
#endif #endif
#ifndef MODULE
#ifndef COMPILE_OFFSETS
extern void ___migrate_enable(void);
struct rq;
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
/*
* The "struct rq" is not available here, so we can't access the
* "runqueues" with this_cpu_ptr(), as the compilation will fail in
* this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
* typeof((ptr) + 0)
*
* So use arch_raw_cpu_ptr()/PERCPU_PTR() directly here.
*/
#ifdef CONFIG_SMP
#define this_rq_raw() arch_raw_cpu_ptr(&runqueues)
#else
#define this_rq_raw() PERCPU_PTR(&runqueues)
#endif
#define this_rq_pinned() (*(unsigned int *)((void *)this_rq_raw() + RQ_nr_pinned))
static inline void __migrate_enable(void)
{
struct task_struct *p = current;
#ifdef CONFIG_DEBUG_PREEMPT
/*
* Check both overflow from migrate_disable() and superfluous
* migrate_enable().
*/
if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
return;
#endif
if (p->migration_disabled > 1) {
p->migration_disabled--;
return;
}
/*
* Ensure stop_task runs either before or after this, and that
* __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
*/
guard(preempt)();
if (unlikely(p->cpus_ptr != &p->cpus_mask))
___migrate_enable();
/*
* Mustn't clear migration_disabled() until cpus_ptr points back at the
* regular cpus_mask, otherwise things that race (eg.
* select_fallback_rq) get confused.
*/
barrier();
p->migration_disabled = 0;
this_rq_pinned()--;
}
static inline void __migrate_disable(void)
{
struct task_struct *p = current;
if (p->migration_disabled) {
#ifdef CONFIG_DEBUG_PREEMPT
/*
*Warn about overflow half-way through the range.
*/
WARN_ON_ONCE((s16)p->migration_disabled < 0);
#endif
p->migration_disabled++;
return;
}
guard(preempt)();
this_rq_pinned()++;
p->migration_disabled = 1;
}
#else /* !COMPILE_OFFSETS */
static inline void __migrate_disable(void) { }
static inline void __migrate_enable(void) { }
#endif /* !COMPILE_OFFSETS */
/*
* So that it is possible to not export the runqueues variable, define and
* export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
* them for the modules. The macro "INSTANTIATE_EXPORTED_MIGRATE_DISABLE" will
* be defined in kernel/sched/core.c.
*/
#ifndef INSTANTIATE_EXPORTED_MIGRATE_DISABLE
static inline void migrate_disable(void)
{
__migrate_disable();
}
static inline void migrate_enable(void)
{
__migrate_enable();
}
#else /* INSTANTIATE_EXPORTED_MIGRATE_DISABLE */
extern void migrate_disable(void);
extern void migrate_enable(void);
#endif /* INSTANTIATE_EXPORTED_MIGRATE_DISABLE */
#else /* MODULE */
extern void migrate_disable(void);
extern void migrate_enable(void);
#endif /* MODULE */
DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
#endif #endif

View File

@ -30,33 +30,24 @@ struct sd_flag_debug {
}; };
extern const struct sd_flag_debug sd_flag_debug[]; extern const struct sd_flag_debug sd_flag_debug[];
struct sched_domain_topology_level;
#ifdef CONFIG_SCHED_SMT #ifdef CONFIG_SCHED_SMT
static inline int cpu_smt_flags(void) extern int cpu_smt_flags(void);
{ extern const struct cpumask *tl_smt_mask(struct sched_domain_topology_level *tl, int cpu);
return SD_SHARE_CPUCAPACITY | SD_SHARE_LLC;
}
#endif #endif
#ifdef CONFIG_SCHED_CLUSTER #ifdef CONFIG_SCHED_CLUSTER
static inline int cpu_cluster_flags(void) extern int cpu_cluster_flags(void);
{ extern const struct cpumask *tl_cls_mask(struct sched_domain_topology_level *tl, int cpu);
return SD_CLUSTER | SD_SHARE_LLC;
}
#endif #endif
#ifdef CONFIG_SCHED_MC #ifdef CONFIG_SCHED_MC
static inline int cpu_core_flags(void) extern int cpu_core_flags(void);
{ extern const struct cpumask *tl_mc_mask(struct sched_domain_topology_level *tl, int cpu);
return SD_SHARE_LLC;
}
#endif #endif
#ifdef CONFIG_NUMA extern const struct cpumask *tl_pkg_mask(struct sched_domain_topology_level *tl, int cpu);
static inline int cpu_numa_flags(void)
{
return SD_NUMA;
}
#endif
extern int arch_asym_cpu_priority(int cpu); extern int arch_asym_cpu_priority(int cpu);
@ -172,7 +163,7 @@ bool cpus_equal_capacity(int this_cpu, int that_cpu);
bool cpus_share_cache(int this_cpu, int that_cpu); bool cpus_share_cache(int this_cpu, int that_cpu);
bool cpus_share_resources(int this_cpu, int that_cpu); bool cpus_share_resources(int this_cpu, int that_cpu);
typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef const struct cpumask *(*sched_domain_mask_f)(struct sched_domain_topology_level *tl, int cpu);
typedef int (*sched_domain_flags_f)(void); typedef int (*sched_domain_flags_f)(void);
struct sd_data { struct sd_data {

View File

@ -260,7 +260,7 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
#endif #endif
static inline const struct cpumask *cpu_cpu_mask(int cpu) static inline const struct cpumask *cpu_node_mask(int cpu)
{ {
return cpumask_of_node(cpu_to_node(cpu)); return cpumask_of_node(cpu_to_node(cpu));
} }

View File

@ -23859,6 +23859,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
BTF_SET_START(btf_id_deny) BTF_SET_START(btf_id_deny)
BTF_ID_UNUSED BTF_ID_UNUSED
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
BTF_ID(func, ___migrate_enable)
BTF_ID(func, migrate_disable) BTF_ID(func, migrate_disable)
BTF_ID(func, migrate_enable) BTF_ID(func, migrate_enable)
#endif #endif

View File

@ -7,6 +7,8 @@
* Copyright (C) 1991-2002 Linus Torvalds * Copyright (C) 1991-2002 Linus Torvalds
* Copyright (C) 1998-2024 Ingo Molnar, Red Hat * Copyright (C) 1998-2024 Ingo Molnar, Red Hat
*/ */
#define INSTANTIATE_EXPORTED_MIGRATE_DISABLE
#include <linux/sched.h>
#include <linux/highmem.h> #include <linux/highmem.h>
#include <linux/hrtimer_api.h> #include <linux/hrtimer_api.h>
#include <linux/ktime_api.h> #include <linux/ktime_api.h>
@ -2381,28 +2383,7 @@ static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
__do_set_cpus_allowed(p, &ac); __do_set_cpus_allowed(p, &ac);
} }
void migrate_disable(void) void ___migrate_enable(void)
{
struct task_struct *p = current;
if (p->migration_disabled) {
#ifdef CONFIG_DEBUG_PREEMPT
/*
*Warn about overflow half-way through the range.
*/
WARN_ON_ONCE((s16)p->migration_disabled < 0);
#endif
p->migration_disabled++;
return;
}
guard(preempt)();
this_rq()->nr_pinned++;
p->migration_disabled = 1;
}
EXPORT_SYMBOL_GPL(migrate_disable);
void migrate_enable(void)
{ {
struct task_struct *p = current; struct task_struct *p = current;
struct affinity_context ac = { struct affinity_context ac = {
@ -2410,35 +2391,19 @@ void migrate_enable(void)
.flags = SCA_MIGRATE_ENABLE, .flags = SCA_MIGRATE_ENABLE,
}; };
#ifdef CONFIG_DEBUG_PREEMPT __set_cpus_allowed_ptr(p, &ac);
/* }
* Check both overflow from migrate_disable() and superfluous EXPORT_SYMBOL_GPL(___migrate_enable);
* migrate_enable().
*/
if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
return;
#endif
if (p->migration_disabled > 1) { void migrate_disable(void)
p->migration_disabled--; {
return; __migrate_disable();
} }
EXPORT_SYMBOL_GPL(migrate_disable);
/* void migrate_enable(void)
* Ensure stop_task runs either before or after this, and that {
* __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule(). __migrate_enable();
*/
guard(preempt)();
if (p->cpus_ptr != &p->cpus_mask)
__set_cpus_allowed_ptr(p, &ac);
/*
* Mustn't clear migration_disabled() until cpus_ptr points back at the
* regular cpus_mask, otherwise things that race (eg.
* select_fallback_rq) get confused.
*/
barrier();
p->migration_disabled = 0;
this_rq()->nr_pinned--;
} }
EXPORT_SYMBOL_GPL(migrate_enable); EXPORT_SYMBOL_GPL(migrate_enable);
@ -4490,6 +4455,9 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
#ifdef CONFIG_FAIR_GROUP_SCHED #ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL; p->se.cfs_rq = NULL;
#ifdef CONFIG_CFS_BANDWIDTH
init_cfs_throttle_work(p);
#endif
#endif #endif
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS

View File

@ -2551,6 +2551,25 @@ static int find_later_rq(struct task_struct *task)
return -1; return -1;
} }
static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
{
struct task_struct *p;
if (!has_pushable_dl_tasks(rq))
return NULL;
p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
WARN_ON_ONCE(rq->cpu != task_cpu(p));
WARN_ON_ONCE(task_current(rq, p));
WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
WARN_ON_ONCE(!task_on_rq_queued(p));
WARN_ON_ONCE(!dl_task(p));
return p;
}
/* Locks the rq it finds */ /* Locks the rq it finds */
static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq) static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
{ {
@ -2578,12 +2597,37 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
/* Retry if something changed. */ /* Retry if something changed. */
if (double_lock_balance(rq, later_rq)) { if (double_lock_balance(rq, later_rq)) {
if (unlikely(task_rq(task) != rq || /*
* double_lock_balance had to release rq->lock, in the
* meantime, task may no longer be fit to be migrated.
* Check the following to ensure that the task is
* still suitable for migration:
* 1. It is possible the task was scheduled,
* migrate_disabled was set and then got preempted,
* so we must check the task migration disable
* flag.
* 2. The CPU picked is in the task's affinity.
* 3. For throttled task (dl_task_offline_migration),
* check the following:
* - the task is not on the rq anymore (it was
* migrated)
* - the task is not on CPU anymore
* - the task is still a dl task
* - the task is not queued on the rq anymore
* 4. For the non-throttled task (push_dl_task), the
* check to ensure that this task is still at the
* head of the pushable tasks list is enough.
*/
if (unlikely(is_migration_disabled(task) ||
!cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) || !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
task_on_cpu(rq, task) || (task->dl.dl_throttled &&
!dl_task(task) || (task_rq(task) != rq ||
is_migration_disabled(task) || task_on_cpu(rq, task) ||
!task_on_rq_queued(task))) { !dl_task(task) ||
!task_on_rq_queued(task))) ||
(!task->dl.dl_throttled &&
task != pick_next_pushable_dl_task(rq)))) {
double_unlock_balance(rq, later_rq); double_unlock_balance(rq, later_rq);
later_rq = NULL; later_rq = NULL;
break; break;
@ -2606,25 +2650,6 @@ static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *rq)
return later_rq; return later_rq;
} }
static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
{
struct task_struct *p;
if (!has_pushable_dl_tasks(rq))
return NULL;
p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
WARN_ON_ONCE(rq->cpu != task_cpu(p));
WARN_ON_ONCE(task_current(rq, p));
WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
WARN_ON_ONCE(!task_on_rq_queued(p));
WARN_ON_ONCE(!dl_task(p));
return p;
}
/* /*
* See if the non running -deadline tasks on this rq * See if the non running -deadline tasks on this rq
* can be sent to some other CPU where they can preempt * can be sent to some other CPU where they can preempt

View File

@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
if (!gcfs_rq || !gcfs_rq->load.weight) if (!gcfs_rq || !gcfs_rq->load.weight)
return; return;
if (throttled_hierarchy(gcfs_rq))
return;
shares = calc_group_shares(gcfs_rq); shares = calc_group_shares(gcfs_rq);
if (unlikely(se->load.weight != shares)) if (unlikely(se->load.weight != shares))
reweight_entity(cfs_rq_of(se), se, shares); reweight_entity(cfs_rq_of(se), se, shares);
@ -5291,18 +5288,16 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (cfs_rq->nr_queued == 1) { if (cfs_rq->nr_queued == 1) {
check_enqueue_throttle(cfs_rq); check_enqueue_throttle(cfs_rq);
if (!throttled_hierarchy(cfs_rq)) { list_add_leaf_cfs_rq(cfs_rq);
list_add_leaf_cfs_rq(cfs_rq);
} else {
#ifdef CONFIG_CFS_BANDWIDTH #ifdef CONFIG_CFS_BANDWIDTH
if (cfs_rq->pelt_clock_throttled) {
struct rq *rq = rq_of(cfs_rq); struct rq *rq = rq_of(cfs_rq);
if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock) cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
cfs_rq->throttled_clock = rq_clock(rq); cfs_rq->throttled_clock_pelt;
if (!cfs_rq->throttled_clock_self) cfs_rq->pelt_clock_throttled = 0;
cfs_rq->throttled_clock_self = rq_clock(rq);
#endif
} }
#endif
} }
} }
@ -5341,8 +5336,6 @@ static void set_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se); struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable--; cfs_rq->h_nr_runnable--;
if (cfs_rq_throttled(cfs_rq))
break;
} }
} }
@ -5363,8 +5356,6 @@ static void clear_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se); struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable++; cfs_rq->h_nr_runnable++;
if (cfs_rq_throttled(cfs_rq))
break;
} }
} }
@ -5392,7 +5383,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* DELAY_DEQUEUE relies on spurious wakeups, special task * DELAY_DEQUEUE relies on spurious wakeups, special task
* states must not suffer spurious wakeups, excempt them. * states must not suffer spurious wakeups, excempt them.
*/ */
if (flags & DEQUEUE_SPECIAL) if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
delay = false; delay = false;
WARN_ON_ONCE(delay && se->sched_delayed); WARN_ON_ONCE(delay && se->sched_delayed);
@ -5450,8 +5441,18 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & DEQUEUE_DELAYED) if (flags & DEQUEUE_DELAYED)
finish_delayed_dequeue_entity(se); finish_delayed_dequeue_entity(se);
if (cfs_rq->nr_queued == 0) if (cfs_rq->nr_queued == 0) {
update_idle_cfs_rq_clock_pelt(cfs_rq); update_idle_cfs_rq_clock_pelt(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
if (throttled_hierarchy(cfs_rq)) {
struct rq *rq = rq_of(cfs_rq);
list_del_leaf_cfs_rq(cfs_rq);
cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
cfs_rq->pelt_clock_throttled = 1;
}
#endif
}
return true; return true;
} }
@ -5725,74 +5726,253 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttled; return cfs_bandwidth_used() && cfs_rq->throttled;
} }
static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
{
return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
}
/* check whether cfs_rq, or any parent, is throttled */ /* check whether cfs_rq, or any parent, is throttled */
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{ {
return cfs_bandwidth_used() && cfs_rq->throttle_count; return cfs_bandwidth_used() && cfs_rq->throttle_count;
} }
/* static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
* Ensure that neither of the group entities corresponding to src_cpu or
* dest_cpu are members of a throttled hierarchy when performing group
* load-balance operations.
*/
static inline int throttled_lb_pair(struct task_group *tg,
int src_cpu, int dest_cpu)
{ {
struct cfs_rq *src_cfs_rq, *dest_cfs_rq; return throttled_hierarchy(task_group(p)->cfs_rq[dst_cpu]);
src_cfs_rq = tg->cfs_rq[src_cpu];
dest_cfs_rq = tg->cfs_rq[dest_cpu];
return throttled_hierarchy(src_cfs_rq) ||
throttled_hierarchy(dest_cfs_rq);
} }
static inline bool task_is_throttled(struct task_struct *p)
{
return cfs_bandwidth_used() && p->throttled;
}
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static void throttle_cfs_rq_work(struct callback_head *work)
{
struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
struct sched_entity *se;
struct cfs_rq *cfs_rq;
struct rq *rq;
WARN_ON_ONCE(p != current);
p->sched_throttle_work.next = &p->sched_throttle_work;
/*
* If task is exiting, then there won't be a return to userspace, so we
* don't have to bother with any of this.
*/
if ((p->flags & PF_EXITING))
return;
scoped_guard(task_rq_lock, p) {
se = &p->se;
cfs_rq = cfs_rq_of(se);
/* Raced, forget */
if (p->sched_class != &fair_sched_class)
return;
/*
* If not in limbo, then either replenish has happened or this
* task got migrated out of the throttled cfs_rq, move along.
*/
if (!cfs_rq->throttle_count)
return;
rq = scope.rq;
update_rq_clock(rq);
WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
/*
* Must not set throttled before dequeue or dequeue will
* mistakenly regard this task as an already throttled one.
*/
p->throttled = true;
resched_curr(rq);
}
}
void init_cfs_throttle_work(struct task_struct *p)
{
init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
p->sched_throttle_work.next = &p->sched_throttle_work;
INIT_LIST_HEAD(&p->throttle_node);
}
/*
* Task is throttled and someone wants to dequeue it again:
* it could be sched/core when core needs to do things like
* task affinity change, task group change, task sched class
* change etc. and in these cases, DEQUEUE_SLEEP is not set;
* or the task is blocked after throttled due to freezer etc.
* and in these cases, DEQUEUE_SLEEP is set.
*/
static void detach_task_cfs_rq(struct task_struct *p);
static void dequeue_throttled_task(struct task_struct *p, int flags)
{
WARN_ON_ONCE(p->se.on_rq);
list_del_init(&p->throttle_node);
/* task blocked after throttled */
if (flags & DEQUEUE_SLEEP) {
p->throttled = false;
return;
}
/*
* task is migrating off its old cfs_rq, detach
* the task's load from its old cfs_rq.
*/
if (task_on_rq_migrating(p))
detach_task_cfs_rq(p);
}
static bool enqueue_throttled_task(struct task_struct *p)
{
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
/* @p should have gone through dequeue_throttled_task() first */
WARN_ON_ONCE(!list_empty(&p->throttle_node));
/*
* If the throttled task @p is enqueued to a throttled cfs_rq,
* take the fast path by directly putting the task on the
* target cfs_rq's limbo list.
*
* Do not do that when @p is current because the following race can
* cause @p's group_node to be incorectly re-insterted in its rq's
* cfs_tasks list, despite being throttled:
*
* cpuX cpuY
* p ret2user
* throttle_cfs_rq_work() sched_move_task(p)
* LOCK task_rq_lock
* dequeue_task_fair(p)
* UNLOCK task_rq_lock
* LOCK task_rq_lock
* task_current_donor(p) == true
* task_on_rq_queued(p) == true
* dequeue_task(p)
* put_prev_task(p)
* sched_change_group()
* enqueue_task(p) -> p's new cfs_rq
* is throttled, go
* fast path and skip
* actual enqueue
* set_next_task(p)
* list_move(&se->group_node, &rq->cfs_tasks); // bug
* schedule()
*
* In the above race case, @p current cfs_rq is in the same rq as
* its previous cfs_rq because sched_move_task() only moves a task
* to a different group from the same rq, so we can use its current
* cfs_rq to derive rq and test if the task is current.
*/
if (throttled_hierarchy(cfs_rq) &&
!task_current_donor(rq_of(cfs_rq), p)) {
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
return true;
}
/* we can't take the fast path, do an actual enqueue*/
p->throttled = false;
return false;
}
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static int tg_unthrottle_up(struct task_group *tg, void *data) static int tg_unthrottle_up(struct task_group *tg, void *data)
{ {
struct rq *rq = data; struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
struct task_struct *p, *tmp;
cfs_rq->throttle_count--; if (--cfs_rq->throttle_count)
if (!cfs_rq->throttle_count) { return 0;
if (cfs_rq->pelt_clock_throttled) {
cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
cfs_rq->throttled_clock_pelt; cfs_rq->throttled_clock_pelt;
cfs_rq->pelt_clock_throttled = 0;
/* Add cfs_rq with load or one or more already running entities to the list */
if (!cfs_rq_is_decayed(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
if (cfs_rq->throttled_clock_self) {
u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
cfs_rq->throttled_clock_self = 0;
if (WARN_ON_ONCE((s64)delta < 0))
delta = 0;
cfs_rq->throttled_clock_self_time += delta;
}
} }
if (cfs_rq->throttled_clock_self) {
u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
cfs_rq->throttled_clock_self = 0;
if (WARN_ON_ONCE((s64)delta < 0))
delta = 0;
cfs_rq->throttled_clock_self_time += delta;
}
/* Re-enqueue the tasks that have been throttled at this level. */
list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
list_del_init(&p->throttle_node);
p->throttled = false;
enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
}
/* Add cfs_rq with load or one or more already running entities to the list */
if (!cfs_rq_is_decayed(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq);
return 0; return 0;
} }
static inline bool task_has_throttle_work(struct task_struct *p)
{
return p->sched_throttle_work.next != &p->sched_throttle_work;
}
static inline void task_throttle_setup_work(struct task_struct *p)
{
if (task_has_throttle_work(p))
return;
/*
* Kthreads and exiting tasks don't return to userspace, so adding the
* work is pointless
*/
if ((p->flags & (PF_EXITING | PF_KTHREAD)))
return;
task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
}
static void record_throttle_clock(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
cfs_rq->throttled_clock = rq_clock(rq);
if (!cfs_rq->throttled_clock_self)
cfs_rq->throttled_clock_self = rq_clock(rq);
}
static int tg_throttle_down(struct task_group *tg, void *data) static int tg_throttle_down(struct task_group *tg, void *data)
{ {
struct rq *rq = data; struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
/* group is entering throttled state, stop time */ if (cfs_rq->throttle_count++)
if (!cfs_rq->throttle_count) { return 0;
cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
/*
* For cfs_rqs that still have entities enqueued, PELT clock
* stop happens at dequeue time when all entities are dequeued.
*/
if (!cfs_rq->nr_queued) {
list_del_leaf_cfs_rq(cfs_rq); list_del_leaf_cfs_rq(cfs_rq);
cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
WARN_ON_ONCE(cfs_rq->throttled_clock_self); cfs_rq->pelt_clock_throttled = 1;
if (cfs_rq->nr_queued)
cfs_rq->throttled_clock_self = rq_clock(rq);
} }
cfs_rq->throttle_count++;
WARN_ON_ONCE(cfs_rq->throttled_clock_self);
WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
return 0; return 0;
} }
@ -5800,8 +5980,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{ {
struct rq *rq = rq_of(cfs_rq); struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se; int dequeue = 1;
long queued_delta, runnable_delta, idle_delta, dequeue = 1;
raw_spin_lock(&cfs_b->lock); raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */ /* This will start the period timer if necessary */
@ -5824,76 +6003,17 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue) if (!dequeue)
return false; /* Throttle no longer required. */ return false; /* Throttle no longer required. */
se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
/* freeze hierarchy runnable averages while throttled */ /* freeze hierarchy runnable averages while throttled */
rcu_read_lock(); rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq); walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock(); rcu_read_unlock();
queued_delta = cfs_rq->h_nr_queued;
runnable_delta = cfs_rq->h_nr_runnable;
idle_delta = cfs_rq->h_nr_idle;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
int flags;
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
goto done;
/*
* Abuse SPECIAL to avoid delayed dequeue in this instance.
* This avoids teaching dequeue_entities() about throttled
* entities and keeps things relatively simple.
*/
flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
if (se->sched_delayed)
flags |= DEQUEUE_DELAYED;
dequeue_entity(qcfs_rq, se, flags);
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_delta = cfs_rq->h_nr_queued;
qcfs_rq->h_nr_queued -= queued_delta;
qcfs_rq->h_nr_runnable -= runnable_delta;
qcfs_rq->h_nr_idle -= idle_delta;
if (qcfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
break;
}
}
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* throttled entity or throttle-on-deactivate */
if (!se->on_rq)
goto done;
update_load_avg(qcfs_rq, se, 0);
se_update_runnable(se);
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_delta = cfs_rq->h_nr_queued;
qcfs_rq->h_nr_queued -= queued_delta;
qcfs_rq->h_nr_runnable -= runnable_delta;
qcfs_rq->h_nr_idle -= idle_delta;
}
/* At this point se is NULL and we are at root level*/
sub_nr_running(rq, queued_delta);
done:
/* /*
* Note: distribution will already see us throttled via the * Note: distribution will already see us throttled via the
* throttled-list. rq->lock protects completion. * throttled-list. rq->lock protects completion.
*/ */
cfs_rq->throttled = 1; cfs_rq->throttled = 1;
WARN_ON_ONCE(cfs_rq->throttled_clock); WARN_ON_ONCE(cfs_rq->throttled_clock);
if (cfs_rq->nr_queued)
cfs_rq->throttled_clock = rq_clock(rq);
return true; return true;
} }
@ -5901,9 +6021,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{ {
struct rq *rq = rq_of(cfs_rq); struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
struct sched_entity *se; struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
long queued_delta, runnable_delta, idle_delta;
long rq_h_nr_queued = rq->cfs.h_nr_queued; /*
* It's possible we are called with !runtime_remaining due to things
* like user changed quota setting(see tg_set_cfs_bandwidth()) or async
* unthrottled us with a positive runtime_remaining but other still
* running entities consumed those runtime before we reached here.
*
* Anyway, we can't unthrottle this cfs_rq without any runtime remaining
* because any enqueue in tg_unthrottle_up() will immediately trigger a
* throttle, which is not supposed to happen on unthrottle path.
*/
if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
return;
se = cfs_rq->tg->se[cpu_of(rq)]; se = cfs_rq->tg->se[cpu_of(rq)];
@ -5933,62 +6064,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (list_add_leaf_cfs_rq(cfs_rq_of(se))) if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
break; break;
} }
goto unthrottle_throttle;
} }
queued_delta = cfs_rq->h_nr_queued;
runnable_delta = cfs_rq->h_nr_runnable;
idle_delta = cfs_rq->h_nr_idle;
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
/* Handle any unfinished DELAY_DEQUEUE business first. */
if (se->sched_delayed) {
int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
dequeue_entity(qcfs_rq, se, flags);
} else if (se->on_rq)
break;
enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_delta = cfs_rq->h_nr_queued;
qcfs_rq->h_nr_queued += queued_delta;
qcfs_rq->h_nr_runnable += runnable_delta;
qcfs_rq->h_nr_idle += idle_delta;
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
}
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
update_load_avg(qcfs_rq, se, UPDATE_TG);
se_update_runnable(se);
if (cfs_rq_is_idle(group_cfs_rq(se)))
idle_delta = cfs_rq->h_nr_queued;
qcfs_rq->h_nr_queued += queued_delta;
qcfs_rq->h_nr_runnable += runnable_delta;
qcfs_rq->h_nr_idle += idle_delta;
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(qcfs_rq))
goto unthrottle_throttle;
}
/* Start the fair server if un-throttling resulted in new runnable tasks */
if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
dl_server_start(&rq->fair_server);
/* At this point se is NULL and we are at root level*/
add_nr_running(rq, queued_delta);
unthrottle_throttle:
assert_list_leaf_cfs_rq(rq); assert_list_leaf_cfs_rq(rq);
/* Determine whether we need to wake up potentially idle CPU: */ /* Determine whether we need to wake up potentially idle CPU: */
@ -6472,6 +6549,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_rq->runtime_enabled = 0; cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list); INIT_LIST_HEAD(&cfs_rq->throttled_list);
INIT_LIST_HEAD(&cfs_rq->throttled_csd_list); INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
} }
void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@ -6639,19 +6717,28 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {} static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
static inline void sync_throttle(struct task_group *tg, int cpu) {} static inline void sync_throttle(struct task_group *tg, int cpu) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
static void task_throttle_setup_work(struct task_struct *p) {}
static bool task_is_throttled(struct task_struct *p) { return false; }
static void dequeue_throttled_task(struct task_struct *p, int flags) {}
static bool enqueue_throttled_task(struct task_struct *p) { return false; }
static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{ {
return 0; return 0;
} }
static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
{
return false;
}
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{ {
return 0; return 0;
} }
static inline int throttled_lb_pair(struct task_group *tg, static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
int src_cpu, int dest_cpu)
{ {
return 0; return 0;
} }
@ -6831,6 +6918,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int rq_h_nr_queued = rq->cfs.h_nr_queued; int rq_h_nr_queued = rq->cfs.h_nr_queued;
u64 slice = 0; u64 slice = 0;
if (task_is_throttled(p) && enqueue_throttled_task(p))
return;
/* /*
* The code below (indirectly) updates schedutil which looks at * The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency. * the cfs_rq utilization to select a frequency.
@ -6883,10 +6973,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq)) if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1; h_nr_idle = 1;
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
flags = ENQUEUE_WAKEUP; flags = ENQUEUE_WAKEUP;
} }
@ -6908,10 +6994,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq)) if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1; h_nr_idle = 1;
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
} }
if (!rq_h_nr_queued && rq->cfs.h_nr_queued) { if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@ -6941,7 +7023,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new) if (!task_new)
check_update_overutilized_status(rq); check_update_overutilized_status(rq);
enqueue_throttle:
assert_list_leaf_cfs_rq(rq); assert_list_leaf_cfs_rq(rq);
hrtick_update(rq); hrtick_update(rq);
@ -6963,6 +7044,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
bool was_sched_idle = sched_idle_rq(rq); bool was_sched_idle = sched_idle_rq(rq);
bool task_sleep = flags & DEQUEUE_SLEEP; bool task_sleep = flags & DEQUEUE_SLEEP;
bool task_delayed = flags & DEQUEUE_DELAYED; bool task_delayed = flags & DEQUEUE_DELAYED;
bool task_throttled = flags & DEQUEUE_THROTTLE;
struct task_struct *p = NULL; struct task_struct *p = NULL;
int h_nr_idle = 0; int h_nr_idle = 0;
int h_nr_queued = 0; int h_nr_queued = 0;
@ -6996,9 +7078,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq)) if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued; h_nr_idle = h_nr_queued;
/* end evaluation on encountering a throttled cfs_rq */ if (throttled_hierarchy(cfs_rq) && task_throttled)
if (cfs_rq_throttled(cfs_rq)) record_throttle_clock(cfs_rq);
return 0;
/* Don't dequeue parent if it has other entities besides us */ /* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) { if (cfs_rq->load.weight) {
@ -7010,7 +7091,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
* Bias pick_next to pick a task from this cfs_rq, as * Bias pick_next to pick a task from this cfs_rq, as
* p is sleeping when it is within its sched_slice. * p is sleeping when it is within its sched_slice.
*/ */
if (task_sleep && se && !throttled_hierarchy(cfs_rq)) if (task_sleep && se)
set_next_buddy(se); set_next_buddy(se);
break; break;
} }
@ -7037,9 +7118,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq)) if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued; h_nr_idle = h_nr_queued;
/* end evaluation on encountering a throttled cfs_rq */ if (throttled_hierarchy(cfs_rq) && task_throttled)
if (cfs_rq_throttled(cfs_rq)) record_throttle_clock(cfs_rq);
return 0;
} }
sub_nr_running(rq, h_nr_queued); sub_nr_running(rq, h_nr_queued);
@ -7073,6 +7153,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/ */
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{ {
if (task_is_throttled(p)) {
dequeue_throttled_task(p, flags);
return true;
}
if (!p->se.sched_delayed) if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p); util_est_dequeue(&rq->cfs, p);
@ -8660,7 +8745,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* lead to a throttle). This both saves work and prevents false * lead to a throttle). This both saves work and prevents false
* next-buddy nomination below. * next-buddy nomination below.
*/ */
if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) if (task_is_throttled(p))
return; return;
if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
@ -8741,19 +8826,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{ {
struct sched_entity *se; struct sched_entity *se;
struct cfs_rq *cfs_rq; struct cfs_rq *cfs_rq;
struct task_struct *p;
bool throttled;
again: again:
cfs_rq = &rq->cfs; cfs_rq = &rq->cfs;
if (!cfs_rq->nr_queued) if (!cfs_rq->nr_queued)
return NULL; return NULL;
throttled = false;
do { do {
/* Might not have done put_prev_entity() */ /* Might not have done put_prev_entity() */
if (cfs_rq->curr && cfs_rq->curr->on_rq) if (cfs_rq->curr && cfs_rq->curr->on_rq)
update_curr(cfs_rq); update_curr(cfs_rq);
if (unlikely(check_cfs_rq_runtime(cfs_rq))) throttled |= check_cfs_rq_runtime(cfs_rq);
goto again;
se = pick_next_entity(rq, cfs_rq); se = pick_next_entity(rq, cfs_rq);
if (!se) if (!se)
@ -8761,7 +8849,10 @@ again:
cfs_rq = group_cfs_rq(se); cfs_rq = group_cfs_rq(se);
} while (cfs_rq); } while (cfs_rq);
return task_of(se); p = task_of(se);
if (unlikely(throttled))
task_throttle_setup_work(p);
return p;
} }
static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first); static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
@ -8923,8 +9014,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{ {
struct sched_entity *se = &p->se; struct sched_entity *se = &p->se;
/* throttled hierarchies are not runnable */ /* !se->on_rq also covers throttled task */
if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) if (!se->on_rq)
return false; return false;
/* Tell the scheduler that we'd really like se to run next. */ /* Tell the scheduler that we'd really like se to run next. */
@ -9283,7 +9374,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* /*
* We do not migrate tasks that are: * We do not migrate tasks that are:
* 1) delayed dequeued unless we migrate load, or * 1) delayed dequeued unless we migrate load, or
* 2) throttled_lb_pair, or * 2) target cfs_rq is in throttled hierarchy, or
* 3) cannot be migrated to this CPU due to cpus_ptr, or * 3) cannot be migrated to this CPU due to cpus_ptr, or
* 4) running (obviously), or * 4) running (obviously), or
* 5) are cache-hot on their current CPU, or * 5) are cache-hot on their current CPU, or
@ -9292,7 +9383,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load)) if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0; return 0;
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) if (lb_throttled_hierarchy(p, env->dst_cpu))
return 0; return 0;
/* /*
@ -13076,10 +13167,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
{ {
struct cfs_rq *cfs_rq = cfs_rq_of(se); struct cfs_rq *cfs_rq = cfs_rq_of(se);
if (cfs_rq_throttled(cfs_rq)) /*
return; * If a task gets attached to this cfs_rq and before being queued,
* it gets migrated to another CPU due to reasons like affinity
if (!throttled_hierarchy(cfs_rq)) * change, make sure this cfs_rq stays on leaf cfs_rq list to have
* that removed load decayed or it can cause faireness problem.
*/
if (!cfs_rq_pelt_clock_throttled(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq); list_add_leaf_cfs_rq(cfs_rq);
/* Start to propagate at parent */ /* Start to propagate at parent */
@ -13090,10 +13184,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
update_load_avg(cfs_rq, se, UPDATE_TG); update_load_avg(cfs_rq, se, UPDATE_TG);
if (cfs_rq_throttled(cfs_rq)) if (!cfs_rq_pelt_clock_throttled(cfs_rq))
break;
if (!throttled_hierarchy(cfs_rq))
list_add_leaf_cfs_rq(cfs_rq); list_add_leaf_cfs_rq(cfs_rq);
} }
} }

View File

@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{ {
u64 throttled; u64 throttled;
if (unlikely(cfs_rq->throttle_count)) if (unlikely(cfs_rq->pelt_clock_throttled))
throttled = U64_MAX; throttled = U64_MAX;
else else
throttled = cfs_rq->throttled_clock_pelt_time; throttled = cfs_rq->throttled_clock_pelt_time;
@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */ /* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{ {
if (unlikely(cfs_rq->throttle_count)) if (unlikely(cfs_rq->pelt_clock_throttled))
return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time; return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;

12
kernel/sched/rq-offsets.c Normal file
View File

@ -0,0 +1,12 @@
// SPDX-License-Identifier: GPL-2.0
#define COMPILE_OFFSETS
#include <linux/kbuild.h>
#include <linux/types.h>
#include "sched.h"
int main(void)
{
DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));
return 0;
}

View File

@ -760,10 +760,12 @@ struct cfs_rq {
u64 throttled_clock_pelt_time; u64 throttled_clock_pelt_time;
u64 throttled_clock_self; u64 throttled_clock_self;
u64 throttled_clock_self_time; u64 throttled_clock_self_time;
int throttled; bool throttled:1;
bool pelt_clock_throttled:1;
int throttle_count; int throttle_count;
struct list_head throttled_list; struct list_head throttled_list;
struct list_head throttled_csd_list; struct list_head throttled_csd_list;
struct list_head throttled_limbo_list;
#endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */ #endif /* CONFIG_FAIR_GROUP_SCHED */
}; };
@ -2367,6 +2369,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SPECIAL 0x10 #define DEQUEUE_SPECIAL 0x10
#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */ #define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */
#define DEQUEUE_DELAYED 0x200 /* Matches ENQUEUE_DELAYED */ #define DEQUEUE_DELAYED 0x200 /* Matches ENQUEUE_DELAYED */
#define DEQUEUE_THROTTLE 0x800
#define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02 #define ENQUEUE_RESTORE 0x02
@ -2683,6 +2686,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
extern void init_dl_entity(struct sched_dl_entity *dl_se); extern void init_dl_entity(struct sched_dl_entity *dl_se);
extern void init_cfs_throttle_work(struct task_struct *p);
#define BW_SHIFT 20 #define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT) #define BW_UNIT (1 << BW_SHIFT)
#define RATIO_SHIFT 8 #define RATIO_SHIFT 8

View File

@ -1591,7 +1591,6 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
enum numa_topology_type sched_numa_topology_type; enum numa_topology_type sched_numa_topology_type;
static int sched_domains_numa_levels; static int sched_domains_numa_levels;
static int sched_domains_curr_level;
int sched_max_numa_distance; int sched_max_numa_distance;
static int *sched_domains_numa_distance; static int *sched_domains_numa_distance;
@ -1632,14 +1631,7 @@ sd_init(struct sched_domain_topology_level *tl,
int sd_id, sd_weight, sd_flags = 0; int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span; struct cpumask *sd_span;
#ifdef CONFIG_NUMA sd_weight = cpumask_weight(tl->mask(tl, cpu));
/*
* Ugly hack to pass state to sd_numa_mask()...
*/
sched_domains_curr_level = tl->numa_level;
#endif
sd_weight = cpumask_weight(tl->mask(cpu));
if (tl->sd_flags) if (tl->sd_flags)
sd_flags = (*tl->sd_flags)(); sd_flags = (*tl->sd_flags)();
@ -1677,7 +1669,7 @@ sd_init(struct sched_domain_topology_level *tl,
}; };
sd_span = sched_domain_span(sd); sd_span = sched_domain_span(sd);
cpumask_and(sd_span, cpu_map, tl->mask(cpu)); cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
sd_id = cpumask_first(sd_span); sd_id = cpumask_first(sd_span);
sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map); sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
@ -1732,22 +1724,63 @@ sd_init(struct sched_domain_topology_level *tl,
return sd; return sd;
} }
#ifdef CONFIG_SCHED_SMT
int cpu_smt_flags(void)
{
return SD_SHARE_CPUCAPACITY | SD_SHARE_LLC;
}
const struct cpumask *tl_smt_mask(struct sched_domain_topology_level *tl, int cpu)
{
return cpu_smt_mask(cpu);
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
int cpu_cluster_flags(void)
{
return SD_CLUSTER | SD_SHARE_LLC;
}
const struct cpumask *tl_cls_mask(struct sched_domain_topology_level *tl, int cpu)
{
return cpu_clustergroup_mask(cpu);
}
#endif
#ifdef CONFIG_SCHED_MC
int cpu_core_flags(void)
{
return SD_SHARE_LLC;
}
const struct cpumask *tl_mc_mask(struct sched_domain_topology_level *tl, int cpu)
{
return cpu_coregroup_mask(cpu);
}
#endif
const struct cpumask *tl_pkg_mask(struct sched_domain_topology_level *tl, int cpu)
{
return cpu_node_mask(cpu);
}
/* /*
* Topology list, bottom-up. * Topology list, bottom-up.
*/ */
static struct sched_domain_topology_level default_topology[] = { static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT #ifdef CONFIG_SCHED_SMT
SDTL_INIT(cpu_smt_mask, cpu_smt_flags, SMT), SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
#endif #endif
#ifdef CONFIG_SCHED_CLUSTER #ifdef CONFIG_SCHED_CLUSTER
SDTL_INIT(cpu_clustergroup_mask, cpu_cluster_flags, CLS), SDTL_INIT(tl_cls_mask, cpu_cluster_flags, CLS),
#endif #endif
#ifdef CONFIG_SCHED_MC #ifdef CONFIG_SCHED_MC
SDTL_INIT(cpu_coregroup_mask, cpu_core_flags, MC), SDTL_INIT(tl_mc_mask, cpu_core_flags, MC),
#endif #endif
SDTL_INIT(cpu_cpu_mask, NULL, PKG), SDTL_INIT(tl_pkg_mask, NULL, PKG),
{ NULL, }, { NULL, },
}; };
@ -1768,10 +1801,14 @@ void __init set_sched_topology(struct sched_domain_topology_level *tl)
} }
#ifdef CONFIG_NUMA #ifdef CONFIG_NUMA
static int cpu_numa_flags(void)
static const struct cpumask *sd_numa_mask(int cpu)
{ {
return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; return SD_NUMA;
}
static const struct cpumask *sd_numa_mask(struct sched_domain_topology_level *tl, int cpu)
{
return sched_domains_numa_masks[tl->numa_level][cpu_to_node(cpu)];
} }
static void sched_numa_warn(const char *str) static void sched_numa_warn(const char *str)
@ -2413,7 +2450,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
* breaks the linking done for an earlier span. * breaks the linking done for an earlier span.
*/ */
for_each_cpu(cpu, cpu_map) { for_each_cpu(cpu, cpu_map) {
const struct cpumask *tl_cpu_mask = tl->mask(cpu); const struct cpumask *tl_cpu_mask = tl->mask(tl, cpu);
int id; int id;
/* lowest bit set in this mask is used as a unique id */ /* lowest bit set in this mask is used as a unique id */
@ -2421,7 +2458,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
if (cpumask_test_cpu(id, id_seen)) { if (cpumask_test_cpu(id, id_seen)) {
/* First CPU has already been seen, ensure identical spans */ /* First CPU has already been seen, ensure identical spans */
if (!cpumask_equal(tl->mask(id), tl_cpu_mask)) if (!cpumask_equal(tl->mask(tl, id), tl_cpu_mask))
return false; return false;
} else { } else {
/* First CPU hasn't been seen before, ensure it's a completely new span */ /* First CPU hasn't been seen before, ensure it's a completely new span */