mirror of
https://github.com/nxp-imx/linux-imx.git
synced 2025-09-03 02:16:09 +02:00
lf-6.6.y
3886 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
65457b74aa |
sched/psi: Rename existing poll members in preparation
Renaming in PSI implementation to make a clear distinction between privileged and unprivileged triggers code to be implemented in the next patch. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20230330105418.77061-3-cerasuolodomenico@gmail.com |
||
![]() |
7fab21fa0d |
sched/psi: Rearrange polling code in preparation
Move a few functions up in the file to avoid forward declaration needed in the patch implementing unprivileged PSI triggers. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20230330105418.77061-2-cerasuolodomenico@gmail.com |
||
![]() |
39afe5d6fc |
sched/fair: Fix inaccurate tally of ttwu_move_affine
There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.
When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.
Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.
Fixes:
|
||
![]() |
cd8fe5b6db |
Merge 6.3-rc5 into driver-core-next
We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
aa464ba9a1 |
lazy tlb: introduce lazy tlb mm refcount helper functions
Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
68e2d17c9e |
trace: Add trace_ipi_send_cpu()
Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net |
||
![]() |
68f4ff04db |
sched, smp: Trace smp callback causing an IPI
Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com |
||
![]() |
cc9cb0a717 |
sched, smp: Trace IPIs sent via send_call_function_single_ipi()
send_call_function_single_ipi() is the thing that sends IPIs at the bottom of smp_call_function*() via either generic_exec_single() or smp_call_function_many_cond(). Give it an IPI-related tracepoint. Note that this ends up tracing any IPI sent via __smp_call_single_queue(), which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free". Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com |
||
![]() |
e3ff7c609f |
livepatch,sched: Add livepatch task switching to cond_resched()
There have been reports [1][2] of live patches failing to complete within a reasonable amount of time due to CPU-bound kthreads. Fix it by patching tasks in cond_resched(). There are four different flavors of cond_resched(), depending on the kernel configuration. Hook into all of them. A more elegant solution might be to use a preempt notifier. However, non-ORC unwinders can't unwind a preempted task reliably. [1] https://lore.kernel.org/lkml/20220507174628.2086373-1-song@kernel.org/ [2] https://lkml.kernel.org/lkml/20230120-vhost-klp-switching-v1-0-7c2b65519c43@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org |
||
![]() |
41abdba937 |
sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
CPU cfs bandwidth controller uses hrtimer. Currently there is no initial value set. Hence all period timers would align at expiry. This happens when there are multiple CPU cgroup's. There is a performance gain that can be achieved here if the timers are interleaved when the utilization of each CPU cgroup is low and total utilization of all the CPU cgroup's is less than 50%. If the timers are interleaved, then the unthrottled cgroup can run freely without many context switches and can also benefit from SMT Folding. This effect will be further amplified in SPLPAR environment. This commit adds a random offset after initializing each hrtimer. This would result in interleaving the timers at expiry, which helps in achieving the said performance gain. This was tested on powerpc platform with 8 core SMT=8. Socket power was measured when the workload. Benchmarked the stress-ng with power information. Throughput oriented benchmarks show significant gain up to 25% while power consumption increases up to 15%. Workload: stress-ng --cpu=32 --cpu-ops=50000. 1CG - 1 cgroup is running. 2CG - 2 cgroups are running together. Time taken to complete stress-ng in seconds and power is in watts. each cgroup is throttled at 25% with 100ms as the period value. 6.2-rc6 | with patch 8 core 1CG power 2CG power | 1CG power 2 CG power 27.5 80.6 40 90 | 27.3 82 32.3 104 27.5 81 40.2 91 | 27.5 81 38.7 96 27.7 80 40.1 89 | 27.6 80 29.7 106 27.7 80.1 40.3 94 | 27.6 80 31.5 105 Latency might be affected by this change. That could happen if the CPU was in a deep idle state which is possible if we interleave the timers. Used schbench for measuring the latency. Each cgroup is throttled at 25% with period value is set to 100ms. Numbers are when both the cgroups are running simultaneously. Latency values don't degrade much. Some improvement is seen in tail latencies. 6.2-rc6 with patch Groups: 16 50.0th: 39.5 42.5 75.0th: 924.0 922.0 90.0th: 972.0 968.0 95.0th: 1005.5 994.0 99.0th: 4166.0 2287.0 99.5th: 7314.0 7448.0 99.9th: 15024.0 13600.0 Groups: 32 50.0th: 819.0 463.0 75.0th: 1596.0 918.0 90.0th: 5992.0 1281.5 95.0th: 13184.0 2765.0 99.0th: 21792.0 14240.0 99.5th: 25696.0 18920.0 99.9th: 33280.0 35776.0 Groups: 64 50.0th: 4806.0 3440.0 75.0th: 31136.0 33664.0 90.0th: 54144.0 58752.0 95.0th: 66176.0 67200.0 99.0th: 84736.0 91520.0 99.5th: 97408.0 114048.0 99.9th: 136448.0 140032.0 Initial RFC PATCH, discussions and details on the problem: Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/ Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230223185153.1499710-1-sshegde@linux.vnet.ibm.com |
||
![]() |
eff6c8ce8d |
sched/core: Reduce cost of sched_move_task when config autogroup
Some sched_move_task calls are useless because that task_struct->sched_task_group maybe not changed (equals task_group of cpu_cgroup) when system enable autogroup. So do some checks in sched_move_task. sched_move_task eg: task A belongs to cpu_cgroup0 and autogroup0, it will always belong to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue. The call graph is as follow. do_exit sched_autogroup_exit_task sched_move_task dequeue_task sched_change_group A.sched_task_group = sched_get_task_group (=cpu_cgroup0) enqueue_task Performance results: =========================== 1. env cpu: bogomips=4600.00 kernel: 6.3.0-rc3 cpu_cgroup: 6:cpu,cpuacct:/user.slice 2. cmds do_exit script: for i in {0..10000}; do sleep 0 & done wait Run the above script, then use the following bpftrace cmd to get the cost of sched_move_task: bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; } kr:sched_move_task /@ts[tid]/ { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }' 3. cost time(ns): without patch: 43528033 with patch: 18541416 diff:-24986617 -57.4% As the result show, the patch will save 57.4% in the scenario. Signed-off-by: wuchi <wuchi.zero@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230321064459.39421-1-wuchi.zero@gmail.com |
||
![]() |
530bfad1d5 |
sched/core: Avoid selecting the task that is throttled to run when core-sched enable
When {rt, cfs}_rq or dl task is throttled, since cookied tasks are not dequeued from the core tree, So sched_core_find() and sched_core_next() may return throttled task, which may cause throttled task to run on the CPU. So we add checks in sched_core_find() and sched_core_next() to make sure that the return is a runnable task that is not throttled. Co-developed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230316081806.69544-1-jiahao.os@bytedance.com |
||
![]() |
d91e15a21d |
sched/topology: Make sched_energy_mutex,update static
smatch reports kernel/sched/topology.c:212:1: warning: symbol 'sched_energy_mutex' was not declared. Should it be static? kernel/sched/topology.c:213:6: warning: symbol 'sched_energy_update' was not declared. Should it be static? These variables are only used in topology.c, so should be static Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230314144818.1453523-1-trix@redhat.com |
||
![]() |
a53ce18cac |
sched/fair: Sanitize vruntime of entity being migrated
Commit |
||
![]() |
34320745df |
sched/debug: Put sched/domains files under the verbose flag
The debug files under sched/domains can take a long time to regenerate, especially when updates are done one at a time. Move these files under the sched verbose debug flag. Allow changes to verbose to trigger generation of the files. This lets a user batch the updates but still have the information available. The detailed topology printk messages are also under verbose. Discussion that lead to this approach can be found in the link below. Simplified code to maintain use of debugfs bool routines suggested by Michael Ellerman <mpe@ellerman.id.au>. Signed-off-by: Phil Auld <pauld@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Vishal Chourasia <vishalc@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vishal Chourasia <vishalc@linux.vnet.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/ Link: https://lore.kernel.org/r/20230303183754.3076321-1-pauld@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
6015b1aca1 |
sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
The getaffinity() system call uses 'cpumask_size()' to decide how big the CPU mask is - so far so good. It is indeed the allocation size of a cpumask. But the code also assumes that the whole allocation is initialized without actually doing so itself. That's wrong, because we might have fixed-size allocations (making copying and clearing more efficient), but not all of it is then necessarily used if 'nr_cpu_ids' is smaller. Having checked other users of 'cpumask_size()', they all seem to be ok, either using it purely for the allocation size, or explicitly zeroing the cpumask before using the size in bytes to copy it. See for example the ublk_ctrl_get_queue_affinity() function that uses the proper 'zalloc_cpumask_var()' to make sure that the whole mask is cleared, whether the storage is on the stack or if it was an external allocation. Fix this by just zeroing the allocation before using it. Do the same for the compat version of sched_getaffinity(), which had the same logic. Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to access the bits. For a cpumask_var_t, it ends up being a pointer to the same data either way, but it's just a good idea to treat it like you would a 'cpumask_t'. The compat case already did that. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/ Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
071c44e427 |
sched/idle: Mark arch_cpu_idle_dead() __noreturn
Before commit 076cbf5d2163 ("x86/xen: don't let xen_pv_play_dead() return"), in Xen, when a previously offlined CPU was brought back online, it unexpectedly resumed execution where it left off in the middle of the idle loop. There were some hacks to make that work, but the behavior was surprising as do_idle() doesn't expect an offlined CPU to return from the dead (in arch_cpu_idle_dead()). Now that Xen has been fixed, and the arch-specific implementations of arch_cpu_idle_dead() also don't return, give it a __noreturn attribute. This will cause the compiler to complain if an arch-specific implementation might return. It also improves code generation for both caller and callee. Also fixes the following warning: vmlinux.o: warning: objtool: do_idle+0x25f: unreachable instruction Reported-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/60d527353da8c99d4cf13b6473131d46719ed16d.1676358308.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
![]() |
dfb0f170ca |
sched/idle: Make sure weak version of arch_cpu_idle_dead() doesn't return
arch_cpu_idle_dead() should never return. Make it so. Link: https://lore.kernel.org/r/cf5ad95eef50f7704bb30e7770c59bfe23372af7.1676358308.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
![]() |
c8b4accf86 |
More power management updates for 6.3-rc1
- Fix error handling in the apple-soc cpufreq driver (Dan Carpenter). - Change the log level of a message in the amd-pstate cpufreq driver so it is more visible to users (Kai-Heng Feng). - Adjust the balance_performance EPP value for Sapphire Rapids in the intel_pstate cpufreq driver (Srinivas Pandruvada). - Remove MODULE_LICENSE from 3 pieces of non-modular code (Nick Alcock). - Make a read-only kobj_type structure in the schedutil cpufreq governor constant (Thomas Weißschuh). - Add Add Power Limit4 support for Meteor Lake SoC to the Intel RAPL power capping driver (Sumeet Pawnikar). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmQCNm0SHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRx0P8QAJdcjEg4cY/OJq9VTqiUo92mt63aqWXr vOMYgp5QDuDRPKiVMMuCJNCJSrA/xPwAz4pcSOj/59CZn1UZqzg6Ap/vdmV7pQV1 vDp6N9E2sgCL0QMQqueyv3yEN9meleGrQPnAulOZf29Q9PC0rGiAUdivvgDXZq/9 9wo+/11EzZpKyaDTfLzotIvNOkmmkp5FtxaCR64D5w3e6gehGrL/wL3FgSefDnsa fNlhxgi66FYuamSgXqQTkuIuig0Rbvlp0fmhllPaIOkNMI4o7rvP5rB7FbAKZm1E XI+M3aVlZsImPpEEJ1dTqbc4y9WU9HakLfRRSiUnnWHSXpwBI8ncEwP0oulqqVoF elA9kd7Sv1DwLiUGMy3GaOwscTN6NDUwICH4UJeISWlMZfrP7YR3Bb7gVtzlCrKC b99oA92OFazzWliYPSzzSsFQTI1PczNLZKnelzgF1+5g76q3sIUa3tu6Ts6Z84UK rHCLDVw8TUFpsEQxKOvM3oLUdpvE0mmQpdG8fSeHVxon47jVzmkbDjgOBFp1i19F HBI7LfOhDkkzO35qb6+DfkQoij0mpn8ldbVVLd1XQe4F0WjfSe8D4oGTgB3ErTK7 8cOKGoez9FsQRHbSVCaZcaTSemDHdB3UMRUHG2hvU1jbwXOaQcLfMAF5McyECJiN U8uI28JVGVPI =lRKS -----END PGP SIGNATURE----- Merge tag 'pm-6.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These update power capping (new hardware support and cleanup) and cpufreq (bug fixes, cleanups and intel_pstate adjustment for a new platform). Specifics: - Fix error handling in the apple-soc cpufreq driver (Dan Carpenter) - Change the log level of a message in the amd-pstate cpufreq driver so it is more visible to users (Kai-Heng Feng) - Adjust the balance_performance EPP value for Sapphire Rapids in the intel_pstate cpufreq driver (Srinivas Pandruvada) - Remove MODULE_LICENSE from 3 pieces of non-modular code (Nick Alcock) - Make a read-only kobj_type structure in the schedutil cpufreq governor constant (Thomas Weißschuh) - Add Add Power Limit4 support for Meteor Lake SoC to the Intel RAPL power capping driver (Sumeet Pawnikar)" * tag 'pm-6.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: apple-soc: Fix an IS_ERR() vs NULL check powercap: remove MODULE_LICENSE in non-modules cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules powercap: RAPL: Add Power Limit4 support for Meteor Lake SoC cpufreq: amd-pstate: remove MODULE_LICENSE in non-modules cpufreq: schedutil: make kobj_type structure constant cpufreq: amd-pstate: Let user know amd-pstate is disabled cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids |
||
![]() |
3822a7c409 |
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ... |
||
![]() |
70ba26cbe0 |
cpufreq: schedutil: make kobj_type structure constant
Since commit
|
||
![]() |
5b7c4cabbb |
Networking changes for 6.3.
Core ---- - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols --------- - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF --- - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter --------- - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt. races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API ---------- - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers ---------------------- - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers ------- - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - enetc: support XDP_REDIRECT for XDP non-linear buffers - enetc: improve reconfig, avoid link flap and waiting for idle - enetc: support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation. Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmP1VIYACgkQMUZtbf5S IrvsChAApz0rNL/sPKxXTEfxZ1tN7D3sYxYKQPomxvl5BV+MvicrLddJy3KmzEFK nnJNO3nuRNuH422JQ/ylZ4mGX1opa6+5QJb0UINImXUI7Fm8HHBIuPGkv7d5CheZ 7JexFqjPJXUy9nPyh1Rra+IA9AcRd2U7jeGEZR38wb99bHJQj5Bzdk20WArEB0el n44aqg49LXH71bSeXRz77x5SjkwVtYiccQxLcnmTbjLU2xVraLvI2J+wAhHnVXWW 9lrU1+V4Ex2Xcd1xR0L0cHeK+meP1TrPRAeF+JDpVI3a/zJiE7cZjfHdG/jH5xWl leZJqghVozrZQNtewWWO7XhUFhMDgFu3W/1vNLjSHPZEqaz1JpM67J1+ql6s63l4 LMWoXbcYZz+SL9ZRCoPkbGue/5fKSHv8/Jl9Sh58+eTS+c/zgN8uFGRNFXLX1+EP n8uvt985PxMd6x1+dHumhOUzxnY4Sfi1vjitSunTsNFQ3Cmp4SO0IfBVJWfLUCuC xz5hbJGJJbSpvUsO+HWyCg83E5OWghRE/Onpt2jsQSZCrO9HDg4FRTEf3WAMgaqc edb5KfbRZPTJQM08gWdluXzSk1nw3FNP2tXW4XlgUrEbjb+fOk0V9dQg2gyYTxQ1 Nhvn8ZQPi6/GMMELHAIPGmmW1allyOGiAzGlQsv8EmL+OFM6WDI= =xXhC -----END PGP SIGNATURE----- Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ... |
||
![]() |
8cc01d43f8 |
RCU pull request for v6.3
This pull request contains the following branches: doc.2023.01.05a: Documentation updates. fixes.2023.01.23a: Miscellaneous fixes, perhaps most notably: o Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks. o Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized. o Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have doen this for mnay years.) o Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases.) kvfree.2023.01.03a: Cause kfree_rcu() and friends to take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark. This series also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh). srcu.2023.01.03a: SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture. It also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another. srcu-always.2023.02.02a: Cleans up the now-useless SRCU Kconfig option. There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window. tasks.2023.01.03a: RCU-tasks updates, perhaps most notably these fixes: o A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang. o A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period. o A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU. torture.2023.01.05a: Torture-test updates and fixes. torturescript.2023.01.03a: Torture-test scripting updates and fixes. stall.2023.01.09a: Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq29UTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jAhVEACEAKJY1VJ9IUqz7CwzAYkzgRJfiygh oDUXmlqtm6ew9pr2GdLUVCVsUSldzBc0K7Djb/G1niv4JPs+v7YwupIV33+UbStU Qxt6ztTdxc4lKospLm1+2vF9ZdzVEmiP4wVCc4iDarv5FM3FpWSTNc8+L7qmlC+X myjv+GqMTxkXZBvYJOgJGFjDwN8noTd7Fr3mCCVLFm3PXMDa7tcwD6HRP5AqD2N8 qC5M6LEqepKVGmz0mYMLlSN1GPaqIsEcexIFEazRsPEivPh/iafyQCQ/cqxwhXmV vEt7u+dXGZT/oiDq9cJ+/XRDS2RyKIS6dUE14TiiHolDCn1ONESahfA/gXWKykC2 BaGPfjWXrWv/hwbeZ+8xEdkAvTIV92tGpXir9Fby1Z5PjP3balvrnn6hs5AnQBJb NdhRPLzy/dCnEF+CweAYYm1qvTo8cd5nyiNwBZHn7rEAIu3Axrecag1rhFl3AJ07 cpVMQXZtkQVa2X8aIRTUC+ijX6yIqNaHlu0HqNXgIUTDzL4nv5cMjOMzpNQP9/dZ FwAMZYNiOk9IlMiKJ8ZiVcxeiA8ouIBlkYM3k6vGrmiONZ7a/EV/mSHoJqI8bvqr AxUIJ2Ayhg3bxPboL5oKgCiLql0A7ZVvz6quX6McitWGMgaSvel1fDzT3TnZd41e 4AFBFd/+VedUGg== =bBYK -----END PGP SIGNATURE----- Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes, perhaps most notably: - Throttling callback invocation based on the number of callbacks that are now ready to invoke instead of on the total number of callbacks - Several patches that suppress false-positive boot-time diagnostics, for example, due to lockdep not yet being initialized - Make expedited RCU CPU stall warnings dump stacks of any tasks that are blocking the stalled grace period. (Normal RCU CPU stall warnings have done this for many years) - Lazy-callback fixes to avoid delays during boot, suspend, and resume. (Note that lazy callbacks must be explicitly enabled, so this should not (yet) affect production use cases) - Make kfree_rcu() and friends take advantage of polled grace periods, thus reducing memory footprint by almost two orders of magnitude, admittedly on a microbenchmark This also begins the transition from kfree_rcu(p) to kfree_rcu_mightsleep(p). This transition was motivated by bugs where kfree_rcu(p), which can block, was typed instead of the intended kfree_rcu(p, rh) - SRCU updates, perhaps most notably fixing a bug that causes SRCU to fail when booted on a system with a non-zero boot CPU. This surprising situation actually happens for kdump kernels on the powerpc architecture This also adds an srcu_down_read() and srcu_up_read(), which act like srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side critical section to be handed off from one task to another - Clean up the now-useless SRCU Kconfig option There are a few more commits that are not yet acked or pulled into maintainer trees, and these will be in a pull request for a later merge window - RCU-tasks updates, perhaps most notably these fixes: - A strange interaction between PID-namespace unshare and the RCU-tasks grace period that results in a low-probability but very real hang - A race between an RCU tasks rude grace period on a single-CPU system and CPU-hotplug addition of the second CPU that can result in a too-short grace period - A race between shrinking RCU tasks down to a single callback list and queuing a new callback to some other CPU, but where that queuing is delayed for more than an RCU grace period. This can result in that callback being stranded on the non-boot CPU - Torture-test updates and fixes - Torture-test scripting updates and fixes - Provide additional RCU CPU stall-warning information in kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute timeout limit for expedited RCU CPU stall warnings * tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits) rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep() kernel/notifier: Remove CONFIG_SRCU init: Remove "select SRCU" fs/quota: Remove "select SRCU" fs/notify: Remove "select SRCU" fs/btrfs: Remove "select SRCU" fs: Remove CONFIG_SRCU drivers/pci/controller: Remove "select SRCU" drivers/net: Remove "select SRCU" drivers/md: Remove "select SRCU" drivers/hwtracing/stm: Remove "select SRCU" drivers/dax: Remove "select SRCU" drivers/base: Remove CONFIG_SRCU rcu: Disable laziness if lazy-tracking says so rcu: Track laziness during boot and suspend rcu: Remove redundant call to rcu_boost_kthread_setaffinity() rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts rcu: Align the output of RCU CPU stall warning messages rcu: Add RCU stall diagnosis information sched: Add helper nr_context_switches_cpu() ... |
||
![]() |
1f2d9ffc7a |
Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic with large number of CPUs. - Fix & rework various cpuidle routines, simplify interaction with the generic scheduler code. Add __cpuidle methods as noinstr to objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks. - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query previously issued registrations. - Limit scheduler slice duration to the sysctl_sched_latency period, to improve scheduling granularity with a large number of SCHED_IDLE tasks. - Debuggability enhancement on sys_exit(): warn about disabled IRQs, but also enable them to prevent a cascade of followup problems and repeat warnings. - Fix the rescheduling logic in prio_changed_dl(). - Micro-optimize cpufreq and sched-util methods. - Micro-optimize ttwu_runnable() - Micro-optimize the idle-scanning in update_numa_stats(), select_idle_capacity() and steal_cookie_task(). - Update the RSEQ code & self-tests - Constify various scheduler methods - Remove unused methods - Refine __init tags - Documentation updates - ... Misc other cleanups, fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8 7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1 eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5 skkQXoONNaM= =l1nN -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Improve the scalability of the CFS bandwidth unthrottling logic with large number of CPUs. - Fix & rework various cpuidle routines, simplify interaction with the generic scheduler code. Add __cpuidle methods as noinstr to objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks. - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query previously issued registrations. - Limit scheduler slice duration to the sysctl_sched_latency period, to improve scheduling granularity with a large number of SCHED_IDLE tasks. - Debuggability enhancement on sys_exit(): warn about disabled IRQs, but also enable them to prevent a cascade of followup problems and repeat warnings. - Fix the rescheduling logic in prio_changed_dl(). - Micro-optimize cpufreq and sched-util methods. - Micro-optimize ttwu_runnable() - Micro-optimize the idle-scanning in update_numa_stats(), select_idle_capacity() and steal_cookie_task(). - Update the RSEQ code & self-tests - Constify various scheduler methods - Remove unused methods - Refine __init tags - Documentation updates - Misc other cleanups, fixes * tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) sched/rt: pick_next_rt_entity(): check list_entry sched/deadline: Add more reschedule cases to prio_changed_dl() sched/fair: sanitize vruntime of entity being placed sched/fair: Remove capacity inversion detection sched/fair: unlink misfit task from cpu overutilized objtool: mem*() are not uaccess safe cpuidle: Fix poll_idle() noinstr annotation sched/clock: Make local_clock() noinstr sched/clock/x86: Mark sched_clock() noinstr x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read() x86/atomics: Always inline arch_atomic64*() cpuidle: tracing, preempt: Squash _rcuidle tracing cpuidle: tracing: Warn about !rcu_is_watching() cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG cpuidle: drivers: firmware: psci: Dont instrument suspend code KVM: selftests: Fix build of rseq test exit: Detect and fix irq disabled state in oops cpuidle, arm64: Fix the ARM64 cpuidle logic cpuidle: mvebu: Fix duplicate flags assignment sched/fair: Limit sched slice duration ... |
||
![]() |
01bb11ad82 |
sched/topology: fix KASAN warning in hop_cmp()
Despite that prev_hop is used conditionally on cur_hop
is not the first hop, it's initialized unconditionally.
Because initialization implies dereferencing, it might happen
that the code dereferences uninitialized memory, which has been
spotted by KASAN. Fix it by reorganizing hop_cmp() logic.
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Fixes:
|
||
![]() |
c2dbe32d5d |
sched/psi: Fix use-after-free in ep_remove_wait_queue()
If a non-root cgroup gets removed when there is a thread that registered trigger and is polling on a pressure file within the cgroup, the polling waitqueue gets freed in the following path: do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy However, the polling thread still has a reference to the pressure file and will access the freed waitqueue when the file is closed or upon exit: fput ep_eventpoll_release ep_free ep_remove_wait_queue remove_wait_queue This results in use-after-free as pasted below. The fundamental problem here is that cgroup_file_release() (and consequently waitqueue's lifetime) is not tied to the file's real lifetime. Using wake_up_pollfree() here might be less than ideal, but it is in line with the comment at commit |
||
![]() |
df14b7f9ef |
sched/core: Fix a missed update of user_cpus_ptr
Since commit |
||
![]() |
7c4a5b89a0 |
sched/rt: pick_next_rt_entity(): check list_entry
Commit |
||
![]() |
7ea98dfa44 |
sched/deadline: Add more reschedule cases to prio_changed_dl()
I've been tracking down an issue on a ~5.17ish kernel where:
CPUx CPUy
<DL task p0 owns an rtmutex M>
<p0 depletes its runtime, gets throttled>
<rq switches to the idle task>
<DL task p1 blocks on M, boost/replenish p0>
<No call to resched_curr() happens here>
[idle task keeps running here until *something*
accidentally sets TIF_NEED_RESCHED]
On that kernel, it is quite easy to trigger using rt-tests's deadline_test
[1] with the test running on isolated CPUs (this reduces the chance of
something unrelated setting TIF_NEED_RESCHED on the idle tasks, making the
issue even more obvious as the hung task detector chimes in).
I haven't been able to reproduce this using a mainline kernel, even if I
revert
|
||
![]() |
829c1651e9 |
sched/fair: sanitize vruntime of entity being placed
When a scheduling entity is placed onto cfs_rq, its vruntime is pulled to the base level (around cfs_rq->min_vruntime), so that the entity doesn't gain extra boost when placed backwards. However, if the entity being placed wasn't executed for a long time, its vruntime may get too far behind (e.g. while cfs_rq was executing a low-weight hog), which can inverse the vruntime comparison due to s64 overflow. This results in the entity being placed with its original vruntime way forwards, so that it will effectively never get to the cpu. To prevent that, ignore the vruntime of the entity being placed if it didn't execute for much longer than the characteristic sheduler time scale. [rkagan: formatted, adjusted commit log, comments, cutoff value] Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Co-developed-by: Roman Kagan <rkagan@amazon.de> Signed-off-by: Roman Kagan <rkagan@amazon.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230130122216.3555094-1-rkagan@amazon.de |
||
![]() |
a2e90611b9 |
sched/fair: Remove capacity inversion detection
Remove the capacity inversion detection which is now handled by util_fits_cpu() returning -1 when we need to continue to look for a potential CPU with better performance. This ends up almost reverting patches below except for some comments: commit |
||
![]() |
e5ed0550c0 |
sched/fair: unlink misfit task from cpu overutilized
By taking into account uclamp_min, the 1:1 relation between task misfit and cpu overutilized is no more true as a task with a small util_avg may not fit a high capacity cpu because of uclamp_min constraint. Add a new state in util_fits_cpu() to reflect the case that task would fit a CPU except for the uclamp_min hint which is a performance requirement. Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we can use this new value to take additional action to select the best CPU that doesn't match uclamp_min hint. When util_fits_cpu() returns -1, we will continue to look for a possible CPU with better performance, which replaces Capacity Inversion detection with capacity_orig_of() - thermal_load_avg to detect a capacity inversion. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-and-tested-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Kajetan Puchalski <kajetan.puchalski@arm.com> Link: https://lore.kernel.org/r/20230201143628.270912-2-vincent.guittot@linaro.org |
||
![]() |
214dbc4281 |
sched: convert to vma iterator
Use the vma iterator so that the iterator can be invalidated or updated to avoid each caller doing so. Link: https://lkml.kernel.org/r/20230120162650.984577-23-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9feae65845 |
sched/topology: Introduce sched_numa_hop_mask()
Tariq has pointed out that drivers allocating IRQ vectors would benefit from having smarter NUMA-awareness - cpumask_local_spread() only knows about the local node and everything outside is in the same bucket. sched_domains_numa_masks is pretty much what we want to hand out (a cpumask of CPUs reachable within a given distance budget), introduce sched_numa_hop_mask() to export those cpumasks. Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
cd7f55359c |
sched: add sched_numa_find_nth_cpu()
The function finds Nth set CPU in a given cpumask starting from a given node. Leveraging the fact that each hop in sched_domains_numa_masks includes the same or greater number of CPUs than the previous one, we can use binary search on hops instead of linear walk, which makes the overall complexity of O(log n) in terms of number of cpumask_weight() calls. Signed-off-by: Yury Norov <yury.norov@gmail.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Peter Lafreniere <peter@n8pjl.ca> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
776f22913b |
sched/clock: Make local_clock() noinstr
With sched_clock() noinstr, provide a noinstr implementation of local_clock(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126151323.760767043@infradead.org |
||
![]() |
57a30218fa |
Linux 6.2-rc6
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh SDc/Y/c= =zpaD -----END PGP SIGNATURE----- Merge tag 'v6.2-rc6' into sched/core, to pick up fixes Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
5657c11678 |
sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
The kernel commit |
||
![]() |
79ba1e607d |
sched/fair: Limit sched slice duration
In presence of a lot of small weight tasks like sched_idle tasks, normal or high weight tasks can see their ideal runtime (sched_slice) to increase to hundreds ms whereas it normally stays below sysctl_sched_latency. 2 normal tasks running on a CPU will have a max sched_slice of 12ms (half of the sched_period). This means that they will make progress every sysctl_sched_latency period. If we now add 1000 idle tasks on the CPU, the sched_period becomes 3006 ms and the ideal runtime of the normal tasks becomes 609 ms. It will even become 1500ms if the idle tasks belongs to an idle cgroup. This means that the scheduler will look for picking another waiting task after 609ms running time (1500ms respectively). The idle tasks change significantly the way the 2 normal tasks interleave their running time slot whereas they should have a small impact. Such long sched_slice can delay significantly the release of resources as the tasks can wait hundreds of ms before the next running slot just because of idle tasks queued on the rq. Cap the ideal_runtime to sysctl_sched_latency to make sure that tasks will regularly make progress and will not be significantly impacted by idle/background tasks queued on the rq. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20230113133613.257342-1-vincent.guittot@linaro.org |
||
![]() |
89b3098703 |
arch/idle: Change arch_cpu_idle() behavior: always exit with IRQs disabled
Current arch_cpu_idle() is called with IRQs disabled, but will return with IRQs enabled. However, the very first thing the generic code does after calling arch_cpu_idle() is raw_local_irq_disable(). This means that architectures that can idle with IRQs disabled end up doing a pointless 'enable-disable' dance. Therefore, push this IRQ disabling into the idle function, meaning that those architectures can avoid the pointless IRQ state flipping. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org |
||
![]() |
a01353cf18 |
cpuidle: Fix ct_idle_*() usage
The whole disable-RCU, enable-IRQS dance is very intricate since changing IRQ state is traced, which depends on RCU. Add two helpers for the cpuidle case that mirror the entry code: ct_cpuidle_enter() ct_cpuidle_exit() And fix all the cases where the enter/exit dance was buggy. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.130014793@infradead.org |
||
![]() |
da07d2f9c1 |
sched/fair: Fixes for capacity inversion detection
Traversing the Perf Domains requires rcu_read_lock() to be held and is
conditional on sched_energy_enabled(). Ensure right protections applied.
Also skip capacity inversion detection for our own pd; which was an
error.
Fixes:
|
||
![]() |
e26fd28db8 |
sched/uclamp: Fix a uninitialized variable warnings
Addresses the following warnings:
> config: riscv-randconfig-m031-20221111
> compiler: riscv64-linux-gcc (GCC) 12.1.0
>
> smatch warnings:
> kernel/sched/fair.c:7263 find_energy_efficient_cpu() error: uninitialized symbol 'util_min'.
> kernel/sched/fair.c:7263 find_energy_efficient_cpu() error: uninitialized symbol 'util_max'.
Fixes:
|
||
![]() |
9a5418bc48 |
sched/core: Use kfree_rcu() in do_set_cpus_allowed()
Commit |
||
![]() |
87ca4f9efb |
sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
Since commit |
||
![]() |
7fb3ff22ad |
sched/core: Fix arch_scale_freq_tick() on tickless systems
In order for the scheduler to be frequency invariant we measure the
ratio between the maximum CPU frequency and the actual CPU frequency.
During long tickless periods of time the calculations that keep track
of that might overflow, in the function scale_freq_tick():
if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
goto error;
eventually forcing the kernel to disable the feature for all CPUs,
and show the warning message:
"Scheduler frequency invariance went wobbly, disabling!".
Let's avoid that by limiting the frequency invariant calculations
to CPUs with regular tick.
Fixes:
|
||
![]() |
544a4f2ecd |
sched/membarrier: Introduce MEMBARRIER_CMD_GET_REGISTRATIONS
Provide a method to query previously issued registrations. Signed-off-by: Michal Clapinski <mclapinski@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20221207164338.1535591-2-mclapinski@google.com |
||
![]() |
948fb4c4e9 |
cpufreq, sched/util: Optimize operations with single CPU capacity lookup
The max CPU capacity is the same for all CPUs sharing frequency domain. There is a way to avoid heavy operations in a loop for each CPU by leveraging this knowledge. Thus, simplify the looping code in the sugov_next_freq_shared() and drop heavy multiplications. Instead, use simple max() to get the highest utilization from these CPUs. This is useful for platforms with many (4 or 6) little CPUs. We avoid heavy 2*PD_CPU_NUM multiplications in that loop, which is called billions of times, since it's not limited by the schedutil time delta filter in sugov_should_update_freq(). When there was no need to change frequency the code bailed out, not updating the sg_policy::last_freq_update_time. Then every visit after delta_ns time longer than the sg_policy::freq_update_delay_ns goes through and triggers the next frequency calculation code. Although, if the next frequency, as outcome of that, would be the same as current frequency, we won't update the sg_policy::last_freq_update_time and the story will be repeated (in a very short period, sometimes a few microseconds). The max CPU capacity must be fetched every time we are called, due to difficulties during the policy setup, where we are not able to get the normalized CPU capacity at the right time. The fetched CPU capacity value is than used in sugov_iowait_apply() to calculate the right boost. This required a few changes in the local functions and arguments. The capacity value should hopefully be fetched once when needed and then passed over CPU registers to those functions. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221208160256.859-2-lukasz.luba@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> |
||
![]() |
160fb0d83f |
sched/core: Reorganize ttwu_do_wakeup() and ttwu_do_activate()
ttwu_do_activate() is used for a complete wakeup, in which we will activate_task() and use ttwu_do_wakeup() to mark the task runnable and perform wakeup-preemption, also call class->task_woken() callback and update the rq->idle_stamp. Since ttwu_runnable() is not a complete wakeup, don't need all those done in ttwu_do_wakeup(), so we can move those to ttwu_do_activate() to simplify ttwu_do_wakeup(), making it only mark the task runnable to be reused in ttwu_runnable() and try_to_wake_up(). This patch should not have any functional changes. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221223103257.4962-2-zhouchengming@bytedance.com |
||
![]() |
efe0938586 |
sched/core: Micro-optimize ttwu_runnable()
ttwu_runnable() is used as a fast wakeup path when the wakee task is running on CPU or runnable on RQ, in both cases we can just set its state to TASK_RUNNING to prevent a sleep. If the wakee task is on_cpu running, we don't need to update_rq_clock() or check_preempt_curr(). But if the wakee task is on_rq && !on_cpu (e.g. an IRQ hit before the task got to schedule() and the task been preempted), we should check_preempt_curr() to see if it can preempt the current running. This also removes the class->task_woken() callback from ttwu_runnable(), which wasn't required per the RT/DL implementations: any required push operation would have been queued during class->set_next_task() when p got preempted. ttwu_runnable() also loses the update to rq->idle_stamp, as by definition the rq cannot be idle in this scenario. Suggested-by: Valentin Schneider <vschneid@redhat.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20221223103257.4962-1-zhouchengming@bytedance.com |
||
![]() |
7c182722a0 |
sched: Add helper nr_context_switches_cpu()
Add a function nr_context_switches_cpu() that returns number of context switches since boot on the specified CPU. This information will be used to diagnose RCU CPU stalls. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> |
||
![]() |
ef90cf2281 |
sched/topology: Add __init for sched_init_domains()
sched_init_domains() is only used in initialization Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230105014943.9857-1-huangbing775@126.com |
||
![]() |
bbd0b03150 |
sched/rseq: Fix concurrency ID handling of usermodehelper kthreads
sched_mm_cid_after_execve() does not expect NULL t->mm, but it may happen
if a usermodehelper kthread fails when attempting to execute a binary.
sched_mm_cid_fork() can be issued from a usermodehelper kthread, which
has t->flags PF_KTHREAD set.
Fixes:
|
||
![]() |
c89970202a |
cputime: remove cputime_to_nsecs fallback
The archs that use cputime_to_nsecs() internally provide their own definition and don't need the fallback. cputime_to_usecs() unused except in this fallback, and is not defined anywhere. This removes the final remnant of the cputime_t code from the kernel. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com |
||
![]() |
8589018acc |
sched/core: Adjusting the order of scanning CPU
When select_idle_capacity() starts scanning for an idle CPU, it starts with target CPU that has already been checked in select_idle_sibling(). So we start checking from the next CPU and try the target CPU at the end. Similarly for task_numa_assign(), we have just checked numa_migrate_on of dst_cpu, so start from the next CPU. This also works for steal_cookie_task(), the first scan must fail and start directly from the next one. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com |
||
![]() |
feaed76376 |
sched/numa: Stop an exhastive search if an idle core is found
In update_numa_stats() we try to find an idle cpu on the NUMA node, preferably an idle core. we can stop looking for the next idle core or idle cpu after finding an idle core. But we can't stop the whole loop of scanning the CPU, because we need to calculate approximate NUMA stats at a point in time. For example, the src and dst nr_running is needed by task_numa_find_cpu(). Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-2-jiahao.os@bytedance.com |
||
![]() |
904cbab71d |
sched: Make const-safe
With a modified container_of() that preserves constness, the compiler finds some pointers which should have been marked as const. task_of() also needs to become const-preserving for the !FAIR_GROUP_SCHED case so that cfs_rq_of() can take a const argument. No change to generated code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org |
||
![]() |
af7f588d8f |
sched: Introduce per-memory-map concurrency ID
This feature allows the scheduler to expose a per-memory map concurrency ID to user-space. This concurrency ID is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory map. If a memory map has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container. One of the key concern from scheduler maintainers is the overhead associated with additional spin locks or atomic operations in the scheduler fast-path. This is why the following optimization is implemented. On context switch between threads belonging to the same memory map, transfer the mm_cid from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory map. Additional optimizations can be done if the spin locks added when context switching between threads belonging to different memory maps end up being a performance bottleneck. Those are left out of this patch though. A performance impact would have to be clearly demonstrated to justify the added complexity. The credit goes to Paul Turner (Google) for the original virtual cpu id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. The following benchmarks do not show any significant overhead added to the scheduler context switch by this feature: * perf bench sched messaging (process) Baseline: 86.5±0.3 ms With mm_cid: 86.7±2.6 ms * perf bench sched messaging (threaded) Baseline: 84.3±3.0 ms With mm_cid: 84.7±2.6 ms * hackbench (process) Baseline: 82.9±2.7 ms With mm_cid: 82.9±2.9 ms * hackbench (threaded) Baseline: 85.2±2.6 ms With mm_cid: 84.4±2.9 ms [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com |
||
![]() |
8ad075c2eb |
sched: Async unthrottling for cfs bandwidth
CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's inline in an hrtimer callback. Runtime distribution is a per-cpu operation, and unthrottling is a per-cgroup operation, since a tg walk is required. On machines with a large number of cpus and large cgroup hierarchies, this cpus*cgroups work can be too much to do in a single hrtimer callback: since IRQ are disabled, hard lockups may easily occur. Specifically, we've found this scalability issue on configurations with 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high memory bandwidth usage. To fix this, we can instead unthrottle cfs_rq's asynchronously via a CSD. Each cpu is responsible for unthrottling itself, thus sharding the total work more fairly across the system, and avoiding hard lockups. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221117005418.3499691-1-joshdon@google.com |
||
![]() |
9a5322db46 |
sched/topology: Add __init for init_defrootdomain
init_defrootdomain is only used in initialization Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20221118034208.267330-1-huangbing775@126.com |
||
![]() |
48ea09cdda |
hardening updates for v6.2-rc1
- Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook). - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions. - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook). - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking. - Improve code generation for strscpy() and update str*() kern-doc. - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests. - Always use a non-NULL argument for prepare_kernel_cred(). - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell). - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li). - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu). - Fix um vs FORTIFY warnings for always-NULL arguments. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+ I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw 0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs AHXcibguPRQBPAdiPQ== =yaaN -----END PGP SIGNATURE----- Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ... |
||
![]() |
8fa37a6835 |
sysctl changes for v6.2-rc1
Only step forward on the sysctl cleanups for this cycle. This has been on linux-next since September and this time it goes with a "Yeah, think so, it just moves stuff around a bit" from Peter Zijlstra. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmOYC3sSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinVEYQAL6/3nRt854jULd3zRwrWDyJZd5yxbnc R8jJBTt3q4CKwtMqd59uQqVYLpSqOCx/GsArfsXkmY4x7KYhlaSKcC4LHmFS8Z/u dofyVKumIFqtXMI+hYuTyqNqfGoK9UKXUftrqYb8pK+K3h73uqYbrDgSex4G9GJo Au0/WeDjTzLlgqt7RPN7n0PL2jMtfWVQkr3001OCQOWW9sdrOjtprn/3bDTUnW5q KukKB5saU0CvUzrTn2DaweQiRCJxQfCQfy3DZfhDRHVuWFYMV9b1okaGEoVmQlQT I9/urfdf3aLCdBBxCQG5W6uRxZwZ2Yb93M+rijZNWNFMC6WHrMCmSiADwz9LJzIK iQV7LoolGe1TFTEVJbsde5xKSF6BeId0IF5mmPQuokAx3TPE9279HNgluaB/38c8 p3P4+mP6qE12mMPyhpwDwNOzEWgUnLsGSIE5n/WPwxCiGNa7UsN2lzMDP1cJejp5 NlRg1hRKmgt30d9+t9sHeKMcWhrjxyPGsyUMwBJTuMCHbjqizGyBsB8DzyK95OoF aN66pyRqwsK0+IUivd8VfLgfriE2gDrQD5VqkJ8lfWBx9pq8RMEq7zQ1eE9IbCff hzbfG+7k9R3o4SPfJYmCBXtp6fcq+ovjbLYSvGGCJk0zfFe6SQE21rZ3hCQPq3v5 xKFh05xUfbRF =M48U -----END PGP SIGNATURE----- Merge tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "Only a small step forward on the sysctl cleanups for this cycle" * tag 'sysctl-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sched: Move numa_balancing sysctls to its own file |
||
![]() |
ce8a79d560 |
for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9 2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI 0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7 Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr wxzFGfvHfw== =k/vv -----END PGP SIGNATURE----- Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ... |
||
![]() |
8702f2c611 |
Non-MM patches for 6.2-rc1.
- A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line. - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files. - A series from Yang Yingliang to address rapido memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t(). - And a whole shower of singleton patches all over the place. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5efRgAKCRDdBJ7gKXxA jgvdAP0al6oFDtaSsshIdNhrzcMwfjt6PfVxxHdLmNhF1hX2dwD/SVluS1bPSP7y 0sZp7Ustu3YTb8aFkMl96Y9m9mY1Nwg= =ga5B -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ... |
||
![]() |
bf57ae2165 |
Scheduler changes for v6.2:
- Implement persistent user-requested affinity: introduce affinity_context::user_mask and unconditionally preserve the user-requested CPU affinity masks, for long-lived tasks to better interact with cpusets & CPU hotplug events over longer timespans, without destroying the original affinity intent if the underlying topology changes. - Uclamp updates: fix relationship between uclamp and fits_capacity() - PSI fixes - Misc fixes & updates. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXkmgRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j/dQ//WYW/JaBpydqnVxDu6C21z0w3+fHDdlsN nQ6jyLPlouFjI2Ink1E7i7Iq8C73sdewCgD7Jq3xGa1GhsPEJIrPAaBgacxYjOqc x9HHZoygSkAihTfrVvzq37YttD2t/gQQxc81tBziMBVP2A+gb9z44u+ezMlxjiGz irgE07qNfiLyTeD/dJhEU2EOsPJm/gestW3+Cd8uwYAe6pj0X4FE3n8ipmr0BzNZ 6nxFJaSspwAkREjpAIZVENEArq7XrkGqUFKgKpYqWn0HAnuTWgFcW8E2NrDw7Qbf 4aAdBuzimbWdbkqRoX9r7++r5wqc3KW+is8Y97aEUsc0zhrXHAW1Hn2w7en5XxiQ btaPi77Boi69sHvOrfMy3i6UZ895yh2sROIkYBDT485w57BR75HsMLkk2LNIm7qE mATrrZ65bbGAgAxZouqlnQk40BUlniIfDlfZyReyFtXkW8UH5tTNX6qtpWzzdwfy posrm+XvgDcP96/7DIczZwT6VEJE5GBZbPvk2Vw4GNq6/QeW7g9GPhYTaV6CXzzW lCk/MV1n+IWCUqjkGXXCTS53TIyC6WZh2ehegcsh1KYyWcVijEs42S6eqXZI9cO7 F4oU7sehg4vlhMm1uE5JgaABfYqqzzKlvZySdwXbne2Vjt4nsWlWoe6u6JAdA4EB PRwmUDRMyEE= =aao/ -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement persistent user-requested affinity: introduce affinity_context::user_mask and unconditionally preserve the user-requested CPU affinity masks, for long-lived tasks to better interact with cpusets & CPU hotplug events over longer timespans, without destroying the original affinity intent if the underlying topology changes. - Uclamp updates: fix relationship between uclamp and fits_capacity() - PSI fixes - Misc fixes & updates * tag 'sched-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Clear ttwu_pending after enqueue_task() sched/psi: Use task->psi_flags to clear in CPU migration sched/psi: Stop relying on timer_pending() for poll_work rescheduling sched/psi: Fix avgs_work re-arm in psi_avgs_work() sched/psi: Fix possible missing or delayed pending event sched: Always clear user_cpus_ptr in do_set_cpus_allowed() sched: Enforce user requested affinity sched: Always preserve the user requested cpumask sched: Introduce affinity_context sched: Add __releases annotations to affine_move_task() sched/fair: Check if prev_cpu has highest spare cap in feec() sched/fair: Consider capacity inversion in util_fits_cpu() sched/fair: Detect capacity inversion sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition sched/uclamp: Make cpu_overutilized() use util_fits_cpu() sched/uclamp: Make asym_fits_capacity() use util_fits_cpu() sched/uclamp: Make select_idle_capacity() use util_fits_cpu() sched/uclamp: Fix fits_capacity() check in feec() sched/uclamp: Make task_fits_capacity() use util_fits_cpu() sched/uclamp: Fix relationship between uclamp and migration margin |
||
![]() |
79cc1ba7ba |
panic: Consolidate open-coded panic_on_warn checks
Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org |
||
![]() |
cdcc5ef26b |
Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
This reverts commit |
||
![]() |
0dff89c448 |
sched: Move numa_balancing sysctls to its own file
The sysctl_numa_balancing_promote_rate_limit and sysctl_numa_balancing are part of sched, move them to its own file. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
![]() |
8baceabca6 |
sched/fair: use try_cmpxchg in task_numa_work
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in task_numa_work. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Link: https://lkml.kernel.org/r/20220822173956.82525-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ee7dc86b6d |
wait: Return number of exclusive waiters awaken
Sbitmap code will need to know how many waiters were actually woken for its batched wakeups implementation. Return the number of woken exclusive waiters from __wake_up() to facilitate that. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221115224553.23594-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
d6962c4fe8 |
sched: Clear ttwu_pending after enqueue_task()
We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit
|
||
![]() |
91dabf33ae |
sched: Fix race in task_call_func()
There is a very narrow race between schedule() and task_call_func().
CPU0 CPU1
__schedule()
rq_lock();
prev_state = READ_ONCE(prev->__state);
if (... && prev_state) {
deactivate_tasl(rq, prev, ...)
prev->on_rq = 0;
task_call_func()
raw_spin_lock_irqsave(p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb();
if (... || p->on_rq) // false!!!
rq = __task_rq_lock()
ret = func();
next = pick_next_task();
rq = context_switch(prev, next)
prepare_lock_switch()
spin_release(&__rq_lockp(rq)->dep_map...)
So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.
Avoid this by waiting for p->on_cpu to get cleared, which guarantees
the task is fully finished on the old CPU.
( While arguably the fixes tag is 'wrong' -- none of the previous
task_call_func() users appears to care for this case. )
Fixes:
|
||
![]() |
52b33d87b9 |
sched/psi: Use task->psi_flags to clear in CPU migration
The commit
|
||
![]() |
710ffe671e |
sched/psi: Stop relying on timer_pending() for poll_work rescheduling
Psi polling mechanism is trying to minimize the number of wakeups to run psi_poll_work and is currently relying on timer_pending() to detect when this work is already scheduled. This provides a window of opportunity for psi_group_change to schedule an immediate psi_poll_work after poll_timer_fn got called but before psi_poll_work could reschedule itself. Below is the depiction of this entire window: poll_timer_fn wake_up_interruptible(&group->poll_wait); psi_poll_worker wait_event_interruptible(group->poll_wait, ...) psi_poll_work psi_schedule_poll_work if (timer_pending(&group->poll_timer)) return; ... mod_timer(&group->poll_timer, jiffies + delay); Prior to |
||
![]() |
2fcd7bbae9 |
sched/psi: Fix avgs_work re-arm in psi_avgs_work()
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
only the kworker running avgs_work on the CPU.
Although commit
|
||
![]() |
e38f89af6a |
sched/psi: Fix possible missing or delayed pending event
When a pending event exists and growth is less than the threshold, the current logic is to skip this trigger without generating event. However, from |
||
![]() |
851a723e45 |
sched: Always clear user_cpus_ptr in do_set_cpus_allowed()
The do_set_cpus_allowed() function is used by either kthread_bind() or select_fallback_rq(). In both cases the user affinity (if any) should be destroyed too. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-6-longman@redhat.com |
||
![]() |
da01903281 |
sched: Enforce user requested affinity
It was found that the user requested affinity via sched_setaffinity() can be easily overwritten by other kernel subsystems without an easy way to reset it back to what the user requested. For example, any change to the current cpuset hierarchy may reset the cpumask of the tasks in the affected cpusets to the default cpuset value even if those tasks have pre-existing user requested affinity. That is especially easy to trigger under a cgroup v2 environment where writing "+cpuset" to the root cgroup's cgroup.subtree_control file will reset the cpus affinity of all the processes in the system. That is problematic in a nohz_full environment where the tasks running in the nohz_full CPUs usually have their cpus affinity explicitly set and will behave incorrectly if cpus affinity changes. Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr() and use it to restrcit the given cpumask unless there is no overlap. In that case, it will fallback to the given one. The SCA_USER flag is reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr masking should be skipped. In addition, masking should also be skipped if any of the SCA_MIGRATE_* flag is set. All callers of set_cpus_allowed_ptr() will be affected by this change. A scratch cpumask is added to percpu runqueues structure for doing additional masking when user_cpus_ptr is set. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-4-longman@redhat.com |
||
![]() |
8f9ea86fdf |
sched: Always preserve the user requested cpumask
Unconditionally preserve the user requested cpumask on sched_setaffinity() calls. This allows using it outside of the fairly narrow restrict_cpus_allowed_ptr() use-case and fix some cpuset issues that currently suffer destruction of cpumasks. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-3-longman@redhat.com |
||
![]() |
713a2e21a5 |
sched: Introduce affinity_context
In order to prepare for passing through additional data through the affinity call-chains, convert the mask and flags argument into a structure. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-5-longman@redhat.com |
||
![]() |
5584e8ac2c |
sched: Add __releases annotations to affine_move_task()
affine_move_task() assumes task_rq_lock() has been called and it does an implicit task_rq_unlock() before returning. Add the appropriate __releases annotations to make this clear. A typo error in comment is also fixed. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220922180041.1768141-2-longman@redhat.com |
||
![]() |
ad841e569f |
sched/fair: Check if prev_cpu has highest spare cap in feec()
When evaluating the CPU candidates in the perf domain (pd) containing the previously used CPU (prev_cpu), find_energy_efficient_cpu() evaluates the energy of the pd: - without the task (base_energy) - with the task placed on prev_cpu (if the task fits) - with the task placed on the CPU with the highest spare capacity, prev_cpu being excluded from this set If prev_cpu is already the CPU with the highest spare capacity, max_spare_cap_cpu will be the CPU with the second highest spare capacity. On an Arm64 Juno-r2, with a workload of 10 tasks at a 10% duty cycle, when prev_cpu and max_spare_cap_cpu are both valid candidates, prev_spare_cap > max_spare_cap at ~82%. Thus the energy of the pd when placing the task on max_spare_cap_cpu is computed with no possible positive outcome 82% most of the time. Do not consider max_spare_cap_cpu as a valid candidate if prev_spare_cap > max_spare_cap. Signed-off-by: Pierre Gondois <pierre.gondois@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20221006081052.3862167-2-pierre.gondois@arm.com |
||
![]() |
aa69c36f31 |
sched/fair: Consider capacity inversion in util_fits_cpu()
We do consider thermal pressure in util_fits_cpu() for uclamp_min only. With the exception of the biggest cores which by definition are the max performance point of the system and all tasks by definition should fit. Even under thermal pressure, the capacity of the biggest CPU is the highest in the system and should still fit every task. Except when it reaches capacity inversion point, then this is no longer true. We can handle this by using the inverted capacity as capacity_orig in util_fits_cpu(). Which not only addresses the problem above, but also ensure uclamp_max now considers the inverted capacity. Force fitting a task when a CPU is in this adverse state will contribute to making the thermal throttling last longer. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-10-qais.yousef@arm.com |
||
![]() |
44c7b80bff |
sched/fair: Detect capacity inversion
Check each performance domain to see if thermal pressure is causing its capacity to be lower than another performance domain. We assume that each performance domain has CPUs with the same capacities, which is similar to an assumption made in energy_model.c We also assume that thermal pressure impacts all CPUs in a performance domain equally. If there're multiple performance domains with the same capacity_orig, we will trigger a capacity inversion if the domain is under thermal pressure. The new cpu_in_capacity_inversion() should help users to know when information about capacity_orig are not reliable and can opt in to use the inverted capacity as the 'actual' capacity_orig. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-9-qais.yousef@arm.com |
||
![]() |
d81304bc61 |
sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
If the utilization of the woken up task is 0, we skip the energy
calculation because it has no impact.
But if the task is boosted (uclamp_min != 0) will have an impact on task
placement and frequency selection. Only skip if the util is truly
0 after applying uclamp values.
Change uclamp_task_cpu() signature to avoid unnecessary additional calls
to uclamp_eff_get(). feec() is the only user now.
Fixes:
|
||
![]() |
c56ab1b350 |
sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
So that it is now uclamp aware.
This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping
the system in overutilized state which disables EAS and leads to wasting
energy in the long run.
Without this patch running a busy background activity like JIT
compilation on Pixel 6 causes the system to be in overutilized state
74.5% of the time.
With this patch this goes down to 9.79%.
It also fixes another problem when long running tasks that have their
UCLAMP_MIN changed while running such that they need to upmigrate to
honour the new UCLAMP_MIN value. The upmigration doesn't get triggered
because overutilized state never gets set in this state, hence misfit
migration never happens at tick in this case until the task wakes up
again.
Fixes:
|
||
![]() |
a2e7f03ed2 |
sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now.
Fixes:
|
||
![]() |
b759caa1d9 |
sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.
Fixes:
|
||
![]() |
244226035a |
sched/uclamp: Fix fits_capacity() check in feec()
As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.
The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.
We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.
Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.
[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html
Fixes:
|
||
![]() |
b48e16a697 |
sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
So that the new uclamp rules in regard to migration margin and capacity
pressure are taken into account correctly.
Fixes:
|
||
![]() |
48d5e9daa8 |
sched/uclamp: Fix relationship between uclamp and migration margin
fits_capacity() verifies that a util is within 20% margin of the
capacity of a CPU, which is an attempt to speed up upmigration.
But when uclamp is used, this 20% margin is problematic because for
example if a task is boosted to 1024, then it will not fit on any CPU
according to fits_capacity() logic.
Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
end up on big instead on the desired medium CPU.
Similar corner cases exist for uclamp and usage of capacity_of().
Slightest irq pressure on biggest CPU for example will make a 1024
boosted task look like it can't fit.
What we really want is for uclamp comparisons to ignore the migration
margin and capacity pressure, yet retain them for when checking the
_actual_ util signal.
For example, task p:
p->util_avg = 300
p->uclamp[UCLAMP_MIN] = 1024
Will fit a big CPU. But
p->util_avg = 900
p->uclamp[UCLAMP_MIN] = 1024
will not, this should trigger overutilized state because the big CPU is
now *actually* being saturated.
Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:
p->util_avg = 1024
p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)
Should fit the task on medium cpus without triggering overutilized
state.
Inlined comments expand more on desired behavior in more scenarios.
Introduce new util_fits_cpu() function which encapsulates the new logic.
The new function is not used anywhere yet, but will be used to update
various users of fits_capacity() in later patches.
Fixes:
|
||
![]() |
8e5bad7dcc |
sched: Introduce struct balance_callback to avoid CFI mismatches
Introduce distinct struct balance_callback instead of performing function pointer casting which will trip CFI. Avoids warnings as found by Clang's future -Wcast-function-type-strict option: In file included from kernel/sched/core.c:84: kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict] head->func = (void (*)(struct callback_head *))func; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No binary differences result from this change. This patch is a cleanup based on Brad Spengler/PaX Team's modifications to sched code in their last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1724 Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org |
||
![]() |
e705968dd6 |
sched/core: Fix comparison in sched_group_cookie_match()
In commit |
||
![]() |
bd9a3dba18 |
PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW SIZD/S/hzdI= =91tB -----END PGP SIGNATURE----- Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull PSI updates from Ingo Molnar: - Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. * tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Per-cgroup PSI accounting disable/re-enable interface sched/psi: Cache parent psi_group to speed up group iteration sched/psi: Consolidate cgroup_psi() sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure sched/psi: Remove NR_ONCPU task accounting sched/psi: Optimize task switch inside shared cgroups again sched/psi: Move private helpers to sched/stats.h sched/psi: Save percpu memory when !psi_cgroups_enabled sched/psi: Don't create cgroup PSI files when psi_disabled sched/psi: Fix periodic aggregation shut off |
||
![]() |
27bc50fc90 |
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ... |
||
![]() |
d4013bc4d4 |
bitmap patches for v6.1-rc1
From Phil Auld: drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES From me: cpumask: cleanup nr_cpu_ids vs nr_cpumask_bits mess This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. From me: lib: optimize find_bit() functions Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. From me: lib/find: add find_nth_bit() Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. From me: cpumask: repair cpumask_check() After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. From Valentin Schneider: bitmap,cpumask: Add for_each_cpu_andnot() and for_each_cpu_andnot() Extends the API with one more function and applies it in sched/core. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmNBwmUACgkQsUSA/Tof vshPRwv+KlqnZlKtuSPgbo/Kgswworpi/7TqfnN9GWlb8AJ2uhjBKI3GFwv4TDow 7KV6wdKdXYLr4pktcIhWy3qLrT+bDDExfarHRo3QI1A1W42EJ+ZiUaGnQGcnVMzD 5q/K1YMJYq0oaesHEw5PVUh8mm6h9qRD8VbX1u+riW/VCWBj3bho9Dp4mffQ48Q6 hVy/SnMGgClQwNYp+sxkqYx38xUqUGYoU5MzeziUmoS6pZQh+4lF33MULnI3EKmc /ehXilPPtOV/Tm0RovDWFfm3rjNapV9FXHu8Ob2z/c+1A29EgXnE3pwrBDkAx001 TQrL9qbCANRDGPLzWQHw0dwFIaXvTdrSttCsfYYfU5hI4JbnJEe0Pqkaaohy7jqm r0dW/TlyOG5T+k8Kwdx9w9A+jKs8TbKKZ8HOaN8BpkXswVnpbzpQbj3TITZI4aeV 6YR4URBQ5UkrVLEXFXbrOzwjL2zqDdyNoBdTJmGLJ+5b/n0HHzmyMVkegNIwLLM3 GR7sMQae =Q/+F -----END PGP SIGNATURE----- Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linux Pull bitmap updates from Yury Norov: - Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld) - cleanup nr_cpu_ids vs nr_cpumask_bits mess (me) This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. - optimize find_bit() functions (me) Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. - add find_nth_bit() (me) Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. - repair cpumask_check() (me) After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. - Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin Schneider) Extends the API with one more function and applies it in sched/core. * tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits) sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() lib/test_cpumask: Add for_each_cpu_and(not) tests cpumask: Introduce for_each_cpu_andnot() lib/find_bit: Introduce find_next_andnot_bit() cpumask: fix checking valid cpu range lib/bitmap: add tests for for_each() loops lib/find: optimize for_each() macros lib/bitmap: introduce for_each_set_bit_wrap() macro lib/find_bit: add find_next{,_and}_bit_wrap cpumask: switch for_each_cpu{,_not} to use for_each_bit() net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and} cpumask: add cpumask_nth_{,and,andnot} lib/bitmap: remove bitmap_ord_to_pos lib/bitmap: add tests for find_nth_bit() lib: add find_nth{,_and,_andnot}_bit() lib/bitmap: add bitmap_weight_and() lib/bitmap: don't call __bitmap_weight() in kernel code tools: sync find_bit() implementation lib/find_bit: optimize find_next_bit() functions lib/find_bit: create find_first_zero_bit_le() ... |
||
![]() |
30c999937f |
Scheduler changes for v6.1:
- Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities - Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code - Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. - Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() - Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+ LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq 4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i TeSLCxQxXlU= =KjMD -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Debuggability: - Change most occurances of BUG_ON() to WARN_ON_ONCE() - Reorganize & fix TASK_ state comparisons, turn it into a bitmap - Update/fix misc scheduler debugging facilities Load-balancing & regular scheduling: - Improve the behavior of the scheduler in presence of lot of SCHED_IDLE tasks - in particular they should not impact other scheduling classes. - Optimize task load tracking, cleanups & fixes - Clean up & simplify misc load-balancing code Freezer: - Rewrite the core freezer to behave better wrt thawing and be simpler in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting all the fallout. Deadline scheduler: - Fix the DL capacity-aware code - Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period() - Relax/optimize locking in task_non_contending() Cleanups: - Factor out the update_current_exec_runtime() helper - Various cleanups, simplifications" * tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched: Fix more TASK_state comparisons sched: Fix TASK_state comparisons sched/fair: Move call to list_last_entry() in detach_tasks sched/fair: Cleanup loop_max and loop_break sched/fair: Make sure to try to detach at least one movable task sched: Show PF_flag holes freezer,sched: Rewrite core freezer logic sched: Widen TAKS_state literals sched/wait: Add wait_event_state() sched/completion: Add wait_for_completion_state() sched: Add TASK_ANY for wait_task_inactive() sched: Change wait_task_inactive()s match_state freezer,umh: Clean up freezer/initrd interaction freezer: Have {,un}lock_system_sleep() save/restore flags sched: Rename task_running() to task_on_cpu() sched/fair: Cleanup for SIS_PROP sched/fair: Default to false in test_idle_cores() sched/fair: Remove useless check in select_idle_core() sched/fair: Avoid double search on same cpu sched/fair: Remove redundant check in select_idle_smt() ... |
||
![]() |
513389809e |
for-6.1/block-2022-10-03
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67XkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiHoD/9eN+6YnNRPu5+2zeGnnm1Nlwic6YMZeORr KFIeC0COMWoFhNBIPFkgAKT+0qIH+uGt5UsHSM3Y5La7wMR8yLxD4PAnvTZ/Ijtt yxVIOmonJoQ0OrQ2kTbvDXL/9OCUrzwXXyUIEPJnH0Ca1mxeNOgDHbE7VGF6DMul 0D3pI8qs2WLnHlDi1V/8kH5qZ6WoAJSDcb8sTzOUVnyveZPNaZhGQJuHA2XAYMtg fqKMDJqgmNk6jdTMUgdF5B+rV64PQoCy28I7fXqGkEe+RE5TBy57vAa0XY84V8XR /a8CEuwMts2ypk1hIcJG8Vv8K6u5war9yPM5MTngKsoMpzNIlhrhaJQVyjKdcs+E Ixwzexu6xTYcrcq+mUARgeTh79FzTBM/uXEdbCG2G3S6HPd6UZWUJZGfxw/l0Aem V4xB7lj6SQaJDU1iJCYUaHcekNXhQAPvyVG+R2ED1SO3McTpTPIM1aeigxw6vj7u bH3Kfdr94Z8HNuoLuiS6YYfjNt2Shf4LEB6GxKJ9TYHtyhdOyO0H64jGHpygrWqN cSnkWPUqUUNpF7srKM0ZgbliCshvmyJc4aMOFd0gBY/kXf5J/j7IXvh8TFCi9rHH 0KyZH3/3Zsu9geUn3ynznlr4FXU+BcqE6boaa/iWb9sN1m+Rvaahv8cSch/dh44a vQNj/iOBQA== =R05e -----END PGP SIGNATURE----- Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... |
||
![]() |
585463f0d5 |
sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
This removes the second use of the sched_core_mask temporary mask. Suggested-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
c79e6fa98c |
Power management updates for 6.1-rc1
- Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug Smythies). - Update the AMD P-state driver (Perry Yuan): * Fix wrong lowest perf fetch. * Map desired perf into pstate scope for powersave governor. * Update pstate frequency transition delay time. * Fix initial highest_perf value. * Clean up. - Move max CPU capacity to sugov_policy in the schedutil cpufreq governor (Lukasz Luba). - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski). - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen, and Yang Yingliang). - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan, and Viresh Kumar). - Minor cleanups around functions called at module_init() (Xiu Jianfeng). - Use module_init and add module_exit for bmips driver (Zhang Jianhua). - Add AlderLake-N support to intel_idle (Zhang Rui). - Replace strlcpy() with unused retval with strscpy() in intel_idle (Wolfram Sang). - Remove redundant check from cpuidle_switch_governor() (Yu Liao). - Replace strlcpy() with unused retval with strscpy() in the powernv cpuidle driver (Wolfram Sang). - Drop duplicate word from a comment in the coupled cpuidle driver (Jason Wang). - Make rpm_resume() return -EINPROGRESS if RPM_NOWAIT is passed to it in the flags and the device is about to resume (Rafael Wysocki). - Add extra debugging statement for multiple active IRQs to system wakeup handling code (Mario Limonciello). - Replace strlcpy() with unused retval with strscpy() in the core system suspend support code (Wolfram Sang). - Update the intel_rapl power capping driver: * Use standard Energy Unit for SPR Dram RAPL domain (Zhang Rui). * Add support for RAPTORLAKE_S (Zhang Rui). * Fix UBSAN shift-out-of-bounds issue (Chao Qin). - Handle -EPROBE_DEFER when regulator is not probed on mtk-ci-devfreq.c (AngeloGioacchino Del Regno). - Fix message typo and use dev_err_probe() in rockchip-dfi.c (Christophe JAILLET). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmM7OrYSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxeKAP/jFiZ1lhTGRngiVLMV6a6SSSy5xzzXZZ b/V0oqsuUvWWo6CzVmfU4QfmKGr55+77NgI9Yh5qN6zJTEJmunuCYwVD80KdxPDJ 8SjMUNCACiVwfryLR1gFJlO+0BN4CWTxvto2gjGxzm0l1UQBACf71wm9MQCP8b7A gcBNuOtM7o5NLywDB+/528SiF9AXfZKjkwXhJACimak5yQytaCJaqtOWtcG2KqYF USunmqSB3IIVkAa5LJcwloc8wxHYo5mTPaWGGuSA65hfF42k3vJQ2/b8v8oTVza7 bKzhegErIYtL6B9FjB+P1FyknNOvT7BYr+4RSGLvaPySfjMn1bwz9fM1Epo59Guk Azz3ExpaPixDh+x7b89W1Gb751FZU/zlWT+h1CNy5sOP/ChfxgCEBHw0mnWJ2Y0u CPcI/Ch0FNQHG+PdbdGlyfvORHVh7te/t6dOhoEHXBue+1r3VkOo8tRGY9x+2IrX /JB968u1r0oajF0btGwaDdbbWlyMRTzjrxVl3bwsuz/Kv/0JxsryND2JT0zkKAMZ qYT29HQxhdE0Duw1chgAK6X+BsgP58Bu6LeM3mVcwnGPZE9QvcFa0GQh7z+H71AW 3yOGNmMVMqQSThBYFC6GDi7O2N1UEsLOMV9+ThTRh6D11nU4uiITM5QVIn8nWZGR z3IZ52Jg0oeJ =+3IL -----END PGP SIGNATURE----- Merge tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These add support for some new hardware, extend the existing hardware support, fix some issues and clean up code Specifics: - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug Smythies) - Update the AMD P-state driver (Perry Yuan): - Fix wrong lowest perf fetch - Map desired perf into pstate scope for powersave governor - Update pstate frequency transition delay time - Fix initial highest_perf value - Clean up - Move max CPU capacity to sugov_policy in the schedutil cpufreq governor (Lukasz Luba) - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski) - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen, and Yang Yingliang) - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan, and Viresh Kumar) - Minor cleanups around functions called at module_init() (Xiu Jianfeng) - Use module_init and add module_exit for bmips driver (Zhang Jianhua) - Add AlderLake-N support to intel_idle (Zhang Rui) - Replace strlcpy() with unused retval with strscpy() in intel_idle (Wolfram Sang) - Remove redundant check from cpuidle_switch_governor() (Yu Liao) - Replace strlcpy() with unused retval with strscpy() in the powernv cpuidle driver (Wolfram Sang) - Drop duplicate word from a comment in the coupled cpuidle driver (Jason Wang) - Make rpm_resume() return -EINPROGRESS if RPM_NOWAIT is passed to it in the flags and the device is about to resume (Rafael Wysocki) - Add extra debugging statement for multiple active IRQs to system wakeup handling code (Mario Limonciello) - Replace strlcpy() with unused retval with strscpy() in the core system suspend support code (Wolfram Sang) - Update the intel_rapl power capping driver: - Use standard Energy Unit for SPR Dram RAPL domain (Zhang Rui). - Add support for RAPTORLAKE_S (Zhang Rui). - Fix UBSAN shift-out-of-bounds issue (Chao Qin) - Handle -EPROBE_DEFER when regulator is not probed on mtk-ci-devfreq.c (AngeloGioacchino Del Regno) - Fix message typo and use dev_err_probe() in rockchip-dfi.c (Christophe JAILLET)" * tag 'pm-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (29 commits) cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh cpufreq: Add __init annotation to module init funcs cpufreq: tegra194: change tegra239_cpufreq_soc to static PM / devfreq: rockchip-dfi: Fix an error message PM / devfreq: mtk-cci: Handle sram regulator probe deferral powercap: intel_rapl: Use standard Energy Unit for SPR Dram RAPL domain PM: runtime: Return -EINPROGRESS from rpm_resume() in the RPM_NOWAIT case intel_idle: Add AlderLake-N support powercap: intel_rapl: fix UBSAN shift-out-of-bounds issue cpufreq: tegra194: Add support for Tegra239 cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode powercap: intel_rapl: Add support for RAPTORLAKE_S cpufreq: amd-pstate: Fix initial highest_perf value cpuidle: Remove redundant check in cpuidle_switch_governor() PM: wakeup: Add extra debugging statement for multiple active IRQs cpufreq: tegra194: Remove the unneeded result variable PM: suspend: move from strlcpy() with unused retval to strscpy() intel_idle: move from strlcpy() with unused retval to strscpy() cpuidle: powernv: move from strlcpy() with unused retval to strscpy() ... |
||
![]() |
0766fa2e8a |
Merge branch 'pm-cpufreq'
Merge cpufreq changes for 6.1-rc1: - Add isupport for Tiger Lake in no-HWP mode to intel_pstate (Doug Smythies). - Update the AMD P-state driver (Perry Yuan): * Fix wrong lowest perf fetch. * Map desired perf into pstate scope for powersave governor. * Update pstate frequency transition delay time. * Fix initial highest_perf value. * Clean up. - Move max CPU capacity to sugov_policy in the schedutil cpufreq governor (Lukasz Luba). - Add SM6115 to cpufreq-dt blocklist (Adam Skladowski). - Add support for Tegra239 and minor cleanups (Sumit Gupta, ye xingchen, and Yang Yingliang). - Add freq qos for qcom cpufreq driver and minor cleanups (Xuewen Yan, and Viresh Kumar). - Minor cleanups around functions called at module_init() (Xiu Jianfeng). - Use module_init and add module_exit for bmips driver (Zhang Jianhua). * pm-cpufreq: cpufreq: qcom-cpufreq-hw: Add cpufreq qos for LMh cpufreq: Add __init annotation to module init funcs cpufreq: tegra194: change tegra239_cpufreq_soc to static cpufreq: tegra194: Add support for Tegra239 cpufreq: qcom-cpufreq-hw: Fix uninitialized throttled_freq warning cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode cpufreq: amd-pstate: Fix initial highest_perf value cpufreq: tegra194: Remove the unneeded result variable cpufreq: amd-pstate: update pstate frequency transition delay time cpufreq: amd_pstate: map desired perf into pstate scope for powersave governor cpufreq: amd_pstate: fix wrong lowest perf fetch cpufreq: amd-pstate: fix white-space cpufreq: amd-pstate: simplify cpudata pointer assignment cpufreq: bmips-cpufreq: Use module_init and add module_exit cpufreq: schedutil: Move max CPU capacity to sugov_policy cpufreq: Add SM6115 to cpufreq-dt-platdev blocklist |
||
![]() |
890f242084 |
RCU pull request for v6.1
This pull request contains the following branches: doc.2022.08.31b: Documentation updates. This is the first in a series from an ongoing review of the RCU documentation. "Why are people thinking -that- about RCU? Oh. Because that is an entirely reasonable interpretation of its documentation." fixes.2022.08.31b: Miscellaneous fixes. kvfree.2022.08.31b: Improved memory allocation and heuristics. nocb.2022.09.01a: Improve rcu_nocbs diagnostic output. poll.2022.08.31b: Add full-sized polled RCU grace period state values. These are the same size as an rcu_head structure, which is double that of the traditional unsigned long state values that may still be obtained from et_state_synchronize_rcu(). The added size avoids missing overlapping grace periods. This benefit is that call_rcu() can be replaced by polling, which can be attractive in situations where RCU-protected data is aged out of memory. Early in the series, the size of this state value is three unsigned longs. Later in the series, the synchronize_rcu() and synchronize_rcu_expedited() fastpaths are reworked to permit the full state to be represented by only two unsigned longs. This reworking slows these two functions down in SMP kernels running either on single-CPU systems or on systems with all but one CPU offlined, but this should not be a significant problem. And if it somehow becomes a problem in some yet-as-unforeseen situations, three-value state values can be provided for only those situations. Finally, a pair of functions named same_state_synchronize_rcu() and same_state_synchronize_rcu_full() allow grace-period state values to be compared for equality. This permits users to maintain lists of data structures having the same state value, removing the need for per-data-structure grace-period state values, thus decreasing memory footprint. poll-srcu.2022.08.31b: Polled SRCU grace-period updates, including adding tests to rcutorture and reducing the incidence of Tiny SRCU grace-period-state counter wrap. tasks.2022.08.31b: Improve Tasks RCU diagnostics and quiescent-state detection. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmM3bxQTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jC0zD/0cPe9Nl3LPKVTqDN8wWG6SUOHcwQrg dLBUo1pbBh3mK3HcHzwl1iIF7gd2nmKN7UT3m+C+qv+N3Q9ej9K+MutGThCiRvNT A56TDYU9I1xfqoQ25E9TL7nqty818rtYYMl36Rw8epcLKHo/It9MFODb5kEBY5ir P5UaIK2D4heHfJL6Di8JDq9vC5a/NlNIIkiIj7lUB+px0FpVW0dUqmnWbIOE74YH OBGJ/Mxn6KDO4WeFO0v0DxVaBTLd+khu6W0JspI0szOO6iyTqiDCGE5EqkEdcs5I Fk9WCifdo9nrQG0LPIuEBv0YnwNfGbe5nYXupAmGFb3tdCbjkM+W0UBUE032nXog 3E6m5FEBD1XGQttScFHm70kYssa+xI7khGb9/ZFoYN/QW28oWqwfnx6+eAZGxPNS AZx6pc2bebg8sOUhkz/Sv+qMH7CQgIgcMR66SKl5SdT1Onaig45sgdUuC23BshgG oEdDxvK7vexFQT6q0oqU8LAO/CVKdyVIswt3pB6CUmn8yNgSo+qDZzlEHt0gPdMY 4Xa1jnNtOHobDnI4g0JMdVqAujByrRq74ZsVW96hdedKrA0r9y462jnVBm9tqW68 lu0Lw9WLif2kw0lMY8Q59zqTL+fB8TdNiZrHoqefwvQ/ZrvinfHGSblcrS8zAhX3 4oVwCUs9pPQRMA== =AZPZ -----END PGP SIGNATURE----- Merge tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates. This is the first in a series from an ongoing review of the RCU documentation. "Why are people thinking -that- about RCU? Oh. Because that is an entirely reasonable interpretation of its documentation." - Miscellaneous fixes. - Improved memory allocation and heuristics. - Improve rcu_nocbs diagnostic output. - Add full-sized polled RCU grace period state values. These are the same size as an rcu_head structure, which is double that of the traditional unsigned long state values that may still be obtained from et_state_synchronize_rcu(). The added size avoids missing overlapping grace periods. This benefit is that call_rcu() can be replaced by polling, which can be attractive in situations where RCU-protected data is aged out of memory. Early in the series, the size of this state value is three unsigned longs. Later in the series, the fastpaths in synchronize_rcu() and synchronize_rcu_expedited() are reworked to permit the full state to be represented by only two unsigned longs. This reworking slows these two functions down in SMP kernels running either on single-CPU systems or on systems with all but one CPU offlined, but this should not be a significant problem. And if it somehow becomes a problem in some yet-as-unforeseen situations, three-value state values can be provided for only those situations. Finally, a pair of functions named same_state_synchronize_rcu() and same_state_synchronize_rcu_full() allow grace-period state values to be compared for equality. This permits users to maintain lists of data structures having the same state value, removing the need for per-data-structure grace-period state values, thus decreasing memory footprint. - Polled SRCU grace-period updates, including adding tests to rcutorture and reducing the incidence of Tiny SRCU grace-period-state counter wrap. - Improve Tasks RCU diagnostics and quiescent-state detection. * tag 'rcu.2022.09.30a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (55 commits) rcutorture: Use the barrier operation specified by cur_ops rcu-tasks: Make RCU Tasks Trace check for userspace execution rcu-tasks: Ensure RCU Tasks Trace loops have quiescent states rcu-tasks: Convert RCU_LOCKDEP_WARN() to WARN_ONCE() srcu: Make Tiny SRCU use full-sized grace-period counters srcu: Make Tiny SRCU poll_state_synchronize_srcu() more precise srcu: Add GP and maximum requested GP to Tiny SRCU rcutorture output rcutorture: Make "srcud" option also test polled grace-period API rcutorture: Limit read-side polling-API testing rcu: Add functions to compare grace-period state values rcutorture: Expand rcu_torture_write_types() first "if" statement rcutorture: Use 1-suffixed variable in rcu_torture_write_types() check rcu: Make synchronize_rcu() fastpath update only boot-CPU counters rcutorture: Adjust rcu_poll_need_2gp() for rcu_gp_oldstate field removal rcu: Remove ->rgos_polled field from rcu_gp_oldstate structure rcu: Make synchronize_rcu_expedited() fast path update .expedited_sequence rcu: Remove expedited grace-period fast-path forward-progress helper rcu: Make synchronize_rcu() fast path update ->gp_seq counters rcu-tasks: Remove grace-period fast-path rcu-tasks helper rcu: Set rcu_data structures' initial ->gpwrap value to true ... |
||
![]() |
5aec788aeb |
sched: Fix TASK_state comparisons
Task state is fundamentally a bitmask; direct comparisons are probably
not working as intended. Specifically the normal wait-state have
a number of possible modifiers:
TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE
TASK_INTERRUPTIBLE: TASK_FREEZABLE
Specifically, the addition of TASK_FREEZABLE wrecked
__wait_is_interruptible(). This however led to an audit of direct
comparisons yielding the rest of the changes.
Fixes:
|
||
![]() |
0cd4d02c32 |
sched: use maple tree iterator to walk VMAs
The linked list is slower than walking the VMAs using the maple tree. We can't use the VMA iterator here because it doesn't support moving to an earlier position. Link: https://lkml.kernel.org/r/20220906194824.2110408-49-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
467b171af8 |
mm/demotion: update node_is_toptier to work with memory tiers
With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier [sj@kernel.org: include missed header file, memory-tiers.h] Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h] [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers] Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
bd74fdaea1 |
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[8, 10]% Ops/sec KB/sec patch1-7: 1147696.57 44640.29 patch1-8: 1245274.91 48435.66 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 48.16% lzo1x_1_do_compress (real work) 8.20% page_vma_mapped_walk (overhead) 7.06% _raw_spin_unlock_irq 2.92% ptep_clear_flush 2.53% __zram_bvec_write 2.11% do_raw_spin_lock 2.02% memmove 1.93% lru_gen_look_around 1.56% free_unref_page_list 1.40% memset patch1-8 49.44% lzo1x_1_do_compress (real work) 6.19% page_vma_mapped_walk (overhead) 5.97% _raw_spin_unlock_irq 3.13% get_pfn_folio 2.85% ptep_clear_flush 2.42% __zram_bvec_write 2.08% do_raw_spin_lock 1.92% memmove 1.44% alloc_zspage 1.36% memset Configurations: no change Thanks to the following developers for their efforts [3]. kernel test robot <lkp@intel.com> [1] https://lwn.net/Articles/23732/ [2] https://llvm.org/docs/ScudoHardenedAllocator.html [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/ Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
527eb453bb |
sched/psi: export psi_memstall_{enter,leave}
To properly account for all refaults from file system logic, file systems need to call psi_memstall_enter directly, so export it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220915094200.139713-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
7e9518baed |
sched/fair: Move call to list_last_entry() in detach_tasks
Move the call to list_last_entry() in detach_tasks() after testing loop_max and loop_break. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-4-vincent.guittot@linaro.org |
||
![]() |
c59862f826 |
sched/fair: Cleanup loop_max and loop_break
sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org |
||
![]() |
b0defa7ae0 |
sched/fair: Make sure to try to detach at least one movable task
During load balance, we try at most env->loop_max time to move a task. But it can happen that the loop_max LRU tasks (ie tail of the cfs_tasks list) can't be moved to dst_cpu because of affinity. In this case, loop in the list until we found at least one. The maximum of detached tasks remained the same as before. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org |
||
![]() |
c959924b0d |
memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
c6833e1000 |
memory tiering: rate limit NUMA migration throughput
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
33024536ba |
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
e35be05d74 |
Driver core fixes for 6.0-rc5
Here are some small driver core and debugfs fixes for 6.0-rc5. Included in here are: - multiple attempts to get the arch_topology code to work properly on non-cluster SMT systems. First attempt caused build breakages in linux-next and 0-day, second try worked. - debugfs fixes for a long-suffering memory leak. The pattern of debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so add debugfs_lookup_and_remove() to fix this problem. Also fix up the scheduler debug code that highlighted this problem. Fixes for other subsystems will be trickling in over the next few months for this same issue once the debugfs function is merged. All of these have been in linux-next since Wednesday with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYxuERw8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylPqwCgjU6xlN2y/80HH+66k+yyzlxocE8AoLPgnGrA dJZIGWFXExzO26tvMT52 =zGHA -----END PGP SIGNATURE----- Merge tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are some small driver core and debugfs fixes for 6.0-rc5. Included in here are: - multiple attempts to get the arch_topology code to work properly on non-cluster SMT systems. First attempt caused build breakages in linux-next and 0-day, second try worked. - debugfs fixes for a long-suffering memory leak. The pattern of debugfs_remove(debugfs_lookup(...)) turns out to leak dentries, so add debugfs_lookup_and_remove() to fix this problem. Also fix up the scheduler debug code that highlighted this problem. Fixes for other subsystems will be trickling in over the next few months for this same issue once the debugfs function is merged. All of these have been in linux-next since Wednesday with no reported problems" * tag 'driver-core-6.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: arch_topology: Make cluster topology span at least SMT CPUs sched/debug: fix dentry leak in update_sched_domain_debugfs debugfs: add debugfs_lookup_and_remove() driver core: fix driver_set_override() issue with empty strings Revert "arch_topology: Make cluster topology span at least SMT CPUs" arch_topology: Make cluster topology span at least SMT CPUs |
||
![]() |
34f26a1561 |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit
|
||
![]() |
dc86aba751 |
sched/psi: Cache parent psi_group to speed up group iteration
We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group in struct psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com |
||
![]() |
52b1364ba0 |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com |
||
![]() |
71dbdde791 |
sched/psi: Remove NR_ONCPU task accounting
We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance. Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use one bit in state_mask to track instead. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com |
||
![]() |
65176f59a1 |
sched/psi: Optimize task switch inside shared cgroups again
Way back when PSI_MEM_FULL was accounted from the timer tick, task
switching could simply iterate next and prev to the common ancestor to
update TSK_ONCPU and be done.
Then memstall ticks were replaced with checking curr->in_memstall
directly in psi_group_change(). That meant that now if the task switch
was between a memstall and a !memstall task, we had to iterate through
the common ancestors at least ONCE to fix up their state_masks.
We added the identical_state filter to make sure the common ancestor
elimination was skipped in that case. It seems that was always a
little too eager, because it caused us to walk the common ancestors
*twice* instead of the required once: the iteration for next could
have stopped at the common ancestor; prev could have updated TSK_ONCPU
up to the common ancestor, then finish to the root without changing
any flags, just to get the new curr->in_memstall into the state_masks.
This patch recognizes this and makes it so that we walk to the root
exactly once if state_mask needs updating, which is simply catching up
on a missed optimization that could have been done in commit
|
||
![]() |
d79ddb069c |
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-5-zhouchengming@bytedance.com |
||
![]() |
e2ad8ab04c |
sched/psi: Save percpu memory when !psi_cgroups_enabled
We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-4-zhouchengming@bytedance.com |
||
![]() |
c530a3c716 |
sched/psi: Fix periodic aggregation shut off
We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit |
||
![]() |
f5d39b0208 |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org |
||
![]() |
929659acea |
sched/completion: Add wait_for_completion_state()
Allows waiting with a custom @state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220822114648.922711674@infradead.org |
||
![]() |
f9fc8cad97 |
sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net |
||
![]() |
9204a97f7a |
sched: Change wait_task_inactive()s match_state
Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org |
||
![]() |
0b9d46fc5e |
sched: Rename task_running() to task_on_cpu()
There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net |
||
![]() |
96c1c0cfe4 |
sched/fair: Cleanup for SIS_PROP
The sched-domain of this cpu is only used for some heuristics when SIS_PROP is enabled, and it should be irrelevant whether the local sd_llc is valid or not, since all we care about is target sd_llc if !SIS_PROP. Access the local domain only when there is a need. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220907112000.1854-6-wuyun.abel@bytedance.com |
||
![]() |
398ba2b0cc |
sched/fair: Default to false in test_idle_cores()
It's uncertain whether idle cores exist or not if shared sched- domains are not ready, so returning "no idle cores" usually makes sense. While __update_idle_core() is an exception, it checks status of this core and set hint to shared sched-domain if necessary. So the whole logic of this function depends on the existence of shared sched-domain, and can certainly bail out early if it is not available. It's somehow a little tricky, and as Josh suggested that it should be transient while the domain isn't ready. So remove the self-defined default value to make things more clearer. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-5-wuyun.abel@bytedance.com |
||
![]() |
8eeeed9c4a |
sched/fair: Remove useless check in select_idle_core()
The function select_idle_core() only gets called when has_idle_cores is true which can be possible only when sched_smt_present is enabled. This change also aligns select_idle_core() with select_idle_smt() in the way that the caller do the check if necessary. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-4-wuyun.abel@bytedance.com |
||
![]() |
b9bae70440 |
sched/fair: Avoid double search on same cpu
The prev cpu is checked at the beginning of SIS, and it's unlikely to be idle before the second check in select_idle_smt(). So we'd better focus on its SMT siblings. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-3-wuyun.abel@bytedance.com |
||
![]() |
3e6efe87cd |
sched/fair: Remove redundant check in select_idle_smt()
If two cpus share LLC cache, then the two cores they belong to are also in the same LLC domain. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-2-wuyun.abel@bytedance.com |
||
![]() |
c2e4065965 |
sched/debug: fix dentry leak in update_sched_domain_debugfs
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventually runs out of memory. Fix this up by using the newly created debugfs_lookup_and_remove() call instead which properly handles the dentry reference counting logic. Cc: Major Chen <major.chen@samsung.com> Cc: stable <stable@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
33f9352579 |
sched/deadline: Move __dl_clear_params out of dl_bw lock
As members in sched_dl_entity are independent with dl_bw, move __dl_clear_params out of dl_bw lock. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220827020911.30641-1-shangxiaojing@huawei.com |
||
![]() |
96458e7f7d |
sched/deadline: Add replenish_dl_new_period helper
Wrap repeated code in helper function replenish_dl_new_period, which set the deadline and runtime of input dl_se based on pi_of(dl_se). Note that setup_new_dl_entity originally set the deadline and runtime base on dl_se, which should equals to pi_of(dl_se) for non-boosted task. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826100037.12146-1-shangxiaojing@huawei.com |
||
![]() |
973bee493a |
sched/deadline: Add dl_task_is_earliest_deadline helper
Wrap repeated code in helper function dl_task_is_earliest_deadline, which return true if there is no deadline task on the rq at all, or task's deadline earlier than the whole rq. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826083453.698-1-shangxiaojing@huawei.com |
||
![]() |
bc1cca97e6 |
sched/debug: Show the registers of 'current' in dump_cpu_task()
The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
e73dfe3093 |
sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()
The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> |
||
![]() |
53aa930dc4 |
Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit
Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
5531ecffa4 |
sched: Add update_current_exec_runtime helper
Wrap repeated code in helper function update_current_exec_runtime for update the exec time of the current. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220824082856.15674-1-shangxiaojing@huawei.com |
||
![]() |
8238b45798 |
wait_on_bit: add an acquire memory barrier
There are several places in the kernel where wait_on_bit is not followed by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read). On architectures with weak memory ordering, it may happen that memory accesses that follow wait_on_bit are reordered before wait_on_bit and they may return invalid data. Fix this class of bugs by introducing a new function "test_bit_acquire" that works like test_bit, but has acquire memory ordering semantics. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
6d5afdc97e |
cpufreq: schedutil: Move max CPU capacity to sugov_policy
There is no need to keep the max CPU capacity in the per_cpu instance. Furthermore, there is no need to check and update that variable (sg_cpu->max) every time in the frequency change request, which is part of hot path. Instead use struct sugov_policy to store that information. Initialize the max CPU capacity during the setup and start callback. We can do that since all CPUs in the same frequency domain have the same max capacity (capacity setup and thermal pressure are based on that). Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
![]() |
e4fe074d6c |
sched/fair: Don't init util/runnable_avg for !fair task
post_init_entity_util_avg() init task util_avg according to the cpu util_avg at the time of fork, which will decay when switched_to_fair() some time later, we'd better to not set them at all in the case of !fair task. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-10-zhouchengming@bytedance.com |
||
![]() |
d6531ab6e5 |
sched/fair: Move task sched_avg attach to enqueue_task_fair()
When wake_up_new_task(), we use post_init_entity_util_avg() to init util_avg/runnable_avg based on cpu's util_avg at that time, and attach task sched_avg to cfs_rq. Since enqueue_task_fair() -> enqueue_entity() -> update_load_avg() loop will do attach, we can move this work to update_load_avg(). wake_up_new_task(p) post_init_entity_util_avg(p) attach_entity_cfs_rq() --> (1) activate_task(rq, p) enqueue_task() := enqueue_task_fair() enqueue_entity() loop update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH) if (!se->avg.last_update_time && (flags & DO_ATTACH)) attach_entity_load_avg() --> (2) This patch move attach from (1) to (2), update related comments too. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-9-zhouchengming@bytedance.com |
||
![]() |
df16b71c68 |
sched/fair: Allow changing cgroup of new forked task
commit
|
||
![]() |
7e2edaf618 |
sched/fair: Fix another detach on unattached task corner case
commit
|
||
![]() |
e1f078f504 |
sched/fair: Combine detach into dequeue when migrating task
When we are migrating task out of the CPU, we can combine detach and propagation into dequeue_entity() to save the detach_entity_cfs_rq() in migrate_task_rq_fair(). This optimization is like combining DO_ATTACH in the enqueue_entity() when migrating task to the CPU. So we don't have to traverse the CFS tree extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(), which wouldn't be called anymore with this patch's change. detach_task() deactivate_task() dequeue_task_fair() for_each_sched_entity(se) dequeue_entity() update_load_avg() /* (1) */ detach_entity_load_avg() set_task_cpu() migrate_task_rq_fair() detach_entity_cfs_rq() /* (2) */ update_load_avg(); detach_entity_load_avg(); propagate_entity_cfs_rq(); for_each_sched_entity() update_load_avg() This patch save the detach_entity_cfs_rq() called in (2) by doing the detach_entity_load_avg() for a CPU migrating task inside (1) (the task being the first se in the loop) Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com |
||
![]() |
859f206290 |
sched/fair: Update comments in enqueue/dequeue_entity()
When reading the sched_avg related code, I found the comments in enqueue/dequeue_entity() are not updated with the current code. We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg during enqueue/dequeue_entity(), those are done only for attach/detach. This patch updates the comments to reflect the current code working. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-5-zhouchengming@bytedance.com |
||
![]() |
5d6da83c44 |
sched/fair: Reset sched_avg last_update_time before set_task_rq()
set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked task's sched_avg when migrate, which is not needed for already detached task. task_change_group_fair() will detached the task sched_avg from prev cfs_rq first, so reset sched_avg last_update_time before set_task_rq() to avoid that. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220818124805.601-4-zhouchengming@bytedance.com |
||
![]() |
39c4261191 |
sched/fair: Remove redundant cpu_cgrp_subsys->fork()
We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().
Since commit
|
||
![]() |
78b6b15770 |
sched/fair: Maintain task se depth in set_task_rq()
Previously we only maintain task se depth in task_move_group_fair(), if a !fair task change task group, its se depth will not be updated, so commit |
||
![]() |
76b079ef4c |
sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
2b97cf7628 |
sched/psi: Zero the memory of struct psi_group
After commit |
||
![]() |
09348d75a6 |
sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com |
||
![]() |
cac03ac368 |
Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuvmwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gONQ/+KkkPTeKgGDvrahTfeYZlmRyvcI1R78r9 yooa8v+DtifznBW2eXDBc8WTruzqr78VyUY+1YSjfKS6FRQWYMficJ3qk3hxgBru 998KZbvl3jXBBlRkqgGeFlF5Ty2KaryEZgX97a7IF/0xWDgpm972jFkJ/KCo/YTY WSQrzutz2FKe71EjK4cAplYxPZIiy/zo2hSGTbsso4M7bO5VLc1Y4qMtFGcCZ7JB s9JYkj2Rfz+AS5wioDRcGuec4A4SrroxKszZA6QDDBuhMJukqexO02xs/fxZ2W4Z DF4U5MFOrtz9AWSGsf1P6XXbgJO8qTgQXZchFsEcJwypV13w8U0IViXQfD/Pvx2X y+WHdnZVIO2sDwOJ15ew7IuoJZ2LsVygrBNFJJaIFOtIz3RzprI0BJN7LeWFALOa IPmbtiY8hVwhKmjRgMHWDwJhMEHLuhGx3idiD89w1pknzTUnKDiwLyEUtyynxeGd ft9uCvPefrYQVx9AiH7wf0W+fg334FCccC+0f8LyduyftUyQCfZIZY6LUSKuKded Odm7k0ngLDPbdZwAHs0Nf/ilRwd91Z7b6hGt5U3ptx+8BPMKB+/k1VoKog7OISPc zGaP7DrtuC4sEdX4X6bqX+mEQhpkLcQw15gVGxhKoHqygWNSZrV634aSSXwfVXJx eT5m/K9a7L0= =CYl5 -----END PGP SIGNATURE----- Merge tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various fixes: a deadline scheduler fix, a migration fix, a Sparse fix and a comment fix" * tag 'sched-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Do not requeue task on CPU excluded from cpus_mask sched/rt: Fix Sparse warnings due to undefined rt.c declarations exit: Fix typo in comment: s/sub-theads/sub-threads sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed |
||
![]() |
751d4cbc43 |
sched/core: Do not requeue task on CPU excluded from cpus_mask
The following warning was triggered on a large machine early in boot on a distribution kernel but the same problem should also affect mainline. WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440 Call Trace: <TASK> rescuer_thread+0x1f6/0x360 kthread+0x156/0x180 ret_from_fork+0x22/0x30 </TASK> Commit |
||
![]() |
8648f92a66 |
sched/core: Remove superfluous semicolon
Signed-off-by: Xin Gao <gaoxin@cdjrlc.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220719111044.7095-1-gaoxin@cdjrlc.com |
||
![]() |
18c31c9711 |
sched/fair: Make per-cpu cpumasks static
The load_balance_mask and select_rq_mask percpu variables are only used in kernel/sched/fair.c. Make them static and move their allocation into init_sched_fair_class(). Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL class (local_cpu_mask_dl in init_sched_dl_class()). [ mingo: Tidied up changelog & touched up the code. ] Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com |
||
![]() |
d985ee9f44 |
sched/fair: Remove unused parameter idle of _nohz_idle_balance()
After commit
|
||
![]() |
b6bb70f9ab |
Several core optimizations:
* threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). * threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. * psi no longer allocates memory when disabled. along with some code cleanups. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo= =sucy -----END PGP SIGNATURE----- Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several core optimizations: - threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). - threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. - psi no longer allocates memory when disabled. ... along with some code cleanups" * tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Skip subtree root in cgroup_update_dfl_csses() cgroup: remove "no" prefixed mount options cgroup: Make !percpu threadgroup_rwsem operations optional cgroup: Add "no" prefixed mount options cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes psi: dont alloc memory for psi by default |
||
![]() |
87514b2c24 |
sched/rt: Fix Sparse warnings due to undefined rt.c declarations
There are several symbols defined in kernel/sched/sched.h but get wrapped in CONFIG_CGROUP_SCHED, even though dummy versions get built in rt.c and therefore trigger Sparse warnings: kernel/sched/rt.c:309:6: warning: symbol 'unregister_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:311:6: warning: symbol 'free_rt_sched_group' was not declared. Should it be static? kernel/sched/rt.c:313:5: warning: symbol 'alloc_rt_sched_group' was not declared. Should it be static? Fix this by moving them outside the CONFIG_CGROUP_SCHED block. [ mingo: Refreshed to the latest scheduler tree, tweaked changelog. ] Signed-off-by: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220721145155.358366-1-ben-linux@fluff.org |
||
![]() |
b6e8d40d43 |
sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:
[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
:
[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
:
[80468.207946] Call Trace:
[80468.208947] cpuset_can_attach+0xa0/0x140
[80468.209953] cgroup_migrate_execute+0x8c/0x490
[80468.210931] cgroup_update_dfl_csses+0x254/0x270
[80468.211898] cgroup_subtree_control_write+0x322/0x400
[80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
[80468.213777] new_sync_write+0x11f/0x1b0
[80468.214689] vfs_write+0x1eb/0x280
[80468.215592] ksys_write+0x5f/0xe0
[80468.216463] do_syscall_64+0x5c/0x80
[80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.
Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.
Fixes:
|
||
![]() |
7d9d077c78 |
RCU pull request for v5.20 (or whatever)
This pull request contains the following branches: doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms. poll.2022.07.21a: Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods. rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead. torture.2022.06.21a: Torture-test updates. ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt 0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5 7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0 Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r vX60+QNxvUBLwA== =vUNm -----END PGP SIGNATURE----- Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates - Miscellaneous fixes - Callback-offload updates, perhaps most notably a new RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be offloaded at boot time, regardless of kernel boot parameters. This is useful to battery-powered systems such as ChromeOS and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot parameter prevents offloaded callbacks from interfering with real-time workloads and with energy-efficiency mechanisms - Polled grace-period updates, perhaps most notably making these APIs account for both normal and expedited grace periods - Tasks RCU updates, perhaps most notably reducing the CPU overhead of RCU tasks trace grace periods by more than a factor of two on a system with 15,000 tasks. The reduction is expected to increase with the number of tasks, so it seems reasonable to hypothesize that a system with 150,000 tasks might see a 20-fold reduction in CPU overhead - Torture-test updates - Updates that merge RCU's dyntick-idle tracking into context tracking, thus reducing the overhead of transitioning to kernel mode from either idle or nohz_full userspace execution for kernels that track context independently of RCU. This is expected to be helpful primarily for kernels built with CONFIG_NO_HZ_FULL=y * tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits) rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings rcu: Diagnose extended sync_rcu_do_polled_gp() loops rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings rcutorture: Test polled expedited grace-period primitives rcu: Add polled expedited grace-period primitives rcutorture: Verify that polled GP API sees synchronous grace periods rcu: Make Tiny RCU grace periods visible to polled APIs rcu: Make polled grace-period API account for expedited grace periods rcu: Switch polled grace-period APIs to ->gp_seq_polled rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty rcu/nocb: Add option to opt rcuo kthreads out of RT priority rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread() rcu/nocb: Add an option to offload all CPUs on boot rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order rcu/nocb: Add/del rdp to iterate from rcuog itself rcu/tree: Add comment to describe GP-done condition in fqs loop rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs() rcu/kvfree: Remove useless monitor_todo flag rcu: Cleanup RCU urgency state for offline CPU ... |
||
![]() |
b349b1181d |
for-5.20/io_uring-2022-07-29
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm5gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmKMD/4l3QIrLbjYIxlfrzQcHbmYuUkbQtj3SbZg 6ejbnGVhCs1P9DdXH8MgE2BxgpiXQE0CqOK7vbSoo5ep2n2UTLI2DIxAl74SMIo7 0wmJXtUJySuViKr3NYVHqlN180MkQYddBz0nGElhkQBPBCMhW8CrtPCeURr/YyHp 2RxSYBXiUx2gRyig+klnp6oPEqelcBZJUyNHdA9yVrgl/RhB/t2rKj7D++8ukQM3 Zuyh8WIkTeTfUz9hdGG7fuCEdZN4DlO2CCEc7uy0cKi6VRCKH4hYUCqClJ+/cfd2 43dUI2O7B6D1t/ObFh8AGIDXBDqVA6ePQohQU6gooRkfQiBPKkc9d0ts4yIhRqca AjkzNM+0Eve3A01loJ8J84w8oZnvNpYEv5n8/sZVLWcyU3UIs0I88nC2OBiFtoRq d77CtFLwOTo+r3STtAhnZOqez90rhS6BqKtqlUP346PCuFItl6/MbGtwdTbLYEFj CVNIb2pERWSr2NxGv4lFyXaX/cRwruxojWH7yc3rRYjr4Ykevd1pe/fMGNiMAnKw 5em/3QU3qq0ZVcXLMihksKeHHFIQwGDRMuyuv/fktV10+yYXQ0t16WzkJT3aR8Xo cqs0r8+6Jnj3uYcOMzj/FoLcpEPr21hnwAtzLto1mG1Wh4JRn/D7Nx5zqxPLxcW+ NiU6VihPOw== =gxeV -----END PGP SIGNATURE----- Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - As per (valid) complaint in the last merge window, fs/io_uring.c has grown quite large these days. io_uring isn't really tied to fs either, as it supports a wide variety of functionality outside of that. Move the code to io_uring/ and split it into files that either implement a specific request type, and split some code into helpers as well. The code is organized a lot better like this, and io_uring.c is now < 4K LOC (me). - Deprecate the epoll_ctl opcode. It'll still work, just trigger a warning once if used. If we don't get any complaints on this, and I don't expect any, then we can fully remove it in a future release (me). - Improve the cancel hash locking (Hao) - kbuf cleanups (Hao) - Efficiency improvements to the task_work handling (Dylan, Pavel) - Provided buffer improvements (Dylan) - Add support for recv/recvmsg multishot support. This is similar to the accept (or poll) support for have for multishot, where a single SQE can trigger everytime data is received. For applications that expect to do more than a few receives on an instantiated socket, this greatly improves efficiency (Dylan). - Efficiency improvements for poll handling (Pavel) - Poll cancelation improvements (Pavel) - Allow specifiying a range for direct descriptor allocations (Pavel) - Cleanup the cqe32 handling (Pavel) - Move io_uring types to greatly cleanup the tracing (Pavel) - Tons of great code cleanups and improvements (Pavel) - Add a way to do sync cancelations rather than through the sqe -> cqe interface, as that's a lot easier to use for some use cases (me). - Add support to IORING_OP_MSG_RING for sending direct descriptors to a different ring. This avoids the usually problematic SCM case, as we disallow those. (me) - Make the per-command alloc cache we use for apoll generic, place limits on it, and use it for netmsg as well (me). - Various cleanups (me, Michal, Gustavo, Uros) * tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits) io_uring: ensure REQ_F_ISREG is set async offload net: fix compat pointer in get_compat_msghdr() io_uring: Don't require reinitable percpu_ref io_uring: fix types in io_recvmsg_multishot_overflow io_uring: Use atomic_long_try_cmpxchg in __io_account_mem io_uring: support multishot in recvmsg net: copy from user before calling __get_compat_msghdr net: copy from user before calling __copy_msghdr io_uring: support 0 length iov in buffer select in compat io_uring: fix multishot ending when not polled io_uring: add netmsg cache io_uring: impose max limit on apoll cache io_uring: add abstraction around apoll cache io_uring: move apoll cache to poll.c io_uring: consolidate hash_locked io-wq handling io_uring: clear REQ_F_HASH_LOCKED on hash removal io_uring: don't race double poll setting REQ_F_ASYNC_DATA io_uring: don't miss setting REQ_F_DOUBLE_POLL io_uring: disable multishot recvmsg io_uring: only trace one of complete or overflow ... |
||
![]() |
0f03d6805b |
sched/debug: Print each field value left-aligned in sched_show_task()
Currently, the values of some fields are printed right-aligned, causing the field value to be next to the next field name rather than next to its own field name. So print each field value left-aligned, to make it more readable. Before: stack: 0 pid: 307 ppid: 2 flags:0x00000008 After: stack:0 pid:308 ppid:2 flags:0x0000000a This also makes them print in the same style as the other two fields: task:demo0 state:R running task Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com |
||
![]() |
b3f53daacc |
sched/deadline: Use sched_dl_entity's dl_density in dl_task_fits_capacity()
Save a multiplication in dl_task_fits_capacity() by using already maintained per-sched_dl_entity (i.e. per-task) `dl_runtime/dl_deadline` (dl_density). cap_scale(dl_deadline, cap) >= dl_runtime dl_deadline * cap >> SCHED_CAPACITY_SHIFT >= dl_runtime cap >= dl_runtime << SCHED_CAPACITY_SHIFT / dl_deadline cap >= (dl_runtime << BW_SHIFT / dl_deadline) >> BW_SHIFT - SCHED_CAPACITY_SHIFT cap >= dl_density >> BW_SHIFT - SCHED_CAPACITY_SHIFT __sched_setscheduler()->__checkparam_dl() ensures that the 2 corner cases (if conditions) `runtime == RUNTIME_INF (-1)` and `period == 0` of to_ratio(deadline, runtime) are not met when setting dl_density in __sched_setscheduler()-> __setscheduler_params()->__setparam_dl(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-4-dietmar.eggemann@arm.com |
||
![]() |
6092478bcb |
sched/deadline: Make dl_cpuset_cpumask_can_shrink() capacity-aware
dl_cpuset_cpumask_can_shrink() is used to validate whether there is still enough CPU capacity for DL tasks in the reduced cpuset. Currently it still operates on `# remaining CPUs in the cpuset` (1). Change this to use the already capacity-aware DL admission control __dl_overflow() for the `cpumask can shrink` test. dl_b->bw = sched_rt_period << BW_SHIFT / sched_rt_period dl_b->bw * (1) >= currently allocated bandwidth in root_domain (rd) Replace (1) w/ `\Sum CPU capacity in rd >> SCHED_CAPACITY_SHIFT` Adapt __dl_bw_capacity() to take a cpumask instead of a CPU number argument so that `rd->span` and `cpumask of the reduced cpuset` can be used here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-3-dietmar.eggemann@arm.com |
||
![]() |
740cf8a760 |
sched/core: Introduce sched_asym_cpucap_active()
Create an inline helper for conditional code to be only executed on asymmetric CPU capacity systems. This makes these (currently ~10 and future) conditions a lot more readable. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com |
||
![]() |
b167fdffe9 |
This cycle's scheduler updates for v6.0 are:
Load-balancing improvements: ============================ - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: ======================= - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: ============== - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: ====================== - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42 HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT /Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4 nzDVOob1LgE= =xr2b -----END PGP SIGNATURE----- Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Load-balancing improvements: - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes" * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) rseq: Kill process when unknown flags are encountered in ABI structures rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags sched/core: Fix the bug that task won't enqueue into core tree when update cookie nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt() sched/core: Always flush pending blk_plug sched/fair: fix case with reduced capacity CPU sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling sched/core: add forced idle accounting for cgroups sched/fair: Remove the energy margin in feec() sched/fair: Remove task_util from effective utilization in feec() sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() sched/fair: Rename select_idle_mask to select_rq_mask sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() sched/fair: Decay task PELT values during wakeup migration sched/fair: Provide u64 read for 32-bits arch helper sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg sched: only perform capability check on privileged operation sched: Remove unused function group_first_cpu() sched/fair: Remove redundant word " *" selftests/rseq: check if libc rseq support is registered ... |
||
![]() |
ed29b0b4fd |
io_uring: move to separate directory
In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
34bc7b454d |
Merge branch 'ctxt.2022.07.05a' into HEAD
ctxt.2022.07.05a: Linux-kernel memory model development branch. |
||
![]() |
91caa5ae24 |
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
In function sched_core_update_cookie(), a task will enqueue into the core tree only when it enqueued before, that is, if an uncookied task is cookied, it will not enqueue into the core tree until it enqueue again, which will result in unnecessary force idle. Here follows the scenario: CPU x and CPU y are a pair of SMT siblings. 1. Start task a running on CPU x without sleeping, and task b and task c running on CPU y without sleeping. 2. We create a cookie and share it to task a and task b, and then we create another cookie and share it to task c. 3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched And we will find out that core_forceidle_sum of task a takes 30% time of the sampling period, which shouldn't happen as task a and b have the same cookie. Then we migrate task a to CPU x', migrate task b and c to CPU y', where CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we will found out that core_forceidle_sum of task a and b are almost zero. To solve this problem, we enqueue the task into the core tree if it's on rq. Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com |
||
![]() |
5c66d1b9b3 |
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:
dequeue_task_rt()
dequeue_rt_entity()
dequeue_rt_stack()
dequeue_top_rt_rq()
sub_nr_running() // decrements rq->nr_running
sched_update_tick_dependency()
sched_can_stop_tick() // checks rq->rt.rt_nr_running,
...
__dequeue_rt_entity()
dec_rt_tasks() // decrements rq->rt.rt_nr_running
...
Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.
Fixes:
|
||
![]() |
ddfc710395 |
sched/deadline: Fix BUG_ON condition for deboosted tasks
Tasks the are being deboosted from SCHED_DEADLINE might enter
enqueue_task_dl() one last time and hit an erroneous BUG_ON condition:
since they are not boosted anymore, the if (is_dl_boosted()) branch is
not taken, but the else if (!dl_prio) is and inside this one we
BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered)
otherwise we had entered the if branch above. Long story short, the
current condition doesn't make sense and always leads to triggering of a
BUG.
Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has
to be present, but additional flags are not a problem.
Fixes:
|
||
![]() |
401e4963bf |
sched/core: Always flush pending blk_plug
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit
|
||
![]() |
c82a69629c |
sched/fair: fix case with reduced capacity CPU
The capacity of the CPU available for CFS tasks can be reduced because of other activities running on the latter. In such case, it's worth trying to move CFS tasks on a CPU with more available capacity. The rework of the load balance has filtered the case when the CPU is classified to be fully busy but its capacity is reduced. Check if CPU's capacity is reduced while gathering load balance statistic and classify it group_misfit_task instead of group_fully_busy so we can try to move the load on another CPU. Reported-by: David Chen <david.chen@nutanix.com> Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Chen <david.chen@nutanix.com> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org |
||
![]() |
e67198cc05 |
context_tracking: Take idle eqs entrypoints over RCU
The RCU dynticks counter is going to be merged into the context tracking subsystem. Start with moving the idle extended quiescent states entrypoints to context tracking. For now those are dumb redirections to existing RCU calls. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> |
||
![]() |
c02d5546ea |
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag,
so this change saves a compare after cmpxchg.
The definition of cmpxchg based fetch_or was changed in the
same way as atomic_fetch_##op definitions were changed
in
|
||
![]() |
1fcf54deb7 |
sched/core: add forced idle accounting for cgroups
|
||
![]() |
24a9c54182 |
context_tracking: Split user tracking Kconfig
Context tracking is going to be used not only to track user transitions but also idle/IRQs/NMIs. The user tracking part will then become a separate feature. Prepare Kconfig for that. [ frederic: Apply Max Filippov feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> |
||
![]() |
b812fc9768 |
sched/fair: Remove the energy margin in feec()
find_energy_efficient_cpu() integrates a margin to protect tasks from bouncing back and forth from a CPU to another. This margin is set as being 6% of the total current energy estimated on the system. This however does not work for two reasons: 1. The energy estimation is not a good absolute value: compute_energy() used in feec() is a good estimation for task placement as it allows to compare the energy with and without a task. The computed delta will give a good overview of the cost for a certain task placement. It, however, doesn't work as an absolute estimation for the total energy of the system. First it adds the contribution to idle CPUs into the energy, second it mixes util_avg with util_est values. util_avg contains the near history for a CPU usage, it doesn't tell at all what the current utilization is. A system that has been quite busy in the near past will hold a very high energy and then a high margin preventing any task migration to a lower capacity CPU, wasting energy. It even creates a negative feedback loop: by holding the tasks on a less efficient CPU, the margin contributes in keeping the energy high. 2. The margin handicaps small tasks: On a system where the workload is composed mostly of small tasks (which is often the case on Android), the overall energy will be high enough to create a margin none of those tasks can cross. On a Pixel4, a small utilization of 5% on all the CPUs creates a global estimated energy of 140 joules, as per the Energy Model declaration of that same device. This means, after applying the 6% margin that any migration must save more than 8 joules to happen. No task with a utilization lower than 40 would then be able to migrate away from the biggest CPU of the system. The 6% of the overall system energy was brought by the following patch: ( |
||
![]() |
3e8c6c9aac |
sched/fair: Remove task_util from effective utilization in feec()
The energy estimation in find_energy_efficient_cpu() (feec()) relies on the computation of the effective utilization for each CPU of a perf domain (PD). This effective utilization is then used as an estimation of the busy time for this pd. The function effective_cpu_util() which gives this value, scales the utilization relative to IRQ pressure on the CPU to take into account that the IRQ time is hidden from the task clock. The IRQ scaling is as follow: effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of the CPU and irq the IRQ avg time. If now we take as an example a task placement which doesn't raise the OPP on the candidate CPU, we can write the energy delta as: delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) - effective_cpu_util(cpu_util)) = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util We end-up with an energy delta depending on the IRQ avg time, which is a problem: first the time spent on IRQs by a CPU has no effect on the additional energy that would be consumed by a task. Second, we don't want to favour a CPU with a higher IRQ avg time value. Nonetheless, we need to take the IRQ avg time into account. If a task placement raises the PD's frequency, it will increase the energy cost for the entire time where the CPU is busy. A solution is to only use effective_cpu_util() with the CPU contribution part. The task contribution is added separately and scaled according to prev_cpu's IRQ time. No change for the FREQUENCY_UTIL component of the energy estimation. We still want to get the actual frequency that would be selected after the task placement. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com |
||
![]() |
9b340131a4 |
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays invariant after Energy Model creation, i.e. it is not updated after CPU hotplug operations. That's why the PD mask is used in conjunction with the cpu_online_mask (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched multiple times (in compute_energy()) during a run-queue selection for a task. cpu_online_mask may change during this time which can lead to wrong energy calculations. To be able to avoid this, use the select_rq_mask per-cpu cpumask to create a cpumask out of PD cpumask and cpu_online_mask and pass it through the function calls of the EAS run-queue selection path. The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection (find_energy_efficient_cpu()) is now ANDed not only with the SD mask but also with the cpu_online_mask. This is fine since this cpumask has to be in syc with the one used for energy computation (compute_energy()). An exclusive cpuset setup with at least one asymmetric CPU capacity island (hence the additional AND with the SD cpumask) is the obvious exception here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com |
||
![]() |
ec4fc801a0 |
sched/fair: Rename select_idle_mask to select_rq_mask
On 21/06/2022 11:04, Vincent Donnefort wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
that this patch doesn't build anymore (on tip sched/core or linux-next)
because of commit
|
||
![]() |
bb44799949 |
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com |
||
![]() |
e2f3e35f1f |
sched/fair: Decay task PELT values during wakeup migration
Before being migrated to a new CPU, a task sees its PELT values synchronized with rq last_update_time. Once done, that same task will also have its sched_avg last_update_time reset. This means the time between the migration and the last clock update will not be accounted for in util_avg and a discontinuity will appear. This issue is amplified by the PELT clock scaling. It takes currently one tick after the CPU being idle to let clock_pelt catching up clock_task. This is especially problematic for asymmetric CPU capacity systems which need stable util_avg signals for task placement and energy estimation. Ideally, this problem would be solved by updating the runqueue clocks before the migration. But that would require taking the runqueue lock which is quite expensive [1]. Instead estimate the missing time and update the task util_avg with that value. To that end, we need sched_clock_cpu() but it is a costly function. Limit the usage to the case where the source CPU is idle as we know this is when the clock is having the biggest risk of being outdated. See comment in migrate_se_pelt_lag() for more details about how the PELT value is estimated. Notice though this estimation doesn't take into account IRQ and Paravirt time. [1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com |
||
![]() |
d05b43059d |
sched/fair: Provide u64 read for 32-bits arch helper
Introducing macro helpers u64_u32_{store,load}() to factorize lockless accesses to u64 variables for 32-bits architectures. Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To accommodate the later where the copy lies outside of the structure (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy), use the _copy() version of those helpers. Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and therefore, have a small penalty for 32-bits machines in set_task_rq_fair() and init_cfs_rq(). Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com |
||
![]() |
70fb5ccf2e |
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:
@usecs:
[0] 533 | |
[1] 5495 | |
[2, 4) 12008 | |
[4, 8) 239252 | |
[8, 16) 4041924 |@@@@@@@@@@@@@@ |
[16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 4507667 |@@@@@@@@@@@@@@@ |
[512, 1K) 2600472 |@@@@@@@@@ |
[1K, 2K) 927912 |@@@ |
[2K, 4K) 218720 | |
[4K, 8K) 98161 | |
[8K, 16K) 37722 | |
[16K, 32K) 6715 | |
[32K, 64K) 477 | |
[64K, 128K) 7 | |
netperf latency usecs:
=======
case load Lat_99th std%
TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.
[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%):
A ratio expressed as a percentage of runqueues scanned versus
idle CPUs found. A 100% efficiency indicates that the target,
prev or recent CPU of a task was idle at wakeup. The lower the
efficiency, the more runqueues were scanned before an idle CPU
was found.
SIS Domain Search Efficiency(dom_eff%):
Similar, except only for the slower SIS
patch.
SIS Fast Success Rate(fast_rate%):
Percentage of SIS that used target, prev or
recent CPUs.
SIS Success rate(success_rate%):
Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.
Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
TCP_RR 28 threads 99.978 18.535 99.995 100.000
TCP_RR 56 threads 99.397 5.671 99.964 100.000
TCP_RR 84 threads 21.721 6.818 73.632 100.000
TCP_RR 112 threads 12.500 5.533 59.000 100.000
TCP_RR 140 threads 8.524 4.535 49.020 100.000
TCP_RR 168 threads 6.438 3.945 40.309 99.999
TCP_RR 196 threads 5.397 3.718 32.320 99.982
TCP_RR 224 threads 4.874 3.661 25.775 99.767
UDP_RR 28 threads 99.988 17.704 99.997 100.000
UDP_RR 56 threads 99.528 5.977 99.970 100.000
UDP_RR 84 threads 24.219 6.992 76.479 100.000
UDP_RR 112 threads 13.907 5.706 62.538 100.000
UDP_RR 140 threads 9.408 4.699 52.519 100.000
UDP_RR 168 threads 7.095 4.077 44.352 100.000
UDP_RR 196 threads 5.757 3.775 35.764 99.991
UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
normal 1 mthread 99.152 6.400 99.941 100.000
normal 2 mthreads 97.844 4.003 99.908 100.000
normal 3 mthreads 96.395 2.118 99.917 99.998
normal 4 mthreads 55.288 1.451 98.615 99.804
normal 5 mthreads 7.004 1.870 45.597 61.036
normal 6 mthreads 3.354 1.346 20.777 34.230
normal 7 mthreads 2.183 1.028 11.257 21.055
normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
process-pipe 1 group 99.991 7.692 99.999 100.000
process-pipe 2 groups 99.934 4.615 99.997 100.000
process-pipe 3 groups 99.597 3.198 99.987 100.000
process-pipe 4 groups 98.378 2.464 99.958 100.000
process-pipe 5 groups 27.474 3.653 89.811 99.800
process-pipe 6 groups 20.201 4.098 82.763 99.570
process-pipe 7 groups 16.423 4.156 77.398 99.316
process-pipe 8 groups 13.165 3.920 72.232 98.828
process-sockets 1 group 99.977 5.882 99.999 100.000
process-sockets 2 groups 99.927 5.505 99.996 100.000
process-sockets 3 groups 99.397 3.250 99.980 100.000
process-sockets 4 groups 79.680 4.258 98.864 99.998
process-sockets 5 groups 7.673 2.503 63.659 92.115
process-sockets 6 groups 4.642 1.584 58.946 88.048
process-sockets 7 groups 3.493 1.379 49.816 81.164
process-sockets 8 groups 3.015 1.407 40.845 75.500
threads-pipe 1 group 99.997 0.000 100.000 100.000
threads-pipe 2 groups 99.894 2.932 99.997 100.000
threads-pipe 3 groups 99.611 4.117 99.983 100.000
threads-pipe 4 groups 97.703 2.624 99.937 100.000
threads-pipe 5 groups 22.919 3.623 87.150 99.764
threads-pipe 6 groups 18.016 4.038 80.491 99.557
threads-pipe 7 groups 14.663 3.991 75.239 99.247
threads-pipe 8 groups 12.242 3.808 70.651 98.644
threads-sockets 1 group 99.990 6.667 99.999 100.000
threads-sockets 2 groups 99.940 5.114 99.997 100.000
threads-sockets 3 groups 99.469 4.115 99.977 100.000
threads-sockets 4 groups 87.528 4.038 99.400 100.000
threads-sockets 5 groups 6.942 2.398 59.244 88.337
threads-sockets 6 groups 4.359 1.954 49.448 87.860
threads-sockets 7 groups 2.845 1.345 41.198 77.102
threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
loopback 28 threads 99.976 18.369 99.995 100.000
loopback 56 threads 99.222 7.799 99.934 100.000
loopback 84 threads 19.723 6.819 70.215 100.000
loopback 112 threads 11.283 5.371 55.371 99.999
loopback 140 threads 0.000 0.000 0.000 0.000
loopback 168 threads 0.000 0.000 0.000 0.000
loopback 196 threads 0.000 0.000 0.000 0.000
loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.
[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.
In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.
Introduce the quadratic function:
y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE
x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:
nr_scan = llc_weight * y'
Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
and the number of CPUs to be scanned. Use heuristic scan for now.
For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.
When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.
This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit
|
||
![]() |
700a78335f |
sched: only perform capability check on privileged operation
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler() a CAP_SYS_NICE audit event unconditionally, even when the requested operation does not require that capability / is unprivileged, i.e. for reducing niceness. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to allow the task the capability CAP_SYS_NICE, while it does not need it, violating the principle of least privilege. Conduct privilged/unprivileged categorization first and perform a capable test (and at most once) only if needed. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com |
||
![]() |
c64b551f6a |
sched: Remove unused function group_first_cpu()
As of commit
|
||
![]() |
fb95a5a04d |
sched/fair: Remove redundant word " *"
" *" is redundant. so remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com |
||
![]() |
e386b67257 |
rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs
Currently, the RCU Tasks Trace grace-period kthread IPIs each online CPU using smp_call_function_single() in order to track any tasks currently in RCU Tasks Trace read-side critical sections during which the corresponding task has neither blocked nor been preempted. These IPIs are annoying and are also not strictly necessary because any task that blocks or is preempted within its current RCU Tasks Trace read-side critical section will be tracked on one of the per-CPU rcu_tasks_percpu structure's ->rtp_blkd_tasks list. So the only time that this is a problem is if one of the CPUs runs through a long-duration RCU Tasks Trace read-side critical section without a context switch. Note that the task_call_func() function cannot help here because there is no safe way to identify the target task. Of course, the task_call_func() function will be very useful later, when processing the list of tasks, but it needs to know the task. This commit therefore creates a cpu_curr_snapshot() function that returns a pointer the task_struct structure of some task that happened to be running on the specified CPU more or less during the time that the cpu_curr_snapshot() function was executing. If there was no context switch during this time, this function will return a pointer to the task_struct structure of the task that was running throughout. If there was a context switch, then the outgoing task will be taken care of by RCU's context-switch hook, and the incoming task was either already taken care during some previous context switch, or it is not currently within an RCU Tasks Trace read-side critical section. And in this latter case, the grace period already started, so there is no need to wait on this task. This new cpu_curr_snapshot() function is invoked on each CPU early in the RCU Tasks Trace grace-period processing, and the resulting tasks are queued for later quiescent-state inspection. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: KP Singh <kpsingh@kernel.org> |
||
![]() |
f3dd3f6745 |
sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle
Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.
The commit
|
||
![]() |
28156108fe |
sched: Fix the check of nr_running at queue wakelist
The commit
|
||
![]() |
792b9f65a5 |
sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to arrive, such as with pending wakeups. newidle_balance() already accounts for this by exiting the sched_domain load_balance() iteration if it detects these cases. This is very important for minimizing wakeup latency. However, if we are already in load_balance(), we may stay there for a while before returning back to newidle_balance(). This is most exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A very straightforward workaround to this is to adjust should_we_balance() to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are detected. This was tested with the following reproduction: - two threads that take turns sleeping and waking each other up are affined to two cores - a large number of threads with 100% utilization are pinned to all other cores Without this patch, wakeup latency was ~120us for the pair of threads, almost entirely spent in load_balance(). With this patch, wakeup latency is ~6us. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com |
||
![]() |
2ed81e7654 |
sched/deadline: Use proc_douintvec_minmax() limit minimum value
sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned integer, but proc_dointvec() wouldn't return error even if we set a negative number. Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min. It's just an optimization for match data and proc_handler in struct ctl_table. The 'if (period < min || period > max)' in __checkparam_dl() will work fine even if there hasn't this patch. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev |
||
![]() |
51bf903b64 |
sched/fair: Optimize and simplify rq leaf_cfs_rq_list
We notice the rq leaf_cfs_rq_list has two problems when do bugfix backports and some test profiling. 1. cfs_rqs under throttled subtree could be added to the list, and make their fully decayed ancestors on the list, even though not needed. 2. #1 also make the leaf_cfs_rq_list management complex and error prone, this is the list of related bugfix so far: commit |
||
![]() |
f5b2eeb499 |
sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
In the case of systems containing multiple LLCs per socket, like AMD Zen systems, users want to spread bandwidth hungry applications across multiple LLCs. Stream is one such representative workload where the best performance is obtained by limiting one stream thread per LLC. To ensure this, users are known to pin the tasks to a specify a subset of the CPUs consisting of one CPU per LLC while running such bandwidth hungry tasks. Suppose we kickstart a multi-threaded task like stream with 8 threads using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs (0-63,128-191 in one socket, 64-127,192-255 in another socket) Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 Here each CPU in the list is from a different LLC and 4 of those LLCs are on one socket, while the other 4 are on another socket. Ideally we would prefer that each stream thread runs on a different CPU from the allowed list of CPUs. However, the current heuristics in find_idlest_group() do not allow this during the initial placement. Suppose the first socket (0-63,128-191) is our local group from which we are kickstarting the stream tasks. The first four stream threads will be placed in this socket. When it comes to placing the 5th thread, all the allowed CPUs are from the local group (0,16,32,48) would have been taken. However, the current scheduler code simply checks if the number of tasks in the local group is fewer than the allowed numa-imbalance threshold. This threshold was previously 25% of the NUMA domain span (in this case threshold = 32) but after the v6 of Mel's patchset "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip, Commit: |
||
![]() |
026b98a93b |
sched/numa: Adjust imb_numa_nr to a better approximation of memory channels
For a single LLC per node, a NUMA imbalance is allowed up until 25% of CPUs sharing a node could be active. One intent of the cut-off is to avoid an imbalance of memory channels but there is no topological information based on active memory channels. Furthermore, there can be differences between nodes depending on the number of populated DIMMs. A cut-off of 25% was arbitrary but generally worked. It does have a severe corner cases though when an parallel workload is using 25% of all available CPUs over-saturates memory channels. This can happen due to the initial forking of tasks that get pulled more to one node after early wakeups (e.g. a barrier synchronisation) that is not quickly corrected by the load balancer. The LB may fail to act quickly as the parallel tasks are considered to be poor migrate candidates due to locality or cache hotness. On a range of modern Intel CPUs, 12.5% appears to be a better cut-off assuming all memory channels are populated and is used as the new cut-off point. A minimum of 1 is specified to allow a communicating pair to remain local even for CPUs with low numbers of cores. For modern AMDs, there are multiple LLCs and are not affected. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net |
||
![]() |
cb29a5c19d |
sched/numa: Apply imbalance limitations consistently
The imbalance limitations are applied inconsistently at fork time and at runtime. At fork, a new task can remain local until there are too many running tasks even if the degree of imbalance is larger than NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance figure used during load balancing is different to the one used at NUMA placement. Load balancing uses the number of tasks that must move to restore imbalance where as NUMA balancing uses the total imbalance. In combination, it is possible for a parallel workload that uses a small number of CPUs without applying scheduler policies to have very variable run-to-run performance. [lkp@intel.com: Fix build breakage for arc-allyesconfig] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net |