linux-yocto/include
Joshua Hahn 2a72a8ddf1 mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
commit 0acc67c4030c39f39ac90413cc5d0abddd3a9527 upstream.

Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5.

Motivation & Approach
=====================

While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups.  Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.

This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks.  We can see these issues manifesting as warnings:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu:     20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu:              hardirqs   softirqs   csw/system
[ 4512.638793] rcu:      number:        0        145            0
[ 4512.651177] rcu:     cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu:     (t=21077 jiffies g=783665 q=1242213 ncpus=316)

While these warnings don't indicate a crash or a kernel panic, they do
point to the underlying issue of lock contention.  To prevent starvation
in both locks, batch the freeing of pages using pcp->batch.

Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks).  Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.

A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.

Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.

On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).

Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%)  |
+------------+---------------+-----------+
| real       |        0.8070 |  - 1.4865 |
| user       |        0.2823 |  + 0.4081 |
| sys        |        5.0267 |  -11.8737 |
+------------+---------------+-----------+

Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real       |        0.2806 |  +0.0351 |
| user       |        0.0994 |  +0.3170 |
| sys        |        0.6229 |  -0.6277 |
+------------+---------------+----------+

Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real       |        0.1503 |  -2.6585 |
| user       |        0.0431 |  -2.2984 |
| sys        |        0.1870 |  -3.2013 |
+------------+---------------+----------+

Here, variation is the coefficient of variation, i.e.  standard deviation
/ mean.

Based on these results, it seems like there are varying degrees to how
much lock contention this reduces.  For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction.  There is also some performance increases visible from
userspace.

Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine.  One possible theory is that because the high
watermark depends on both memory and the number of local CPUs, what
impacts zone contention the most is not these individual values, but
rather the ratio of mem:processors.


This patch (of 5):

Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates.  Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.

However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.

Simplify the code by returning a bool instead to indicate if updates
were made.

In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.

Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23 11:21:36 +01:00
..
acpi Revert "ACPI: processor: idle: Optimize ACPI idle driver registration" 2025-11-25 16:08:06 +01:00
asm-generic mm: actually mark kernel page table pages 2026-01-23 11:21:35 +01:00
clocksource
crypto crypto: scatterwalk - Fix memcpy_sglist() to always succeed 2026-01-02 12:57:08 +01:00
cxl
drm drm/bridge: dw-hdmi-qp: Fix spurious IRQ on resume 2026-01-23 11:21:14 +01:00
dt-bindings dt-bindings: clock: mmcc-sdm660: Add missing MDSS reset 2026-01-02 12:57:07 +01:00
hyperv
keys
kunit kunit: Enforce task execution in {soft,hard}irq contexts 2026-01-08 10:16:50 +01:00
kvm KVM: arm64: Kill leftovers of ad-hoc timer userspace access 2025-10-13 14:42:41 +01:00
linux mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection 2026-01-23 11:21:36 +01:00
math-emu
media media: v4l2-mem2mem: Fix outdated documentation 2026-01-02 12:57:11 +01:00
memory
misc
net ipv4: ip_tunnel: spread netdev_lockdep_set_classes() 2026-01-23 11:21:15 +01:00
pcmcia
ras RAS: Report all ARM processor CPER information to userspace 2025-12-18 14:03:09 +01:00
rdma
rv
scsi scsi: core: Fix error handler encryption support 2026-01-23 11:21:23 +01:00
soc
sound ALSA: pcm: Improve the fix for race of buffer access at PCM OSS layer 2026-01-23 11:21:23 +01:00
target scsi: target: Fix LUN/device R/W and total command stats 2025-12-18 14:02:48 +01:00
trace btrfs: fix NULL dereference on root when tracing inode eviction 2026-01-17 16:35:18 +01:00
uapi ext4: fix ext4_tune_sb_params padding 2026-01-23 11:21:28 +01:00
ufs scsi: ufs: core: Move the ufshcd_enable_intr() declaration 2025-12-18 14:02:36 +01:00
vdso
video
xen
Kbuild