linux-yocto/mm
Herton R. Krzesinski 61a9f2e5c4 mm/debug_vm_pgtable: clear page table entries at destroy_args()
commit dde30854bddfb5d69f30022b53c5955a41088b33 upstream.

The mm/debug_vm_pagetable test allocates manually page table entries for
the tests it runs, using also its manually allocated mm_struct.  That in
itself is ok, but when it exits, at destroy_args() it fails to clear those
entries with the *_clear functions.

The problem is that leaves stale entries.  If another process allocates an
mm_struct with a pgd at the same address, it may end up running into the
stale entry.  This is happening in practice on a debug kernel with
CONFIG_DEBUG_VM_PGTABLE=y, for example this is the output with some extra
debugging I added (it prints a warning trace if pgtables_bytes goes
negative, in addition to the warning at check_mm() function):

[    2.539353] debug_vm_pgtable: [get_random_vaddr         ]: random_vaddr is 0x7ea247140000
[    2.539366] kmem_cache info
[    2.539374] kmem_cachep 0x000000002ce82385 - freelist 0x0000000000000000 - offset 0x508
[    2.539447] debug_vm_pgtable: [init_args                ]: args->mm is 0x000000002267cc9e
(...)
[    2.552800] WARNING: CPU: 5 PID: 116 at include/linux/mm.h:2841 free_pud_range+0x8bc/0x8d0
[    2.552816] Modules linked in:
[    2.552843] CPU: 5 UID: 0 PID: 116 Comm: modprobe Not tainted 6.12.0-105.debug_vm2.el10.ppc64le+debug #1 VOLUNTARY
[    2.552859] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW910.00 (VL910_062) hv:phyp pSeries
[    2.552872] NIP:  c0000000007eef3c LR: c0000000007eef30 CTR: c0000000003d8c90
[    2.552885] REGS: c0000000622e73b0 TRAP: 0700   Not tainted  (6.12.0-105.debug_vm2.el10.ppc64le+debug)
[    2.552899] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822  XER: 0000000a
[    2.552954] CFAR: c0000000008f03f0 IRQMASK: 0
[    2.552954] GPR00: c0000000007eef30 c0000000622e7650 c000000002b1ac00 0000000000000001
[    2.552954] GPR04: 0000000000000008 0000000000000000 c0000000007eef30 ffffffffffffffff
[    2.552954] GPR08: 00000000ffff00f5 0000000000000001 0000000000000048 0000000000004000
[    2.552954] GPR12: 00000003fa440000 c000000017ffa300 c0000000051d9f80 ffffffffffffffdb
[    2.552954] GPR16: 0000000000000000 0000000000000008 000000000000000a 60000000000000e0
[    2.552954] GPR20: 4080000000000000 c0000000113af038 00007fffcf130000 0000700000000000
[    2.552954] GPR24: c000000062a6a000 0000000000000001 8000000062a68000 0000000000000001
[    2.552954] GPR28: 000000000000000a c000000062ebc600 0000000000002000 c000000062ebc760
[    2.553170] NIP [c0000000007eef3c] free_pud_range+0x8bc/0x8d0
[    2.553185] LR [c0000000007eef30] free_pud_range+0x8b0/0x8d0
[    2.553199] Call Trace:
[    2.553207] [c0000000622e7650] [c0000000007eef30] free_pud_range+0x8b0/0x8d0 (unreliable)
[    2.553229] [c0000000622e7750] [c0000000007f40b4] free_pgd_range+0x284/0x3b0
[    2.553248] [c0000000622e7800] [c0000000007f4630] free_pgtables+0x450/0x570
[    2.553274] [c0000000622e78e0] [c0000000008161c0] exit_mmap+0x250/0x650
[    2.553292] [c0000000622e7a30] [c0000000001b95b8] __mmput+0x98/0x290
[    2.558344] [c0000000622e7a80] [c0000000001d1018] exit_mm+0x118/0x1b0
[    2.558361] [c0000000622e7ac0] [c0000000001d141c] do_exit+0x2ec/0x870
[    2.558376] [c0000000622e7b60] [c0000000001d1ca8] do_group_exit+0x88/0x150
[    2.558391] [c0000000622e7bb0] [c0000000001d1db8] sys_exit_group+0x48/0x50
[    2.558407] [c0000000622e7be0] [c00000000003d810] system_call_exception+0x1e0/0x4c0
[    2.558423] [c0000000622e7e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
(...)
[    2.558892] ---[ end trace 0000000000000000 ]---
[    2.559022] BUG: Bad rss-counter state mm:000000002267cc9e type:MM_ANONPAGES val:1
[    2.559037] BUG: non-zero pgtables_bytes on freeing mm: -6144

Here the modprobe process ended up with an allocated mm_struct from the
mm_struct slab that was used before by the debug_vm_pgtable test.  That is
not a problem, since the mm_struct is initialized again etc., however, if
it ends up using the same pgd table, it bumps into the old stale entry
when clearing/freeing the page table entries, so it tries to free an entry
already gone (that one which was allocated by the debug_vm_pgtable test),
which also explains the negative pgtables_bytes since it's accounting for
not allocated entries in the current process.

As far as I looked pgd_{alloc,free} etc.  does not clear entries, and
clearing of the entries is explicitly done in the free_pgtables->
free_pgd_range->free_p4d_range->free_pud_range->free_pmd_range->
free_pte_range path.  However, the debug_vm_pgtable test does not call
free_pgtables, since it allocates mm_struct and entries manually for its
test and eg.  not goes through page faults.  So it also should clear
manually the entries before exit at destroy_args().

This problem was noticed on a reboot X number of times test being done on
a powerpc host, with a debug kernel with CONFIG_DEBUG_VM_PGTABLE enabled.
Depends on the system, but on a 100 times reboot loop the problem could
manifest once or twice, if a process ends up getting the right mm->pgd
entry with the stale entries used by mm/debug_vm_pagetable.  After using
this patch, I couldn't reproduce/experience the problems anymore.  I was
able to reproduce the problem as well on latest upstream kernel (6.16).

I also modified destroy_args() to use mmput() instead of mmdrop(), there
is no reason to hold mm_users reference and not release the mm_struct
entirely, and in the output above with my debugging prints I already had
patched it to use mmput, it did not fix the problem, but helped in the
debugging as well.

Link: https://lkml.kernel.org/r/20250731214051.4115182-1-herton@redhat.com
Fixes: 3c9b84f044 ("mm/debug_vm_pgtable: introduce struct pgtable_debug_args")
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28 16:31:05 +02:00
..
damon mm/damon/ops-common: ignore migration request to invalid nodes 2025-08-28 16:31:03 +02:00
kasan kasan: use vmalloc_dump_obj() for vmalloc error reports 2025-08-01 09:48:43 +01:00
kfence kfence: skip __GFP_THISNODE allocations on NUMA systems 2025-02-17 10:05:31 +01:00
kmsan dma: kmsan: export kmsan_handle_dma() for modules 2025-03-13 13:01:58 +01:00
backing-dev.c writeback: support retrieving per group debug writeback stats of bdi 2024-05-05 17:53:51 -07:00
balloon_compaction.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
bootmem_info.c
cma_debug.c
cma_sysfs.c
cma.c mm/cma: add cma_{alloc,free}_folio() 2024-09-03 21:15:36 -07:00
cma.h
compaction.c mm/compaction: fix bug in hugetlb handling pathway 2025-04-25 10:47:53 +02:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: clear page table entries at destroy_args() 2025-08-28 16:31:05 +02:00
debug.c mm: open-code page_folio() in dump_page() 2024-12-14 20:03:33 +01:00
dmapool_test.c mm/dmapool: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
dmapool.c
early_ioremap.c
execmem.c mm/execmem, arch: convert remaining overrides of module_alloc to execmem 2024-05-14 00:31:43 -07:00
fadvise.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c readahead: fix return value of page_cache_next_miss() when no hole is found 2025-08-28 16:30:58 +02:00
folio-compat.c mm: remove putback_lru_page() 2024-09-09 16:38:59 -07:00
gup_test.c
gup_test.h
gup.c mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked" 2025-07-06 11:01:43 +02:00
highmem.c mm/highmem: make nr_free_highpages() return "unsigned long" 2024-07-03 19:30:06 -07:00
hmm.c mm/hmm: move pmd_to_hmm_pfn_flags() to the respective #ifdeffery 2025-08-15 12:14:13 +02:00
huge_memory.c mm/huge_memory: fix dereferencing invalid pmd migration entry 2025-05-18 08:24:51 +02:00
hugetlb_cgroup.c mm: memcg: don't call propagate_protected_usage() needlessly 2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hugetlb.c mm/hugetlb: unshare page tables during VMA split, not before 2025-06-27 11:11:40 +01:00
hwpoison-inject.c mm/hwpoison: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
init-mm.c
internal.h mm: fix folio_pte_batch() on XEN PV 2025-05-18 08:24:51 +02:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig resource: remove dependency on SPARSEMEM from GET_FREE_REGION 2024-10-28 21:40:39 -07:00
Kconfig.debug slub: Introduce CONFIG_SLUB_RCU_DEBUG 2024-08-27 14:12:51 +02:00
khugepaged.c mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vma 2025-08-01 09:48:47 +01:00
kmemleak.c mm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lock 2025-08-20 18:30:55 +02:00
ksm.c mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show() 2025-08-01 09:48:42 +01:00
list_lru.c mm: list_lru: fix UAF for memory cgroup 2024-08-07 18:33:56 -07:00
maccess.c
madvise.c mm: close theoretical race where stale TLB entries could linger 2025-06-27 11:11:38 +01:00
Makefile mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
mapping_dirty_helpers.c
memblock.c memblock: Accept allocated memory before use in memblock_double_array() 2025-05-18 08:24:54 +02:00
memcontrol-v1.c mm/thp: fix deferred split unqueue naming and locking 2024-11-05 16:49:54 -08:00
memcontrol-v1.h mm: memcg: declare do_memsw_account inline 2024-12-14 20:03:33 +01:00
memcontrol.c memcg: always call cond_resched() after fn() 2025-05-29 11:03:22 +02:00
memfd.c mm: reinstate ability to map write-sealed memfd mappings read-only 2025-01-09 13:33:54 +01:00
memory_hotplug.c mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper 2025-04-20 10:15:50 +02:00
memory-failure.c mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list 2025-08-01 09:48:44 +01:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm: fix apply_to_existing_page_range() 2025-04-25 10:47:53 +02:00
mempolicy.c mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MM 2024-12-14 20:03:32 +01:00
mempool.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
memremap.c mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() 2024-05-05 17:53:49 -07:00
memtest.c
migrate_device.c mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize() 2025-02-27 04:30:22 -08:00
migrate.c mm/migrate: fix shmem xarray update during migration 2025-03-28 22:03:30 +01:00
mincore.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
mlock.c mm/mlock: set the correct prev on failure 2024-11-07 14:14:58 -08:00
mm_init.c mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSION 2024-09-03 21:15:28 -07:00
mm_slot.h
mmap_lock.c mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer 2024-07-03 19:30:26 -07:00
mmap.c mm: reinstate ability to map write-sealed memfd mappings read-only 2025-01-09 13:33:54 +01:00
mmu_gather.c
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm: refactor map_deny_write_exec() 2024-11-05 16:49:55 -08:00
mremap.c mm/mremap: correctly handle partial mremap() of VMA starting at 0 2025-04-20 10:15:49 +02:00
mseal.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
msync.c
nommu.c nommu: pass NULL argument to vma_iter_prealloc() 2024-11-11 17:20:23 -08:00
numa_emulation.c mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
numa_memblks.c mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id 2024-10-28 21:40:40 -07:00
numa.c mm: make range-to-target_node lookup facility a part of numa_memblks 2024-09-03 21:15:32 -07:00
oom_kill.c memcg: fix soft lockup in the OOM process 2025-02-08 09:58:19 +01:00
page_alloc.c page_pool: Move pp_magic check into helper functions 2025-06-19 15:31:42 +02:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_idle.c
page_io.c mm: count zeromap read and set for swapout and swapin 2024-11-11 00:00:37 -08:00
page_isolation.c mm/hugetlb: wait for hugetlb folios to be freed 2025-03-22 12:54:28 -07:00
page_owner.c mm/page-owner: use gfp_nested_mask() instead of open coded masking 2024-05-19 14:40:44 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c mm: make page_mapped_in_vma() hugetlb walk aware 2025-04-20 10:15:49 +02:00
page-writeback.c mm: fix ratelimit_pages update error in dirty_ratio_handler() 2025-06-27 11:11:22 +01:00
pagewalk.c mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present check 2024-10-28 21:40:38 -07:00
percpu-internal.h mm: remove CONFIG_MEMCG_KMEM 2024-07-10 12:14:54 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c percpu: remove pcpu_alloc_size() 2024-09-01 20:26:04 -07:00
pgalloc-track.h
pgtable-generic.c mm: fix race between __split_huge_pmd_locked() and GUP-fast 2024-05-07 10:37:00 -07:00
process_vm_access.c
ptdump.c mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() 2025-08-20 18:30:55 +02:00
readahead.c mm/readahead: fix large folio support in async readahead 2025-01-09 13:33:54 +01:00
rmap.c mm/rmap: reject hugetlb folios in folio_make_device_exclusive() 2025-04-20 10:15:49 +02:00
rodata_test.c
secretmem.c fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass 2025-07-10 16:05:09 +02:00
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
shmem.c mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper 2025-04-20 10:15:50 +02:00
show_mem.c mm/show_mem.c: report alloc tags in human readable units 2024-09-17 01:07:00 -07:00
shrinker_debug.c mm: shrinker: use min() to improve shrinker_debugfs_scan_write() 2024-09-03 21:15:40 -07:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shuffle.c
shuffle.h
slab_common.c slab: Fix too strict alignment check in create_cache() 2024-12-09 10:41:07 +01:00
slab.h mm/slub: Avoid list corruption when removing a slab from the full list 2024-12-09 10:41:04 +01:00
slub.c mm, slab: restore NUMA policy support for large kmalloc 2025-08-20 18:30:55 +02:00
sparse-vmemmap.c LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space 2024-10-21 22:11:19 +08:00
sparse.c A set of X86 fixes: 2024-09-01 14:43:08 -07:00
swap_cgroup.c mm: attempt to batch free swap entries for zap_pte_range() 2024-09-03 21:15:33 -07:00
swap_slots.c
swap_state.c mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios 2024-09-17 01:07:01 -07:00
swap.c mm: page_alloc: move mlocked flag clearance into free_pages_prepare() 2024-11-11 17:20:23 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swapfile.c mm: swap: fix potential buffer overflow in setup_clusters() 2025-08-15 12:14:14 +02:00
truncate.c mm: Fix missing folio invalidation calls during truncation 2024-08-24 16:09:16 +02:00
usercopy.c
userfaultfd.c userfaultfd: fix a crash in UFFDIO_MOVE when PMD is a migration entry 2025-08-20 18:30:54 +02:00
util.c mm: only enforce minimum stack gap size if it's sensible 2024-09-01 20:26:02 -07:00
vma_internal.h mm/hugetlb: unshare page tables during VMA split, not before 2025-06-27 11:11:40 +01:00
vma.c mm/vma: reset VMA iterator on commit_merge() OOM failure 2025-07-06 11:01:48 +02:00
vma.h mm/vma: add give_up_on_oom option on modify/merge, use in uffd release 2025-04-25 10:48:06 +02:00
vmalloc.c mm/vmalloc: leave lazy MMU mode on PTE mapping error 2025-07-17 18:37:14 +02:00
vmpressure.c
vmscan.c mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list 2025-08-01 09:48:44 +01:00
vmstat.c vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event 2024-12-09 10:41:01 +01:00
workingset.c cachestat: do not flush stats in recency check 2024-07-03 22:40:37 -07:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n 2025-08-01 09:48:44 +01:00
zswap.c mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead() 2025-04-10 14:39:40 +02:00