linux-yocto/mm
Feng Tang c635a42d9b mm/page_alloc: detect allocation forbidden by cpuset and bail out early
[ Upstream commit 8ca1b5a498 ]

There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.

It can be reproduced with command:

    $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

(where node 4 is a movable node)

  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope.  The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handling
 it inside OOM code, adding cpuset check, refining comments]

Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 65f97cc81b0a ("cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28 16:24:37 +02:00
..
damon mm/damon/vaddr: fix issue in damon_va_evenly_split_region() 2024-12-14 19:51:45 +01:00
kasan kasan: disable kasan_non_canonical_hook() for HW tags 2023-12-23 10:42:00 +01:00
kfence kfence: skip __GFP_THISNODE allocations on NUMA systems 2025-03-13 12:50:50 +01:00
backing-dev.c writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs 2023-05-11 23:00:18 +09:00
balloon_compaction.c
bootmem_info.c bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem 2022-08-31 17:16:48 +02:00
cleancache.c
cma_debug.c
cma_sysfs.c
cma.c mm/cma: drop incorrect alignment check in cma_init_reserved_mem 2024-07-05 09:14:13 +02:00
cma.h
compaction.c mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations 2024-04-10 16:19:37 +02:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: clear page table entries at destroy_args() 2025-08-28 16:24:34 +02:00
debug.c mm/debug: sync up latest migrate_reason to migrate_reason_names 2021-09-24 16:13:35 -07:00
dmapool.c
early_ioremap.c mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
fadvise.c
failslab.c
filemap.c mm: drop the assumption that VM_SHARED always implies writable 2025-08-28 16:24:30 +02:00
frontswap.c
gup_test.c
gup_test.h
gup.c mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable() 2025-05-02 07:44:13 +02:00
highmem.c highmem: fix checks in __kmap_local_sched_{in,out} 2022-04-13 20:59:21 +02:00
hmm.c mm/hmm: move pmd_to_hmm_pfn_flags() to the respective #ifdeffery 2025-08-28 16:24:14 +02:00
huge_memory.c mm/huge_memory: fix dereferencing invalid pmd migration entry 2025-06-27 11:05:37 +01:00
hugetlb_cgroup.c
hugetlb_vmemmap.c
hugetlb_vmemmap.h
hugetlb.c mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race 2025-06-27 11:05:35 +01:00
hwpoison-inject.c mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler 2022-07-12 16:35:05 +02:00
init-mm.c
internal.h mm: unconditionally close VMAs on error 2024-12-14 19:50:38 +01:00
interval_tree.c
io-mapping.c
ioremap.c mm: move ioremap_page_range to vmalloc.c 2021-09-08 11:50:24 -07:00
Kconfig kmap_local: don't assume kmap PTEs are linear arrays in memory 2021-11-25 09:48:43 +01:00
Kconfig.debug
khugepaged.c mm/khugepaged: check again on anon uffd-wp during isolation 2023-04-26 13:51:52 +02:00
kmemleak.c mm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lock 2025-08-28 16:24:27 +02:00
ksm.c mm/ksm: remove old GCC 4.9+ check 2021-09-13 10:18:28 -07:00
list_lru.c
maccess.c maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() 2022-11-26 09:24:47 +01:00
madvise.c mm: drop the assumption that VM_SHARED always implies writable 2025-08-28 16:24:30 +02:00
Makefile mm: introduce Data Access MONitor (DAMON) 2021-09-08 11:50:24 -07:00
mapping_dirty_helpers.c
memblock.c memblock: allow to specify flags with memblock_add_node() 2023-12-20 15:17:33 +01:00
memcontrol.c memcg: always call cond_resched() after fn() 2025-06-04 14:38:06 +02:00
memfd.c mm: reinstate ability to map write-sealed memfd mappings read-only 2025-08-28 16:24:30 +02:00
memory_hotplug.c memblock: allow to specify flags with memblock_add_node() 2023-12-20 15:17:33 +01:00
memory-failure.c mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn 2025-08-28 16:24:37 +02:00
memory.c mm: fix apply_to_existing_page_range() 2025-05-02 07:44:24 +02:00
mempolicy.c mm/numa_balancing: teach mpol_to_str about the balancing mode 2024-08-19 05:45:17 +02:00
mempool.c
memremap.c mm/memremap.c: map FS_DAX device memory as decrypted 2022-11-16 09:58:27 +01:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-04-10 16:18:42 +02:00
migrate.c mm/migrate: set swap entry values of THP tail pages properly. 2024-04-10 16:19:31 +02:00
mincore.c
mlock.c mm/mlock: fix potential imbalanced rlimit ucounts adjustment 2022-05-15 20:18:53 +02:00
mm_init.c
mmap_lock.c mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer 2024-08-19 05:45:10 +02:00
mmap.c mm: reinstate ability to map write-sealed memfd mappings read-only 2025-08-28 16:24:30 +02:00
mmu_gather.c mm/khugepaged: fix GUP-fast interaction by sending IPI 2022-12-14 11:37:17 +01:00
mmu_notifier.c mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() 2022-04-27 14:38:58 +02:00
mmzone.c
mprotect.c mm: don't try to NUMA-migrate COW pages that have other uses 2022-02-23 12:03:03 +01:00
mremap.c mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0) 2022-04-13 20:59:22 +02:00
msync.c
nommu.c mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling 2024-12-14 19:50:38 +01:00
oom_kill.c memcg: fix soft lockup in the OOM process 2025-03-13 12:50:48 +01:00
page_alloc.c mm/page_alloc: detect allocation forbidden by cpuset and bail out early 2025-08-28 16:24:37 +02:00
page_counter.c
page_ext.c mm/migrate: add CPU hotplug to demotion #ifdef 2021-10-18 20:22:02 -10:00
page_idle.c mm/idle_page_tracking: make PG_idle reusable 2021-09-08 11:50:24 -07:00
page_io.c mm: fix unexpected zeroed page mapping with zram swap 2022-04-20 09:34:18 +02:00
page_isolation.c Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
page_owner.c mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE 2021-09-08 11:50:22 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_vma_mapped.c
page-writeback.c mm: fix ratelimit_pages update error in dirty_ratio_handler() 2025-06-27 11:05:26 +01:00
pagewalk.c mm: pagewalk: Fix race between unmap and page walker 2022-09-08 12:28:05 +02:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c memblock: drop memblock_free_early_nid() and memblock_free_early() 2025-03-13 12:50:07 +01:00
pgalloc-track.h
pgtable-generic.c
process_vm_access.c
ptdump.c mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() 2025-08-28 16:24:32 +02:00
readahead.c vfs: fix readahead(2) on block devices 2023-11-20 11:08:13 +01:00
rmap.c mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse 2022-09-05 10:30:07 +02:00
rodata_test.c
secretmem.c secretmem: disable memfd_secret() if arch cannot set direct map 2024-10-22 15:40:41 +02:00
shmem.c mm: update memfd seal write check to include F_SEAL_WRITE 2025-08-28 16:24:30 +02:00
shuffle.c
shuffle.h
slab_common.c mm, slab: remove duplicate kernel-doc comment for ksize() 2025-04-10 14:32:06 +02:00
slab.c slab: Introduce kmalloc_size_roundup() 2025-04-10 14:31:49 +02:00
slab.h mm, kfence: support kmem_dump_obj() for KFENCE objects 2022-04-27 14:38:51 +02:00
slob.c slab: Introduce kmalloc_size_roundup() 2025-04-10 14:31:49 +02:00
slub.c mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context. 2022-09-28 11:11:44 +02:00
sparse-vmemmap.c
sparse.c memblock: drop memblock_free_early_nid() and memblock_free_early() 2025-03-13 12:50:07 +01:00
swap_cgroup.c
swap_slots.c
swap_state.c mm: swap: get rid of livelock in swapin readahead 2022-03-23 09:16:41 +01:00
swap.c mm: fs: invalidate bh_lrus for only cold path 2021-09-24 16:13:35 -07:00
swapfile.c mm/swapfile: skip HugeTLB pages for unuse_vma 2024-10-22 15:40:41 +02:00
truncate.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
usercopy.c mm/usercopy: return 1 from hardened_usercopy __setup() handler 2022-04-08 14:24:14 +02:00
userfaultfd.c userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb 2024-03-01 13:21:43 +01:00
util.c mm: unconditionally close VMAs on error 2024-12-14 19:50:38 +01:00
vmacache.c
vmalloc.c mm/vmalloc: leave lazy MMU mode on PTE mapping error 2025-08-28 16:24:03 +02:00
vmpressure.c net-memcg: Fix scope of sockmem pressure indicators 2023-09-19 12:22:33 +02:00
vmscan.c mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock 2025-05-02 07:44:04 +02:00
vmstat.c vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event 2024-12-14 19:50:33 +01:00
workingset.c memcg: sync flush only if periodic flush is delayed 2022-04-27 14:38:57 +02:00
z3fold.c
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n 2025-08-28 16:24:05 +02:00
zswap.c