Go to file
Feng Tang c635a42d9b mm/page_alloc: detect allocation forbidden by cpuset and bail out early
[ Upstream commit 8ca1b5a498 ]

There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.

It can be reproduced with command:

    $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

(where node 4 is a movable node)

  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope.  The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handling
 it inside OOM code, adding cpuset check, refining comments]

Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 65f97cc81b0a ("cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28 16:24:37 +02:00
arch x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper 2025-08-28 16:24:37 +02:00
block block: avoid possible overflow for chunk_sectors check in blk_stack_limits() 2025-08-28 16:24:25 +02:00
certs certs/blacklist_hashes.c: fix const confusion in certs blacklist 2022-06-22 14:22:01 +02:00
crypto crypto: xts - Only add ecb if it is not already there 2025-06-27 11:05:10 +01:00
Documentation asm-generic: Add memory barrier dma_mb() 2025-08-28 16:24:36 +02:00
drivers iio: light: as73211: Ensure buffer holes are zeroed 2025-08-28 16:24:37 +02:00
fs f2fs: fix to avoid out-of-boundary access in dnode page 2025-08-28 16:24:36 +02:00
include mm/page_alloc: detect allocation forbidden by cpuset and bail out early 2025-08-28 16:24:37 +02:00
init sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP 2025-05-02 07:44:36 +02:00
io_uring io_uring: fix possible deadlock in io_register_iowq_max_workers() 2024-11-17 15:06:25 +01:00
ipc ipc: fix to protect IPCS lookups using RCU 2025-06-27 11:05:26 +01:00
kernel mm/page_alloc: detect allocation forbidden by cpuset and bail out early 2025-08-28 16:24:37 +02:00
lib lib: bitmap: Introduce node-aware alloc API 2025-08-28 16:24:02 +02:00
LICENSES
mm mm/page_alloc: detect allocation forbidden by cpuset and bail out early 2025-08-28 16:24:37 +02:00
net mptcp: disable add_addr retransmission when timeout is 0 2025-08-28 16:24:35 +02:00
samples samples: mei: Fix building on musl libc 2025-08-28 16:24:07 +02:00
scripts kconfig: lxdialog: fix 'space' to (de)select options 2025-08-28 16:24:25 +02:00
security securityfs: don't pin dentries twice, once is enough... 2025-08-28 16:24:17 +02:00
sound ALSA: hda/realtek: Add support for HP EliteBook x360 830 G6 and EliteBook 830 G6 2025-08-28 16:24:34 +02:00
tools selftests: mptcp: pm: check flush doesn't reset limits 2025-08-28 16:24:37 +02:00
usr kbuild: hdrcheck: fix cross build with clang 2025-06-27 11:05:22 +01:00
virt KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() 2024-10-22 15:40:41 +02:00
.clang-format
.cocciconfig
.get_maintainer.ignore
.gitattributes
.gitignore Remove *.orig pattern from .gitignore 2024-10-17 15:11:10 +02:00
.mailmap
COPYING
CREDITS
Kbuild
Kconfig
MAINTAINERS trace: Relocate event helper files 2024-04-10 16:19:24 +02:00
Makefile kbuild: userprogs: use correct linker when mixing clang and GNU ld 2025-08-28 16:24:33 +02:00
README

Linux kernel

There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first.

In order to build the documentation, use make htmldocs or make pdfdocs. The formatted documentation can also be read online at:

https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory, several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.