Go to file
Nhat Pham 56e5a103a7 zsmalloc: prefer the the original page's node for compressed data
Currently, zsmalloc, zswap's and zram's backend memory allocator, does not
enforce any policy for the allocation of memory for the compressed data,
instead just adopting the memory policy of the task entering reclaim, or
the default policy (prefer local node) if no such policy is specified. 
This can lead to several pathological behaviors in multi-node NUMA
systems:

1. Systems with CXL-based memory tiering can encounter the following
   inversion with zswap/zram: the coldest pages demoted to the CXL tier
   can return to the high tier when they are reclaimed to compressed swap,
   creating memory pressure on the high tier.

2. Consider a direct reclaimer scanning nodes in order of allocation
   preference.  If it ventures into remote nodes, the memory it compresses
   there should stay there.  Trying to shift those contents over to the
   reclaiming thread's preferred node further *increases* its local
   pressure, and provoking more spills.  The remote node is also the most
   likely to refault this data again.  This undesirable behavior was
   pointed out by Johannes Weiner in [1].

3. For zswap writeback, the zswap entries are organized in
   node-specific LRUs, based on the node placement of the original pages,
   allowing for targeted zswap writeback for specific nodes.

   However, the compressed data of a zswap entry can be placed on a
   different node from the LRU it is placed on.  This means that reclaim
   targeted at one node might not free up memory used for zswap entries in
   that node, but instead reclaiming memory in a different node.

All of these issues will be resolved if the compressed data go to the same
node as the original page.  This patch encourages this behavior by having
zswap and zram pass the node of the original page to zsmalloc, and have
zsmalloc prefer the specified node if we need to allocate new (zs)pages
for the compressed data.

Note that we are not strictly binding the allocation to the preferred
node.  We still allow the allocation to fall back to other nodes when the
preferred node is full, or if we have zspages with slots available on a
different node.  This is OK, and still a strict improvement over the
status quo:

1. On a system with demotion enabled, we will generally prefer
   demotions over compressed swapping, and only swap when pages have
   already gone to the lowest tier.  This patch should achieve the desired
   effect for the most part.

2. If the preferred node is out of memory, letting the compressed data
   going to other nodes can be better than the alternative (OOMs, keeping
   cold memory unreclaimed, disk swapping, etc.).

3. If the allocation go to a separate node because we have a zspage
   with slots available, at least we're not creating extra immediate
   memory pressure (since the space is already allocated).

3. While there can be mixings, we generally reclaim pages in same-node
   batches, which encourage zspage grouping that is more likely to go to
   the right node.

4. A strict binding would require partitioning zsmalloc by node, which
   is more complicated, and more prone to regression, since it reduces the
   storage density of zsmalloc.  We need to evaluate the tradeoff and
   benchmark carefully before adopting such an involved solution.

[1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/

[senozhatsky@chromium.org: coding-style fixes]
  Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge
Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram, zsmalloc]
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>	[zswap/zsmalloc]
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:48:06 -07:00
arch arch: remove mk_pmd() 2025-05-11 17:48:04 -07:00
block block-6.15-20250509 2025-05-09 10:34:50 -07:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto crypto: scompress - increment scomp_scratch_users when already allocated 2025-04-25 10:33:30 +08:00
Documentation Input updates for v6.15-rc5 2025-05-11 10:29:29 -07:00
drivers zsmalloc: prefer the the original page's node for compressed data 2025-05-11 17:48:06 -07:00
fs mm: add folio_mk_pmd() 2025-05-11 17:48:04 -07:00
include zsmalloc: prefer the the original page's node for compressed data 2025-05-11 17:48:06 -07:00
init rust: clean Rust 1.88.0's unnecessary_transmutes lint 2025-05-07 00:11:47 +02:00
io_uring io_uring/sqpoll: Increase task_work submission batch size 2025-05-09 07:56:53 -06:00
ipc treewide: const qualify ctl_tables where applicable 2025-01-28 13:48:37 +01:00
kernel kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() 2025-05-11 17:26:06 -07:00
lib iov_iter: convert iov_iter_extract_xarray_pages() to use folios 2025-05-11 17:48:05 -07:00
LICENSES LICENSES: add 0BSD license text 2024-09-01 20:43:24 -07:00
mm zsmalloc: prefer the the original page's node for compressed data 2025-05-11 17:48:06 -07:00
net net: export a helper for adding up queue stats 2025-05-08 11:56:12 +02:00
rust rust: clean Rust 1.88.0's clippy::uninlined_format_args lint 2025-05-07 00:11:47 +02:00
samples samples/bpf: Fix compilation failure for samples/bpf on LoongArch Fedora 2025-04-25 09:32:02 -07:00
scripts scripts: Do not strip .rela.dyn section 2025-05-08 12:01:01 +00:00
security Landlock fix for v6.15-rc4 2025-04-24 12:59:05 -07:00
sound ASoC: Fixes for v6.15 2025-05-01 10:22:20 +02:00
tools ARM: 2025-05-11 11:30:13 -07:00
usr kbuild: hdrcheck: fix cross build with clang 2025-03-05 04:06:45 +09:00
virt ARM: 2025-04-08 13:47:55 -07:00
.clang-format clang-format: Update the ForEachMacros list for v6.15-rc1 2025-04-13 11:03:59 +02:00
.clippy.toml rust: clean Rust 1.88.0's warning about clippy::disallowed_macros configuration 2025-05-07 00:11:47 +02:00
.cocciconfig
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes
.gitignore kbuild: Create intermediate vmlinux build with relocations preserved 2025-03-17 00:29:50 +09:00
.mailmap Input updates for v6.15-rc5 2025-05-11 10:29:29 -07:00
.rustfmt.toml
COPYING
CREDITS MAINTAINERS: update SLAB ALLOCATOR maintainers 2025-04-17 20:10:06 -07:00
Kbuild drm: ensure drm headers are self-contained and pass kernel-doc 2025-02-12 10:44:43 +02:00
Kconfig io_uring: Rename KConfig to Kconfig 2025-02-19 14:53:27 -07:00
MAINTAINERS MAINTAINERS: add mm GUP section 2025-05-11 17:26:07 -07:00
Makefile Linux 6.15-rc6 2025-05-11 14:54:11 -07:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel

There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first.

In order to build the documentation, use make htmldocs or make pdfdocs. The formatted documentation can also be read online at:

https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory, several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.