Go to file
Kairui Song 5faad2d566 mm/shmem, swap: improve cached mTHP handling and fix potential hang
commit 5c241ed8d0 upstream.

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com
Fixes: 809bc86517 ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20 18:41:41 +02:00
arch mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() 2025-08-20 18:41:41 +02:00
block block: Introduce bio_needs_zone_write_plugging() 2025-08-20 18:41:35 +02:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto crypto: jitter - fix intermediary handling 2025-08-20 18:41:24 +02:00
Documentation sphinx: kernel_abi: fix performance regression with O=<dir> 2025-08-20 18:41:22 +02:00
drivers i2c: core: Fix double-free of fwnode in i2c_unregister_device() 2025-08-20 18:41:41 +02:00
fs ocfs2: reset folio to NULL when get folio fails 2025-08-20 18:41:40 +02:00
include block: Introduce bio_needs_zone_write_plugging() 2025-08-20 18:41:35 +02:00
init io_uring: fix breakage in EXPERT menu 2025-08-15 16:38:23 +02:00
io_uring io_uring/net: commit partial buffers on retry 2025-08-20 18:40:44 +02:00
ipc - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
kernel futex: Use user_write_access_begin/_end() in futex_put_value() 2025-08-20 18:41:35 +02:00
lib lib/sbitmap: convert shallow_depth from one word to the whole sbitmap 2025-08-20 18:41:31 +02:00
LICENSES LICENSES: add CC0-1.0 license text 2025-05-21 14:54:17 +02:00
mm mm/shmem, swap: improve cached mTHP handling and fix potential hang 2025-08-20 18:41:41 +02:00
net net/sched: ets: use old 'nbands' while purging unused classes 2025-08-20 18:41:40 +02:00
rust rust: workaround rustdoc target modifiers bug 2025-08-20 18:41:35 +02:00
samples samples/damon/mtier: support boot time enable setup 2025-08-20 18:41:35 +02:00
scripts kconfig: lxdialog: fix 'space' to (de)select options 2025-08-20 18:41:31 +02:00
security apparmor: fix x_table_lookup when stacking is not the first entry 2025-08-20 18:41:29 +02:00
sound ASoC: fsl_sai: replace regmap_write with regmap_update_bits 2025-08-20 18:41:33 +02:00
tools tools/power turbostat: Handle cap_get_proc() ENOSYS 2025-08-20 18:41:31 +02:00
usr usr/include: openrisc: don't HDRTEST bpf_perf_event.h 2025-05-12 15:03:17 +09:00
virt KVM: Allow CPU to reschedule while setting per-page memory attributes 2025-06-24 12:20:17 -07:00
.clang-format Linux 6.15-rc5 2025-05-06 16:39:25 +10:00
.clippy.toml rust: clean Rust 1.88.0's warning about clippy::disallowed_macros configuration 2025-05-07 00:11:47 +02:00
.cocciconfig
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes
.gitignore gitignore: allow .pylintrc to be tracked 2025-08-15 16:39:03 +02:00
.mailmap 11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues 2025-07-24 19:13:30 -07:00
.pylintrc docs: add a .pylintrc file with sys path for docs scripts 2025-04-09 12:10:33 -06:00
.rustfmt.toml
COPYING
CREDITS mm: update MAINTAINERS entry for HMM 2025-07-19 19:26:16 -07:00
Kbuild drm: ensure drm headers are self-contained and pass kernel-doc 2025-02-12 10:44:43 +02:00
Kconfig io_uring: Rename KConfig to Kconfig 2025-02-19 14:53:27 -07:00
MAINTAINERS 11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues 2025-07-24 19:13:30 -07:00
Makefile Linux 6.16.1 2025-08-15 16:39:37 +02:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel

There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first.

In order to build the documentation, use make htmldocs or make pdfdocs. The formatted documentation can also be read online at:

https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory, several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.