Go to file
Hugh Dickins fdac0a3f58 mm/gup: check ref_count instead of lru before migration
commit 98c6d259319ecf6e8d027abd3f14b81324b8c0ad upstream.

Patch series "mm: better GUP pin lru_add_drain_all()", v2.

Series of lru_add_drain_all()-related patches, arising from recent mm/gup
migration report from Will Deacon.


This patch (of 5):

Will Deacon reports:-

When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not be
longterm pinned, for example because they reside in a CMA region or
movable zone.  This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the list
off to migrate_pages() for the actual migration.

It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller.  Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve stage-2
page faults from the guest) on a 6.15-based Pixel 6 device and this
results in the VM terminating prematurely.

In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning.  Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not present
and so the faulting logic allocates a new folio, mlocks it with
mlock_folio() and maps it in the page-table.

Since commit 2fbb0c10d1 ("mm/munlock: mlock_page() munlock_page() batch
by pagevec"), mlock/munlock operations on a folio (formerly page), are
deferred.  For example, mlock_folio() takes an additional reference on the
target folio before placing it into a per-cpu 'folio_batch' for later
processing by mlock_folio_batch(), which drops the refcount once the
operation is complete.  Processing of the batches is coupled with the LRU
batch logic and can be forcefully drained with lru_add_drain_all() but as
long as a folio remains unprocessed on the batch, its refcount will be
elevated.

This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration code
fails to migrate a folio due to the elevated refcount from the pending
mlock operation.

Hugh Dickins adds:-

!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references to
make the folio migratable: the LRU flag may be set even while the folio is
held with an extra reference in a per-CPU LRU cache.

5.18 commit 2fbb0c10d1 may have made it more unreliable.  Then 6.11
commit 33dfe9204f ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing; but
missed the mlock/munlock batches, so still unreliable as reported.

And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.

Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.

Note on PG_private_2: ceph and nfs are still using the deprecated
PG_private_2 flag, with the aid of netfs and filemap support functions.
Although it is consistently matched by an increment of folio ref_count,
folio_expected_ref_count() intentionally does not recognize it, and ceph
folio migration currently depends on that for PG_private_2 folios to be
rejected.  New references to the deprecated flag are discouraged, so do
not add it into the collect_longterm_unpinnable_folios() calculation: but
longterm pinning of transiently PG_private_2 ceph and nfs folios (an
uncommon case) may invoke a redundant lru_add_drain_all().  And this makes
easy the backport to earlier releases: up to and including 6.12, btrfs
also used PG_private_2, but without a ref_count increment.

Note for stable backports: requires 6.16 commit 86ebd50224 ("mm:
add folio_expected_ref_count() for reference count calculation").

Link: https://lkml.kernel.org/r/41395944-b0e3-c3ac-d648-8ddd70451d28@google.com
Link: https://lkml.kernel.org/r/bd1f314a-fca1-8f19-cac0-b936c9614557@google.com
Fixes: 9a4e9f3b2d ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Will Deacon <will@kernel.org>
Closes: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Keir Fraser <keirf@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Xu <weixugc@google.com>
Cc: yangge <yangge1116@126.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25 11:16:46 +02:00
arch um: Fix FD copy size in os_rcv_fd_msg() 2025-09-25 11:16:42 +02:00
block block: don't silently ignore metadata for sync read/write 2025-09-19 16:37:26 +02:00
certs sign-file,extract-cert: use pkcs11 provider for OPENSSL MAJOR >= 3 2024-09-20 19:52:48 +03:00
crypto crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg 2025-09-25 11:16:45 +02:00
Documentation doc/netlink: Fix typos in operation attributes 2025-09-25 11:16:44 +02:00
drivers dm-stripe: fix a possible integer overflow 2025-09-25 11:16:46 +02:00
fs btrfs: initialize inode::file_extent_tree after i_mode has been set 2025-09-25 11:16:46 +02:00
include crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg 2025-09-25 11:16:45 +02:00
init io_uring: fix breakage in EXPERT menu 2025-08-15 16:38:23 +02:00
io_uring fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availability 2025-09-19 16:37:26 +02:00
ipc - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
kernel Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()" 2025-09-25 11:16:46 +02:00
lib lib/sbitmap: convert shallow_depth from one word to the whole sbitmap 2025-08-20 18:41:31 +02:00
LICENSES LICENSES: add CC0-1.0 license text 2025-05-21 14:54:17 +02:00
mm mm/gup: check ref_count instead of lru before migration 2025-09-25 11:16:46 +02:00
net tls: make sure to abort the stream if headers are bogus 2025-09-25 11:16:45 +02:00
rust rust: mm: mark VmaNew as transparent 2025-09-09 19:02:29 +02:00
samples ftrace/samples: Fix function size computation 2025-09-19 16:37:28 +02:00
scripts rust: support Rust >= 1.91.0 target spec 2025-09-09 19:02:35 +02:00
security apparmor: Fix 8-byte alignment for initial dfa blob streams 2025-08-28 16:34:16 +02:00
sound ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported 2025-09-25 11:16:42 +02:00
tools selftests: mptcp: sockopt: fix error messages 2025-09-25 11:16:43 +02:00
usr usr/include: openrisc: don't HDRTEST bpf_perf_event.h 2025-05-12 15:03:17 +09:00
virt KVM: Allow CPU to reschedule while setting per-page memory attributes 2025-06-24 12:20:17 -07:00
.clang-format Linux 6.15-rc5 2025-05-06 16:39:25 +10:00
.clippy.toml rust: clean Rust 1.88.0's warning about clippy::disallowed_macros configuration 2025-05-07 00:11:47 +02:00
.cocciconfig
.editorconfig .editorconfig: remove trim_trailing_whitespace option 2024-06-13 16:47:52 +02:00
.get_maintainer.ignore MAINTAINERS: Retire Ralf Baechle 2024-11-12 15:48:59 +01:00
.gitattributes
.gitignore gitignore: allow .pylintrc to be tracked 2025-08-15 16:39:03 +02:00
.mailmap 11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues 2025-07-24 19:13:30 -07:00
.pylintrc docs: add a .pylintrc file with sys path for docs scripts 2025-04-09 12:10:33 -06:00
.rustfmt.toml
COPYING
CREDITS mm: update MAINTAINERS entry for HMM 2025-07-19 19:26:16 -07:00
Kbuild drm: ensure drm headers are self-contained and pass kernel-doc 2025-02-12 10:44:43 +02:00
Kconfig io_uring: Rename KConfig to Kconfig 2025-02-19 14:53:27 -07:00
MAINTAINERS 11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues 2025-07-24 19:13:30 -07:00
Makefile Linux 6.16.8 2025-09-19 16:37:39 +02:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel

There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first.

In order to build the documentation, use make htmldocs or make pdfdocs. The formatted documentation can also be read online at:

https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory, several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.