mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2025-08-22 00:42:01 +02:00
v6.6.100
1006 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
af6cfcd0ef |
mm/hugetlb: unshare page tables during VMA split, not before
commit |
||
![]() |
fe68429041 |
mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
commit |
||
![]() |
b5681a8b99 |
mm/hugetlb: move hugetlb_sysctl_init() to the __init section
commit
|
||
![]() |
c04035ce80 |
mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
commit |
||
![]() |
56b274473d |
mm: hugetlb: independent PMD page table shared count
[ Upstream commit |
||
![]() |
ec500230d3 |
mm/hugetlb: enforce that PMD PT sharing has split PMD PT locks
[ Upstream commit |
||
![]() |
0275e4021b |
mm: always initialise folio->_deferred_list
commit |
||
![]() |
26273f5f4c |
mm: gup: stop abusing try_grab_folio
commit |
||
![]() |
2ae4d58218 |
mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
commit |
||
![]() |
99a49b670e |
mm/hugetlb: fix possible recursive locking detected warning
commit |
||
![]() |
c311d65129 |
hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
commit
|
||
![]() |
cb3ea7684a |
mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nid
commit |
||
![]() |
7e0a322877 |
mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
commit |
||
![]() |
2431b5f265 |
mm: turn folio_test_hugetlb into a PageType
commit |
||
![]() |
f6c5d21db1 |
mm/hugetlb: fix missing hugetlb_lock for resv uncharge
commit |
||
![]() |
db01bfbddd |
mm/userfaultfd: allow hugetlb change protection upon poison entry
commit |
||
![]() |
512b420aaf |
hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
commit |
||
![]() |
18576d128e |
mm/hugetlb: use nth_page() in place of direct struct page manipulation
commit |
||
![]() |
2820b0f09b |
hugetlbfs: close race between MADV_DONTNEED and page fault
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.
Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
Fixes:
|
||
![]() |
bf4916922c |
hugetlbfs: extend hugetlb_vma_lock to private VMAs
Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.
Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.
Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
Fixes:
|
||
![]() |
92fe9dcbe4 |
hugetlbfs: clear resv_map pointer if mmap fails
Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by extending the hugetlb_vma_lock locking scheme to also
cover private hugetlb mappings (with resv_map), and pulling the locking
from __unmap_hugepage_final_range into helper functions called from
zap_page_range_single. This ensures page faults stay locked out of the
MADV_DONTNEED VMA until the huge pages have actually been freed.
This patch (of 3):
Hugetlbfs leaves a dangling pointer in the VMA if mmap fails. This has
not been a problem so far, but other code in this patch series tries to
follow that pointer.
Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
Fixes:
|
||
![]() |
935d4f0c6d |
mm: hugetlb: add huge page size param to set_huge_pte_at()
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit |
||
![]() |
8cfd014efd |
hugetlb: add documentation for vma_kernel_pagesize()
This is an exported symbol, so it should have kernel-doc. Update it to mention folios, and point out that they might be larger than the supported page size for this VMA. Link: https://lkml.kernel.org/r/20230822172459.4190699-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6c14197308 |
hugetlb: clear flags in tail pages that will be freed individually
hugetlb manually creates and destroys compound pages. As such it makes assumptions about struct page layout. Commit |
||
![]() |
9c5ccf2db0 |
mm: remove HUGETLB_PAGE_DTOR
We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
454a00c40a |
mm: convert free_huge_page() to free_huge_folio()
Pass a folio instead of the head page to save a few instructions. Update the documentation, at least in English. Link: https://lkml.kernel.org/r/20230816151201.3655946-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5994eabf3b | merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes | ||
![]() |
e727bfd5e7 |
mm: replace mmap with vma write lock assertions when operating on a vma
Vma write lock assertion always includes mmap write lock assertion and additional vma lock checks when per-VMA locks are enabled. Replace weaker mmap_assert_write_locked() assertions with stronger vma_assert_write_locked() ones when we are operating on a vma which is expected to be locked. Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6c1aa2d37f |
mm/hugetlb.c: use helper macro K()
Use helper macro K() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20230804012559.2617515-8-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f720b471fd |
mm: hugetlb: use flush_hugetlb_tlb_range() in move_hugetlb_page_tables()
Archs may need to do special things when flushing hugepage tlb, so use the
more applicable flush_hugetlb_tlb_range() instead of flush_tlb_range().
Link: https://lkml.kernel.org/r/20230801023145.17026-2-wangkefeng.wang@huawei.com
Fixes:
|
||
![]() |
4ec31152a8 |
mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
Handle a little more of the page fault path outside the mmap sem. The hugetlb path doesn't need to check whether the VMA is anonymous; the VM_HUGETLB flag is only set on hugetlbfs VMAs. There should be no performance change from the previous commit; this is simply a step to ease bisection of any problems. Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
1af5a81099 |
mmu_notifiers: rename invalidate_range notifier
There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention. Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ec8832d007 |
mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
affd26b1fb |
mm/hugetlb: get rid of page_hstate()
Convert the last page_hstate() user to use folio_hstate() so page_hstate() can be safely removed. Link: https://lkml.kernel.org/r/20230719184145.301911-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
8a13897fb0 |
mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
The behavior here is the same as it is for anon/shmem. This is done separately because hugetlb pte marker handling is a bit different. Link: https://lkml.kernel.org/r/20230707215540.2324998-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
af19487f00 |
mm: make PTE_MARKER_SWAPIN_ERROR more general
Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD", v4. This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4 for a detailed description of the feature. This patch (of 8): Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement UFFDIO_POISON, so make some various preparations for that: First, rename it to just PTE_MARKER_POISONED. The "SWAPIN" can be confusing since we're going to re-use it for something not really related to swap. This can be particularly confusing for things like hugetlbfs, which doesn't support swap whatsoever. Also rename some various helper functions. Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap. But, since we're going to re-use it, we want it to go ahead and copy it just like non-hugetlbfs memory does today. Since the code to do this is more complicated now, pull it out into a helper which can be re-used in both places. While we're at it, also make it slightly more explicit in its handling of e.g. uffd wp markers. For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an error entry, return VM_FAULT_HWPOISON. For most cases this change doesn't matter, e.g. a userspace program would receive a SIGBUS either way. But for UFFDIO_POISON, this change will let KVM guests get an MCE out of the box, instead of giving a SIGBUS to the hypervisor and requiring it to somehow inject an MCE. Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today because the lack of swap support means we'll never end up with such a PTE anyway, but this behavior will be needed once such entries *can* show up via UFFDIO_POISON. Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
4849807114 |
mm/gup: retire follow_hugetlb_page()
Now __get_user_pages() should be well prepared to handle thp completely, as long as hugetlb gup requests even without the hugetlb's special path. Time to retire follow_hugetlb_page(). Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal. Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5502ea44f5 |
mm/hugetlb: add page_mask for hugetlb_follow_page_mask()
follow_page() doesn't need it, but we'll start to need it when unifying gup for hugetlb. Link: https://lkml.kernel.org/r/20230628215310.73782-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
458568c929 |
mm/hugetlb: prepare hugetlb_follow_page_mask() for FOLL_PIN
follow_page() doesn't use FOLL_PIN, meanwhile hugetlb seems to not be the target of FOLL_WRITE either. However add the checks. Namely, either the need to CoW due to missing write bit, or proper unsharing on !AnonExclusive pages over R/O pins to reject the follow page. That brings this function closer to follow_hugetlb_page(). So we don't care before, and also for now. But we'll care if we switch over slow-gup to use hugetlb_follow_page_mask(). We'll also care when to return -EMLINK properly, as that's the gup internal api to mean "we should unshare". Not really needed for follow page path, though. When at it, switching the try_grab_page() to use WARN_ON_ONCE(), to be clear that it just should never fail. When error happens, instead of setting page==NULL, capture the errno instead. Link: https://lkml.kernel.org/r/20230628215310.73782-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
dd767aaa2f |
mm/hugetlb: handle FOLL_DUMP well in follow_page_mask()
Patch series "mm/gup: Unify hugetlb, speed up thp", v4. Hugetlb has a special path for slow gup that follow_page_mask() is actually skipped completely along with faultin_page(). It's not only confusing, but also duplicating a lot of logics that generic gup already has, making hugetlb slightly special. This patchset tries to dedup the logic, by first touching up the slow gup code to be able to handle hugetlb pages correctly with the current follow page and faultin routines (where we're mostly there.. due to 10 years ago we did try to optimize thp, but half way done; more below), then at the last patch drop the special path, then the hugetlb gup will always go the generic routine too via faultin_page(). Note that hugetlb is still special for gup, mostly due to the pgtable walking (hugetlb_walk()) that we rely on which is currently per-arch. But this is still one small step forward, and the diffstat might be a proof too that this might be worthwhile. Then for the "speed up thp" side: as a side effect, when I'm looking at the chunk of code, I found that thp support is actually partially done. It doesn't mean that thp won't work for gup, but as long as **pages pointer passed over, the optimization will be skipped too. Patch 6 should address that, so for thp we now get full speed gup. For a quick number, "chrt -f 1 ./gup_test -m 512 -t -L -n 1024 -r 10" gives me 13992.50us -> 378.50us. Gup_test is an extreme case, but just to show how it affects thp gups. This patch (of 8): Firstly, the no_page_table() is meaningless for hugetlb which is a no-op there, because a hugetlb page always satisfies: - vma_is_anonymous() == false - vma->vm_ops->fault != NULL So we can already safely remove it in hugetlb_follow_page_mask(), alongside with the page* variable. Meanwhile, what we do in follow_hugetlb_page() actually makes sense for a dump: we try to fault in the page only if the page cache is already allocated. Let's do the same here for follow_page_mask() on hugetlb. It should so far has zero effect on real dumps, because that still goes into follow_hugetlb_page(). But this may start to influence a bit on follow_page() users who mimics a "dump page" scenario, but hopefully in a good way. This also paves way for unifying the hugetlb gup-slow. Link: https://lkml.kernel.org/r/20230628215310.73782-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230628215310.73782-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
32c877191e |
hugetlb: do not clear hugetlb dtor until allocating vmemmap
Patch series "Fix hugetlb free path race with memory errors".
In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.
A much more detailed description is in the first patch.
The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation. This could lead to slowdowns if one is freeing a large
number of hugetlb pages.
The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.
The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression. It can be combined with
the first, but was done separately make reviewing easier.
This patch (of 2):
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes:
|
||
![]() |
191fcdb6c9 |
mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
The following crash happens for me when running the -mm selftests (below). Specifically, it happens while running the uffd-stress subtests: kernel BUG at mm/hugetlb.c:7249! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 3238 Comm: uffd-stress Not tainted 6.4.0-hubbard-github+ #109 Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018 RIP: 0010:huge_pte_alloc+0x12c/0x1a0 ... Call Trace: <TASK> ? __die_body+0x63/0xb0 ? die+0x9f/0xc0 ? do_trap+0xab/0x180 ? huge_pte_alloc+0x12c/0x1a0 ? do_error_trap+0xc6/0x110 ? huge_pte_alloc+0x12c/0x1a0 ? handle_invalid_op+0x2c/0x40 ? huge_pte_alloc+0x12c/0x1a0 ? exc_invalid_op+0x33/0x50 ? asm_exc_invalid_op+0x16/0x20 ? __pfx_put_prev_task_idle+0x10/0x10 ? huge_pte_alloc+0x12c/0x1a0 hugetlb_fault+0x1a3/0x1120 ? finish_task_switch+0xb3/0x2a0 ? lock_is_held_type+0xdb/0x150 handle_mm_fault+0xb8a/0xd40 ? find_vma+0x5d/0xa0 do_user_addr_fault+0x257/0x5d0 exc_page_fault+0x7b/0x1f0 asm_exc_page_fault+0x22/0x30 That happens because a BUG() statement in huge_pte_alloc() attempts to check that a pte, if present, is a hugetlb pte, but it does so in a non-lockless-safe manner that leads to a false BUG() report. We got here due to a couple of bugs, each of which by itself was not quite enough to cause a problem: First of all, before commit c33c794828f2("mm: ptep_get() conversion"), the BUG() statement in huge_pte_alloc() was itself fragile: it relied upon compiler behavior to only read the pte once, despite using it twice in the same conditional. Next, commit |
||
![]() |
fd4aed8d98 |
hugetlb: revert use of page_cache_next_miss()
Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
Closes tag. The issue showed up after the conversion of hugetlb page
cache lookup code to use page_cache_next_miss. User visible effects are:
- hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
in the file.
- hugetlb pages will not be included in core dumps if they need to be
brought in via GUP.
- userfaultfd UFFDIO_COPY will not notice pages already present in the
cache. It may try to allocate a new page and potentially return
ENOMEM as opposed to EEXIST.
Revert the use page_cache_next_miss() in hugetlb code.
IMPORTANT NOTE FOR STABLE BACKPORTS:
This patch will apply cleanly to v6.3. However, due to the change of
filemap_get_folio() return values, it will not function correctly. This
patch must be modified for stable backports.
[dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()]
Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain
Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com
Fixes:
|
||
![]() |
c33c794828 |
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
349d167000 |
mm/hugetlb: fix pgtable lock on pmd sharing
Huge pmd sharing operates on PUD not PMD, huge_pte_lock() is not suitable in this case because it should only work for last level pte changes, while pmd sharing is always one level higher. Meanwhile, here we're locking over the spte pgtable lock which is even not a lock for current mm but someone else's. It seems even racy on operating on the lock, as after put_page() of the spte pgtable page logically the page can be released, so at least the spin_unlock() needs to be done after the put_page(). No report I am aware, I'm not even sure whether it'll just work on taking the spte pmd lock, because while we're holding i_mmap read lock it probably means the vma interval tree is frozen, all pte allocators over this pud entry could always find the specific svma and spte page, so maybe they'll serialize on this spte page lock? Even so, doesn't seem to be expected. It just seems to be an accident of |
||
![]() |
e3b7bf972d |
mm/folio: avoid special handling for order value 0 in folio_set_order
folio_set_order(folio, 0) is used in kernel at two places
__destroy_compound_gigantic_folio and __prep_compound_gigantic_folio.
Currently, It is called to clear out the folio->_folio_nr_pages and
folio->_folio_order.
For __destroy_compound_gigantic_folio:
In past, folio_set_order(folio, 0) was needed because page->mapping used
to overlap with _folio_nr_pages and _folio_order. So if these fields were
left uncleared during freeing gigantic hugepages, they were causing
"BUG: bad page state" due to non-zero page->mapping. Now, After
Commit
|
||
![]() |
061e62e818 |
mm/hugetlb: use a folio in hugetlb_fault()
We can replace seven implicit calls to compound_head() with one by using folio. [akpm@linux-foundation.org: update comment, per Sidhartha] Link: https://lkml.kernel.org/r/20230606062013.2947002-4-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
959a78b6dd |
mm/hugetlb: use a folio in hugetlb_wp()
We can replace nine implict calls to compound_head() with one by using old_folio. The page we get back is always a head page, so we just convert old_page to old_folio. Link: https://lkml.kernel.org/r/20230606062013.2947002-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ad27ce206a |
mm/hugetlb: use a folio in copy_hugetlb_page_range()
Patch series "Convert several functions in hugetlb.c to use a folio", v2. This patch series converts three functions in hugetlb.c to use a folio, which can remove several implicit calls to compound_head(). This patch (of 3): We can replace five implict calls to compound_head() with one by using pte_folio. The page we get back is always a head page, so we just convert ptepage to pte_folio. Link: https://lkml.kernel.org/r/20230606062013.2947002-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20230606062013.2947002-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b2cac24819 |
mm/gup: remove vmas array from internal GUP functions
Now we have eliminated all callers to GUP APIs which use the vmas parameter, eliminate it altogether. This eliminates a class of bugs where vmas might have been kept around longer than the mmap_lock and thus we need not be concerned about locks being dropped during this operation leaving behind dangling pointers. This simplifies the GUP API and makes it considerably clearer as to its purpose - follow flags are applied and if pinning, an array of pages is returned. Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |