Commit Graph

2259 Commits

Author SHA1 Message Date
Linus Torvalds
fcd0bb8e99 vfs-6.16-rc2.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaD1hUAAKCRCRxhvAZXjc
 ohIBAQCJQYP7bzg+A7C7K6cA+5LwC1u4aZ424B5puIZrLiEEDQEAxQli95/rTIeE
 m2DWBDl5rMrKlfmpqGvjbbJldU75swo=
 =2EhD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know
   how to decode a connected non-dir dentry fail the request

 - Use repr(transparent) to ensure identical layout between the C and
   Rust implementation of struct file

 - Add a missing xas_pause() into the dax code employing
   wait_entry_unlocked_exclusive()

 - Fix FOP_DONTCACHE which we disabled for v6.15.

   A folio could get redirtied and/or scheduled for writeback after the
   initial dropbehind test. Change the test accordingly to handle these
   cases so we can re-enable FOP_DONTCACHE again

* tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exportfs: require ->fh_to_parent() to encode connectable file handles
  rust: file: improve safety comments
  rust: file: mark `LocalFile` as `repr(transparent)`
  fs/dax: Fix "don't skip locked entries when scanning entries"
  iomap: don't lose folio dropbehind state for overwrites
  mm/filemap: unify dropbehind flag testing and clearing
  mm/filemap: unify read/write dropbehind naming
  Revert "Disable FOP_DONTCACHE for now due to bugs"
  mm/filemap: use filemap_end_dropbehind() for read invalidation
  mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
2025-06-02 12:49:16 -07:00
Linus Torvalds
00c010e130 - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a
   folio and reduces the amount of plumbing which architecture must
   implement to provide this.
 
 - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
   is a shower of largely unrelated folio infrastructure changes which
   clean things up and better prepare us for future work.
 
 - The 3 patch series "memory,x86,acpi: hotplug memory alignment
   advisement" from Gregory Price adds early-init code to prevent x86 from
   leaving physical memory unused when physical address regions are not
   aligned to memory block size.
 
 - The 2 patch series "mm/compaction: allow more aggressive proactive
   compaction" from Michal Clapinski provides some tuning of the (sadly,
   hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
   of proactive compaction.  In a simple test case, the reduction of a guest
   VM's memory consumption was dramatic.
 
 - The 8 patch series "Minor cleanups and improvements to swap freeing
   code" from Kemeng Shi provides some code cleaups and a small efficiency
   improvement to this part of our swap handling code.
 
 - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
   from Dmitry Levin adds the ability for a ptracer to modify syscalls
   arguments.  At this time we can alter only "system call information that
   are used by strace system call tampering, namely, syscall number,
   syscall arguments, and syscall return value.
 
   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.
 
 - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
   guard regions" from Andrei Vagin extends the info returned by the
   PAGEMAP_SCAN ioctl against /proc/pid/pagemap.  This permits CRIU to more
   efficiently get at the info about guard regions.
 
 - The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
   from Gavin Shan implements that fix.  No runtime effect is expected
   because validate_page_before_insert() happens to fix up this error.
 
 - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
   rewrite" from David Hildenbrand basically brings uprobe text poking into
   the current decade.  Remove a bunch of hand-rolled implementation in
   favor of using more current facilities.
 
 - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
   from Anshuman Khandual provides enhancements and generalizations to the
   pte dumping code.  This might be needed when 128-bit Page Table
   Descriptors are enabled for ARM.
 
 - The 12 patch series "Always call constructor for kernel page tables"
   from Kevin Brodsky "ensures that the ctor/dtor is always called for
   kernel pgtables, as it already is for user pgtables".  This permits the
   addition of more functionality such as "insert hooks to protect page
   tables".  This change does result in various architectures performing
   unnecesary work, but this is fixed up where it is anticipated to occur.
 
 - The 9 patch series "Rust support for mm_struct, vm_area_struct, and
   mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
   structures.
 
 - The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
   from Lorenzo Stoakes takes advantage of some VMA merging opportunities
   which we've been missing for 15 years.
 
 - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
   and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
   flushing.  Instead of flushing each address range in the provided iovec,
   we batch the flushing across all the iovec entries.  The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.
 
 - The 6 patch series "Track node vacancy to reduce worst case allocation
   counts" from Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.  stress-ng mmap performance increased by single-digit
   percentages and the amount of unnecessarily preallocated memory was
   dramaticelly reduced.
 
 - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
   Baoquan He removes a few unnecessary things which Baoquan noted when
   reading the code.
 
 - The 3 patch series ""Enhance sysfs handling for memory hotplug in
   weighted interleave" from Rakie Kim "enhances the weighted interleave
   policy in the memory management subsystem by improving sysfs handling,
   fixing memory leaks, and introducing dynamic sysfs updates for memory
   hotplug support".  Fixes things on error paths which we are unlikely to
   hit.
 
 - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
   including tiered memory" from SeongJae Park introduces new DAMOS quota
   goal metrics which eliminate the manual tuning which is required when
   utilizing DAMON for memory tiering.
 
 - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
   Baoquan He provides cleanups and small efficiency improvements which
   Baoquan found via code inspection.
 
 - The 2 patch series "vmscan: enforce mems_effective during demotion"
   from Gregory Price "changes reclaim to respect cpuset.mems_effective
   during demotion when possible".  because "presently, reclaim explicitly
   ignores cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated." "This is useful for isolating workloads on a
   multi-tenant system from certain classes of memory more consistently."
 
 - The 2 patch series ""Clean up split_huge_pmd_locked() and remove
   unnecessary folio pointers" from Gavin Guo provides minor cleanups and
   efficiency gains in in the huge page splitting and migrating code.
 
 - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
   creates a slab cache for `struct mem_cgroup', yielding improved memory
   utilization.
 
 - The 4 patch series "add max arg to swappiness in memory.reclaim and
   lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
   argument for memory.reclaim MGLRU's lru_gen.  This directs proactive
   reclaim to reclaim from only anon folios rather than file-backed folios.
 
 - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
   Rapoport is the first step on the path to permitting the kernel to
   maintain existing VMs while replacing the host kernel via file-based
   kexec.  At this time only memblock's reserve_mem is preserved.
 
 - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
   Woodhouse provides and uses a smarter way of looping over a pfn range.
   By skipping ranges of invalid pfns.
 
 - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
   one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
   VMA scanning when a task is pinned a single NUMA mode.  Dramatic
   performance benefits were seen in some real world cases.
 
 - The 2 patch series "JFS: Implement migrate_folio for
   jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
   during memory compaction when using JFS.
 
 - The 4 patch series "move all VMA allocation, freeing and duplication
   logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
   into the more appropriate mm/vma.c.
 
 - The 6 patch series "mm, swap: clean up swap cache mapping helper" from
   Kairui Song provides code consolidation and cleanups related to the
   folio_index() function.
 
 - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
   Moola does that.
 
 - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
   Waiman Long addresses some bogus failures which are being reported by
   the test_memcontrol selftest.
 
 - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
   hook" from Lorenzo Stoakes commences the deprecation of
   file_operations.mmap() in favor of the new
   file_operations.mmap_prepare().  The latter is more restrictive and
   prevents drivers from messing with things in ways which, amongst other
   problems, may defeat VMA merging.
 
 - The 4 patch series "memcg: decouple memcg and objcg stocks"" from
   Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
   one.  This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.
 
 - The 6 patch series "mm/damon: minor fixups and improvements for code,
   tests, and documents" from SeongJae Park is "yet another batch of
   miscellaneous DAMON changes.  Fix and improve minor problems in code,
   tests and documents."
 
 - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
   Butt converts memcg stats to be irq safe.  Another step along the way to
   making memcg charging and stats updates NMI-safe, a BPF requirement.
 
 - The 4 patch series "Let unmap_hugepage_range() and several related
   functions take folio instead of page" from Fan Ni provides folio
   conversions in the hugetlb code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
 ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
 Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
 =tYYm
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Linus Torvalds
0f70f5b08a automount wart removal
Calling conventions of ->d_automount() made saner (flagday change)
 vfs_submount() is gone - its sole remaining user (trace_automount) had
 been switched to saner primitives.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
 6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
 EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
 =GXLi
 -----END PGP SIGNATURE-----

Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull automount updates from Al Viro:
 "Automount wart removal

  A bunch of odd boilerplate gone from instances - the reason for
  those was the need to protect the yet-to-be-attched mount from
  mark_mounts_for_expiry() deciding to take it out.

  But that's easy to detect and take care of in mark_mounts_for_expiry()
  itself; no need to have every instance simulate mount being busy by
  grabbing an extra reference to it, with finish_automount() undoing
  that once it attaches that mount.

  Should've done it that way from the very beginning... This is a
  flagday change, thankfully there are very few instances.

  vfs_submount() is gone - its sole remaining user (trace_automount)
  had been switched to saner primitives"

* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill vfs_submount()
  saner calling conventions for ->d_automount()
2025-05-30 15:38:29 -07:00
Jens Axboe
7b2b67dbd4
Revert "Disable FOP_DONTCACHE for now due to bugs"
This reverts commit 478ad02d68.

Both the read and write side dirty && writeback races should be resolved
now, revert the commit that disabled FOP_DONTCACHE for filesystems.

Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/20250527133255.452431-4-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27 21:06:24 +02:00
Linus Torvalds
f83fcb87f8 xfs: New code for 6.16
Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCaDQXTQAKCRBcsMJ8RxYu
 YwUHAYDYYm9oit6AIr0AgTXBMJ+DHyqaszBy0VT2jQUP+yXxyrQc46QExXKU9YQV
 ffmGRAsBgN7ZdDI8D5qWySyOynB3b1Jn3/0jY82GscFK0k0oX3EtxbN9MdrovbgK
 qyO66BVx7w==
 =pG5y
 -----END PGP SIGNATURE-----

Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Carlos Maiolino:

 - Atomic writes for XFS

 - Remove experimental warnings for pNFS, scrub and parent pointers

* tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  xfs: add inode to zone caching for data placement
  xfs: free the item in xfs_mru_cache_insert on failure
  xfs: remove the EXPERIMENTAL warning for pNFS
  xfs: remove some EXPERIMENTAL warnings
  xfs: Remove deprecated xfs_bufd sysctl parameters
  xfs: stop using set_blocksize
  xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  xfs: update atomic write limits
  xfs: add xfs_calc_atomic_write_unit_max()
  xfs: add xfs_file_dio_write_atomic()
  xfs: commit CoW-based atomic writes atomically
  xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  xfs: add xfs_atomic_write_cow_iomap_begin()
  xfs: refine atomic write size check in xfs_file_write_iter()
  xfs: refactor xfs_reflink_end_cow_extent()
  xfs: allow block allocator to take an alignment hint
  xfs: ignore HW which cannot atomic write a single block
  xfs: add helpers to compute transaction reservation for finishing intent items
  xfs: add helpers to compute log item overhead
  xfs: separate out setting buftarg atomic writes limits
  ...
2025-05-26 12:56:01 -07:00
Linus Torvalds
6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Linus Torvalds
8dd53535f1 vfs-6.16-rc1.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 oi3BAQD/IBxTbAZIe7vEAsuLlBoKbWrzPGvxzd4UeMGo6OY18wEAvvyJM+arQy51
 jS0ZErDOJnPNe7jps+Gh+WDx6d3NMAY=
 =lqAG
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs freezing updates from Christian Brauner:
 "This contains various filesystem freezing related work for this cycle:

   - Allow the power subsystem to support filesystem freeze for suspend
     and hibernate.

     Now all the pieces are in place to actually allow the power
     subsystem to freeze/thaw filesystems during suspend/resume.
     Filesystems are only frozen and thawed if the power subsystem does
     actually own the freeze.

     If the filesystem is already frozen by the time we've frozen all
     userspace processes we don't care to freeze it again. That's
     userspace's job once the process resumes. We only actually freeze
     filesystems if we absolutely have to and we ignore other failures
     to freeze.

     We could bubble up errors and fail suspend/resume if the error
     isn't EBUSY (aka it's already frozen) but I don't think that this
     is worth it. Filesystem freezing during suspend/resume is
     best-effort. If the user has 500 ext4 filesystems mounted and 4
     fail to freeze for whatever reason then we simply skip them.

     What we have now is already a big improvement and let's see how we
     fare with it before making our lives even harder (and uglier) than
     we have to.

   - Allow efivars to support freeze and thaw

     Allow efivarfs to partake to resync variable state during system
     hibernation and suspend. Add freeze/thaw support.

     This is a pretty straightforward implementation. We simply add
     regular freeze/thaw support for both userspace and the kernel.
     efivars is the first pseudofilesystem that adds support for
     filesystem freezing and thawing.

     The simplicity comes from the fact that we simply always resync
     variable state after efivarfs has been frozen. It doesn't matter
     whether that's because of suspend, userspace initiated freeze or
     hibernation. Efivars is simple enough that it doesn't matter that
     we walk all dentries. There are no directories and there aren't
     insane amounts of entries and both freeze/thaw are already
     heavy-handed operations. If userspace initiated a freeze/thaw cycle
     they would need CAP_SYS_ADMIN in the initial user namespace (as
     that's where efivarfs is mounted) so it can't be triggered by
     random userspace. IOW, we really really don't care"

* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  f2fs: fix freezing filesystem during resize
  kernfs: add warning about implementing freeze/thaw
  efivarfs: support freeze/thaw
  power: freeze filesystems during suspend/resume
  libfs: export find_next_child()
  super: add filesystem freezing helpers for suspend and hibernate
  gfs2: pass through holder from the VFS for freeze/thaw
  super: use common iterator (Part 2)
  super: use a common iterator (Part 1)
  super: skip dying superblocks early
  super: simplify user_get_super()
  super: remove pointless s_root checks
  fs: allow all writers to be frozen
  locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26 09:33:44 -07:00
Linus Torvalds
181d8e399f vfs-6.16-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
 om0+AQDMxKLweJXplqQQ7jxuvW2dEa60YpE2EalEKWGg9YA3KgEA3nI4kyKMKn7Y
 PRFXgIcKvhs62oJLKsq8SGQUqExqvAE=
 =atEw
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual selections of misc updates for this cycle.

  Features:

   - Use folios for symlinks in the page cache

     FUSE already uses folios for its symlinks. Mirror that conversion
     in the generic code and the NFS code. That lets us get rid of a few
     folio->page->folio conversions in this path, and some of the few
     remaining users of read_cache_page() / read_mapping_page()

   - Try and make a few filesystem operations killable on the VFS
     inode->i_mutex level

   - Add sysctl vfs_cache_pressure_denom for bulk file operations

     Some workloads need to preserve more dentries than we currently
     allow through out sysctl interface

     A HDFS servers with 12 HDDs per server, on a HDFS datanode startup
     involves scanning all files and caching their metadata (including
     dentries and inodes) in memory. Each HDD contains approximately 2
     million files, resulting in a total of ~20 million cached dentries
     after initialization

     To minimize dentry reclamation, they set vfs_cache_pressure to 1.
     Despite this configuration, memory pressure conditions can still
     trigger reclamation of up to 50% of cached dentries, reducing the
     cache from 20 million to approximately 10 million entries. During
     the subsequent cache rebuild period, any HDFS datanode restart
     operation incurs substantial latency penalties until full cache
     recovery completes

     To maintain service stability, more dentries need to be preserved
     during memory reclamation. The current minimum reclaim ratio (1/100
     of total dentries) remains too aggressive for such workload. This
     patch introduces vfs_cache_pressure_denom for more granular cache
     pressure control

     The configuration [vfs_cache_pressure=1,
     vfs_cache_pressure_denom=10000] effectively maintains the full 20
     million dentry cache under memory pressure, preventing datanode
     restart performance degradation

   - Avoid some jumps in inode_permission() using likely()/unlikely()

   - Avid a memory access which is most likely a cache miss when
     descending into devcgroup_inode_permission()

   - Add fastpath predicts for stat() and fdput()

   - Anonymous inodes currently don't come with a proper mode causing
     issues in the kernel when we want to add useful VFS debug assert.
     Fix that by giving them a proper mode and masking it off when we
     report it to userspace which relies on them not having any mode

   - Anonymous inodes currently allow to change inode attributes because
     the VFS falls back to simple_setattr() if i_op->setattr isn't
     implemented. This means the ownership and mode for every single
     user of anon_inode_inode can be changed. Block that as it's either
     useless or actively harmful. If specific ownership is needed the
     respective subsystem should allocate anonymous inodes from their
     own private superblock

   - Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock

   - Add proper tests for anonymous inode behavior

   - Make it easy to detect proper anonymous inodes and to ensure that
     we can detect them in codepaths such as readahead()

  Cleanups:

   - Port pidfs to the new anon_inode_{g,s}etattr() helpers

   - Try to remove the uselib() system call

   - Add unlikely branch hint return path for poll

   - Add unlikely branch hint on return path for core_sys_select

   - Don't allow signals to interrupt getdents copying for fuse

   - Provide a size hint to dir_context for during readdir()

   - Use writeback_iter directly in mpage_writepages

   - Update compression and mtime descriptions in initramfs
     documentation

   - Update main netfs API document

   - Remove useless plus one in super_cache_scan()

   - Remove unnecessary NULL-check guards during setns()

   - Add separate separate {get,put}_cgroup_ns no-op cases

  Fixes:

   - Fix typo in root= kernel parameter description

   - Use KERN_INFO for infof()|info_plog()|infofc()

   - Correct comments of fs_validate_description()

   - Mark an unlikely if condition with unlikely() in
     vfs_parse_monolithic_sep()

   - Delete macro fsparam_u32hex()

   - Remove unused and problematic validate_constant_table()

   - Fix potential unsigned integer underflow in fs_name()

   - Make file-nr output the total allocated file handles"

* tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits)
  fs: Pass a folio to page_put_link()
  nfs: Use a folio in nfs_get_link()
  fs: Convert __page_get_link() to use a folio
  fs/read_write: make default_llseek() killable
  fs/open: make do_truncate() killable
  fs/open: make chmod_common() and chown_common() killable
  include/linux/fs.h: add inode_lock_killable()
  readdir: supply dir_context.count as readdir buffer size hint
  vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
  fuse: don't allow signals to interrupt getdents copying
  Documentation: fix typo in root= kernel parameter description
  include/cgroup: separate {get,put}_cgroup_ns no-op case
  kernel/nsproxy: remove unnecessary guards
  fs: use writeback_iter directly in mpage_writepages
  fs: remove useless plus one in super_cache_scan()
  fs: add S_ANON_INODE
  fs: remove uselib() system call
  device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
  fs/fs_parse: Remove unused and problematic validate_constant_table()
  fs: touch up predicts in inode_permission()
  ...
2025-05-26 09:02:39 -07:00
Linus Torvalds
dc76285144 vfs-6.16-rc1.writepage
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTgAKCRCRxhvAZXjc
 ovkTAP9tyN24Oo+koY/2UedYBxM54cW4BCCRsVmkzfr8NSVdwwD/dg+v6gS8+nyD
 3jlR0Z/08UyMHapB7fnAuFxPXXc8oAo=
 =e55o
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull final writepage conversion from Christian Brauner:
 "This converts vboxfs from ->writepage() to ->writepages().

  This was the last user of the ->writepage() method. So remove
  ->writepage() completely and all references to it"

* tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: Remove aops->writepage
  mm: Remove swap_writepage() and shmem_writepage()
  ttm: Call shmem_writeout() from ttm_backup_backup_page()
  i915: Use writeback_iter()
  shmem: Add shmem_writeout()
  writeback: Remove writeback_use_writepage()
  migrate: Remove call to ->writepage
  vboxsf: Convert to writepages
  9p: Add a migrate_folio method
2025-05-26 08:23:09 -07:00
Linus Torvalds
478ad02d68 Disable FOP_DONTCACHE for now due to bugs
This is kind of last-minute, but Al Viro reported that the new
FOP_DONTCACHE flag causes memory corruption due to use-after-free
issues.

This was triggered by commit 974c5e6139 ("xfs: flag as supporting
FOP_DONTCACHE"), but that is not the underlying bug - it is just the
first user of the flag.

Vlastimil Babka suspects the underlying problem stems from the
folio_end_writeback() logic introduced in commit fb7d3bc414
("mm/filemap: drop streaming/uncached pages when writeback completes").

The most straightforward fix would be to just revert the commit that
exposed this, but Matthew Wilcox points out that other filesystems are
also starting to enable the FOP_DONTCACHE logic, so this instead
disables that bit globally for now.

The fix will hopefully end up being trivial and we can just re-enable
this logic after more testing, but until such a time we'll have to
disable the new FOP_DONTCACHE flag.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20250525083209.GS2023217@ZenIV/
Triggered-by: 974c5e6139 ("xfs: flag as supporting FOP_DONTCACHE")
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-05-25 15:43:36 -07:00
Lorenzo Stoakes
6669d1aaa0 mm: remove WARN_ON_ONCE() in file_has_valid_mmap_hooks()
Having encountered a trinity report in linux-next (Linked in the 'Closes'
tag) it appears that there are legitimate situations where a file-backed
mapping can be acquired but no file->f_op->mmap or
file->f_op->mmap_prepare is set, at which point do_mmap() should simply
error out with -ENODEV.

Since previously we did not warn in this scenario and it appears we rely
upon this, restore this situation, while retaining a WARN_ON_ONCE() for
the case where both are set, which is absolutely incorrect and must be
addressed and thus always requires a warning.

If further work is required to chase down precisely what is causing this,
then we can later restore this, but it makes no sense to hold up this
series to do so, as this is existing and apparently expected behaviour.

Link: https://lkml.kernel.org/r/20250514084024.29148-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b ("mm: introduce new .mmap_prepare() file callback")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202505141434.96ce5e5d-lkp@intel.com
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22 14:55:37 -07:00
Max Kellermann
d8c5507cd1
include/linux/fs.h: add inode_lock_killable()
Prepare for making inode operations killable while they're waiting for
the lock.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/20250513150327.1373061-1-max.kellermann@ionos.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 12:03:11 +02:00
Miklos Szeredi
e0410e956b
readdir: supply dir_context.count as readdir buffer size hint
This is a preparation for large readdir buffers in fuse.

Simply setting the fuse buffer size to the userspace buffer size should
work, the record sizes are similar (fuse's is slightly larger than libc's,
so no overflow should ever happen).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jaco Kroon <jaco@uls.co.za>
Link: https://lore.kernel.org/20250513151012.1476536-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:26:05 +02:00
Miklos Szeredi
8d9117009d
fuse: don't allow signals to interrupt getdents copying
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit().  This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.

Overload d_type as flags, since it only uses 4 bits from 32.

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/20250513112335.1473177-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-15 11:12:11 +02:00
Lorenzo Stoakes
c84bf6dd2b mm: introduce new .mmap_prepare() file callback
Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2.

During the mmap() of a file-backed mapping, we invoke the underlying
driver file's mmap() callback in order to perform driver/file system
initialisation of the underlying VMA.

This has been a source of issues in the past, including a significant
security concern relating to unwinding of error state discovered by Jann
Horn, as fixed in commit 5de195060b ("mm: resolve faulty mmap_region()
error path behaviour") which performed the recent, significant, rework of
mmap() as a whole.

However, we have had a fly in the ointment remain - drivers have a great
deal of freedom in the .mmap() hook to manipulate VMA state (as well as
page table state).

This can be problematic, as we can no longer reason sensibly about VMA
state once the call is complete (the ability to do - anything - here does
rather interfere with that).

In addition, callers may choose to do odd or unusual things which might
interfere with subsequent steps in the mmap() process, and it may do so
and then raise an error, requiring very careful unwinding of state about
which we can make no assumptions.

Rather than providing such an open-ended interface, this series provides
an alternative, far more restrictive one - we expose a whitelist of fields
which can be adjusted by the driver, along with immutable state upon which
the driver can make such decisions:

struct vm_area_desc {
	/* Immutable state. */
	struct mm_struct *mm;
	unsigned long start;
	unsigned long end;

	/* Mutable fields. Populated with initial state. */
	pgoff_t pgoff;
	struct file *file;
	vm_flags_t vm_flags;
	pgprot_t page_prot;

	/* Write-only fields. */
	const struct vm_operations_struct *vm_ops;
	void *private_data;
};

The mmap logic then updates the state used to either merge with a VMA or
establish a new VMA based upon this logic.

This is achieved via new file hook .mmap_prepare(), which is, importantly,
invoked very early on in the mmap() process.

If an error arises, we can very simply abort the operation with very
little unwinding of state required.

The existing logic contains another, related, peccadillo - since the
.mmap() callback might do anything, it may also cause a previously
unmergeable VMA to become mergeable with adjacent VMAs.

Right now the logic will retry a merge like this only if the driver
changes VMA flags, and changes them in such a way that a merge might
succeed (that is, the flags are not 'special', that is do not contain any
of the flags specified in VM_SPECIAL).

This has also been the source of a great deal of pain - it's hard to
reason about an .mmap() callback that might do - anything - but it's also
hard to reason about setting up a VMA and writing to the maple tree, only
to do it again utilising a great deal of shared state.

Since .mmap_prepare() sets fields before the first merge is even
attempted, the use of this callback obviates the need for this retry merge
logic.

A driver may only specify .mmap_prepare() or the deprecated .mmap()
callback.  In future we may add futher callbacks beyond .mmap_prepare() to
faciliate all use cass as we convert drivers.

In researching this change, I examined every .mmap() callback, and
discovered only a very few that set VMA state in such a way that a.  the
VMA flags changed and b.  this would be mergeable.

In the majority of cases, it turns out that drivers are mapping kernel
memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other
unmergeable VM_SPECIAL flags.

Of those that remain I identified a number of cases which are only
applicable in DAX, setting the VM_HUGEPAGE flag:

* dax_mmap()
* erofs_file_mmap()
* ext4_file_mmap()
* xfs_file_mmap()

For this remerge to not occur and to impact users, each of these cases
would require a user to mmap() files using DAX, in parts, immediately
adjacent to one another.

This is a very unlikely usecase and so it does not appear to be worthwhile
to adjust this functionality accordingly.

We can, however, very quickly do so if needed by simply adding an
.mmap_prepare() callback to these as required.

There are two further non-DAX cases I idenitfied:

* orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
  VM_SEQ_READ.
* usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.

Both of these cases again seem very unlikely to be mmap()'d immediately
adjacent to one another in a fashion that would result in a merge.

Finally, we are left with a viable case:

* secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.

This is viable enough that the mm selftests trigger the logic as a matter
of course.  Therefore, this series replace the .secretmem_mmap() hook with
.secret_mmap_prepare().


This patch (of 3):

Provide a means by which drivers can specify which fields of those
permitted to be changed should be altered to prior to mmap()'ing a range
(which may either result from a merge or from mapping an entirely new
VMA).

Doing so is substantially safer than the existing .mmap() calback which
provides unrestricted access to the part-constructed VMA and permits
drivers and file systems to do 'creative' things which makes it hard to
reason about the state of the VMA after the function returns.

The existing .mmap() callback's freedom has caused a great deal of issues,
especially in error handling, as unwinding the mmap() state has proven to
be non-trivial and caused significant issues in the past, for instance
those addressed in commit 5de195060b ("mm: resolve faulty mmap_region()
error path behaviour").

It also necessitates a second attempt at merge once the .mmap() callback
has completed, which has caused issues in the past, is awkward, adds
overhead and is difficult to reason about.

The .mmap_prepare() callback eliminates this requirement, as we can update
fields prior to even attempting the first merge.  It is safer, as we
heavily restrict what can actually be modified, and being invoked very
early in the mmap() process, error handling can be performed safely with
very little unwinding of state required.

The .mmap_prepare() and deprecated .mmap() callbacks are mutually
exclusive, so we permit only one to be invoked at a time.

Update vma userland test stubs to account for changes.

Link: https://lkml.kernel.org/r/cover.1746792520.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-13 16:28:07 -07:00
Christian Brauner
1af3331764
super: add filesystem freezing helpers for suspend and hibernate
Allow the power subsystem to support filesystem freeze for
suspend and hibernate.

For some kernel subsystems it is paramount that they are guaranteed that
they are the owner of the freeze to avoid any risk of deadlocks. This is
the case for the power subsystem. Enable it to recognize whether it did
actually freeze the filesystem.

If userspace has 10 filesystems and suspend/hibernate manges to freeze 5
and then fails on the 6th for whatever odd reason (current or future)
then power needs to undo the freeze of the first 5 filesystems. It can't
just walk the list again because while it's unlikely that a new
filesystem got added in the meantime it still cannot tell which
filesystems the power subsystem actually managed to get a freeze
reference count on that needs to be dropped during thaw.

There's various ways out of this ugliness. For example, record the
filesystems the power subsystem managed to freeze on a temporary list in
the callbacks and then walk that list backwards during thaw to undo the
freezing or make sure that the power subsystem just actually exclusively
freezes things it can freeze and marking such filesystems as being owned
by power for the duration of the suspend or resume cycle. I opted for
the latter as that seemed the clean thing to do even if it means more
code changes.

If hibernation races with filesystem freezing (e.g. DM reconfiguration),
then hibernation need not freeze a filesystem because it's already
frozen but userspace may thaw the filesystem before hibernation actually
happens.

If the race happens the other way around, DM reconfiguration may
unexpectedly fail with EBUSY.

So allow FREEZE_EXCL to nest with other holders. An exclusive freezer
cannot be undone by any of the other concurrent freezers.

Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09 12:41:02 +02:00
John Garry
5d894321c4 fs: add atomic write unit max opt to statx
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.

Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.

The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.

XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.

As such, there is an atomic write length at which the user may experience
appreciably slower performance.

Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.

When zero, it means that there is no such performance boundary.

Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07 14:25:30 -07:00
Al Viro
2dbf6e0df4 kill vfs_submount()
The last remaining user of vfs_submount() (tracefs) is easy to convert
to fs_context_for_submount(); do that and bury that thing, along with
SB_SUBMOUNT

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-06 12:49:07 -04:00
Christoph Hellwig
732f25a289 fs: add a write stream field to the kiocb
Prepare for io_uring passthrough of write streams. The write stream
field in the kiocb structure fits into an existing 2-byte hole, so its
size is not changed.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:46:43 -06:00
Christian Brauner
19bbfe7b5f
fs: add S_ANON_INODE
This makes it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead().

Readahead on anonymous inodes didn't work because they didn't have a
proper mode. Now that they have we need to retain EINVAL being returned
otherwise LTP will fail.

We also need to ensure that ioctls aren't simply fired like they are for
regular files so things like inotify inodes continue to correctly call
their own ioctl handlers as in [1].

Reported-by: Xilin Wu <sophon@radxa.com>
Link: https://lore.kernel.org/3A9139D5CD543962+89831381-31b9-4392-87ec-a84a5b3507d8@radxa.com [1]
Link: https://lore.kernel.org/7a1a7076-ff6b-4cb0-94e7-7218a0a44028@sirena.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-21 13:20:14 +02:00
Christian Brauner
b47e42d10e
super: use common iterator (Part 2)
Use a common iterator for all callbacks. We could go for something even
more elaborate (advance step-by-step similar to iov_iter) but I really
don't think this is warranted.

Link: https://lore.kernel.org/r/20250329-work-freeze-v2-5-a47af37ecc3d@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:37:17 +02:00
Christian Brauner
2992476528
super: use a common iterator (Part 1)
Use a common iterator for all callbacks.

Link: https://lore.kernel.org/r/20250329-work-freeze-v2-4-a47af37ecc3d@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:37:17 +02:00
James Bottomley
f73bae8367
fs: allow all writers to be frozen
During freeze/thaw we need to be able to freeze all writers during
suspend/hibernate. Otherwise tasks such as systemd-journald that mmap a
file and write to it will not be frozen after we've already frozen the
filesystem.

This has some risk of not being able to freeze processes in case a
process has acquired SB_FREEZE_PAGEFAULT under mmap_sem or
SB_FREEZE_INTERNAL under some other filesytem specific lock. If the
filesystem is frozen, a task can block on the frozen filesystem with
e.g., mmap_sem held. If some other task then blocks on grabbing that
mmap_sem, hibernation ill fail because it is unable to hibernate a task
holding mmap_sem. This could be fixed by making a range of filesystem
related locks use freezable sleeping. That's impractical and not
warranted just for suspend/hibernate. Assume that this is an infrequent
problem and we've given userspace a way to skip filesystem freezing
through a sysfs file.

Link: https://lore.kernel.org/r/20250402-work-freeze-v2-2-6719a97b52ac@kernel.org
Link: https://lore.kernel.org/r/20250327140613.25178-3-James.Bottomley@HansenPartnership.com
[brauner: make all freeze levels set TASK_FREEZABLE and rewrite commit message]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:37:16 +02:00
Matthew Wilcox (Oracle)
6b0dfabb35
fs: Remove aops->writepage
All callers and implementations are now removed, so remove the operation
and update the documentation to match.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/r/20250402150005.2309458-10-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:36:50 +02:00
Linus Torvalds
5c2a430e85 Ext4 bug fixes and cleanups, including:
* hardening against maliciously fuzzed file systems
   * backwards compatibility for the brief period when we attempted to
      ignore zero-width characters
   * avoid potentially BUG'ing if there is a file system corruption found
     during the file system unmount
   * fix free space reporting by statfs when project quotas are enabled
     and the free space is less than the remaining project quota
 
 Also improve performance when replaying a journal with a very large
 number of revoke records (applicable for Lustre volumes).
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmflfY4ACgkQ8vlZVpUN
 gaMx7Qf/akTELvyBZ7iPCCHh2HwayuO8qLhPNqrU0TmYMFvgwgYUPcQ3BLn8CE+/
 j5UeT8XxNaLU4GJn3z+q6yW6PnNHfqZqKry9j/iPc3s1mjTslntr/xENlgu6i4Bp
 Q58xc7Pj45vdmP+xmYhRnJcefgsZMvB/N1SEHxwIP8bntZqsEvP9pI82r9Ouc8SA
 ZLQ1/K4OADmk7f3GhlPr9AtgH7O0CjlAas30h/AW77DXBQl7ZgbDsGDlgTwaGqkR
 jHcvfr6hLnWy+MUVGmlNZ2HY6iUgBPItWlYCP/fsrUdnc+CONyl5E17JPSl1QQtR
 CLYlo4xV8j1+zJ094DjhDWMKI2G7jw==
 =oudL
 -----END PGP SIGNATURE-----

Merge tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Ext4 bug fixes and cleanups, including:

   - hardening against maliciously fuzzed file systems

   - backwards compatibility for the brief period when we attempted to
     ignore zero-width characters

   - avoid potentially BUG'ing if there is a file system corruption
     found during the file system unmount

   - fix free space reporting by statfs when project quotas are enabled
     and the free space is less than the remaining project quota

  Also improve performance when replaying a journal with a very large
  number of revoke records (applicable for Lustre volumes)"

* tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits)
  ext4: fix OOB read when checking dotdot dir
  ext4: on a remount, only log the ro or r/w state when it has changed
  ext4: correct the error handle in ext4_fallocate()
  ext4: Make sb update interval tunable
  ext4: avoid journaling sb update on error if journal is destroying
  ext4: define ext4_journal_destroy wrapper
  ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
  jbd2: add a missing data flush during file and fs synchronization
  ext4: don't over-report free space or inodes in statvfs
  ext4: clear DISCARD flag if device does not support discard
  jbd2: remove jbd2_journal_unfile_buffer()
  ext4: reorder capability check last
  ext4: update the comment about mb_optimize_scan
  jbd2: fix off-by-one while erasing journal
  ext4: remove references to bh->b_page
  ext4: goto right label 'out_mmap_sem' in ext4_setattr()
  ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
  ext4: introduce ITAIL helper
  jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
  ext4: remove redundant function ext4_has_metadata_csum
  ...
2025-03-27 13:27:08 -07:00
Linus Torvalds
e41170cc5e vfs-6.15-rc1.pagesize
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rxAAKCRCRxhvAZXjc
 ooIPAQCwMjDjtWegvBy8kefiRw+fa4z3ZWHrwRT9DJrD/K9WyAD+JVd0ou27SVpQ
 jKpRSRct2eTbyxdYiGydHQGm5F5sLg4=
 =0FyQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagesize updates from Christian Brauner:
 "This enables block sizes greater than the page size for block devices.

  With this we can start supporting block devices with logical block
  sizes larger than 4k.

  It also allows to lift the device cache sector size support to 64k.
  This allows filesystems which can use larger sector sizes up to 64k to
  ensure that the filesystem will not generate writes that are smaller
  than the specified sector size"

* tag 'vfs-6.15-rc1.pagesize' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
  bdev: use bdev_io_min() for statx block size
  block/bdev: lift block size restrictions to 64k
  block/bdev: enable large folio support for large logical block sizes
  fs/buffer fs/mpage: remove large folio restriction
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  fs/mpage: avoid negative shift for large blocksize
  fs/buffer: remove batching from async read
  fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-03-24 12:01:29 -07:00
Linus Torvalds
130e696aa6 vfs-6.15-rc1.mount.namespace
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90r2wAKCRCRxhvAZXjc
 ouC6AQCk3MoqskN0WeNcaZT23dB7dHbEhf/7YXOFC9MFRMKXqQD9Fbn95+GuIe3U
 nBVPbVyQfDtfXE08ml6gbDJrCsbkkQI=
 =Xm1C
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount namespace updates from Christian Brauner:
 "This expands the ability of anonymous mount namespaces:

   - Creating detached mounts from detached mounts

     Currently, detached mounts can only be created from attached
     mounts. This limitaton prevents various use-cases. For example, the
     ability to mount a subdirectory without ever having to make the
     whole filesystem visible first.

     The current permission modelis:

      (1) Check that the caller is privileged over the owning user
          namespace of it's current mount namespace.

      (2) Check that the caller is located in the mount namespace of the
          mount it wants to create a detached copy of.

     While it is not strictly necessary to do it this way it is
     consistently applied in the new mount api. This model will also be
     used when allowing the creation of detached mount from another
     detached mount.

     The (1) requirement can simply be met by performing the same check
     as for the non-detached case, i.e., verify that the caller is
     privileged over its current mount namespace.

     To meet the (2) requirement it must be possible to infer the origin
     mount namespace that the anonymous mount namespace of the detached
     mount was created from.

     The origin mount namespace of an anonymous mount is the mount
     namespace that the mounts that were copied into the anonymous mount
     namespace originate from.

     In order to check the origin mount namespace of an anonymous mount
     namespace the sequence number of the original mount namespace is
     recorded in the anonymous mount namespace.

     With this in place it is possible to perform an equivalent check
     (2') to (2). The origin mount namespace of the anonymous mount
     namespace must be the same as the caller's mount namespace. To
     establish this the sequence number of the caller's mount namespace
     and the origin sequence number of the anonymous mount namespace are
     compared.

     The caller is always located in a non-anonymous mount namespace
     since anonymous mount namespaces cannot be setns()ed into. The
     caller's mount namespace will thus always have a valid sequence
     number.

     The owning namespace of any mount namespace, anonymous or
     non-anonymous, can never change. A mount attached to a
     non-anonymous mount namespace can never change mount namespace.

     If the sequence number of the non-anonymous mount namespace and the
     origin sequence number of the anonymous mount namespace match, the
     owning namespaces must match as well.

     Hence, the capability check on the owning namespace of the caller's
     mount namespace ensures that the caller has the ability to copy the
     mount tree.

   - Allow mount detached mounts on detached mounts

     Currently, detached mounts can only be mounted onto attached
     mounts. This limitation makes it impossible to assemble a new
     private rootfs and move it into place. Instead, a detached tree
     must be created, attached, then mounted open and then either moved
     or detached again. Lift this restriction.

     In order to allow mounting detached mounts onto other detached
     mounts the same permission model used for creating detached mounts
     from detached mounts can be used (cf. above).

     Allowing to mount detached mounts onto detached mounts leaves three
     cases to consider:

      (1) The source mount is an attached mount and the target mount is
          a detached mount. This would be equivalent to moving a mount
          between different mount namespaces. A caller could move an
          attached mount to a detached mount. The detached mount can now
          be freely attached to any mount namespace. This changes the
          current delegatioh model significantly for no good reason. So
          this will fail.

      (2) Anonymous mount namespaces are always attached fully, i.e., it
          is not possible to only attach a subtree of an anoymous mount
          namespace. This simplifies the implementation and reasoning.

          Consequently, if the anonymous mount namespace of the source
          detached mount and the target detached mount are the identical
          the mount request will fail.

      (3) The source mount's anonymous mount namespace is different from
          the target mount's anonymous mount namespace.

          In this case the source anonymous mount namespace of the
          source mount tree must be freed after its mounts have been
          moved to the target anonymous mount namespace. The source
          anonymous mount namespace must be empty afterwards.

     By allowing to mount detached mounts onto detached mounts a caller
     may do the following:

       fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE)
       fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE)

     fd_tree1 and fd_tree2 refer to two different detached mount trees
     that belong to two different anonymous mount namespace.

     It is important to note that fd_tree1 and fd_tree2 both refer to
     the root of their respective anonymous mount namespaces.

     By allowing to mount detached mounts onto detached mounts the
     caller may now do:

         move_mount(fd_tree1, "", fd_tree2, "",
                    MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH)

     This will cause the detached mount referred to by fd_tree1 to be
     mounted on top of the detached mount referred to by fd_tree2.

     Thus, the detached mount fd_tree1 is moved from its separate
     anonymous mount namespace into fd_tree2's anonymous mount
     namespace.

     It also means that while fd_tree2 continues to refer to the root of
     its respective anonymous mount namespace fd_tree1 doesn't anymore.

     This has the consequence that only fd_tree2 can be moved to another
     anonymous or non-anonymous mount namespace. Moving fd_tree1 will
     now fail as fd_tree1 doesn't refer to the root of an anoymous mount
     namespace anymore.

     Now fd_tree1 and fd_tree2 refer to separate detached mount trees
     referring to the same anonymous mount namespace.

     This is conceptually fine. The new mount api does allow for this to
     happen already via:

       mount -t tmpfs tmpfs /mnt
       mkdir -p /mnt/A
       mount -t tmpfs tmpfs /mnt/A

       fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE)
       fd_tree4 = open_tree(-EBADF, "/mnt/A", 0)

     Both fd_tree3 and fd_tree4 refer to two different detached mount
     trees but both detached mount trees refer to the same anonymous
     mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3
     may be moved another mount namespace as fd_tree3 refers to the root
     of the anonymous mount namespace just while fd_tree4 doesn't.

     However, there's an important difference between the
     fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example.

     Closing fd_tree4 and releasing the respective struct file will have
     no further effect on fd_tree3's detached mount tree.

     However, closing fd_tree3 will cause the mount tree and the
     respective anonymous mount namespace to be destroyed causing the
     detached mount tree of fd_tree4 to be invalid for further mounting.

     By allowing to mount detached mounts on detached mounts as in the
     fd_tree1/fd_tree2 example both struct files will affect each other.

     Both fd_tree1 and fd_tree2 refer to struct files that have
     FMODE_NEED_UNMOUNT set.

     To handle this we use the fact that @fd_tree1 will have a parent
     mount once it has been attached to @fd_tree2.

     When dissolve_on_fput() is called the mount that has been passed in
     will refer to the root of the anonymous mount namespace. If it
     doesn't it would mean that mounts are leaked. So before allowing to
     mount detached mounts onto detached mounts this would be a bug.

     Now that detached mounts can be mounted onto detached mounts it
     just means that the mount has been attached to another anonymous
     mount namespace and thus dissolve_on_fput() must not unmount the
     mount tree or free the anonymous mount namespace as the file
     referring to the root of the namespace hasn't been closed yet.

     If it had been closed yet it would be obvious because the mount
     namespace would be NULL, i.e., the @fd_tree1 would have already
     been unmounted. If @fd_tree1 hasn't been unmounted yet and has a
     parent mount it is safe to skip any cleanup as closing @fd_tree2
     will take care of all cleanup operations.

   - Allow mount propagation for detached mount trees

     In commit ee2e3f5062 ("mount: fix mounting of detached mounts
     onto targets that reside on shared mounts") I fixed a bug where
     propagating the source mount tree of an anonymous mount namespace
     into a target mount tree of a non-anonymous mount namespace could
     be used to trigger an integer overflow in the non-anonymous mount
     namespace causing any new mounts to fail.

     The cause of this was that the propagation algorithm was unable to
     recognize mounts from the source mount tree that were already
     propagated into the target mount tree and then reappeared as
     propagation targets when walking the destination propagation mount
     tree.

     When fixing this I disabled mount propagation into anonymous mount
     namespaces. Make it possible for anonymous mount namespace to
     receive mount propagation events correctly. This is now also a
     correctness issue now that we allow mounting detached mount trees
     onto detached mount trees.

     Mark the source anonymous mount namespace with MNTNS_PROPAGATING
     indicating that all mounts belonging to this mount namespace are
     currently in the process of being propagated and make the
     propagation algorithm discard those if they appear as propagation
     targets"

* tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits)
  selftests: test subdirectory mounting
  selftests: add test for detached mount tree propagation
  fs: namespace: fix uninitialized variable use
  mount: handle mount propagation for detached mount trees
  fs: allow creating detached mounts from fsmount() file descriptors
  selftests: seventh test for mounting detached mounts onto detached mounts
  selftests: sixth test for mounting detached mounts onto detached mounts
  selftests: fifth test for mounting detached mounts onto detached mounts
  selftests: fourth test for mounting detached mounts onto detached mounts
  selftests: third test for mounting detached mounts onto detached mounts
  selftests: second test for mounting detached mounts onto detached mounts
  selftests: first test for mounting detached mounts onto detached mounts
  fs: mount detached mounts onto detached mounts
  fs: support getname_maybe_null() in move_mount()
  selftests: create detached mounts from detached mounts
  fs: create detached mounts from detached mounts
  fs: add may_copy_tree()
  fs: add fastpath for dissolve_on_fput()
  fs: add assert for move_mount()
  fs: add mnt_ns_empty() helper
  ...
2025-03-24 11:41:41 -07:00
Linus Torvalds
26d8e43079 vfs-6.15-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90rNwAKCRCRxhvAZXjc
 onBJAP9Z8Ywmlb5KQ1E3HvDmkwyY6yOSyZ9/CmbzrkCJ8ywYkQD/d9/xt0EP/O/q
 N8YtzXArHWt7u0YbcVpy9WK3F72BdwU=
 =VJgY
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs async dir updates from Christian Brauner:
 "This contains cleanups that fell out of the work from async directory
  handling:

   - Change kern_path_locked() and user_path_locked_at() to never return
     a negative dentry. This simplifies the usability of these helpers
     in various places

   - Drop d_exact_alias() from the remaining place in NFS where it is
     still used. This also allows us to drop the d_exact_alias() helper
     completely

   - Drop an unnecessary call to fh_update() from nfsd_create_locked()

   - Change i_op->mkdir() to return a struct dentry

     Change vfs_mkdir() to return a dentry provided by the filesystems
     which is hashed and positive. This allows us to reduce the number
     of cases where the resulting dentry is not positive to very few
     cases. The code in these places becomes simpler and easier to
     understand.

   - Repack DENTRY_* and LOOKUP_* flags"

* tag 'vfs-6.15-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: fix inline emphasis warning
  VFS: Change vfs_mkdir() to return the dentry.
  nfs: change mkdir inode_operation to return alternate dentry if needed.
  fuse: return correct dentry for ->mkdir
  ceph: return the correct dentry on mkdir
  hostfs: store inode in dentry after mkdir if possible.
  Change inode_operations.mkdir to return struct dentry *
  nfsd: drop fh_update() from S_IFDIR branch of nfsd_create_locked()
  nfs/vfs: discard d_exact_alias()
  VFS: add common error checks to lookup_one_qstr_excl()
  VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  VFS: repack LOOKUP_ bit flags.
  VFS: repack DENTRY_ flags.
2025-03-24 10:47:14 -07:00
Linus Torvalds
99c21beaab vfs-6.15-rc1.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90p4AAKCRCRxhvAZXjc
 ojMIAP9atkG3u7+490+NGWLdulQlaHnD51Owa9MiW87UfKpsTQEArwi/NrJqXJNT
 PFQ2xIa5TxG+9haChR89w3kjZ6b/hgs=
 =iDkx
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Add CONFIG_DEBUG_VFS infrastucture:
      - Catch invalid modes in open
      - Use the new debug macros in inode_set_cached_link()
      - Use debug-only asserts around fd allocation and install

   - Place f_ref to 3rd cache line in struct file to resolve false
     sharing

Cleanups:

   - Start using anon_inode_getfile_fmode() helper in various places

   - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by
     f_pos_lock

   - Add unlikely() to kcmp()

   - Remove legacy ->remount_fs method from ecryptfs after port to the
     new mount api

   - Remove invalidate_inodes() in favour of evict_inodes()

   - Simplify ep_busy_loopER by removing unused argument

   - Avoid mmap sem relocks when coredumping with many missing pages

   - Inline getname()

   - Inline new_inode_pseudo() and de-staticize alloc_inode()

   - Dodge an atomic in putname if ref == 1

   - Consistently deref the files table with rcu_dereference_raw()

   - Dedup handling of struct filename init and refcounts bumps

   - Use wq_has_sleeper() in end_dir_add()

   - Drop the lock trip around I_NEW wake up in evict()

   - Load the ->i_sb pointer once in inode_sb_list_{add,del}

   - Predict not reaching the limit in alloc_empty_file()

   - Tidy up do_sys_openat2() with likely/unlikely

   - Call inode_sb_list_add() outside of inode hash lock

   - Sort out fd allocation vs dup2 race commentary

   - Turn page_offset() into a wrapper around folio_pos()

   - Remove locking in exportfs around ->get_parent() call

   - try_lookup_one_len() does not need any locks in autofs

   - Fix return type of several functions from long to int in open

   - Fix return type of several functions from long to int in ioctls

  Fixes:

   - Fix watch queue accounting mismatch"

* tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  fs: sort out fd allocation vs dup2 race commentary, take 2
  fs: call inode_sb_list_add() outside of inode hash lock
  fs: tidy up do_sys_openat2() with likely/unlikely
  fs: predict not reaching the limit in alloc_empty_file()
  fs: load the ->i_sb pointer once in inode_sb_list_{add,del}
  fs: drop the lock trip around I_NEW wake up in evict()
  fs: use wq_has_sleeper() in end_dir_add()
  VFS/autofs: try_lookup_one_len() does not need any locks
  fs: dedup handling of struct filename init and refcounts bumps
  fs: consistently deref the files table with rcu_dereference_raw()
  exportfs: remove locking around ->get_parent() call.
  fs: use debug-only asserts around fd allocation and install
  fs: dodge an atomic in putname if ref == 1
  vfs: Remove invalidate_inodes()
  ecryptfs: remove NULL remount_fs from super_operations
  watch_queue: fix pipe accounting mismatch
  fs: place f_ref to 3rd cache line in struct file to resolve false sharing
  epoll: simplify ep_busy_loop by removing always 0 argument
  fs: Turn page_offset() into a wrapper around folio_pos()
  kcmp: improve performance adding an unlikely hint to task comparisons
  ...
2025-03-24 09:13:50 -07:00
Linus Torvalds
c4cff1ea37 vfs-6.15-rc1.mount.api
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90q+gAKCRCRxhvAZXjc
 ol6uAQCGxtqC3w0bXLF1SKsRlDlYvmAJzRVHMGIxhwpNzeyGAQEApK52+WovJy+1
 zozWTCiGXF8N/IhlqVlrMb6GANmjWAQ=
 =CcBh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs mount API updates from Christian Brauner:
 "This converts the remaining pseudo filesystems to the new mount api.

  The sysv conversion is a bit gratuitous because we remove sysv in
  another pull request. But if we have to revert the removal we at least
  will have it converted to the new mount api already"

* tag 'vfs-6.15-rc1.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  sysv: convert sysv to use the new mount api
  vfs: remove some unused old mount api code
  devtmpfs: replace ->mount with ->get_tree in public instance
  vfs: Convert devpts to use the new mount API
  pstore: convert to the new mount API
2025-03-24 08:49:48 -07:00
Mateusz Guzik
611851010c
fs: dedup handling of struct filename init and refcounts bumps
No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250313142744.1323281-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-18 15:34:27 +01:00
Luis Chamberlain
a64e5a5960
bdev: add back PAGE_SIZE block size validation for sb_set_blocksize()
The commit titled "block/bdev: lift block size restrictions to 64k"
lifted the block layer's max supported block size to 64k inside the
helper blk_validate_block_size() now that we support large folios.
However in lifting the block size we also removed the silly use
cases many filesystems have to use sb_set_blocksize() to *verify*
that the block size <= PAGE_SIZE. The call to sb_set_blocksize() was
used to check the block size <= PAGE_SIZE since historically we've
always supported userspace to create for example 64k block size
filesystems even on 4k page size systems, but what we didn't allow
was mounting them. Older filesystems have been using the check with
sb_set_blocksize() for years.

While, we could argue that such checks should be filesystem specific,
there are much more users of sb_set_blocksize() than LBS enabled
filesystem on upstream, so just do the easier thing and bring back
the PAGE_SIZE check for sb_set_blocksize() users and only skip it
for LBS enabled filesystems.

This will ensure that tests such as generic/466 when run in a loop
against say, ext4, won't try to try to actually mount a filesystem with
a block size larger than your filesystem supports given your PAGE_SIZE
and in the worst case crash.

Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250307020403.3068567-1-mcgrof@kernel.org
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-07 12:56:05 +01:00
NeilBrown
c54b386969
VFS: Change vfs_mkdir() to return the dentry.
vfs_mkdir() does not guarantee to leave the child dentry hashed or make
it positive on success, and in many such cases the filesystem had to use
a different dentry which it can now return.

This patch changes vfs_mkdir() to return the dentry provided by the
filesystems which is hashed and positive when provided.  This reduces
the number of cases where the resulting dentry is not positive to a
handful which don't deserve extra efforts.

The only callers of vfs_mkdir() which are interested in the resulting
inode are in-kernel filesystem clients: cachefiles, nfsd, smb/server.
The only filesystems that don't reliably provide the inode are:
- kernfs, tracefs which these clients are unlikely to be interested in
- cifs in some configurations would need to do a lookup to find the
  created inode, but doesn't.  cifs cannot be exported via NFS, is
  unlikely to be used by cachefiles, and smb/server only has a soft
  requirement for the inode, so this is unlikely to be a problem in
  practice.
- hostfs, nfs, cifs may need to do a lookup (rarely for NFS) and it is
  possible for a race to make that lookup fail.  Actual failure
  is unlikely and providing callers handle negative dentries graceful
  they will fail-safe.

So this patch removes the lookup code in nfsd and smb/server and adjusts
them to fail safe if a negative dentry is provided:
- cache-files already fails safe by restarting the task from the
  top - it still does with this change, though it no longer calls
  cachefiles_put_directory() as that will crash if the dentry is
  negative.
- nfsd reports "Server-fault" which it what it used to do if the lookup
  failed. This will never happen on any file-systems that it can actually
  export, so this is of no consequence.  I removed the fh_update()
  call as that is not needed and out-of-place.  A subsequent
  nfsd_create_setattr() call will call fh_update() when needed.
- smb/server only wants the inode to call ksmbd_smb_inherit_owner()
  which updates ->i_uid (without calling notify_change() or similar)
  which can be safely skipping on cifs (I hope).

If a different dentry is returned, the first one is put.  If necessary
the fact that it is new can be determined by comparing pointers.  A new
dentry will certainly have a new pointer (as the old is put after the
new is obtained).
Similarly if an error is returned (via ERR_PTR()) the original dentry is
put.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-7-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05 11:52:50 +01:00
Christian Brauner
f9fde814de
fs: support getname_maybe_null() in move_mount()
Allow move_mount() to work with NULL path arguments.

Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-8-dbcfcb98c676@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04 09:29:53 +01:00
Pan Deng
e249056c91
fs: place f_ref to 3rd cache line in struct file to resolve false sharing
When running syscall pread in a high core count system, f_ref contends
with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the
same cache line.

This change places f_ref to the 3rd cache line where fields are not
updated as frequently as the 1st cache line, and the contention is
grealy reduced according to tests. In addition, the size of file
object is kept in 3 cache lines.

This change has been tested with rocksdb benchmark readwhilewriting case
in 1 socket 64 physical core 128 logical core baremetal machine, with
build config CONFIG_RANDSTRUCT_NONE=y
Command:
./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60
The throughput(ops/s) is improved up to ~21%.
=====
thread		baseline	compare
16		 100%		 +1.3%
32		 100%		 +2.2%
64		 100%		 +7.2%
128		 100%		 +20.9%

It was also tested with UnixBench: syscall, fsbuffer, fstime,
fsdisk cases that has been used for file struct layout tuning, no
regression was observed.

Signed-off-by: Pan Deng <pan.deng@intel.com>
Link: https://lore.kernel.org/r/20250228020059.3023375-1-pan.deng@intel.com
Tested-by: Lipeng Zhu <lipeng.zhu@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-01 11:02:35 +01:00
NeilBrown
88d5baf690
Change inode_operations.mkdir to return struct dentry *
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g.  on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns.  For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.

This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes.  Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.

This lookup-after-mkdir requires that the directory remains locked to
avoid races.  Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.

To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
  NULL - the directory was created and no other dentry was used
  ERR_PTR() - an error occurred
  non-NULL - this other dentry was spliced in

This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations.  Subsequent patches will make
further changes to some file-systems to return a correct dentry.

Not all filesystems reliably result in a positive hashed dentry:

- NFS, cifs, hostfs will sometimes need to perform a lookup of
  the name to get inode information.  Races could result in this
  returning something different. Note that this lookup is
  non-atomic which is what we are trying to avoid.  Placing the
  lookup in filesystem code means it only happens when the filesystem
  has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
  operation ensures that lookup will be called to correctly populate
  the dentry.  This could be fixed but I don't think it is important
  to any of the users of vfs_mkdir() which look at the dentry.

The recommendation to use
    d_drop();d_splice_alias()
is ugly but fits with current practice.  A planned future patch will
change this.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27 20:00:17 +01:00
Jingbo Xu
8510edf191
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
iocb->ki_pos has been updated with the number of written bytes since
generic_perform_write().

Besides __filemap_fdatawrite_range() accepts the inclusive end of the
data range.

Fixes: 1d44575765 ("mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250218120209.88093-2-jefflexu@linux.alibaba.com
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 14:09:47 +01:00
Mateusz Guzik
1479be6258
vfs: inline new_inode_pseudo() and de-staticize alloc_inode()
The former is a no-op wrapper with the same argument.

I left it in place to not lose the information who needs it -- one day
"pseudo" inodes may start differing from what alloc_inode() returns.

In the meantime no point taking a detour.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250212180459.1022983-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:32 +01:00
Mateusz Guzik
1bb772565f
vfs: inline getname()
It is merely a trivial wrapper around getname_flags which adds a zeroed
argument, no point paying for an extra call.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250206000105.432528-1-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:32 +01:00
Yuichiro Tsuji
f326565c44
ioctl: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu
al behavior. These functions only return int values. This change improves
type consistency across the filesystem code and aligns the function signatu
re with its existing implementation and usage.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com>
Link: https://lore.kernel.org/r/20250121070844.4413-3-yuichtsu@amazon.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:32 +01:00
Yuichiro Tsuji
29d80d506b
open: Fix return type of several functions from long to int
Fix the return type of several functions from long to int to match its actu
al behavior. These functions only return int values. This change improves
type consistency across the filesystem code and aligns the function signatu
re with its existing implementation and usage.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com>
Link: https://lore.kernel.org/r/20250121070844.4413-2-yuichtsu@amazon.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:25:32 +01:00
Mateusz Guzik
3eb7e95104
vfs: use the new debug macros in inode_set_cached_link()
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250209185523.745956-4-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:23:53 +01:00
Mateusz Guzik
8b17e54096
vfs: add initial support for CONFIG_DEBUG_VFS
Small collection of macros taken from mmdebug.h

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250209185523.745956-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 10:23:53 +01:00
Miklos Szeredi
b4c173dfbb
fuse: don't truncate cached, mutated symlink
Fuse allows the value of a symlink to change and this property is exploited
by some filesystems (e.g. CVMFS).

It has been observed, that sometimes after changing the symlink contents,
the value is truncated to the old size.

This is caused by fuse_getattr() racing with fuse_reverse_inval_inode().
fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which
results in fuse_change_attributes() exiting before updating the cached
attributes

This is okay, as the cached attributes remain invalid and the next call to
fuse_change_attributes() will likely update the inode with the correct
values.

The reason this causes problems is that cached symlinks will be
returned through page_get_link(), which truncates the symlink to
inode->i_size.  This is correct for filesystems that don't mutate
symlinks, but in this case it causes bad behavior.

The solution is to just remove this truncation.  This can cause a
regression in a filesystem that relies on supplying a symlink larger than
the file size, but this is unlikely.  If that happens we'd need to make
this behavior conditional.

Reported-by: Laura Promberger <laura.promberger@cern.ch>
Tested-by: Sam Lewis <samclewis@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-20 15:48:17 +01:00
Theodore Ts'o
9e28059d56 ext4: introduce linear search for dentries
This patch addresses an issue where some files in case-insensitive
directories become inaccessible due to changes in how the kernel
function, utf8_casefold(), generates case-folded strings from the
commit 5c26d2f1d3 ("unicode: Don't special case ignorable code
points").

There are good reasons why this change should be made; it's actually
quite stupid that Unicode seems to think that the characters ❤ and ❤️
should be casefolded.  Unfortimately because of the backwards
compatibility issue, this commit was reverted in 231825b2e1.

This problem is addressed by instituting a brute-force linear fallback
if a lookup fails on case-folded directory, which does result in a
performance hit when looking up files affected by the changing how
thekernel treats ignorable Uniode characters, or when attempting to
look up non-existent file names.  So this fallback can be disabled by
setting an encoding flag if in the future, the system administrator or
the manufacturer of a mobile handset or tablet can be sure that there
was no opportunity for a kernel to insert file names with incompatible
encodings.

Fixes: 5c26d2f1d3 ("unicode: Don't special case ignorable code points")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
2025-02-13 15:05:53 -05:00
Mateusz Guzik
37d11cfc63
vfs: sanity check the length passed to inode_set_cached_link()
This costs a strlen() call when instatianating a symlink.

Preferably it would be hidden behind VFS_WARN_ON (or compatible), but
there is no such facility at the moment. With the facility in place the
call can be patched out in production kernels.

In the meantime, since the cost is being paid unconditionally, use the
result to a fixup the bad caller.

This is not expected to persist in the long run (tm).

Sample splat:
bad length passed for symlink [/tmp/syz-imagegen43743633/file0/file0] (got 131109, expected 37)
[rest of WARN blurp goes here]

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250204213207.337980-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:29:59 +01:00
Amir Goldstein
95101401bb
fsnotify: use accessor to set FMODE_NONOTIFY_*
The FMODE_NONOTIFY_* bits are a 2-bits mode.  Open coding manipulation
of those bits is risky.  Use an accessor file_set_fsnotify_mode() to
set the mode.

Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers()
to make way for the simple accessor name.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20250203223205.861346-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07 10:27:26 +01:00
Eric Sandeen
bdfa77e7c6
vfs: remove some unused old mount api code
Remove reconfigure_single, mount_single, and compare_single now
that no users remain.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/20250205213931.74614-5-sandeen@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-06 11:51:43 +01:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Jens Axboe
1d44575765 mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
When a buffered write submitted with IOCB_DONTCACHE has been successfully
submitted, call filemap_fdatawrite_range_kick() to kick off the IO.  File
systems call generic_write_sync() for any successful buffered write
submission, hence add the logic here rather than needing to modify the
file system.

Link: https://lkml.kernel.org/r/20241220154831.1086649-12-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:44 -08:00