mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2026-01-27 12:47:24 +01:00
master
14636 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8341374f67 |
for-6.18-rc5-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmkS/s4ACgkQxWXV+ddt WDvm6RAApWCEpZXx6RFCZOY2VoogkSWVRY+KcRzkJGjmgy0Llxcqq6SGDHEyUIs+ fQ2AjjOInCrsEVrO80vdpazYizU0jVCoupLDMYw2FulI225dWq2Vsicq4Yv7Zubq HerTJgrowVP+CMPRIPpXY1W9O7zYz+T6irdamsedphORZ3yhs5XhRhUH2lrEjZSA O3LBlVQMslFWSZ2/u+XvgD2D4RA8kkZmRM8oUN3rPvjfrBgrRnnjvLDfxV3vRM0F CKd1SYMu2jtglmBa+9L8uO9RKLiIdszArcJSN9tYPmrbZOYN5Sa5jfm4D65SEnC1 pTrWydGyJZCXbBXYgvUa/SBgurNPjeo0yh9nspqNflBsqvYvqqQjNq4/BLCPBxkd vShbWSqU/sj86jSkIc6bzeQBg4m6UsSCeyARqsrII6eqQuHqXzeMAnZEozd3Q7Fj Xc7d568GF6oTo0towpYVbAmeZAKyYBcHcVE0xjx5zLW0bonVvtvV7BDp0kS0ibot 3JADPAQcaC1aDrZ0ZY+Hfdru2kcl1Yrg7xcAIc48hHaBGwETxb4RQZV3+1ldsaoA qGzvGCxNhLQzx3MOR1AqrnUIsEFW6ItoS6KLfyRgWMHyLlChjjFu132sVewB/9gN oSqEz8pOxjPqhUB9i+CwOxheJ6V5wxp2hJe0b4NiG7JMdPzAtvQ= =jVku -----END PGP SIGNATURE----- Merge tag 'for-6.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix new inode name tracking in tree-log - fix conventional zone and stripe calculations in zoned mode - fix bio reference counts on error paths in relocation and scrub * tag 'for-6.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: release root after error in data_reloc_print_warning_inode() btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe() btrfs: do not update last_log_commit when logging inode due to a new name btrfs: zoned: fix stripe width calculation btrfs: zoned: fix conventional zone capacity calculation |
||
|
|
c367af440e |
btrfs: release root after error in data_reloc_print_warning_inode()
data_reloc_print_warning_inode() calls btrfs_get_fs_root() to obtain
local_root, but fails to release its reference when paths_from_inode()
returns an error. This causes a potential memory leak.
Add a missing btrfs_put_root() call in the error path to properly
decrease the reference count of local_root.
Fixes:
|
||
|
|
5fea61aa1c |
btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe()
scrub_raid56_parity_stripe() allocates a bio with bio_alloc(), but
fails to release it on some error paths, leading to a potential
memory leak.
Add the missing bio_put() calls to properly drop the bio reference
in those error cases.
Fixes:
|
||
|
|
bfe3d755ef |
btrfs: do not update last_log_commit when logging inode due to a new name
When logging that a new name exists, we skip updating the inode's
last_log_commit field to prevent a later explicit fsync against the inode
from doing nothing (as updating last_log_commit makes btrfs_inode_in_log()
return true). We are detecting, at btrfs_log_inode(), that logging a new
name is happening by checking the logging mode is not LOG_INODE_EXISTS,
but that is not enough because we may log parent directories when logging
a new name of a file in LOG_INODE_ALL mode - we need to check that the
logging_new_name field of the log context too.
An example scenario where this results in an explicit fsync against a
directory not persisting changes to the directory is the following:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/foo
$ sync
$ mkdir /mnt/dir
# Write some data to our file and fsync it.
$ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/foo
# Add a new link to our file. Since the file was logged before, we
# update it in the log tree by calling btrfs_log_new_name().
$ ln /mnt/foo /mnt/dir/bar
# fsync the root directory - we expect it to persist the dentry for
# the new directory "dir".
$ xfs_io -c "fsync" /mnt
<power fail>
After mounting the fs the entry for directory "dir" does not exists,
despite the explicit fsync on the root directory.
Here's why this happens:
1) When we fsync the file we log the inode, so that it's present in the
log tree;
2) When adding the new link we enter btrfs_log_new_name(), and since the
inode is in the log tree we proceed to updating the inode in the log
tree;
3) We first set the inode's last_unlink_trans to the current transaction
(early in btrfs_log_new_name());
4) We then eventually enter btrfs_log_inode_parent(), and after logging
the file's inode, we call btrfs_log_all_parents() because the inode's
last_unlink_trans matches the current transaction's ID (updated in the
previous step);
5) So btrfs_log_all_parents() logs the root directory by calling
btrfs_log_inode() for the root's inode with a log mode of LOG_INODE_ALL
so that new dentries are logged;
6) At btrfs_log_inode(), because the log mode is LOG_INODE_ALL, we
update root inode's last_log_commit to the last transaction that
changed the inode (->last_sub_trans field of the inode), which
corresponds to the current transaction's ID;
7) Then later when user space explicitly calls fsync against the root
directory, we enter btrfs_sync_file(), which calls skip_inode_logging()
and that returns true, since its call to btrfs_inode_in_log() returns
true and there are no ordered extents (it's a directory, never has
ordered extents). This results in btrfs_sync_file() returning without
syncing the log or committing the current transaction, so all the
updates we did when logging the new name, including logging the root
directory, are not persisted.
So fix this by but updating the inode's last_log_commit if we are sure
we are not logging a new name (if ctx->logging_new_name is false).
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/03c5d7ec-5b3d-49d1-95bc-8970a7f82d87@gmail.com/
Fixes:
|
||
|
|
6a1ab50135 |
btrfs: zoned: fix stripe width calculation
The stripe offset calculation in the zoned code for raid0 and raid10
wrongly uses map->stripe_size to calculate it. In fact, map->stripe_size is
the size of the device extent composing the block group, which always is
the zone_size on the zoned setup.
Fix it by using BTRFS_STRIPE_LEN and BTRFS_STRIPE_LEN_SHIFT. Also, optimize
the calculation a bit by doing the common calculation only once.
Fixes:
|
||
|
|
94f54924b9 |
btrfs: zoned: fix conventional zone capacity calculation
When a block group contains both conventional zone and sequential zone, the
capacity of the block group is wrongly set to the block group's full
length. The capacity should be calculated in btrfs_load_block_group_* using
the last allocation offset.
Fixes:
|
||
|
|
c9cfc122f0 |
for-6.18-rc4-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmkJgRkACgkQxWXV+ddt WDslFg/8C0uoyWYhC7sA/xYS9rKnAbNWekrsDB0BCAEyDBv+DAlbHAxn5rODY4If EMnzDkkq8t1lU21K2BSXB3rco0oO1yV6b8ldJy/q/GzES08cEURfZ02ZX6ZQngNa arrz1AqXglyqFuoR5unB486aLxDJPisYklQU0ND6PiFZVSnvpZddyWFmvu0e6i6c YJBCKzKEDlMF3mAHU3w9f1mzkbkUJcZR4jMEYkuNGpRva7tuF+nB1diJpTpm0mvk O0UWHtexAf3fFVOoWX0f6COGliJdhXw657A+TXSCFw292EdraGLieXEOzxCzBhKX FByafgDFkrgdVgglsTaSx65Zey+Wn7jDJJGgXDb45/vp8YJEaMboNkmV2/+S3XVo wl8XowAPNB1dJVmTAa3NtQ6GQfam4vM9DkOvCkzRumvjRMIGwtBQASYj1tMwDgGf YbZrHXrO/dGN+9sDrPMLLceu7qcFkCngQSDRxiRAKxeHy20fSd/hgpLkBCeHtI0E uz4wr5onTgMjjDTxijxO9zgYz5SPEJ9suq5GHvm7nCqMJ4TtW4SXMSpcWbfLbHY0 Yk6Dy6/h9nWt4OlU8U6BDFDY0Gk8OQjUeOXyJPtybrSUjxie07iGR1wH2+dddTnA xZjS4fudYUYEQxpAg16Jq1BKb8alkS96WcqDLb12Gtp0Ckah66c= =/xXr -----END PGP SIGNATURE----- Merge tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix memory leak in qgroup relation ioctl when qgroup levels are invalid - don't write back dirty metadata on filesystem with errors - properly log renamed links - properly mark prealloc extent range beyond inode size as dirty (when no-noles is not enabled) * tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: mark dirty extent range for out of bound prealloc extents btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation btrfs: ensure no dirty metadata is written back for an fs with errors |
||
|
|
3b1a4a59a2 |
btrfs: mark dirty extent range for out of bound prealloc extents
In btrfs_fallocate(), when the allocated range overlaps with a prealloc extent and the extent starts after i_size, the range doesn't get marked dirty in file_extent_tree. This results in persisting an incorrect disk_i_size for the inode when not using the no-holes feature. This is reproducible since commit |
||
|
|
953902e4fb |
btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f260c6aff0 |
btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
When btrfs_add_qgroup_relation() is called with invalid qgroup levels
(src >= dst), the function returns -EINVAL directly without freeing the
preallocated qgroup_list structure passed by the caller. This causes a
memory leak because the caller unconditionally sets the pointer to NULL
after the call, preventing any cleanup.
The issue occurs because the level validation check happens before the
mutex is acquired and before any error handling path that would free
the prealloc pointer. On this early return, the cleanup code at the
'out' label (which includes kfree(prealloc)) is never reached.
In btrfs_ioctl_qgroup_assign(), the code pattern is:
prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL);
ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc);
prealloc = NULL; // Always set to NULL regardless of return value
...
kfree(prealloc); // This becomes kfree(NULL), does nothing
When the level check fails, 'prealloc' is never freed by either the
callee or the caller, resulting in a 64-byte memory leak per failed
operation. This can be triggered repeatedly by an unprivileged user
with access to a writable btrfs mount, potentially exhausting kernel
memory.
Fix this by freeing prealloc before the early return, ensuring prealloc
is always freed on all error paths.
Fixes:
|
||
|
|
2618849f31 |
btrfs: ensure no dirty metadata is written back for an fs with errors
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
942048d46a |
for-6.18-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmj50JgACgkQxWXV+ddt WDvXYg//SPHH0s7DEnPgMEz/RE9wQXVtvJhzzICDzTk2hawc05K9P1p/2mB2Etio CFya4Cuzs9sfw5HW5BYZcyi1bzMkAOMA5lPI+sHOqGdKhomYXj/5TlxMVNuh83uP Bk/Ycw4k5pIQRtIlen6Km9ViM9Zi7xFqn41c5c8V7GtYXrcgB5ob2LVL5UIEw/y1 RpaXlNZ+/DVRiH+UcvdqvnsdasjzVeugqb7aGiEhSIhm851D6UvBb+X5lqxWwcxY GEBUqU8+FnkuTnru1V3Y/WlHE7NnXyDPYJaLkta3Wru9f2faOEdT4rvnIe4uQHPc NNZ4B/fxRjqzFWCZ+T9s/Rm7jOxIpgPj1UZPqDMiq87bqy5AWxH3PIq+rmsu8nVF rVB7qNdP3iW0CtA7U68D0Y4xYOQIEsVHfx7LY+Q+P/4epTFOZBix51vPoCF73CCY uoQ6gMbeUp109CzQAYa8rGPv+nn7cFLDkl4wSWWg2LssAqEKQLkHK0goHjLCu6mp qHW1gw97qonTtmoUURIzJUYTuMWOlvC90r/e4C46ZTOhM9ndfuZJc28thRCJPsvD P+suVPMDuBBwdcGIOCy+uS9vOnm+KfZp10zQ0uAnYU35foUnE8V/4uaxHlzE0jmb wPFIu07Rb/O3hx4cvSSckevrjBbK8M0TCG5JuhQIEYqXiU1utP0= =YN21 -----END PGP SIGNATURE----- Merge tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in send, fix duplicated rmdir operations when using extrefs (hardlinks), receive can fail with ENOENT - fixup of error check when reading extent root in ref-verify and damaged roots are allowed by mount option (found by smatch) - fix freeing partially initialized fs info (found by syzkaller) - fix use-after-free when printing ref_tracking status of delayed inodes * tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree() btrfs: fix delayed_node ref_tracker use after free btrfs: send: fix duplicated rmdir operations when using extrefs btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots() |
||
|
|
ada7d45b56 |
btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()
btrfs_extent_root()/btrfs_global_root() does not return error pointers,
it returns NULL on error.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/all/aNJfvxj0anEnk9Dm@stanley.mountain/
Fixes :
|
||
|
|
0fd7e7a1ad |
btrfs: fix delayed_node ref_tracker use after free
Move the print before releasing the delayed node.
In my initial testing there was a bug that was causing delayed_nodes
to not get freed which is why I put the print after the release. This
obviously neglects the case where the delayed node is properly freed.
Add condition to make sure we only print if we have more than one
reference to the delayed_node to prevent printing when we only have
the reference taken in btrfs_kill_all_delayed_nodes().
Fixes:
|
||
|
|
1fabe43b4e |
btrfs: send: fix duplicated rmdir operations when using extrefs
Commit
|
||
|
|
17679ac6df |
btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()
If fs_info->super_copy or fs_info->super_for_commit allocated failed in
btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info().
Otherwise btrfs_check_leaked_roots() would access NULL pointer because
fs_info->allocated_roots had not been initialised.
syzkaller reported the following information:
------------[ cut here ]------------
BUG: unable to handle page fault for address: fffffffffffffbb0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline]
RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline]
RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline]
RIP: 0010:refcount_read include/linux/refcount.h:170 [inline]
RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230
[...]
Call Trace:
<TASK>
btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280
btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029
btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097
vfs_get_tree+0x98/0x320 fs/super.c:1759
do_new_mount+0x357/0x660 fs/namespace.c:3899
path_mount+0x716/0x19c0 fs/namespace.c:4226
do_mount fs/namespace.c:4239 [inline]
__do_sys_mount fs/namespace.c:4450 [inline]
__se_sys_mount fs/namespace.c:4427 [inline]
__x64_sys_mount+0x28c/0x310 fs/namespace.c:4427
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f032eaffa8d
[...]
Fixes:
|
||
|
|
9f388a653c |
for-6.18-rc1-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjxIRUACgkQxWXV+ddt WDuzyQ//X7a7IlmhrFYVLUbxhcjD+6gWddNl+zu2zTe0YUEh7rVtzq1W8rr2ku/3 HD7fa18LoKJOQOitamPZqPkHarorlhEfnx7ukl4dgsvLE8jvWvfYnJq+jnfkXtfO d5mT0JKiTuD2MUAPpPjphLjr4Ok7pdeZ85k01D0sYPJ4gFLOJ37ORE7uSodVjPph /ry3vEQOtKOEimcyYPaLhs4MiAsg2Y5YgVk+2GjGihWp/nECFPUvfjolxguowNzN dw0zCyTzE5qu6rE/2zEXWYNHmgfNbzeqIKFwUCFJY2Mfanw8YR26z8xCr18bRmfz kwebmv7BWp0KSSpq06tSArRQZdPFFYKXiltsmA0CkvLR3CZx995Un3AQ7em6EnUz ln8nd+0liaK1ZT2Lrr8aRv0+dTTXBzTXRiJtqM+3eNMLgDTwEnhrAa2GOkfGNt/W M/iDGeMoQhbX4NoEH9Ac6OQuMztGa07w0u6pJ/sXjg3m17piy5bU1i+lvyQWebhw 8R4fCpmUB45xgtdN7tT+G9pveU3/Ugy46Re0UZwb3s4whDkrzprgzCjyw5WGgxeb JLkx9+LctdMZM1Rcf6cbn4waERwM3iOUXtUPFljFXm/tw6i+CB4uR5GKBfLVes7u 5hPDzOJgClUnWt2yleP+4rDNrPsmnO7qFnXcF+SJ1SuIQke59g0= =D6mR -----END PGP SIGNATURE----- Merge tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in tree-checker fix extref bounds check - reorder send context structure to avoid -Wflex-array-member-not-at-end warning - fix extent readahead length for compressed extents - fix memory leaks on error paths (qgroup assign ioctl, zone loading with raid stripe tree enabled) - fix how device specific mount options are applied, in particular the 'ssd' option will be set unexpectedly - fix tracking of relocation state when tasks are running and cancellation is attempted - adjust assertion condition for folios allocated for scrub - remove incorrect assertion checking for block group when populating free space tree * tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx btrfs: tree-checker: fix bounds check in check_inode_extref() btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST btrfs: fix incorrect readahead expansion length btrfs: do not assert we found block group item when creating free space tree btrfs: do not use folio_test_partial_kmap() in ASSERT()s btrfs: only set the device specific options after devices are opened btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running |
||
|
|
8aec9dbf2d |
btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx
The warning -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Fix the following warning: fs/btrfs/send.c:181:24: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] and move the declaration of send_ctx::cur_inode_path to the end. Notice that struct fs_path contains a flexible array member inline_buf, but also a padding array and a limit calculated for the usable space of inline_buf (FS_PATH_INLINE_SIZE). It is not the pattern where flexible array is in the middle of a structure and could potentially overwrite other members. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e92c294120 |
btrfs: tree-checker: fix bounds check in check_inode_extref()
The parentheses for the unlikely() annotation were put in the wrong
place so it means that the condition is basically never true and the
bounds checking is skipped.
Fixes:
|
||
|
|
fec9b9d3ce |
btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST
At the end of btrfs_load_block_group_zone_info() the first thing we do
is to ensure that if the mapping type is not a SINGLE one and there is
no RAID stripe tree, then we return early with an error.
Doing that, though, prevents the code from running the last calls from
this function which are about freeing memory allocated during its
run. Hence, in this case, instead of returning early, we set the ret
value and fall through the rest of the cleanup code.
Fixes:
|
||
|
|
8ab2fa6969 |
btrfs: fix incorrect readahead expansion length
The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.
Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:
[ C ]
[ N1 ][ C1 ][ N2 ][ C2 ]
In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.
Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.
Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes:
|
||
|
|
a5a51bf4e9 |
btrfs: do not assert we found block group item when creating free space tree
Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.
The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).
So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.
Syzbot reported this with the following stack trace:
BTRFS info (device loop3 state M): rebuilding free space tree
assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
------------[ cut here ]------------
kernel BUG at fs/btrfs/free-space-tree.c:1115!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
Code: ff ff e8 d3 (...)
RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
FS: 00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
Call Trace:
<TASK>
btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
reconfigure_super+0x227/0x890 fs/super.c:1076
do_remount fs/namespace.c:3279 [inline]
path_mount+0xd1a/0xfe0 fs/namespace.c:4027
do_mount fs/namespace.c:4048 [inline]
__do_sys_mount fs/namespace.c:4236 [inline]
__se_sys_mount+0x313/0x410 fs/namespace.c:4213
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f424e39066a
Code: d8 64 89 02 (...)
RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/
Fixes:
|
||
|
|
42d3a055d9 |
btrfs: do not use folio_test_partial_kmap() in ASSERT()s
[BUG]
Syzbot reported an ASSERT() triggered inside scrub:
BTRFS info (device loop0): scrub: started on devid 1
assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
------------[ cut here ]------------
kernel BUG at fs/btrfs/scrub.c:697!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
Call Trace:
<TASK>
scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:597 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
---[ end trace 0000000000000000 ]---
Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.
[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.
And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.
But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.
[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those highmem specific debuging options.
Fixes:
|
||
|
|
b7fdfd29a1 |
btrfs: only set the device specific options after devices are opened
[BUG] With v6.17-rc kernels, btrfs will always set 'ssd' mount option even if the block device is not a rotating one: # cat /sys/block/sdd/queue/rotational 1 # cat /etc/fstab: LABEL=DATA2 /data2 btrfs rw,relatime,space_cache=v2,subvolid=5,subvol=/,nofail,nosuid,nodev 0 0 # mount [...] /dev/sdd on /data2 type btrfs (rw,nosuid,nodev,relatime,ssd,space_cache=v2,subvolid=5,subvol=/) [CAUSE] The 'ssd' mount option is set by set_device_specific_options(), and it expects that if there is any rotating device in the btrfs, it will set fs_devices::rotating. However after commit |
||
|
|
53a4acbfc1 |
btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl
On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as
provided by the user, which is kfree'd in the end. But this was not the
case when allocating memory for 'prealloc'. In this case, if it somehow
failed, then the previous code would go directly into calling
'mnt_drop_write_file', without freeing the string duplicated from the
user space.
Fixes:
|
||
|
|
7e5a5983ed |
btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running
When starting relocation, at reloc_chunk_start(), if we happen to find
the flag BTRFS_FS_RELOC_RUNNING is already set we return an error
(-EINPROGRESS) to the callers, however the callers call reloc_chunk_end()
which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since
relocation was started by another task and still running.
Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected
scenario, but still our current behaviour is not correct.
Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has
returned an error, which is what logically makes sense, since the general
widespread pattern is to have end functions called only if the counterpart
start functions succeeded. This requires changing reloc_chunk_start() to
clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request.
Fixes:
|
||
|
|
c746c3b516 |
for-6.18-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjj18wACgkQxWXV+ddt WDvotw//Z50Clttkc87HLmrRbLC/A2xm0YRtaDA6HvaQGMtI17+QegdNZB+yiC7K wSXeh/Je+kdb97YGW2NjBtY5bwMaIpSaAxxvumPd1xsKwrgGRx/BbwKAeOMn4KWq 3lTynA0vKkL7F3GfsYY0W8IPvUsw4zb0sigSNK6/eDAGCRf6m3jWZCC9EBvaVb84 VumPKsGK/KNCRo2xvXEqcqCA4KVrE3JUa/jMyddkTjIPm7tqD2l0J8P0v99AG5+s OaqJU3pUEukpzw41va1JF3eaOVQ8PfFhwGAHvVBKoGy3ownRXikwf05inq93+EBX cbcb4j2mxjv+H6Zy415GBg6kK9A/y9KTI1us/eQgOZtKLXwrySLLocA7aYw/AdxS kGTgwBgUc3BPbpa/mLj6HOemRl+fHNC6EgdsX1/PN+vx9cend7kOJsw1YCqe5c7E 5s8YHFbBca53HzeUS78yqikjG9B3KRY+rKyO8wO0dwozTkDD9Dy3SFO3dtMNaTdm ipxqOjMNYgUDUauKQzDiCdZiQ1yHb3NjTANP6sP35PlBdmjXHqp9TsgsxSNezv6V GBsgcSyZtawyaZlUvg0c4MLYLx3kOsY97ywn7n/dbcZC/EC07tauwD6XnlbARqJu nYQ/drZIImvuDekrIBLxvSpJOinsZ60jYlkRhREbEXT47jUE9zU= =SVmB -----END PGP SIGNATURE----- Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two short fixes that would be good to have before rc1" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix PAGE_SIZE format specifier in open_ctree() btrfs: avoid potential out-of-bounds in btrfs_encode_fh() |
||
|
|
ee2fe81cdc |
It has been a relatively busy cycle in docsland, with changes all over:
- Bring the kernel memory-model docs into the Sphinx build in the "literal
include" mode.
- Lots of build-infrastructure work, further cleaning up long-term
kernel-doc technical debt. The sphinx-pre-install tool has been
converted to Python and updated for current systems.
- A new tool to detect when documents have been moved and generate HTML
redirects; this can be used on kernel.org (or any other site hosting the
rendered docs) to avoid breaking links.
- Automated processing of the YAML files describing the netlink protocol.
- A significant update of the maintainer's PGP guide.
...and a seemingly endless series of typo fixes, build-problem fixes, etc.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmjbwOoACgkQF0NaE2wM
flis1gf/ZvRi3Mo5hIsuGyQfs5kw/jx0N7SG4QYf2rEnt5ZGNa5SkyOVKsWQKTgK
LesQkdaCA0xHMoUWSvZRwn2a0+acpeMm6viXjewd2mU52sSNmSkKG4WsZyxfnOYS
36fkZ1qymQkJ4uSvx5NScTiIBqZx+Qfgkj0eNnXcpJd2vYeAVSu4szsFxeUvcJFj
Ckq3+3DQ5p/dcWwgvdlLKEJGj98Q3cqLrMn8ycbNtwzo3mdVbrlP5+qqNslZC6xY
Nqpv9HXbFWNCaL6YWCybcNOZ4F5UVno1ap2F3imTD8Rp1iG77zAQV5lMlq4Gksf4
kECLc1TtTKSgmgWHmi1sgudqM4Xqpw==
=Qe3Z
-----END PGP SIGNATURE-----
Merge tag 'docs-6.18' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It has been a relatively busy cycle in docsland, with changes all
over:
- Bring the kernel memory-model docs into the Sphinx build in the
"literal include" mode.
- Lots of build-infrastructure work, further cleaning up long-term
kernel-doc technical debt. The sphinx-pre-install tool has been
converted to Python and updated for current systems.
- A new tool to detect when documents have been moved and generate
HTML redirects; this can be used on kernel.org (or any other site
hosting the rendered docs) to avoid breaking links.
- Automated processing of the YAML files describing the netlink
protocol.
- A significant update of the maintainer's PGP guide.
... and a seemingly endless series of typo fixes, build-problem fixes,
etc"
* tag 'docs-6.18' of git://git.lwn.net/linux: (193 commits)
Documentation/features: Update feature lists for 6.17-rc7
docs: remove cdomain.py
Documentation/process: submitting-patches: fix typo in "were do"
docs: dev-tools/lkmm: Fix typo of missing file extension
Documentation: trace: histogram: Convert ftrace docs cross-reference
Documentation: trace: histogram-design: Wrap introductory note in note:: directive
Documentation: trace: historgram-design: Separate sched_waking histogram section heading and the following diagram
Documentation: trace: histogram-design: Trim trailing vertices in diagram explanation text
Documentation: trace: histogram: Fix histogram trigger subsection number order
docs: driver-api: fix spelling of "buses".
Documentation: fbcon: Use admonition directives
Documentation: fbcon: Reindent 8th step of attach/detach/unload
Documentation: fbcon: Add boot options and attach/detach/unload section headings
docs: filesystems: sysfs: add remaining top level sysfs directory descriptions
docs: filesystems: sysfs: clarify symlink destinations in dev and bus/devices descriptions
docs: filesystems: sysfs: remove top level sysfs net directory
docs: maintainer: Fix ambiguous subheading formatting
docs: kdoc: a few more dump_typedef() tweaks
docs: kdoc: remove redundant comment stripping in dump_typedef()
docs: kdoc: remove some dead code in dump_typedef()
...
|
||
|
|
8804d970fa |
Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation. - The 4 patch series "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs. - The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters. - The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps. - The 2 patch series "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code. - The 11 patch series "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code. - The 5 patch series "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero. - The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature. - The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs. - The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code. - The 7 patch series "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code. - The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations. - The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc. - The 3 patch series "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path. - The 5 patch series "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code. - The 2 patch series "test that rmap behaves as expected" from Wei Yang adds some rmap selftests. - The 3 patch series "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers. - The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues. - The 3 patch series "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks. - The 2 patch series "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code. - The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem. - The 4 patch series "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/. - The 2 patch series "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c. - The 2 patch series "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation. - The 3 patch series "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc). - The 2 patch series "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code. - The 37 patch series "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function. - The 2 patch series "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only. - The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code. - The 12 patch series "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy. - The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages(). - The 3 patch series "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver. - The 2 patch series "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code. - The 14 patch series "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations. - The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little. - The 3 patch series "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code. - The 2 patch series "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature. - The 3 patch series "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work. - The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem. - The 2 patch series "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code. - The 10 patch series "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources. - The 5 patch series "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON. - The 7 patch series "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix. - The 2 patch series "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information. - The 2 patch series "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma. - The 2 patch series "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems. - The 6 patch series "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate. - The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters. - The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1 2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA= =4mhY -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ... |
||
|
|
5832d26433 |
for-6.18/io_uring-20250929
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
ye4/K8Y3vA==
=uc/3
-----END PGP SIGNATURE-----
Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Store ring provided buffers locally for the users, rather than stuff
them into struct io_kiocb.
These types of buffers must always be fully consumed or recycled in
the current context, and leaving them in struct io_kiocb is hence not
a good ideas as that struct has a vastly different life time.
Basically just an architecture cleanup that can help prevent issues
with ring provided buffers in the future.
- Support for mixed CQE sizes in the same ring.
Before this change, a CQ ring either used the default 16b CQEs, or it
was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
a few 32b CQEs were needed, this caused everything else to use big
CQEs. This is wasteful both in terms of memory usage, but also memory
bandwidth for the posted CQEs.
With IORING_SETUP_CQE_MIXED, applications may use request types that
post both normal 16b and big 32b CQEs on the same ring.
- Add helpers for async data management, to make it harder for opcode
handlers to mess it up.
- Add support for multishot for uring_cmd, which ublk can use. This
helps improve efficiency, by providing a persistent request type that
can trigger multiple CQEs.
- Add initial support for ring feature querying.
We had basic support for probe operations, but the API isn't great.
Rather than expand that, add support for QUERY which is easily
expandable and can cover a lot more cases than the existing probe
support. This will help applications get a better idea of what
operations are supported on a given host.
- zcrx improvements from Pavel:
- Improve refill entry alignment for better caching
- Various cleanups, especially around deduplicating normal
memory vs dmabuf setup.
- Generalisation of the niov size (Patch 12). It's still hard
coded to PAGE_SIZE on init, but will let the user to specify
the rx buffer length on setup.
- Syscall / synchronous bufer return. It'll be used as a slow
fallback path for returning buffers when the refill queue is
full. Useful for tolerating slight queue size misconfiguration
or with inconsistent load.
- Accounting more memory to cgroups.
- Additional independent cleanups that will also be useful for
mutli-area support.
- Various fixes and cleanups
* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
io_uring: fix nvme's 32b cqes on mixed cq
io_uring/query: cap number of queries
io_uring/query: prevent infinite loops
io_uring/zcrx: account niov arrays to cgroup
io_uring/zcrx: allow synchronous buffer return
io_uring/zcrx: introduce io_parse_rqe()
io_uring/zcrx: don't adjust free cache space
io_uring/zcrx: use guards for the refill lock
io_uring/zcrx: reduce netmem scope in refill
io_uring/zcrx: protect netdev with pp_lock
io_uring/zcrx: rename dma lock
io_uring/zcrx: make niov size variable
io_uring/zcrx: set sgt for umem area
io_uring/zcrx: remove dmabuf_offset
io_uring/zcrx: deduplicate area mapping
io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
io_uring/zcrx: check all niovs filled with dma addresses
io_uring/zcrx: move area reg checks into io_import_area
io_uring/zcrx: don't pass slot to io_zcrx_create_area
...
|
||
|
|
4335c4496b |
btrfs: fix PAGE_SIZE format specifier in open_ctree()
There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:
In file included from fs/btrfs/compression.h:17,
from fs/btrfs/extent_io.h:15,
from fs/btrfs/locking.h:13,
from fs/btrfs/ctree.h:19,
from fs/btrfs/disk-io.c:22:
fs/btrfs/disk-io.c: In function 'open_ctree':
include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
...
fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
3398 | btrfs_warn(fs_info,
| ^~~~~~~~~~
PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.
Fixes:
|
||
|
|
f3827213ab |
for-6.18-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjTk4MACgkQxWXV+ddt
WDvOHA//ajYvH7DIoFgQ09Q+UCfdawhWs/b4aW2ePpNK061tF6hvGgmGVe/Ugy8W
297kSBVxpnaLfedHkm3m91SAft6VKSfdV3oV2DNn9sxUXQoa9hC6n9qIaqeOpfd8
Nk+OvgSWpqonAHHMbsNev4C+vKZO534VRg09eFfIV7ATpQO7wxc1DKXFT5hgYP3m
nosRc0f/4gx0EGHjiXyfuG5una1A/vry4+EP7jrvzvKHY9VzYMLRXH+glNUi5X5E
GOwFXd6ADUpKDKN9Ove/Bm4DSz9jrTNu81qm/1i1mTpxS80sxBFIrD4KOil+hQDX
B82n01KS8yJkBYH32Qnpg+9Cij/ZR/0OOg88wBLGeQiDoDw7J8D9mJe1/RHWHHTC
rQ1C50CDlVGIPpnB1BftbvvdYlAPKgpnnznaaKg9Mdy3T5FtFQ3MqwZYOW/jubtY
Zo7shxrDjSvPb7MHG6GlLBNxZ8JXXGyc+seEfjZ8iiEeMGsE9vIQ1L18c0GZSmgc
/m/nQV/akycoNg/9J84HqClGLUWUApdMPaXrvOwC5CjpgOgJZ+rdUqhexqcNwmsl
O+s9fwQidtAr5fAgl6SjwqaPauqBd4VSybs7IkGbz+zyaZeRdWo5gsg5t5Hjuyd5
gJiIAztzI8bOPI1T/EheGVwSkmJTEkhnJDQvMRQcpEpo5D5K3YY=
=9wY+
-----END PGP SIGNATURE-----
Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no new features, the changes are in the core code, notably
tree-log error handling and reporting improvements, and initial
support for block size > page size.
Performance improvements:
- search data checksums in the commit root (previous transaction) to
avoid locking contention, this improves parallelism of read
heavy/low write workloads, and also reduces transaction commit
time; on real and reproducer workload the sync time went from
minutes to tens of seconds (workload and numbers are in the
changelog)
Core:
- tree-log updates:
- error handling improvements, transaction aborts
- add new error state 'O' (printed in status messages) when log
replay fails and is aborted
- reduced number of btrfs_path allocations when traversing the
tree
- 'block size > page size' support
- basic implementation with limitations, under experimental build
- limitations: no direct io, raid56, encoded read (standalone and
in send ioctl), encoded write
- preparatory work for compression, removing implicit assumptions
of page and block sizes
- compression workspaces are now per-filesystem, we cannot assume
common block size for work memory among different filesystems
- tree-checker now verifies INODE_EXTREF item (which is implementing
hardlinks)
- tree leaf pretty printer updates, there were missing data from
items, keys/items
- move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
it's a debugging feature and not needed to be enabled separately
- more struct btrfs_path auto free updates
- use ref_tracker API for tracking delayed inodes, enabled by mount
option 'ref_verify', allowing to better pinpoint leaking references
- in zoned mode, avoid selecting data relocation zoned for ordinary
data block groups
- updated and enhanced error messages
- lots of cleanups and refactoring"
* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
btrfs: add unlikely annotations to branches leading to transaction abort
btrfs: add unlikely annotations to branches leading to EIO
btrfs: add unlikely annotations to branches leading to EUCLEAN
btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
btrfs: zoned: don't fail mount needlessly due to too many active zones
btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
btrfs: enable experimental bs > ps support
btrfs: add extra ASSERT()s to catch unaligned bios
btrfs: fix symbolic link reading when bs > ps
btrfs: prepare scrub to support bs > ps cases
btrfs: prepare zlib to support bs > ps cases
btrfs: prepare lzo to support bs > ps cases
btrfs: prepare zstd to support bs > ps cases
btrfs: prepare compression folio alloc/free for bs > ps cases
btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
btrfs: remove pointless key offset setup in create_pending_snapshot()
btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
btrfs: make the rule checking more readable for should_cow_block()
btrfs: simplify inline extent end calculation at replay_one_extent()
...
|
||
|
|
b786405685 |
vfs-6.18-rc1.workqueue
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQYgAKCRCRxhvAZXjc olgGAQDWr4sD7kUt8TxifdAXsQNgyGG8qOUkb/BHHSqJ/5mKvAEAlTwJ+81tgNKT hYYdPyvWdbgW6CnWeiQLi0JjpFvUPQU= =uHwG -----END PGP SIGNATURE----- Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq |
||
|
|
56e7b31071 |
vfs-6.18-rc1.inode
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQQgAKCRCRxhvAZXjc
oud9AQD5IG4sNnzCjsvcTDpQkbX5eZW+LFIiAiiN+nztZ+OcRQEAvC2N7YovfqM3
TWpVoNDKvEPdtDc9ttFMUKqBZYvxvgE=
=sEaL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
|
||
|
|
b7ce6fa90f |
vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
=1ICf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
||
|
|
dff4f9ff5d |
btrfs: avoid potential out-of-bounds in btrfs_encode_fh()
The function btrfs_encode_fh() does not properly account for the three
cases it handles.
Before writing to the file handle (fh), the function only returns to the
user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or
BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes).
However, when a parent exists and the root ID of the parent and the
inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT
(10 dwords, 40 bytes).
If *max_len is not large enough, this write goes out of bounds because
BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than
BTRFS_FID_SIZE_CONNECTABLE originally returned.
This results in an 8-byte out-of-bounds write at
fid->parent_root_objectid = parent_root_id.
A previous attempt to fix this issue was made but was lost.
https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/
Although this issue does not seem to be easily triggerable, it is a
potential memory corruption bug that should be fixed. This patch
resolves the issue by ensuring the function returns the appropriate size
for all three cases and validates that *max_len is large enough before
writing any data.
Fixes:
|
||
|
|
74c7cc79aa |
for-6.17-rc7-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjTbXwACgkQxWXV+ddt WDtj1g//ZYTmnaJi16hS7yD2XkX0ZWZi/fFGj6y0/y4GdUG7kE4ZO8ujyZjssVvk UGVNyrv6zbWLh2z+QioBMDPMsFDGT4gBrBSsT8SP2VtMD+G6ElAxYq2raDU9Wsw6 IY86UhrnWx7RFYLbpY2YrK0F6G4UhNkwz4S8brftxFGOVF5hmfCD+5mSpfCOOnoG iK6/p0G1Kf1pIwuSl4d0bl33ruTN/5r/pQZwfguWFLwVJnagE4/a0Y6DGY9B2YO5 xEFuVCv26Im/XRz9HlcZC1VbWEwSyMlNdmvhONsFCWyPkwsguFyPBTOKZO4em6fK P3QgW6vjLTwBgcLflsrcezEbmmdeQ82REQil0NpuM8x9NcD649ecHpmwDqY/2aUw XH8bIDqhekeoV/sDVkGegaWMDxJizTHCZTdhokcIMRR+wbLVRgFmAHBmFjR392SC 7APzgCbzLzjECSQuv1KviceTW+JQMiERoSdAIFUtumRoa0wDkR+5y6ve6Um9Z0Ze KXHdtH2hcsw1qat1i3DCk91F91f0fxP73aE/driCwPlAdWpHwIGFTPg0hGM/Tca3 YSKeS+cDt0LGSJKE8iB3LQrE6Nj5kAOwvMsM4SvFgHfRndjiZv5rilzkj59S6NGu qcH03hIZgBPCjtjKAJG6qfe9Krd/yy19Mq18/4Jn1XhlQahCY/8= =sOwE -----END PGP SIGNATURE----- Merge tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more regression fix for a problem in zoned mode: mounting would fail if the number of open and active zones reached a common limit that didn't use to be checked" * tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: don't fail mount needlessly due to too many active zones |
||
|
|
53de7ee4e2 |
btrfs: zoned: don't fail mount needlessly due to too many active zones
Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit |
||
|
|
45c222468d |
btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit
|
||
|
|
a929904cf7 |
btrfs: add unlikely annotations to branches leading to transaction abort
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
cc53bd2085 |
btrfs: add unlikely annotations to branches leading to EIO
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
9264d004a6 |
btrfs: add unlikely annotations to branches leading to EUCLEAN
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
4ca6f24a52 |
btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
c9ff83963a |
btrfs: zoned: don't fail mount needlessly due to too many active zones
Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit |
||
|
|
f08d7147da |
btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
As pointed out in the documentation, calling 'kmalloc' with open-coded arithmetic can lead to unfortunate overflows and this particular way of using it has been deprecated. Instead, it's preferred to use 'kmalloc_array' in cases where it might apply so an overflow check is performed. Note this is an API cleanup and is not fixing any overflows because in all cases the multipliers are bounded small numbers derived from number of items in leaves/nodes. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
98077f7f21 |
btrfs: enable experimental bs > ps support
With all the preparation patches, we're able to finally enable btrfs block size (sector size) larger than page size support and give it a full fstests run. And obviously this new feature is hidden behind experimental flags, and should not be considered as a core feature yet as btrfs' default block size is still 4K. But this is still a feature that will shine in the future where 16K block sized device are widely adopted. For now there are some features explicitly disabled: - Direct IO This is the most complex part to support, the root reason is we can not control the pages of iov iter passed in. User space programs can only ensure the virtual addresses are contiguous, but have no control on their physical addresses. Our bs > ps support heavily relies on large folios, and direct IO memory can easily break it. So direct IO is disabled and will always fall back to buffered IO. - RAID56 In theory we can convert RAID56 to use large folios, but it will need to be converted back to page based if we want to support direct IO in the future. So just reject it for now. - Encoded send - Encoded read Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send is utilizing vmallocated memory. Unfortunately for vmallocated memory we can not guarantee the minimal folio order. For send, it will just always fallback to regular writes, which reads from page cache and will follow the existing folio order requirement. - Encoded write Encoded write itself is allocating pages by themselves, and we can easily change it to follow the minimal order. But since encoded read is already disabled, there is no need to only enable encoded write. Finally just like what we did for bs < ps support in the past, add a warning message for bs > ps mounts. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e9bed72e88 |
btrfs: add extra ASSERT()s to catch unaligned bios
Btrfs uses btrfs_bio to handle read/write of logical address, for the incoming bs > ps support, btrfs has extra requirements: - One folio must contain at least one fs block - No fs block can cross folio boundaries This requirement is not hard to maintain, thanks to the address space's minimal folio order. But not all btrfs bios are generated through address space, e.g. compression and scrub. To catch possible unaligned bios, introduce a helper, assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio(). This will check the following things: - bv_offset is aligned to block size - bv_len is aligned to block size With a btrfs bio passing above checks, unless it's empty it will ensure the requirements for bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
67378b7546 |
btrfs: fix symbolic link reading when bs > ps
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:
# mkfs.btrfs -s 8k $dev
# mount $dev $mnt
# mkdir $mnt/dir
# ln -s dir $mnt/link
# ls $mnt/link
The call trace looks like this:
BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): enabling ssd optimizations
BTRFS info (device dm-2): enabling free space tree
------------[ cut here ]------------
kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
Call Trace:
<TASK>
btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
filemap_read_folio+0x37/0xe0
do_read_cache_folio+0x94/0x3e0
__page_get_link.isra.0+0x20/0x90
page_get_link+0x16/0x40
step_into+0x69b/0x830
path_lookupat+0xa7/0x170
filename_lookup+0xf7/0x200
? set_ptes.isra.0+0x36/0x70
vfs_statx+0x7a/0x160
do_statx+0x63/0xa0
__x64_sys_statx+0x90/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.
[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.
But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.
Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.
[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().
For symbolic links this will fix the crash as now the folio will meet
the minimal order.
For regular files this brings no change.
For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.
Fixes:
|
||
|
|
5fbaae4b85 |
btrfs: prepare scrub to support bs > ps cases
This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
e88cb48e67 |
btrfs: prepare zlib to support bs > ps cases
This involves converting the following functions to use correct folio sizes/shifts: - zlib_compress_folios() - zlib_decompress_bio() There is a special handling for s390 hardware acceleration. With bs > ps cases, we can go with 16K block size on s390 (which uses fixed 4K page size). In that case we do not need to do the buffer copy as our folio is large enough for hardware acceleration. So factor out the s390 specific and folio size check into a helper, need_special_buffer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |