linux-yocto

mirror of git://git.yoctoproject.org/linux-yocto.git synced 2025-07-14 01:29:38 +02:00

Author	SHA1	Message	Date
Qu Wenruo	af32d3632e	btrfs: check-integrity: remove btrfsic_mount() function The function btrfsic_mount() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-10-12 16:44:05 +02:00
Qu Wenruo	51cf580c23	btrfs: check-integrity: remove btrfsic_check_bio() function The function btrfsic_check_bio() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-10-12 16:44:05 +02:00
David Sterba	9580503bcb	btrfs: reformat remaining kdoc style comments Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-10-12 16:44:04 +02:00
Anand Jain	f7361d8c3f	btrfs: sipmlify uuid parameters of alloc_fs_devices() Among all the callers, only the device_list_add() function uses the second argument of alloc_fs_devices(). It passes metadata_uuid when available, otherwise, it passes NULL. And in turn, alloc_fs_devices() is designed to copy either metadata_uuid or fsid into fs_devices::metadata_uuid. So remove the second argument in alloc_fs_devices(), and always copy the fsid. In the caller device_list_add() function, we will overwrite it with metadata_uuid when it is available. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-10-12 16:44:02 +02:00
Qu Wenruo	5e0e879926	btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio [BUG] After commit `72a69cd030` ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_dirty_folio() would no longer compile. [CAUSE] If DEBUG is defined, we would do extra checks for btree_dirty_folio(), mostly to make sure the range we marked dirty has an extent buffer and that extent buffer is dirty. For subpage, we need to iterate through all the extent buffers covered by that page range, and make sure they all matches the criteria. However commit `72a69cd030` ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap") changes how we store the bitmap, we pack all the 16 bits bitmaps into a larger bitmap, which would save some space. This means we no longer have btrfs_subpage::dirty_bitmap, instead the dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a length of btrfs_subpage_info::bitmap_nr_bits. [FIX] Although I'm not sure if it still makes sense to maintain such code, at least let it compile. This patch would let us test the bits one by one through the bitmaps. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:11:04 +02:00
Josef Bacik	77d20c685b	btrfs: do not block starts waiting on previous transaction commit Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The reporter used the 'bpf offcputime' tracer to show that we would get stuck in start_transaction for 5 to 30 seconds, and were always being woken up by the transaction commit. Using my timing-everything script, which times how long a function takes and what percentage of that total time is taken up by its children, I saw several traces like this 1083 took 32812902424 ns 29929002926 ns 91.2110% wait_for_commit_duration 25568 ns 7.7920e-05% commit_fs_roots_duration 1007751 ns 0.00307% commit_cowonly_roots_duration 446855602 ns 1.36182% btrfs_run_delayed_refs_duration 271980 ns 0.00082% btrfs_run_delayed_items_duration 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration 9656 ns 2.9427e-05% switch_commit_roots_duration 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration Here I was only tracing functions that happen where we are between START_COMMIT and UNBLOCKED in order to see what would be keeping us blocked for so long. The wait_for_commit() we do is where we wait for a previous transaction that hasn't completed it's commit. This can include all of the unpin work and other cleanups, which tends to be the longest part of our transaction commit. There is no reason we should be blocking new things from entering the transaction at this point, it just adds to random latency spikes for no reason. Fix this by adding a PREP stage. This allows us to properly deal with multiple committers coming in at the same time, we retain the behavior that the winner waits on the previous transaction and the losers all wait for this transaction commit to occur. Nothing else is blocked during the PREP stage, and then once the wait is complete we switch to COMMIT_START and all of the same behavior as before is maintained. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:49 +02:00
Anand Jain	67bc5ad04b	btrfs: drop redundant check to use fs_devices::metadata_uuid fs_devices::metadata_uuid value is already updated based on the super_block::METADATA_UUID flag for either fsid or metadata_uuid as appropriate. So, fs_devices::metadata_uuid can be used directly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:48 +02:00
Anand Jain	6bfe3959b0	btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super The function btrfs_validate_super() should verify the metadata_uuid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stacks: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::metadata_uuid instead. Fix this using the correct metadata_uuid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:48 +02:00
Anand Jain	d167aa76dc	btrfs: use the correct superblock to compare fsid in btrfs_validate_super The function btrfs_validate_super() should verify the fsid in the provided superblock argument. Because, all its callers expect it to do that. Such as in the following stack: write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..) scrub_one_super() btrfs_validate_super(.., sb, ..) And check_dev_super() btrfs_validate_super(.., sb, ..) However, it currently verifies the fs_info::super_copy::fsid instead, which is not correct. Fix this using the correct fsid in the superblock argument. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:48 +02:00
Ruan Jinjie	84af994b85	btrfs: use LIST_HEAD() to initialize the list_head Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:54:46 +02:00
Naohiro Aota	a7e1ac7bdc	btrfs: zoned: reserve zones for an active metadata/system block group Ensure a metadata and system block group can be activated on write time, by leaving a certain number of active zones when trying to activate a data block group. Zones for two metadata block groups (normal and tree-log) and one system block group are reserved, according to the profile type: two zones per block group on the DUP profile and one zone per block group otherwise. The reservation must be freed once a non-data block group is allocated. If not, we over-reserve the active zones and data block group activation will suffer. For the dynamic reservation count, we need to manage the reservation count per device. The reservation count variable is protected by fs_info->zone_active_bgs_lock. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:19 +02:00
Filipe Manana	504b1596bd	btrfs: make btrfs_cleanup_fs_roots() static btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static, remove its prototype from disk-io.h and move its definition above the where it's used in disk-io.c Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	ae3364e521	btrfs: store the error that turned the fs into error state Currently when we turn the fs into an error state, typically after a transaction abort, we don't store the error anywhere, we just set a bit (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the error state. There are cases where it would be useful to have access to the specific error in order to provide a more meaningful error to users/applications. This change adds a member to struct btrfs_fs_info to store the error and removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new member (fs_error) has a value of 0, otherwise its value is a negative errno value. Followup changes will make use of this new member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:18 +02:00
Filipe Manana	e5860f8207	btrfs: make find_first_extent_bit() return a boolean Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Filipe Manana	46d81ebd4a	btrfs: make btrfs_destroy_pinned_extent() return void Currently btrfs_destroy_pinned_extent() is always returning 0 no matter what and its caller ignores its return value (as well everything up in the call chain). This is because this is called in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion. So make btrfs_destroy_pinned_extent() return void. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Filipe Manana	aec5716c3e	btrfs: make btrfs_destroy_marked_extents() return void Currently btrfs_destroy_marked_extents() is returning the value of the last call to find_first_extent_bit(), which returns a value of 1 meaning no more ranges found the dirty pages io tree. This value is useless to the single caller of btrfs_destroy_marked_extents(), which ignores any return value from btrfs_destroy_marked_extents(). This is because it's only used in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion. So make btrfs_destroy_marked_extents() return void. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-21 14:52:12 +02:00
Qu Wenruo	6ebcd021c9	btrfs: reject invalid reloc tree root keys with stack dump [BUG] Syzbot reported a crash that an ASSERT() got triggered inside prepare_to_merge(). That ASSERT() makes sure the reloc tree is properly pointed back by its subvolume tree. [CAUSE] After more debugging output, it turns out we had an invalid reloc tree: BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17 Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM, QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree. But reloc trees can only exist for subvolumes, as for non-subvolume trees, we just COW the involved tree block, no need to create a reloc tree since those tree blocks won't be shared with other trees. Only subvolumes tree can share tree blocks with other trees (thus they have BTRFS_ROOT_SHAREABLE flag). Thus this new debug output proves my previous assumption that corrupted on-disk data can trigger that ASSERT(). [FIX] Besides the dedicated fix and the graceful exit, also let tree-checker to check such root keys, to make sure reloc trees can only exist for subvolumes. CC: stable@vger.kernel.org # 5.15+ Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-10 17:14:42 +02:00
Qu Wenruo	773e722a98	btrfs: avoid race between qgroup tree creation and relocation [BUG] Syzbot reported a weird ASSERT() triggered inside prepare_to_merge(). assertion failed: root->reloc_root == reloc_root, in fs/btrfs/relocation.c:1919 ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1919! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9904 Comm: syz-executor.3 Not tainted 6.4.0-syzkaller-08881-g533925cb7604 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:prepare_to_merge+0xbb2/0xc40 fs/btrfs/relocation.c:1919 Code: fe e9 f5 (...) RSP: 0018:ffffc9000325f760 EFLAGS: 00010246 RAX: 000000000000004f RBX: ffff888075644030 RCX: 1481ccc522da5800 RDX: ffffc90005c09000 RSI: 00000000000364ca RDI: 00000000000364cb RBP: ffffc9000325f870 R08: ffffffff816f33ac R09: 1ffff9200064bea0 R10: dffffc0000000000 R11: fffff5200064bea1 R12: ffff888075644000 R13: ffff88803b166000 R14: ffff88803b166560 R15: ffff88803b166558 FS: 00007f4e305fd700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056080679c000 CR3: 00000000193ad000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> relocate_block_group+0xa5d/0xcd0 fs/btrfs/relocation.c:3749 btrfs_relocate_block_group+0x7ab/0xd70 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3283 __btrfs_balance+0x1b06/0x2690 fs/btrfs/volumes.c:4018 btrfs_balance+0xbdb/0x1120 fs/btrfs/volumes.c:4402 btrfs_ioctl_balance+0x496/0x7c0 fs/btrfs/ioctl.c:3604 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f4e2f88c389 [CAUSE] With extra debugging, the offending reloc_root is for quota tree (rootid 8). Normally we should not use the reloc tree for quota root at all, as reloc trees are only for subvolume trees. But there is a race between quota enabling and relocation, this happens after commit `85724171b3` ("btrfs: fix the btrfs_get_global_root return value"). Before that commit, for quota and free space tree, we exit immediately if we cannot grab it from fs_info. But now we would try to read it from disk, just as if they are fs trees, this sets ROOT_SHAREABLE flags in such race: Thread A \| Thread B ---------------------------------+------------------------------ btrfs_quota_enable() \| \| \| btrfs_get_root_ref() \| \| \|- btrfs_get_global_root() \| \| \| Returned NULL \| \| \|- btrfs_lookup_fs_root() \| \| \| Returned NULL \|- btrfs_create_tree() \| \| \| Now quota root item is \| \| \| inserted \| \|- btrfs_read_tree_root() \| \| \| Got the newly inserted quota root \| \| \|- btrfs_init_fs_root() \| \| \| Set ROOT_SHAREABLE flag [FIX] Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target objectid is not a subvolume tree or data reloc tree. Reported-and-tested-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Fixes: `85724171b3` ("btrfs: fix the btrfs_get_global_root return value") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-08-10 17:10:10 +02:00
Naohiro Aota	95ca6599a5	btrfs: zoned: do not enable async discard The zoned mode need to reset a zone before using it. We rely on btrfs's original discard functionality (discarding unused block group range) to do the resetting. While the commit `63a7cb1307` ("btrfs: auto enable discard=async when possible") made the discard done in an async manner, a zoned reset do not need to be async, as it is fast enough. Even worth, delaying zone rests prevents using those zones again. So, let's disable async discard on the zoned mode. Fixes: `63a7cb1307` ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update message text ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-07-20 19:18:14 +02:00
David Sterba	efcfcbc6a3	btrfs: add xxhash to fast checksum implementations The implementation of XXHASH is now CPU only but still fast enough to be considered for the synchronous checksumming, like non-generic crc32c. A userspace benchmark comparing it to various implementations (patched hash-speedtest from btrfs-progs): Block size: 4096 Iterations: 1000000 Implementation: builtin Units: CPU cycles NULL-NOP: cycles: 73384294, cycles/i 73 NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s Comparing purely software implementation (ref), current outdated accelerated using crc32q instruction (NI), optimized implementations by M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775) and the best one that was taken from kernel using the PCLMULQDQ instruction (PCL). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:33 +02:00
Filipe Manana	99f09ce309	btrfs: make btrfs_destroy_delayed_refs() return void btrfs_destroy_delayed_refs() always returns 0 and its single caller does not check its return value, as it also returns void, and so does the callers' caller and so on. This is because we are in the transaction abort path, where we have no way to deal with errors (we are in a critical situation) and all cleanup of resources works in a best effort fashion. So make btrfs_destroy_delayed_refs() return void. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	184533e361	btrfs: remove unnecessary prototype declarations at disk-io.c We have a few static functions at disk-io.c for which we have a forward declaration of their prototype, but it's not needed because all those functions are defined before they are called, so remove them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:32 +02:00
Filipe Manana	4d34ad34d7	btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node, as we can check whether a reference is in the tree or not simply by checking its red black tree node member with RB_EMPTY_NODE(), as when we remove it from the tree we always call RB_CLEAR_NODE(). So remove that field and use RB_EMPTY_NODE(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:31 +02:00
Tejun Heo	58e814fcac	btrfs: use alloc_ordered_workqueue() to create ordered workqueues BACKGROUND ========== When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue(). However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by `4c16bd327c` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues. While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. This patch series audits all call sites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary. BTRFS ===== * fs_info->scrub_workers initialized in scrub_workers_get() was setting @max_active to 1 when @is_dev_replace is set and it seems that the workqueue actually needs to be ordered if @is_dev_replace. Update the code so that alloc_ordered_workqueue() is used if @is_dev_replace. * fs_info->discard_ctl.discard_workers initialized in btrfs_init_workqueues() was directly using alloc_workqueue() w/ @max_active==1. Converted to alloc_ordered_workqueue(). * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue, which are allocated with btrfs_alloc_workqueue(). btrfs_workqueue implements automatic @max_active adjustment which is disabled when the specified max limit is below a certain threshold, so calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered workqueue whose @max_active won't be changed as the auto-tuning is disabled. This is rather brittle in that nothing clearly indicates that the two workqueues should be ordered or btrfs_alloc_workqueue() must disable auto-tuning when @limit_active==1. This patch factors out the common btrfs_workqueue init code into btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue(). The two workqueues are converted to use the new ordered allocation interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:30 +02:00
Christoph Hellwig	25ac047c9d	btrfs: remove a pointless NULL check in btrfs_lookup_fs_root btrfs_grab_root already checks for a NULL root itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	e91909aace	btrfs: convert btrfs_get_global_root to use a switch statement Use a switch statement instead of an endless chain of if statements to make the code a little cleaner. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	85724171b3	btrfs: fix the btrfs_get_global_root return value btrfs_grab_root returns either the root or NULL, and the callers of btrfs_get_global_root expect it to return the same. But all the more recently added roots instead return an ERR_PTR, so fix this. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	25984a5ae8	btrfs: consolidate uuid comparisons in btrfs_validate_super There are three ways the fsid is validated in btrfs_validate_super(): - verify that super_copy::fsid is the same as fs_devices::fsid - if the metadata_uuid flag is set, verify if super_copy::metadata_uuid and fs_devices::metadata_uuid are the same. - a few lines below, often missed out, verify if dev_item::fsid is the same as fs_devices::metadata_uuid. The function btrfs_validate_super() contains multiple if-statements with memcmp() to check UUIDs. This patch consolidates them into a single location. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Anand Jain	413fb1bc1d	btrfs: return bool from check_tree_block_fsid instead of int Simplify the return type of check_tree_block_fsid() from int (1 or 0) to bool. Its only user is interested in knowing the success or failure. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:29 +02:00
Christoph Hellwig	9e2aff90fc	btrfs: stop using lock_extent in btrfs_buffer_uptodate The only other place that locks extents on the btree inode is read_extent_buffer_subpage while reading in the partial page for a buffer. This means locking the extent in btrfs_buffer_uptodate does not synchronize with anything on non-subpage file systems, and on subpage file systems it only waits for a parallel read(-ahead) to finish, which seems to be counter to what the callers actually expect. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	31d89399da	btrfs: remove the extent_buffer lookup in btree block checksumming The checksumming of btree blocks always operates on the entire extent_buffer, and because btree blocks are always allocated contiguously on disk they are never split by btrfs_submit_bio. Simplify the checksumming code by finding the extent_buffer in the btrfs_bio private data instead of trying to search through the bio_vec. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:28 +02:00
Christoph Hellwig	046b562b20	btrfs: use a separate end_io handler for read_extent_buffer Now that we always use a single bio to read an extent_buffer, the buffer can be passed to the end_io handler as private data. This allows implementing a much simplified dedicated end I/O handler for metadata reads. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	d87e6575e9	btrfs: merge verify_parent_transid and btrfs_buffer_uptodate verify_parent_transid is only called by btrfs_buffer_uptodate, which confusingly inverts the return value. Merge the two functions and reflow the parent_transid so that error handling is in a branch. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	aebcc1596b	btrfs: move setting the buffer uptodate out of validate_extent_buffer Setting the buffer uptodate in a function that is named as a validation helper is a it confusing. Move the call from validate_extent_buffer to the one of its two callers that didn't already have a duplicate call to set_extent_buffer_uptodate. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:27 +02:00
Christoph Hellwig	f880fe6e0b	btrfs: don't hold an extra reference for redirtied buffers When btrfs_redirty_list_add redirties a buffer, it also acquires an extra reference that is released on transaction commit. But this is not required as buffers that are dirty or under writeback are never freed (look for calls to extent_buffer_under_io())). Remove the extra reference and the infrastructure used to drop it again. History behind redirty logic: In the first place, it used releasing_list to hold all the to-be-released extent buffers, and decided which buffers to re-dirty at the commit time. Then, in a later version, the behaviour got changed to re-dirty a necessary buffer and add re-dirtied one to the list in btrfs_free_tree_block(). In short, the list was there mostly for the patch series' historical reason. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ add Naohiro's comment regarding history ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:26 +02:00
Christoph Hellwig	f18cc97845	btrfs: fix dirty_metadata_bytes for redirtied buffers dirty_metadata_bytes is decremented in both places that clear the dirty bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which means that a buffer that is redirtied using btrfs_redirty_list_add won't be added to dirty_metadata_bytes, but it will be subtracted when written out, leading an inconsistency in the counter. Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into set_extent_buffer_dirty to also account for the redirty case, and remove the now unused set_extent_buffer_dirty return value. Fixes: `d3575156f6` ("btrfs: zoned: redirty released extent buffers") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:25 +02:00
Josef Bacik	2cac5af165	btrfs: move btrfs_verify_level_key into tree-checker.c This is more a buffer validation helper, move it into the tree-checker files where it makes more sense. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:25 +02:00
Josef Bacik	85d8a826c7	btrfs: simplify btrfs_check_leaf_* helpers into a single helper We have two helpers for checking leaves, because we have an extra check for debugging in btrfs_mark_buffer_dirty(), and at that stage we may have item data that isn't consistent yet. However we can handle this case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we know the buffer should be internally consistent, otherwise we need to skip checking the item data. Simplify this helper down a single helper and handle the item data checking logic internally to the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:24 +02:00
Christoph Hellwig	8bfec2e426	btrfs: remove hipri_workers workqueue Now that btrfs_wq_submit_bio is never called for synchronous I/O, the hipri_workers workqueue is not used anymore and can be removed. Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-19 13:59:23 +02:00
Qu Wenruo	745806fb45	btrfs: do not ASSERT() on duplicated global roots [BUG] Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot mount option on a corrupted fs. The full report can be found here: https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0 BTRFS error (device loop0: state C): failed to load root csum assertion failed: !tmp, in fs/btrfs/disk-io.c:1103 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3664! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663 RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246 RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1 R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000 R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000 FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103 load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467 load_global_roots fs/btrfs/disk-io.c:2501 [inline] btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline] init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939 open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574 btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456 btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824 legacy_get_tree+0xea/0x180 fs/fs_context.c:610 vfs_get_tree+0x88/0x270 fs/super.c:1530 fc_mount fs/namespace.c:1043 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073 btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884 [CAUSE] Since the introduction of global roots, we handle csum/extent/free-space-tree roots as global roots, even if no extent-tree-v2 feature is enabled. So for regular csum/extent/fst roots, we load them into fs_info::global_root_tree rb tree. And we should not expect any conflicts in that rb tree, thus we have an ASSERT() inside btrfs_global_root_insert(). But rescue=usebackuproot can break the assumption, as we will try to load those trees again and again as long as we have bad roots and have backup roots slot remaining. So in that case we can have conflicting roots in the rb tree, and triggering the ASSERT() crash. [FIX] We can safely remove that ASSERT(), as the caller will properly put the offending root. To make further debugging easier, also add two explicit error messages: - Error message for conflicting global roots - Error message when using backup roots slot Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com Fixes: `abed4aaae4` ("btrfs: track the csum, extent, and free space trees in a rb tree") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-13 01:21:16 +02:00
Qu Wenruo	917ac77846	btrfs: subpage: fix a crash in metadata repair path [BUG] Test case btrfs/027 would crash with subpage (64K page size, 4K sectorsize) with the following dying messages: debug: map_length=16384 length=65536 type=metadata\|raid6(0x104) assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_assertfail+0x28/0x2c [btrfs] btrfs_map_repair_block+0x150/0x2b8 [btrfs] btrfs_repair_io_failure+0xd4/0x31c [btrfs] btrfs_read_extent_buffer+0x150/0x16c [btrfs] read_tree_block+0x38/0xbc [btrfs] read_tree_root_path+0xfc/0x1bc [btrfs] btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs] open_ctree+0xa30/0x172c [btrfs] btrfs_mount_root+0x3c4/0x4a4 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec vfs_kern_mount.part.0+0x90/0xd4 vfs_kern_mount+0x14/0x28 btrfs_mount+0x114/0x418 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec path_mount+0x3e0/0xb64 __arm64_sys_mount+0x200/0x2d8 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x60/0x11c do_el0_svc+0x38/0x98 el0_svc+0x40/0xa8 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x190/0x194 Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000) [CAUSE] In btrfs/027 we test RAID6 with missing devices, in this particular case, we're repairing a metadata at the end of a data stripe. But at btrfs_repair_io_failure(), we always pass a full PAGE for repair, and for subpage case this can cross stripe boundary and lead to the above BUG_ON(). This metadata repair code is always there, since the introduction of subpage support, but this can trigger BUG_ON() after the bio split ability at btrfs_map_bio(). [FIX] Instead of passing the old PAGE_SIZE, we calculate the correct length based on the eb size and page size for both regular and subpage cases. CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-06-05 19:21:57 +02:00
pengfuyuan	5ad9b4719f	btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds When compiling on a MIPS 64-bit machine we get these warnings: In file included from ./arch/mips/include/asm/cacheflush.h:13, from ./include/linux/cacheflush.h:5, from ./include/linux/highmem.h:8, from ./include/linux/bvec.h:10, from ./include/linux/blk_types.h:10, from ./include/linux/blkdev.h:9, from fs/btrfs/disk-io.c:7: fs/btrfs/disk-io.c: In function ‘csum_tree_block’: fs/btrfs/disk-io.c💯34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds] 100 \| kaddr = page_address(buf->pages[i]); \| ~~~~~~~~~~^~~ ./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’ 2135 \| #define page_address(page) lowmem_page_address(page) \| ^~~~ cc1: all warnings being treated as errors We can check if i overflows to solve the problem. However, this doesn't make much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop. In addition, i < num_pages can also ensure that buf->pages[i] will not cross the boundary. Unfortunately, this doesn't help with the problem observed here: gcc still complains. To fix this add a compile-time condition for the extent buffer page array size limit, which would eventually lead to eliminating the whole for loop. CC: stable@vger.kernel.org # 5.10+ Signed-off-by: pengfuyuan <pengfuyuan@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-05-26 23:24:55 +02:00
Josef Bacik	597441b343	btrfs: use nofs when cleaning up aborted transactions Our CI system caught a lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc7+ #1167 Not tainted ------------------------------------------------------ kswapd0/46 is trying to acquire lock: ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120 but task is already holding lock: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xa5/0xe0 kmem_cache_alloc+0x31/0x2c0 alloc_extent_state+0x1d/0xd0 __clear_extent_bit+0x2e0/0x4f0 try_release_extent_mapping+0x216/0x280 btrfs_release_folio+0x2e/0x90 invalidate_inode_pages2_range+0x397/0x470 btrfs_cleanup_dirty_bgs+0x9e/0x210 btrfs_cleanup_one_transaction+0x22/0x760 btrfs_commit_transaction+0x3b7/0x13a0 create_subvol+0x59b/0x970 btrfs_mksubvol+0x435/0x4f0 __btrfs_ioctl_snap_create+0x11e/0x1b0 btrfs_ioctl_snap_create_v2+0xbf/0x140 btrfs_ioctl+0xa45/0x28f0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #0 (sb_internal#2){++++}-{0:0}: __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 start_transaction+0x401/0x730 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 kthread+0xf0/0x120 ret_from_fork+0x29/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sb_internal#2); lock(fs_reclaim); lock(sb_internal#2); * DEADLOCK * 3 locks held by kswapd0/46: #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290 #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0 stack backtrace: CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x58/0x90 check_noncircular+0xd6/0x100 ? save_trace+0x3f/0x310 ? add_lock_to_list+0x97/0x120 __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 start_transaction+0x401/0x730 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 ? lock_release+0x134/0x270 ? __pfx_wake_bit_function+0x10/0x10 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 ? __pfx_inode_lru_isolate+0x10/0x10 ? __pfx_inode_lru_isolate+0x10/0x10 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_kswapd+0x10/0x10 kthread+0xf0/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> This happens because when we abort the transaction in the transaction commit path we call invalidate_inode_pages2_range on our block group cache inodes (if we have space cache v1) and any delalloc inodes we may have. The plain invalidate_inode_pages2_range() call passes through GFP_KERNEL, which makes sense in most cases, but not here. Wrap these two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to make sure we don't end up with the fs reclaim dependency under the transaction dependency. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-05-17 13:08:28 +02:00
Qu Wenruo	1d6a4fc857	btrfs: make clear_cache mount option to rebuild FST without disabling it Previously clear_cache mount option would simply disable free-space-tree feature temporarily then re-enable it to rebuild the whole free space tree. But this is problematic for block-group-tree feature, as we have an artificial dependency on free-space-tree feature. If we go the existing method, after clearing the free-space-tree feature, we would flip the filesystem to read-only mode, as we detect a super block write with block-group-tree but no free-space-tree feature. This patch would change the behavior by properly rebuilding the free space tree without disabling this feature, thus allowing clear_cache mount option to work with block group tree. Now we can mount a filesystem with block-group-tree feature and clear_mount option: $ mkfs.btrfs -O block-group-tree /dev/test/scratch1 -f $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache $ sudo dmesg -t \| head -n 5 BTRFS info (device dm-1): force clearing of disk cache BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): auto enabling async discard BTRFS info (device dm-1): rebuilding free space tree BTRFS info (device dm-1): checking UUID tree CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-05-10 14:51:27 +02:00
Anand Jain	7e812f2054	btrfs: use test_and_clear_bit() in wait_dev_flush() The function wait_dev_flush() tests for the BTRFS_DEV_STATE_FLUSH_SENT bit and then clears it separately. Instead, use test_and_clear_bit(). Though we don't need to do the atomic test and clear, it's following a common pattern. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Anand Jain	1b465784dc	btrfs: change wait_dev_flush() return type to bool The flush error code is maintained in btrfs_device::last_flush_error, so there is no point in returning it in wait_dev_flush() when it is not being used. Instead, we can return a boolean value. Note that even though btrfs_device::last_flush_error may not be used, we will keep it for now. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Anand Jain	de38a206ff	btrfs: open code check_barrier_error() check_barrier_error() is almost a single line function, and just calls btrfs_check_rw_degradable(). Instead, open code it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Anand Jain	bfd3ea946f	btrfs: move last_flush_error to write_dev_flush and wait_dev_flush We parallelize the flush command across devices using our own code, write_dev_flush() sends the flush command to each device and wait_dev_flush() waits for the flush to complete on all devices. Errors from each device are recorded at device->last_flush_error and reset to BLK_STS_OK in write_dev_flush() and to the error, if any, in wait_dev_flush(). These functions are called from barrier_all_devices(). This patch consolidates the use of device->last_flush_error in write_dev_flush() and wait_dev_flush() to remove it from barrier_all_devices(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:20 +02:00
Filipe Manana	a8fdc05172	btrfs: remove obsolete delayed ref throttling logic when truncating items We have this logic encapsulated in btrfs_should_throttle_delayed_refs() where we try to estimate if running the current amount of delayed references we have will take more than half a second, and if so, the caller btrfs_should_throttle_delayed_refs() should do something to prevent more and more delayed refs from being accumulated. This logic was added in commit `0a2b2a844a` ("Btrfs: throttle delayed refs better") and then further refined in commit `a79b7d4b3e` ("Btrfs: async delayed refs"). The idea back then was that the caller of btrfs_should_throttle_delayed_refs() would release its transaction handle (by calling btrfs_end_transaction()) when that function returned true, then btrfs_end_transaction() would trigger an async job to run delayed references in a workqueue, and later start/join a transaction again and do more work. However we don't run delayed references asynchronously anymore, that was removed in commit `db2462a6ad` ("btrfs: don't run delayed refs in the end transaction logic"). That makes the logic that tries to estimate how long we will take to run our current delayed references, at btrfs_should_throttle_delayed_refs(), pointless as we don't take any action to run delayed references anymore. We do have other type of throttling, which consists of checking the size and reserved space of the delayed and global block reserves, as well as if fluhsing delayed references for the current transaction was already started, etc - this is all done by btrfs_should_end_transaction(), and the only user of btrfs_should_throttle_delayed_refs() does periodically call btrfs_should_end_transaction(). So remove btrfs_should_throttle_delayed_refs() and the infrastructure that keeps track of the average time used for running delayed references, as well as adapting btrfs_truncate_inode_items() to call btrfs_check_space_for_delayed_refs() instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:19 +02:00
David Sterba	0b5485391d	btrfs: locking: use atomic for DREW lock writers The DREW lock uses percpu variable to track lock counters and for that it needs to allocate the structure. In btrfs_read_tree_root() or btrfs_init_fs_root() this may add another error case or requires the NOFS scope protection. One way is to preallocate the structure as was suggested in https://lore.kernel.org/linux-btrfs/20221214021125.28289-1-robbieko@synology.com/ We may avoid the allocation altogether if we don't use the percpu variables but an atomic for the writer counter. This should not make any difference, the DREW lock is used for truncate and NOCOW writes along with other IO operations. The percpu counter for writers has been there since the original commit `8257b2dc3c` "Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume". The reason could be to avoid hammering the same cacheline from all the readers but then the writers do that anyway. Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:17 +02:00
Anand Jain	ce4cf3793e	btrfs: remove redundant clearing of NODISCARD If no discard mount option is specified, including the NODISCARD option, we make the async discard the default option then we don't have to call the clear_opt again to clear the NODISCARD flag. Though this makes no difference, that the call is redundant has been pointed out several times so we better remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:17 +02:00
Qu Wenruo	4871c33baf	btrfs: open_ctree() error handling cleanup Currently open_ctree() still uses two variables for error handling, err and ret. This can be confusing and missing some errors and does not conform to current coding style. This patch will fix the problems by: - Use only ret for error handling - Add proper ret assignment Originally we rely on the default value (-EINVAL) of err to handle errors, but that doesn't really reflects the error. This will change it use the correct error number for the following call sites: * subpage_info allocation * btrfs_free_extra_devids() * btrfs_check_rw_degradable() * cleaner_kthread allocation * transaction_kthread allocation - Add an extra ASSERT() To make sure we error out instead of returning 0. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:16 +02:00
Christoph Hellwig	dcb2137c84	btrfs: move all btree inode initialization into btrfs_init_btree_inode Move the remaining code that deals with initializing the btree inode into btrfs_init_btree_inode instead of splitting it between that helpers and its only caller. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	6989627db0	btrfs: drop root refs properly when orphan cleanup fails When we mount the file system we do something like this: while (1) { lookup fs roots; for (i = 0; i < num_roots; i++) { ret = btrfs_orphan_cleanup(roots[i]); if (ret) break; btrfs_put_root(roots[i]); } } for (; i < num_roots; i++) btrfs_put_root(roots[i]); As you can see if we break in that inner loop we just go back to the outer loop and lose the fact that we have to drop references on the remaining roots we looked up. Fix this by making an out label and jumping to that on error so we don't leak a reference to the roots we looked up. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Christoph Hellwig	68d99ab0e9	btrfs: fix fast csum implementation detection The BTRFS_FS_CSUM_IMPL_FAST flag is currently set whenever a non-generic crc32c is detected, which is the incorrect check if the file system uses a different checksumming algorithm. Refactor the code to only check this if crc32c is actually used. Note that in an ideal world the information if an algorithm is hardware accelerated or not should be provided by the crypto API instead, but that's left for another day. CC: stable@vger.kernel.org # 5.4.x: c8a5f8ca9a9c: btrfs: print checksum type and implementation at mount time CC: stable@vger.kernel.org # 5.4.x Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-06 16:34:13 +02:00
Josef Bacik	98c8d683c2	btrfs: combine btrfs_clear_buffer_dirty and clear_extent_buffer_dirty btrfs_clear_buffer_dirty just does the test_clear_bit() and then calls clear_extent_buffer_dirty and does the dirty metadata accounting. Combine this into clear_extent_buffer_dirty and make the result btrfs_clear_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:54 +01:00
Josef Bacik	190a83391b	btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirty btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:53 +01:00
Josef Bacik	c4e54a6571	btrfs: replace clearing extent buffer dirty bit with btrfs_clean_block Now that we're passing in the trans into btrfs_clean_tree_block, we can easily roll in the handling of the !trans case and replace all occurrences of if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) clear_extent_buffer_dirty(eb); with btrfs_tree_lock(eb); btrfs_clean_tree_block(eb); btrfs_tree_unlock(eb); We need the lock because if we are actually dirty we need to make sure we aren't racing with anything that's starting writeout currently. This also makes sure that we're accounting fs_info->dirty_metadata_bytes appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:53 +01:00
Josef Bacik	ed25dab3a0	btrfs: add trans argument to btrfs_clean_tree_block We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:53 +01:00
Christoph Hellwig	35a8d7da3c	btrfs: remove now spurious bio submission helpers Call btrfs_submit_bio and btrfs_submit_compressed_read directly from submit_one_bio now that all additional functionality has moved into btrfs_submit_bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:53 +01:00
Christoph Hellwig	285599b6fe	btrfs: remove the fs_info argument to btrfs_submit_bio btrfs_submit_bio can derive it trivially from bbio->inode, so stop bothering in the callers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:53 +01:00
Christoph Hellwig	542e300e83	btrfs: support cloned bios in btree_csum_one_bio To allow splitting bios in btrfs_submit_bio, btree_csum_one_bio needs to be able to handle cloned bios. As btree_csum_one_bio is always called before handing the bio to the block layer that is trivially done by using bio_for_each_segment instead of bio_for_each_segment_all. Also switch the function to take a btrfs_bio and use that to derive the fs_info. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Christoph Hellwig	f8a53bb58e	btrfs: handle checksum generation in the storage layer Instead of letting the callers of btrfs_submit_bio deal with checksumming the (meta)data in the bio and making decisions on when to offload the checksumming to the bio, leave that to btrfs_submit_bio. Do do so the existing btrfs_submit_bio function is split into an upper and a lower half, so that the lower half can be offloaded to a workqueue. Note that this changes the behavior for direct writes to raid56 volumes so that async checksum offloading is not skipped when more I/O is expected. This runs counter to the argument explaining why it was done, although I can't measure any affects of the change. Commits later in this series will make sure the entire direct writes is offloaded to the workqueue at once and thus make sure it is sent to the raid56 code from a single thread. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Christoph Hellwig	f8c44673e5	btrfs: simplify the btrfs_csum_one_bio calling convention To prepare for further bio submission changes btrfs_csum_one_bio should be able to take all it's arguments from the btrfs_bio structure. It can always use the bbio->inode already, and once the compression code is updated to set ->file_offset that one can be used unconditionally as well instead of looking at the page mapping now that btrfs doesn't allow ordered extents to span discontiguous data ranges. The only slightly tricky bit is the one_ordered flag set by the compressed writes. Replace that one with the driver private bio flag, which gets cleared before the bio is handed off to the block layer so that we don't get in the way of driver use. Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused. But that whole mechanism will be removed in its current form in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Christoph Hellwig	deb6216fa0	btrfs: open code the submit_bio_start helpers The submit helpers are now trivial and can be called directly. Note that btree_csum_one_bio has to be moved up in the file a bit to avoid a forward declaration. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Christoph Hellwig	295fe46ff1	btrfs: remove struct btrfs_bio::is_metadata flag This flag is unused now, so remove it. Re-expand the mirror_num field to 8 bits, and move it to the I/O completion internal section of the structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:52 +01:00
Colin Ian King	67da05b3f2	btrfs: fix spelling mistakes found using codespell There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:38:50 +01:00
Qu Wenruo	b7625f461d	btrfs: sysfs: update fs features directory asynchronously [BUG] Since the introduction of per-fs feature sysfs interface (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never updated. Thus for the following case, that directory will not show the new features like RAID56: # mkfs.btrfs -f $dev1 $dev2 $dev3 # mount $dev1 $mnt # btrfs balance start -f -mconvert=raid5 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes skinny_metadata While after unmount and mount, we got the correct features: # umount $mnt # mount $dev1 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes raid56 skinny_metadata [CAUSE] Because we never really try to update the content of per-fs features/ directory. We had an attempt to update the features directory dynamically in commit `14e46e0495` ("btrfs: synchronize incompat feature bits with sysfs files"), but unfortunately it get reverted in commit `e410e34fad` ("Revert "btrfs: synchronize incompat feature bits with sysfs files""). The problem in the original patch is, in the context of btrfs_create_chunk(), we can not afford to update the sysfs group. The exported but never utilized function, btrfs_sysfs_feature_update() is the leftover of such attempt. As even if we go sysfs_update_group(), new files will need extra memory allocation, and we have no way to specify the sysfs update to go GFP_NOFS. [FIX] This patch will address the old problem by doing asynchronous sysfs update in the cleaner thread. This involves the following changes: - Make __btrfs_(set\|clear)_fs_(incompat\|compat_ro) helpers to set BTRFS_FS_FEATURE_CHANGED flag when needed - Update btrfs_sysfs_feature_update() to use sysfs_update_group() And drop unnecessary arguments. - Call btrfs_sysfs_feature_update() in cleaner_kthread If we have the BTRFS_FS_FEATURE_CHANGED flag set. - Wake up cleaner_kthread in btrfs_commit_transaction if we have BTRFS_FS_FEATURE_CHANGED flag By this, all the previously dangerous call sites like btrfs_create_chunk() need no new changes, as above helpers would have already set the BTRFS_FS_FEATURE_CHANGED flag. The real work happens at cleaner_kthread, thus we pay the cost of delaying the update to sysfs directory, but the delayed time should be small enough that end user can not distinguish though it might get delayed if the cleaner thread is busy with removing subvolumes or defrag. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Josef Bacik	a6ca692ec2	btrfs: fix uninitialized variable warning in run_one_async_start With -Wmaybe-uninitialized compiler complains about ret being possibly uninitialized, which isn't possible as the WQ_ constants are set only from our code, however we can handle the default case and get rid of the warning. The value is set to BLK_STS_IOERR so it does not issue any IO and could be potentially detected, but this is basically a "cannot happen" error. To catch any problems during development use the assert. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ set the error in default: ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Filipe Manana	16199ad9eb	btrfs: do not abort transaction on failure to write log tree when syncing log When syncing the log, if we fail to write log tree extent buffers, we mark the log for a full commit and abort the transaction. However we don't need to abort the transaction, all we really need to do is to make sure no one can commit a superblock pointing to new log tree roots. Just because we got a failure writing extent buffers for a log tree, it does not mean we will also fail to do a transaction commit. One particular case is if due to a bug somewhere, when writing log tree extent buffers, the tree checker detects some corruption and the writeout fails because of that. Aborting the transaction can be very disruptive for a user, specially if the issue happened on a root filesystem. One example is the scenario in the Link tag below, where an isolated corruption on log tree leaves was causing transaction aborts when syncing the log. Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-01-12 15:43:30 +01:00
Qu Wenruo	2ba48b2004	btrfs: fix compat_ro checks against remount [BUG] Even with commit `81d5d61454` ("btrfs: enhance unsupported compat RO flags handling"), btrfs can still mount a fs with unsupported compat_ro flags read-only, then remount it RW: # btrfs ins dump-super /dev/loop0 \| grep compat_ro_flags -A 3 compat_ro_flags 0x403 ( FREE_SPACE_TREE \| FREE_SPACE_TREE_VALID \| unknown flag: 0x400 ) # mount /dev/loop0 /mnt/btrfs mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. ^^^ RW mount failed as expected ^^^ # dmesg -t \| tail -n5 loop0: detected capacity change from 0 to 1048576 BTRFS: device fsid cb5b82f5-0fdd-4d81-9b4b-78533c324afa devid 1 transid 7 /dev/loop0 scanned by mount (1146) BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm BTRFS info (device loop0): using free space tree BTRFS error (device loop0): cannot mount read-write because of unknown compat_ro features (0x403) BTRFS error (device loop0): open_ctree failed # mount /dev/loop0 -o ro /mnt/btrfs # mount -o remount,rw /mnt/btrfs ^^^ RW remount succeeded unexpectedly ^^^ [CAUSE] Currently we use btrfs_check_features() to check compat_ro flags against our current mount flags. That function get reused between open_ctree() and btrfs_remount(). But for btrfs_remount(), the super block we passed in still has the old mount flags, thus btrfs_check_features() still believes we're mounting read-only. [FIX] Replace the existing @sb argument with @is_rw_mount. As originally we only use @sb to determine if the mount is RW. Now it's callers' responsibility to determine if the mount is RW, and since there are only two callers, the check is pretty simple: - caller in open_ctree() Just pass !sb_rdonly(). - caller in btrfs_remount() Pass !(*flags & SB_RDONLY), as our check should be against the new flags. Now we can correctly reject the RW remount: # mount /dev/loop0 -o ro /mnt/btrfs # mount -o remount,rw /mnt/btrfs mount: /mnt/btrfs: mount point not mounted or bad option. dmesg(1) may have more information after failed mount system call. # dmesg -t \| tail -n 1 BTRFS error (device loop0: state M): cannot mount read-write because of unknown compat_ro features (0x403) Reported-by: Chung-Chiang Cheng <shepjeng@gmail.com> Fixes: `81d5d61454` ("btrfs: enhance unsupported compat RO flags handling") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-01-03 16:22:13 +01:00
Qu Wenruo	77177ed17d	btrfs: add error message for metadata level mismatch From a recent regression report, we found that after commit `947a629988` ("btrfs: move tree block parentness check into validate_extent_buffer()") if we have a level mismatch (false alert though), there is no error message at all. This makes later debugging harder. This patch will add the proper error message for such case. Link: https://lore.kernel.org/linux-btrfs/CABXGCsNzVxo4iq-tJSGm_kO1UggHXgq6CdcHDL=z5FL4njYXSQ@mail.gmail.com/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-01-03 15:53:18 +01:00
Christoph Hellwig	bacf60e515	btrfs: move repair_io_failure to bio.c repair_io_failure ties directly into all the glory low-level details of mapping a bio with a logic address to the actual physical location. Move it right below btrfs_submit_bio to keep all the related logic together. Also move btrfs_repair_eb_io_failure to its caller in disk-io.c now that repair_io_failure is available in a header. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Christoph Hellwig	103c19723c	btrfs: split the bio submission path into a separate file The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Qu Wenruo	cb3e217bdb	btrfs: use btrfs_dev_name() helper to handle missing devices better [BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:57 +01:00
Qu Wenruo	947a629988	btrfs: move tree block parentness check into validate_extent_buffer() [BACKGROUND] Although both btrfs metadata and data has their read time verification done at endio time (btrfs_validate_metadata_buffer() and btrfs_verify_data_csum()), metadata has extra verification, mostly parentness check including first key/transid/owner_root/level, done at read_tree_block() and btrfs_read_extent_buffer(). On the other hand, all the data verification is done at endio context. [ENHANCEMENT] This patch will make a new union in btrfs_bio, taking the space of the old data checksums, thus it will not increase the memory usage. With that extra btrfs_tree_parent_check inside btrfs_bio, we can just pass the check parameter into read_extent_buffer_pages(), and before submitting the bio, we can copy the check structure into btrfs_bio. And finally at endio time, we can grab btrfs_bio::parent_check and pass it to validate_extent_buffer(), to move the remaining checks into it. This brings the following benefits: - Much simpler btrfs_read_extent_buffer() Now it only needs to iterate through all mirrors. - Simpler read-time transid check Previously we go verify_parent_transid() after reading out the extent buffer. Now the transid check is done inside the endio function, no other code can modify the content. Thus no need to use the extent lock anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:56 +01:00
Qu Wenruo	789d6a3a87	btrfs: concentrate all tree block parentness check parameters into one structure There are several different tree block parentness check parameters used across several helpers: - level Mandatory - transid Under most cases it's mandatory, but there are several backref cases which skips this check. - owner_root - first_key Utilized by most top-down tree search routine. Otherwise can be skipped. Those four members are not always mandatory checks, and some of them are the same u64, which means if some arguments got swapped compiler will not catch it. Furthermore if we're going to further expand the parentness check, we need to modify quite some helpers just to add one more parameter. This patch will concentrate all these members into a structure called btrfs_tree_parent_check, and pass that structure for the following helpers: - btrfs_read_extent_buffer() - read_tree_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:56 +01:00
David Sterba	35da5a7ede	btrfs: drop private_data parameter from extent_io_tree_init All callers except one pass NULL, so the parameter can be dropped and the inode::io_tree initialization can be open coded. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:54 +01:00
David Sterba	644094fd28	btrfs: pass btrfs_inode to btrfs_submit_metadata_bio The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:52 +01:00
David Sterba	5fcdadc270	btrfs: pass btrfs_inode to btrfs_wq_submit_bio The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:52 +01:00
David Sterba	bfa1706682	btrfs: pass btrfs_inode to btrfs_submit_bio_start_direct_io The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:52 +01:00
David Sterba	882681ac98	btrfs: pass btrfs_inode to btrfs_submit_bio_start The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:52 +01:00
David Sterba	da67daab8d	btrfs: switch async_submit_bio::inode to btrfs_inode The async bio submit is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	ad65ecf30b	btrfs: simplify btree_submit_bio_start and btrfs_submit_bio_start parameters After previous patches the unused parameters can be removed from btree_submit_bio_start and btrfs_submit_bio_start as they don't need to conform to the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
David Sterba	ab2072b292	btrfs: change how submit bio callback is passed to btrfs_wq_submit_bio There's a callback function parameter for btrfs_wq_submit_bio that can be one of: metadata, buffered data, direct io data. The callback abstraction is unnecessary as we have all functions available. Replace the parameter with a command that leads to a direct call in run_one_async_start. The called functions can be then simplified and we can also remove the extent_submit_bio_start_t typedef. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:51 +01:00
Qu Wenruo	1a1a285139	btrfs: remove the unused endio_raid56_workers and btrfs_raid_bio::end_io_work Since we have switched all raid56 workload to submit-and-wait method, there is no use for btrfs_fs_info::endio_raid56_workers workqueue and btrfs_raid_bio::end_io_work. Remove them to save some memory. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:49 +01:00
Josef Bacik	c03b22076b	btrfs: move super prototypes into super.h Move these out of ctree.h into super.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	2fc6822c99	btrfs: move scrub prototypes into scrub.h Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	677074792a	btrfs: move relocation prototypes into relocation.h Move these out of ctree.h into relocation.h to cut down on code in ctree.h Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:47 +01:00
Josef Bacik	c7a03b524d	btrfs: move uuid tree prototypes to uuid-tree.h Move these out of ctree.h into uuid-tree.h to cut down on the code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	59b818e064	btrfs: move defrag related prototypes to their own header Now that the defrag code is all in one file, create a defrag.h and move all the defrag related prototypes and helper out of ctree.h and into defrag.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:46 +01:00
Josef Bacik	45c40c8f95	btrfs: move root tree prototypes to their own header Move all the root-tree.c prototypes to root-tree.h, and then update all the necessary files to include the new header. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	a0231804af	btrfs: move extent-tree helpers into their own header file Move all the extent tree related prototypes to extent-tree.h out of ctree.h, and then go include it everywhere needed so everything compiles. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:44 +01:00
Josef Bacik	07e81dc944	btrfs: move accessor helpers into accessors.h This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	55e5cfd36d	btrfs: remove fs_info::pending_changes and related code Now that we're not using this code anywhere we can remove it as well as the member from fs_info. We don't have any mount options or on/off features that would utilize the pending infrastructure, the last one was inode_cache. There was a patchset [1] to enable some features from sysfs that would break things if it would be set immediately. In case we'll need that kind of logic again the patch can be reverted, but for the current use it can be replaced by the single state bit to do the commit. [1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:42 +01:00
Josef Bacik	c7f13d428e	btrfs: move fs wide helpers out of ctree.h We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
David Sterba	63a7cb1307	btrfs: auto enable discard=async when possible There's a request to automatically enable async discard for capable devices. We can do that, the async mode is designed to wait for larger freed extents and is not intrusive, with limits to iops, kbps or latency. The status and tunables will be exported in /sys/fs/btrfs/FSID/discard . The automatic selection is done if there's at least one discard capable device in the filesystem (not capable devices are skipped). Mounting with any other discard option will honor that option, notably mounting with nodiscard will keep it disabled. Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:41 +01:00
Josef Bacik	51129b33d3	btrfs: move btrfs_get_block_group helper out of disk-io.h This inline helper calls btrfs_fs_compat_ro(), which is defined in another header. To avoid weird header dependency problems move this helper into disk-io.c with the rest of the global root helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:36 +01:00
Peng Hao	c1b078545e	btrfs: simplify cleanup after error in btrfs_create_tree Since leaf is already NULL, and no other branch will go to fail_unlock, the fail_unlock label is useless and can be removed Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-12-05 18:00:36 +01:00
Johannes Thumshirn	a8d1b1647b	btrfs: zoned: initialize device's zone info for seeding When performing seeding on a zoned filesystem it is necessary to initialize each zoned device's btrfs_zoned_device_info structure, otherwise mounting the filesystem will cause a NULL pointer dereference. This was uncovered by fstests' testcase btrfs/163. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-11-07 14:35:24 +01:00

1 2 3 4 5 ...

1750 Commits