mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2025-07-09 07:15:21 +02:00
v6.5/base
676 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
7059c65831 |
btrfs: simplify adding and replacing references during log replay
During log replay, when adding/replacing inode references, there are two special cases that have special code for them: 1) When we have an inode with two or more hardlinks in the same directory, therefore two or more names encoded in the same inode reference item, and one of the hard links gets renamed to the old name of another hard link - that is, the index number for a name changes. This was added in commit |
||
![]() |
30b80f3ce0 |
btrfs: use delayed items when logging a directory
When logging a directory we start by flushing all its delayed items. That results in adding dir index items to the subvolume btree, for new dentries, and removing dir index items from the subvolume btree for any dentries that were deleted. This makes it straightforward to log a directory simply by iterating over all the modified subvolume btree leaves, especially when we used to log both dir index keys and dir item keys (before commit |
||
![]() |
5557a069f3 |
btrfs: skip logging parent dir when conflicting inode is not a dir
When we find a conflicting inode (an inode that had the same name and parent directory as the inode we are logging now) that was deleted in the current transaction, we always end up logging its parent directory. This is to deal with the case where the conflicting inode corresponds to a deleted subvolume/snapshot or a directory that had subvolumes/snapshots (or some subdirectory inside it had subvolumes/snapshots, etc), because we can't deal with dropping subvolumes/snapshots during log replay. So if we log the parent directory, and if we are dealing with these special cases, then we fallback to a transaction commit when logging the parent, because its last_unlink_trans will match the current transaction (which gets set and propagated when a subvolume/snapshot is deleted). This change skips the logging of the parent directory when the conflicting inode is not a directory (or a subvolume/snapshot). This is ok because in this case logging the current inode is enough to trigger an unlink of the conflicting inode during log replay. So for a case like this: $ mkdir /mnt/dir $ echo -n "first foo data" > /mnt/dir/foo $ sync $ rm -f /mnt/dir/foo $ echo -n "second foo data" > /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/foo We avoid logging parent directory "dir" when logging the new file "foo". In other cases it avoids falling back to a transaction commit, when the parent directory has a last_unlink_trans value that matches the current transaction, due to moving a file from it to some other directory. This is a case that happens frequently with dbench for example, where a new file that has the name/parent of another file that was deleted in the current transaction, is fsynced. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
e09d94c9e4 |
btrfs: log conflicting inodes without holding log mutex of the initial inode
When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
f6d86dbeba |
btrfs: move log_new_dir_dentries() above btrfs_log_inode()
The static function log_new_dir_dentries() is currently defined below btrfs_log_inode(), but in an upcoming patch a new function is introduced that is called by btrfs_log_inode() and this new function needs to call log_new_dir_dentries(). So move log_new_dir_dentries() to a location between btrfs_log_inode() and need_log_inode() (the later is called by log_new_dir_dentries()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
a375102426 |
btrfs: move need_log_inode() to above log_conflicting_inodes()
The static function need_log_inode() is defined below btrfs_log_inode() and log_conflicting_inodes(), but in the next patches in the series we will need to call need_log_inode() in a couple new functions that will be used by btrfs_log_inode(). So move its definition to a location above log_conflicting_inodes(). Also make its arguments 'const', since they are not supposed to be modified. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
193df62457 |
btrfs: search for last logged dir index if it's not cached in the inode
The key offset of the last dir index item that was logged is stored in the inode's last_dir_index_offset field. However that field is not persisted in the inode item or elsewhere, so if the inode gets evicted and reloaded, it gets a value of (u64)-1, so that when we are logging dir index items we check if they were logged before, to avoid attempts to insert duplicated keys and fallback to a transaction commit. Improve on this by searching for the last dir index that was logged when we start logging a directory if the inode's last_dir_index_offset is not set (has a value of (u64)-1) and it was logged before. This avoids checking if each dir index item we find was already logged before, and simplifies the logging of dir index items (process_dir_items_leaf()). This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
009d9bea49 |
btrfs: avoid memory allocation at log_new_dir_dentries() for common case
At log_new_dir_dentries() we always start by allocating a list element for the starting inode and then do a while loop with the condition being a list emptiness check. This however is not needed, we can avoid allocating this initial list element and then just check for the list emptiness at the end of the loop's body. So just do that to save one memory allocation from the kmalloc-32 slab. This allows for not doing any memory allocation when we don't have any subdirectory to log, which is a very common case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
4008481343 |
btrfs: free list element sooner at log_new_dir_dentries()
At log_new_dir_dentries(), there's no need to keep the current list element allocated while processing the leaves with directory items for the current directory, and while logging other inodes. Plus in case we find a subdirectory, we also end up allocating a new list element while the current one is still allocated, temporarily using more memory than necessary. So free the current list element early on, before processing leaves. Also make the removal and release of all list elements in case of an error more simple by eliminating the label and goto, adding an explicit loop to release all list elements in case an error happens. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
b96c552b99 |
btrfs: update stale comment for log_new_dir_dentries()
The comment refers to the function log_dir_items() in order to check why
the inodes of new directory entries need to be logged, but the relevant
comments are no longer at log_dir_items(), they were moved to the function
process_dir_items_leaf() in commit
|
||
![]() |
8786a6d740 |
btrfs: remove the root argument from log_new_dir_dentries()
There's no point in passing a root argument to log_new_dir_dentries() because it always corresponds to the root of the given inode. So remove it and extract the root from the given inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
04fc7d5123 |
btrfs: don't drop dir index range items when logging a directory
When logging a directory that was previously logged in the current
transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
type). This is because we will process all leaves in the subvolume's tree
that were changed in the current transaction and then add range items for
covering new dir index items and deleted dir index items, which could
cover now a larger range than before.
We used to fail if we tried to insert a range item key that already
exists, so we dropped all range items to avoid failing. However nowadays,
since commit
|
||
![]() |
d1f68ba069 |
btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent()
btrfs_insert_file_extent() is only ever used to insert holes, so rename it and remove the redundant parameters. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
769030e118 |
btrfs: fix warning during log replay when bumping inode link count
During log replay, at add_link(), we may increment the link count of another inode that has a reference that conflicts with a new reference for the inode currently being processed. During log replay, at add_link(), we may drop (unlink) a reference from some inode in the subvolume tree if that reference conflicts with a new reference found in the log for the inode we are currently processing. After the unlink, If the link count has decreased from 1 to 0, then we increment the link count to prevent the inode from being deleted if it's evicted by an iput() call, because we may have references to add to that inode later on (and we will fixup its link count later during log replay). However incrementing the link count from 0 to 1 triggers a warning: $ cat fs/inode.c (...) void inc_nlink(struct inode *inode) { if (unlikely(inode->i_nlink == 0)) { WARN_ON(!(inode->i_state & I_LINKABLE)); atomic_long_dec(&inode->i_sb->s_remove_count); } (...) The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's never set during log replay. Most of the time, the warning isn't triggered even if we dropped the last reference of the conflicting inode, and this is because: 1) The conflicting inode was previously marked for fixup, through a call to link_to_fixup_dir(), which increments the inode's link count; 2) And the last iput() on the inode has not triggered eviction of the inode, nor was eviction triggered after the iput(). So at add_link(), even if we unlink the last reference of the inode, its link count ends up being 1 and not 0. So this means that if eviction is triggered after link_to_fixup_dir() is called, at add_link() we will read the inode back from the subvolume tree and have it with a correct link count, matching the number of references it has on the subvolume tree. So if when we are at add_link() the inode has exactly one reference only, its link count is 1, and after the unlink its link count becomes 0. So fix this by using set_nlink() instead of inc_nlink(), as the former accepts a transition from 0 to 1 and it's what we use in other similar contexts (like at link_to_fixup_dir(). Also make add_inode_ref() use set_nlink() instead of inc_nlink() to bump the link count from 0 to 1. The warning is actually harmless, but it may scare users. Josef also ran into it recently. CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
7a6b75b799 |
btrfs: fix lost error handling when looking up extended ref on log replay
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes:
|
||
![]() |
723df2bcc9 |
btrfs: join running log transaction when logging new name
When logging a new name, in case of a rename, we pin the log before changing it. We then either delete a directory entry from the log or insert a key range item to mark the old name for deletion on log replay. However when doing one of those log changes we may have another task that started writing out the log (at btrfs_sync_log()) and it started before we pinned the log root. So we may end up changing a log tree while its writeback is being started by another task syncing the log. This can lead to inconsistencies in a log tree and other unexpected results during log replay, because we can get some committed node pointing to a node/leaf that ends up not getting written to disk before the next log commit. The problem, conceptually, started to happen in commit |
||
![]() |
f31f09f6be |
btrfs: tree-log: make the return value for log syncing consistent
Currently we will return 1 or -EAGAIN if we decide we need to commit the transaction rather than sync the log. In practice this doesn't really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to commit the transaction. However this makes it hard to figure out what the correct thing to do is. Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the places where we want to force the transaction to be committed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
143823cf4d |
btrfs: fix typos in comments
Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8aa1e49ea1 |
btrfs: remove unnecessary check of iput argument
iput() already handles NULL and non-NULL parameter, so it is not needed to check that. This unifies all iput calls. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
6a2e9dc46f |
btrfs: remove trivial wrapper btrfs_read_buffer()
The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
750ee45490 |
btrfs: fix assertion failure when logging directory key range item
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
a directory, we don't expect the insertion to fail with -EEXIST, because
we are holding the directory's log_mutex and we have dropped all existing
BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
the directory. However it's possible that during the logging we attempt
to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
happen we need to race with insertions of items from other inodes in the
subvolume's tree while we are logging a directory. Here's how this can
happen:
1) We are logging a directory with inode number 1000 that has its items
spread across 3 leaves in the subvolume's tree:
leaf A - has index keys from the range 2 to 20 for example. The last
item in the leaf corresponds to a dir item for index number 20. All
these dir items were created in a past transaction.
leaf B - has index keys from the range 22 to 100 for example. It has
no keys from other inodes, all its keys are dir index keys for our
directory inode number 1000. Its first key is for the dir item with
a sequence number of 22. All these dir items were also created in a
past transaction.
leaf C - has index keys for our directory for the range 101 to 120 for
example. This leaf also has items from other inodes, and its first
item corresponds to the dir item for index number 101 for our directory
with inode number 1000;
2) When we finish processing the items from leaf A at log_dir_items(),
we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
offset of 21, meaning the log is authoritative for the index range
from 21 to 21 (a single sequence number). At this point leaf B was
not yet modified in the current transaction;
3) When we return from log_dir_items() we have released our read lock on
leaf B, and have set *last_offset_ret to 21 (index number of the first
item on leaf B minus 1);
4) Some other task inserts an item for other inode (inode number 1001 for
example) into leaf C. That resulted in pushing some items from leaf C
into leaf B, in order to make room for the new item, so now leaf B
has dir index keys for the sequence number range from 22 to 102 and
leaf C has the dir items for the sequence number range 103 to 120;
5) At log_directory_changes() we call log_dir_items() again, passing it
a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
of leaf B, since leaf B was modified in the current transaction.
We have also initialized 'last_old_dentry_offset' to 20 after calling
btrfs_previous_item() at log_dir_items(), as it left us at the last
item of leaf A, which refers to the dir item with sequence number 20;
6) We then call process_dir_items_leaf() to process the dir items of
leaf B, and when we process the first item, corresponding to slot 0,
sequence number 22, we notice the dir item was created in a past
transaction and its sequence number is greater than the value of
*last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
error from insert_dir_log_key(), as we have already inserted that
key at step 2, triggering the assertion at process_dir_items_leaf().
The trace produced in dmesg is like the following:
assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
[198255.980839][ T7460] ------------[ cut here ]------------
[198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
[198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
[198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
[198255.989465][ T7460] Code: 8b 4c 89 (...)
[198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
[198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
[198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
[198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
[198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
[198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
[198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
[198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
[198256.015239][ T7460] Call Trace:
[198256.015674][ T7460] <TASK>
[198256.016313][ T7460] log_dir_items.cold+0x16/0x2c
[198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0
[198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270
[198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.031114][ T7460] ? lock_acquired+0xbe/0x660
[198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.034386][ T7460] ? lock_release+0xcf/0x8a0
[198256.036152][ T7460] log_directory_changes+0xf9/0x170
[198256.036993][ T7460] ? log_dir_items+0xba0/0xba0
[198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0
[198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0
[198256.041294][ T7460] ? log_directory_changes+0x170/0x170
[198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60
[198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.046568][ T7460] ? lock_release+0xcf/0x8a0
[198256.047504][ T7460] ? lock_downgrade+0x420/0x420
[198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0
[198256.049747][ T7460] ? lock_downgrade+0x420/0x420
[198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.051618][ T7460] ? __might_resched+0x128/0x1c0
[198256.052511][ T7460] ? __might_sleep+0x66/0xc0
[198256.053442][ T7460] ? __kasan_check_read+0x11/0x20
[198256.054251][ T7460] ? iget5_locked+0xbd/0x150
[198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110
[198256.055929][ T7460] ? btrfs_iget+0xc7/0x150
[198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0
[198256.057502][ T7460] ? free_extent_buffer+0x13/0x20
[198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0
[198256.059137][ T7460] ? log_directory_changes+0x170/0x170
[198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.061905][ T7460] ? lock_contended+0x770/0x770
[198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750
[198256.063582][ T7460] ? lock_downgrade+0x420/0x420
[198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0
[198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0
[198256.066654][ T7460] ? stack_trace_save+0x94/0xc0
[198256.068008][ T7460] ? __kasan_check_write+0x14/0x20
[198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430
[198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0
[198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20
[198256.075345][ T7460] ? lock_downgrade+0x420/0x420
[198256.076142][ T7460] ? lock_contended+0x770/0x770
[198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40
[198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750
[198256.081991][ T7460] ? wait_current_trans+0xc8/0x240
[198256.083320][ T7460] ? lock_downgrade+0x420/0x420
[198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70
[198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.087544][ T7460] ? lock_release+0xcf/0x8a0
[198256.088305][ T7460] ? lock_downgrade+0x420/0x420
[198256.090375][ T7460] ? dget_parent+0x8e/0x300
[198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.094918][ T7460] ? lock_downgrade+0x420/0x420
[198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.101822][ T7460] ? dget_parent+0xb7/0x300
[198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60
[198256.105052][ T7460] btrfs_sync_file+0x629/0xa40
[198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.109655][ T7460] ? __fget_files+0x161/0x230
[198256.110760][ T7460] vfs_fsync_range+0x6d/0x110
[198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.113556][ T7460] __x64_sys_fsync+0x45/0x70
[198256.114323][ T7460] do_syscall_64+0x5c/0xc0
[198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50
[198256.116030][ T7460] ? do_syscall_64+0x69/0xc0
[198256.116768][ T7460] ? do_syscall_64+0x69/0xc0
[198256.117555][ T7460] ? do_syscall_64+0x69/0xc0
[198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0
[198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20
[198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae
[198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
[198256.122067][ T7460] Code: 0f 05 48 (...)
[198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
[198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
[198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
[198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
[198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
[198256.133403][ T7460] </TASK>
Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
it update the item with an end offset corresponding to the maximum between
the previously logged end offset and the new requested end offset. The end
offsets may be different due to dir index key deletions that happened as
part of unlink operations while we are logging a directory (triggered when
fsyncing some other inode parented by the directory) or during renames
which always attempt to log a single dir index deletion.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
Fixes:
|
||
![]() |
d0e64a981f |
btrfs: always log symlinks in full mode
On Linux, empty symlinks are invalid, and attempting to create one with the system call symlink(2) results in an -ENOENT error and this is explicitly documented in the man page. If we rename a symlink that was created in the current transaction and its parent directory was logged before, we actually end up logging the symlink without logging its content, which is stored in an inline extent. That means that after a power failure we can end up with an empty symlink, having no content and an i_size of 0 bytes. It can be easily reproduced like this: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ sync # Create a file inside the directory and fsync the directory. $ touch /mnt/testdir/foo $ xfs_io -c "fsync" /mnt/testdir # Create a symlink inside the directory and then rename the symlink. $ ln -s /mnt/testdir/foo /mnt/testdir/bar $ mv /mnt/testdir/bar /mnt/testdir/baz # Now fsync again the directory, this persist the log tree. $ xfs_io -c "fsync" /mnt/testdir <power failure> $ mount /dev/sdc /mnt $ stat -c %s /mnt/testdir/baz 0 $ readlink /mnt/testdir/baz $ Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that their content is also logged. A test case for fstests will follow. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
50ff57888d |
btrfs: fix leaked plug after failure syncing log on zoned filesystems
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.
So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.
Fixes:
|
||
![]() |
313ab75399 |
btrfs: add and use helper for unlinking inode during log replay
During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
23e3337faf |
btrfs: reset last_reflink_trans after fsyncing inode
When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit |
||
![]() |
96acb3753e |
btrfs: voluntarily relinquish cpu when doing a full fsync
Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
5b7ce5e287 |
btrfs: hold on to less memory when logging checksums during full fsync
When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
7f30c07288 |
btrfs: stop copying old file extents when doing a full fsync
When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit |
||
![]() |
e1f53ed874 |
btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path to the leaf that will hold the file extent item either through a deletion search, via btrfs_drop_extents(), or through an insertion search using btrfs_insert_empty_item(). After that we fill the file extent item's fields one by one directly on the leaf. Instead of doing that, we could prepare the file extent item before obtaining a btree path, and then copy the prepared extent item with a single operation once we get the path. This helps avoid some contention on the log tree, since we are holding write locks for longer than necessary, especially in the case where the path is obtained via btrfs_drop_extents() through a deletion search, which always keeps a write lock on the nodes at levels 1 and 2 (besides the leaf). This change does that, we prepare the file extent item that is going to be inserted before acquiring a path, and then copy it into a leaf using a single copy operation once we get a path. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The following test was run to measure the impact of the whole patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=8 FILE_SIZE=128M RUN_TIME=200 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=1 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=sync filesize=$FILE_SIZE runtime=$RUN_TIME time_based directory=$MNT numjobs=$NUM_JOBS thread EOF echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test ran inside a VM (8 cores, 32G of RAM) with the target disk mapping to a raw NVMe device, and using a non-debug kernel config (Debian's default config). Before the patchset: WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec After the patchset: WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec A 7.8% gain on throughput and +7.0% more IO done in the same period of time (200 seconds). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d845753170 |
btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop that logs the modified extents, since log_one_extent() returns with the path released. In case the list of extents is empty, the path is already released, so there's no need for that case as well. So just remove that unnecessary btrfs_release_path() call. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
65faced5b9 |
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the return value of the function, named 'ret' and 'err'. In some places we use 'ret' and if gets a non-zero value we assign its value to 'err' and then jump to the 'out' label, while in other places we use 'err' directly without 'ret' as an intermediary. This is inconsistent, error prone and not necessary. So change that to use only the 'ret' variable, making this consistent with most functions in btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
0f8ce49821 |
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was previously logged or not, and if it was, do some update to the logged inode. We used to rely exclusively on the logged_trans field of struct btrfs_inode to determine that, but that was not reliable because the value of that field is not persisted in the inode item, so it's lost when an inode is evicted and loaded back again. That led to several issues in the past, such as not persisting deletions (such as the case fixed by commit |
||
![]() |
259c4b96d7 |
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order to remove an inode reference and a directory entry from the log. These are necessary when __btrfs_unlink_inode() is called from the unlink path, but not necessary when it's called from a rename context, because: 1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the inode reference related to the old name, because later in the rename path we call btrfs_log_new_name(), which will drop all inode references from the log and copy all inode references from the subvolume tree to the log tree. So we are doing one unnecessary btree operation which adds additional latency and lock contention in case there are other tasks accessing the log tree; 2) For the btrfs_del_dir_entries_in_log() call, we are now doing the equivalent at btrfs_log_new_name() since the previous patch in the series, that has the subject "btrfs: avoid logging all directory changes during renames". In fact, having __btrfs_unlink_inode() call this function not only adds additional latency and lock contention due to the extra btree operation, but also can make btrfs_log_new_name() unnecessarily log a range item to track the deletion of the old name, since it has no way to known that the directory entry related to the old name was previously logged and already deleted by __btrfs_unlink_inode() through its call to btrfs_del_dir_entries_in_log(). So skip those calls at __btrfs_unlink_inode() when we are doing a rename. Skipping them also allows us now to reduce the duration of time we are pinning a log transaction during renames, which is always beneficial as it's not delaying so much other tasks trying to sync the log tree, in particular we end up not holding the log transaction pinned while adding the new name (adding inode ref, directory entry, etc). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible Just like the previous patch in the series, "btrfs: avoid logging all directory changes during renames", the following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box a using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27399 ms NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%) NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%) NUM_FILES=5000, before patchset: 9241 ms NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%) NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%) NUM_FILES=2000, before patchset: 2550 ms NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%) NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%) NUM_FILES=1000, before patchset: 1088 ms NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%) NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%) The next patch in the series (5/5), also contains dbench results after applying to whole patchset. Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
88d2beec7e |
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory were logged before, we log the new name of the file and then make sure we log the old parent directory, to ensure that after a log replay the old name of the file is deleted and the new name added. The logging of the old parent directory can take some time, because it will scan all leaves modified in the current transaction, check which directory entries were already logged, copy the ones that were not logged before, etc. In this rename context all we need to do is make sure that the old name of the file is deleted on log replay, so instead of triggering a directory log operation, we can just delete the old directory entry from the log if it's there, or in case it isn't there, just log a range item to signal log replay that the old name must be deleted. So change btrfs_log_new_name() to do that. This scenario is actually not uncommon to trigger, and recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete, much more than usual. With strace it could be observed that zypper was spending over 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently and causing high latencies for the rename operations. Taking into account that installation/upgrade of some of these packages needed about a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure, in an efficient way, if an inode was previously logged in the current transaction, so we are pessimistic and assume it was, because in case it was we need to update the logged inode. More details on that in one of the patches in the same series (subject "btrfs: avoid inode logging during rename and link when possible"). Either way, in case the parent directory was logged before, we currently do more work then necessary during a rename, and this change minimizes that amount of work. The following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before this patch: 27399 ms NUM_FILES=10000, after this patch: 9093 ms (-66.8%) NUM_FILES=5000, before this patch: 9241 ms NUM_FILES=5000, after this patch: 4642 ms (-49.8%) NUM_FILES=2000, before this patch: 2550 ms NUM_FILES=2000, after this patch: 1788 ms (-29.9%) NUM_FILES=1000, before this patch: 1088 ms NUM_FILES=1000, after this patch: 905 ms (-16.9%) Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d5f5bd5465 |
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old name, and its length, of an inode when logging the inode during a rename. So instead of passing the inode to btrfs_log_new_name() pass the dentry, because from the dentry we can get the inode, the name and its length. This will avoid passing 3 new parameters to btrfs_log_new_name() in the next patch - the name, its length and an index number. This way we end up passing only 1 new parameter, the index number. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
839061fe88 |
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of btrfs_del_dir_entries_in_log() into a helper function. This new helper function will be used by another patch in the same series, and serves to avoid having duplicated logic. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
de6bc7f598 |
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit
|
||
![]() |
732d591a5d |
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that was changed in the current transaction and copy all its dir index keys to the log tree. That includes copying dir index keys created in past transactions. This is done mostly for simplicity, as after logging the keys we log an item that specifies the start and end ranges of the keys we logged. That item is then used during log replay to figure out which keys need to be deleted - every key in that range that we find in the subvolume tree and is not in the log tree, needs to be deleted. Now that we log only dir index keys, and not dir item keys anymore, when we remove dentries from a directory (due to unlink and rename operations), we can get entire leaves that we changed only for deleting old dir index keys, or that have few dir index keys that are new - this is due to the fact that the offset for new index keys comes from a monotonically increasing counter. We can avoid logging dir index keys from past transactions, and in order to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key type) when we find gaps between consecutive index keys. This massively reduces the amount of logged metadata when we have deleted directory entries, even if it's a small percentage of the total number of entries. The reduction comes from both less items that are logged and instead of logging many dir index items (struct btrfs_dir_item), which have a size of 30 bytes plus a file name, we typically log just a few range items (struct btrfs_dir_log_item), which take only 8 bytes each. Even if no entries were deleted from a directory and only new entries were added, we typically still get a reduction on the amount of logged metadata, because it's very likely the first leaf that got the new dir index entries also has several old dir index entries. So change the logging logic to not log dir index keys created in past transactions and log a range item for every gap it finds between each pair of consecutive index keys, to ensure deletions are tracked and replayed on log replay. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions The following test was run on a branch without this patchset and on a branch with the first three patches applied: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=1000000 NUM_FILE_DELETES=10000 MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config), and the results were the following for various values of NUM_FILES and NUM_FILE_DELETES: ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 585 ms after deleting 10000 files ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 34 ms after deleting 10000 files (-94.2%) ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 50 ms after deleting 1000 files ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 7 ms after deleting 1000 files (-86.0%) ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 9 ms after deleting 100 files ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 5 ms after deleting 100 files (-44.4%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
a450a4af74 |
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a previous index key, which normally is at the end of a leaf that was not changed in the current transaction. Then we log that key and set the start of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of that key. This is to ensure that if there were deleted index keys between that key and the first key we are going to log, those deletions are replayed in case we need to replay to the log after a power failure. However we really don't need to log that previous key, we can just set the start of the logged range to that key's offset plus 1. This achieves the same and avoids logging one dir index key. The same logic is performed when we finish logging the index keys of a leaf and we find that the next leaf has index keys and was not changed in the current transaction. We are logging the first key of that next leaf and use its offset as the end of range we log. This is just to ensure that if there were deleted index keys between the last index key we logged and the first key of that next leaf, those index keys are deleted if we end up replaying the log. However that is not necessary, we can avoid logging that first index key of the next leaf and instead set the end of the logged range to match the offset of that index key minus 1. So avoid logging those index keys at the boundaries and adjust the start and end offsets of the logged ranges as described above. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions Performance test results are listed in the changelog of patch 3/4. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
c816d705b9 |
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees, are not used since 2008, more specifically since commit |
||
![]() |
4751dc9962 |
btrfs: add missing run of delayed items after unlink during log replay
During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: |
||
![]() |
d994788743 |
btrfs: fix lost prealloc extents beyond eof after full fsync
When doing a full fsync, if we have prealloc extents beyond (or at) eof, and the leaves that contain them were not modified in the current transaction, we end up not logging them. This results in losing those extents when we replay the log after a power failure, since the inode is truncated to the current value of the logged i_size. Just like for the fast fsync path, we need to always log all prealloc extents starting at or beyond i_size. The fast fsync case was fixed in commit |
||
![]() |
40cdc50987 |
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit |
||
![]() |
8697b8f88e |
btrfs: do not check -EAGAIN when truncating inodes in the log root
We only throttle the btrfs_truncate_inode_items if the root is SHAREABLE, which isn't set on the log root, which means this loop is unnecessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
71d18b5354 |
btrfs: add inode to truncate control
In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
487e81d2a4 |
btrfs: pass the ino via truncate control
In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
5caa490ed8 |
btrfs: control extent reference updates with a control flag for truncate
We've had weird bugs in the past where we forgot to adjust the truncate path to deal with the fact that we can be called by the tree log path. Instead of checking if our root is a LOG_ROOT use a flag on the btrfs_truncate_control to indicate that we don't want to do extent reference updates during this truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d9ac19c380 |
btrfs: add truncate control struct
I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
26c2c4540d |
btrfs: add an inode-item.h
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
fc28b25e1f |
btrfs: stop accessing ->csum_root directly
We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
3212fa14e7 |
btrfs: drop the _nr from the item helpers
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
ccae4a19c9 |
btrfs: remove no longer needed logic for replaying directory deletes
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items. So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
339d035424 |
btrfs: only copy dir index keys when logging a directory
Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1b2e5e5c7f |
btrfs: fix missing last dir item offset update when logging directory
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.
However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:
Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel: dir oid 704793 type 1
Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707849 type 2
Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707901 type 2
Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704694 type 2
Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707841 type 1
Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167736 type 1
Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704696 type 1
Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel: dir oid 704704 type 2
Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704738 type 1
Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel: dir oid 2167729 type 1
Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel: dir oid 704745 type 2
Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel: dir oid 2167734 type 1
Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167730 type 1
Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel: dir oid 704688 type 1
Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707892 type 1
Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected
So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes:
|
||
![]() |
f35838a693 |
btrfs: fix memory leak in __add_inode_ref()
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).
We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).
1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 struct btrfs_root *root,
1059 struct btrfs_path *path,
1060 struct btrfs_root *log_root,
1061 struct btrfs_inode *dir,
1062 struct btrfs_inode *inode,
1063 u64 inode_objectid, u64 parent_objectid,
1064 u64 ref_index, char *name, int namelen,
1065 int *search_done)
1066 {
1104 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #1: kmalloc (victim_name-1)
1105 if (!victim_name)
1106 return -ENOMEM;
1112 ret = backref_in_log(log_root, &search_key,
1113 parent_objectid, victim_name,
1114 victim_name_len);
1115 if (ret < 0) {
1116 kfree(victim_name); // #2: kfree (victim_name-1)
1117 return ret;
1118 } else if (!ret) {
1169 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #3: kmalloc (victim_name-2)
1170 if (!victim_name)
1171 return -ENOMEM;
1180 ret = backref_in_log(log_root, &search_key,
1181 parent_objectid, victim_name,
1182 victim_name_len);
1183 if (ret < 0) {
1184 return ret; // #4: missing kfree (victim_name-2)
1185 } else if (!ret) {
1241 return 0;
1242 }
Fixes:
|
||
![]() |
84c2544892 |
btrfs: fix re-dirty process of tree-log nodes
There is a report of a transaction abort of -EAGAIN with the following
script.
#!/bin/sh
for d in sda sdb; do
mkfs.btrfs -d single -m single -f /dev/\${d}
done
mount /dev/sda /mnt/test
mount /dev/sdb /mnt/scratch
for dir in test scratch; do
echo 3 >/proc/sys/vm/drop_caches
fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
--numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
--group_reporting |& tee /dev/shm/fio.\${dir}
echo 3 >/proc/sys/vm/drop_caches
done
for d in sda sdb; do
umount /dev/\${d}
done
The stack trace is shown in below.
[3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
[3310.968060] BTRFS info (device sda): forced readonly
[3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
[3310.968065] ------------[ cut here ]------------
[3310.968066] BTRFS: Transaction aborted (error -11)
[3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
[3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
[3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
[3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
[3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
[3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
[3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
[3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
[3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
[3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
[3310.968160] PKRU: 55555554
[3310.968161] Call Trace:
[3310.968167] ? dput+0xd4/0x300
[3310.968174] btrfs_sync_file+0x3f1/0x490
[3310.968180] __x64_sys_fsync+0x33/0x60
[3310.968185] do_syscall_64+0x3b/0x90
[3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3310.968194] RIP: 0033:0x7efe6557329b
[3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
[3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
[3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
[3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
[3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
[3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
[3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
The abort occurs because of a write hole while writing out freeing tree
nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
node to ensure btrfs can write the region and does not leave a hole on
write on a zoned device. The current code fails to re-dirty a node
when the tree-log tree's depth is greater or equal to 2. That leads to
a transaction abort with -EAGAIN.
Fix the issue by properly re-dirtying a node on walking up the tree.
Fixes:
|
||
![]() |
d1ed82f355 |
btrfs: remove root argument from check_item_in_log()
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
6d9cc07215 |
btrfs: remove root argument from add_link()
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
4467af8809 |
btrfs: remove root argument from btrfs_unlink_inode()
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
9798ba24cb |
btrfs: remove root argument from drop_one_dir_item()
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
10adb1152d |
btrfs: fix lost error handling when replaying directory deletes
At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
f42c5da6c1 |
btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8496153945 |
btrfs: add a BTRFS_FS_ERROR helper
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of |
||
![]() |
9a35fc9542 |
btrfs: change error handling for btrfs_delete_*_in_log
Currently we will abort the transaction if we get a random error (like -EIO) while trying to remove the directory entries from the root log during rename. However since these are simply log tree related errors, we can mark the trans as needing a full commit. Then if the error was truly catastrophic we'll hit it during the normal commit and abort as appropriate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
ba51e2a11e |
btrfs: change handle_fs_error in recover_log_trees to aborts
During inspection of the return path for replay I noticed that we don't actually abort the transaction if we get a failure during replay. This isn't a problem necessarily, as we properly return the error and will fail to mount. However we still leave this dangling transaction that could conceivably be committed without thinking there was an error. We were using btrfs_handle_fs_error() here, but that pre-dates the transaction abort code. Simply replace the btrfs_handle_fs_error() calls with transaction aborts, so we still know where exactly things went wrong, and add a few in some other un-handled error cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
da1b811fcd |
btrfs: use single bulk copy operations when logging directories
When logging a directory and inserting a batch of directory items, we are copying the data of each item from a leaf in the fs/subvolume tree to a leaf in a log tree, separately. This is not really needed, since we are copying from a contiguous memory area into another one, so we can use a single copy operation to copy all items at once. This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 3/3. The following test was used to compare performance of a branch without the patchset versus one branch that has the whole patchset applied: $ cat dir-fsync-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=1000 LEAF_SIZE=16K mkfs.btrfs -f -n $LEAF_SIZE $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # Fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT The tests were run on a non-debug kernel (Debian's default kernel config) and were the following: *** with a leaf size of 16K, before patchset *** dir fsync took 8482 ms after adding 1000000 files dir fsync took 166 ms after deleting 1000 files *** with a leaf size of 16K, after patchset *** dir fsync took 8196 ms after adding 1000000 files (-3.4%) dir fsync took 143 ms after deleting 1000 files (-14.9%) *** with a leaf size of 64K, before patchset *** dir fsync took 12851 ms after adding 1000000 files dir fsync took 466 ms after deleting 1000 files *** with a leaf size of 64K, after patchset *** dir fsync took 12287 ms after adding 1000000 files (-4.5%) dir fsync took 414 ms after deleting 1000 files (-11.8%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
b7ef5f3a6f |
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
dc2872247e |
btrfs: keep track of the last logged keys when logging a directory
After the first time we log a directory in the current transaction, for each directory item in a changed leaf of the subvolume tree, we have to check if we previously logged the item, in order to overwrite it in case its data changed or skip it in case its data hasn't changed. Checking if we have logged each item before not only wastes times, but it also adds lock contention on the log tree. So in order to minimize the number of times we do such checks, keep track of the offset of the last key we logged for a directory and, on the next time we log the directory, skip the checks for any new keys that have an offset greater than the offset we have previously saved. This is specially effective for index keys, because the offset for these keys comes from a monotonically increasing counter. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 5/5. The following test was used on a non-debug kernel to measure the impact it has on a directory fsync: $ cat test-dir-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=100000 NUM_FILE_DELETES=1000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT Test results with NUM_NEW_FILES set to 100 000 and 1 000 000: **** before patchset, 100 000 files, 1000 deletes **** dir fsync took 848 ms after adding 100000 files dir fsync took 175 ms after deleting 1000 files **** after patchset, 100 000 files, 1000 deletes **** dir fsync took 758 ms after adding 100000 files (-11.2%) dir fsync took 63 ms after deleting 1000 files (-94.1%) **** before patchset, 1 000 000 files, 1000 deletes **** dir fsync took 9945 ms after adding 1000000 files dir fsync took 473 ms after deleting 1000 files **** after patchset, 1 000 000 files, 1000 deletes **** dir fsync took 8677 ms after adding 1000000 files (-13.6%) dir fsync took 146 ms after deleting 1000 files (-105.6%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
086dcbfa50 |
btrfs: insert items in batches when logging a directory when possible
When logging a directory, we scan its directory items from the subvolume tree and then copy one by one into the log tree. This is not efficient since we generally are able to insert several items in a batch, using a single btree operation for adding several items at once. The reason we copy items one by one is that we must check if each item was previously logged in the current transaction, and if it was we either overwrite it or skip it in case its content did not change in the subvolume tree (this can happen only for dir item keys, but not for dir index keys), and doing such check makes it a bit cumbersome to attempt batch insertions. However the chances for doing batch insertions are very frequent and always happen when: 1) Logging the directory for the first time in the current transaction, as none of the items exist in the log tree yet; 2) Logging new dir index keys, because the offset for new dir index keys comes from a monotonically increasing counter. This means if we keep adding dentries to a directory, through creation of new files and sub-directories or by adding new links or renaming from some other directory into the one we are logging, all the new dir index keys have a new offset that is greater than the offset of any previously logged index keys, so we can insert them in batches into the log tree. For dir item keys, since their offset depends on the result of an hash function against the dentry's name, unless the directory is being logged for the first time in the current transaction, the chances being able to insert the items in the log using batches is pretty much random and not predictable, as it depends on the names of the dentries, but still happens often enough. So change directory logging to keep track of consecutive directory items that don't exist yet in the log and batch insert them. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 4/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
eb10d85ee7 |
btrfs: factor out the copying loop of dir items from log_dir_items()
In preparation for the next change, move the loop that processes a leaf and copies its directory items to the log, into a separate helper function. This makes the next change simpler and it also helps making log_dir_items() a bit shorter (specially after the next change). This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 3/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
d46fb845af |
btrfs: remove redundant log root assignment from log_dir_items()
At log_dir_items() we are assigning the exact same value to the local variable 'log', once when it's declared and once again shortly after. Remove the later assignment as it's pointless. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 2/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
90d04510a7 |
btrfs: remove root argument from btrfs_log_inode() and its callees
The root argument passed to btrfs_log_inode() is unncessary, as it is always the root of the inode we are going to log. This root also gets unnecessarily propagated to several functions called by btrfs_log_inode(), and all of them take the inode as an argument as well. So just remove the root argument from these functions and have them get the root from the inode where needed. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 1/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
f6df27dd27 |
btrfs: do not commit delayed inode when logging a file in full sync mode
When logging a regular file in full sync mode, we are currently committing its delayed inode item. This is to ensure that we never miss copying the inode item, with its most up to date data, into the log tree. However that is not necessary since commit |
||
![]() |
5328b2a7ff |
btrfs: avoid attempt to drop extents when logging inode for the first time
When logging an extent, in the fast fsync path, we always attempt do drop or trim any existing extents with a range that match or overlap the range of the extent we are about to log. We do that through a call to btrfs_drop_extents(). However this is not needed when we are logging the inode for the first time in the current transaction, since we have no inode items of the inode in the log tree. Calling btrfs_drop_extents() does a deletion search on the log tree, which is expensive when we have concurrent tasks accessing the log tree because a deletion search always acquires a write lock on the extent buffers at levels 2, 1 and 0, adding significant lock contention, specially taking into account the height of a log tree rarely (if ever) goes beyond 2 or 3, due to its short life. So skip the call to btrfs_drop_extents() when the inode was not previously logged in the current transaction. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 9/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
a5c733a4b6 |
btrfs: avoid search for logged i_size when logging inode if possible
If we are logging that an inode exists and the inode was not logged before, we can avoid searching in the log tree for the inode item since we know it does not exists. That wastes time and adds more lock contention on the extent buffers of the log tree when there are other tasks that are logging other inodes. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 8/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
4934a81502 |
btrfs: avoid expensive search when truncating inode items from the log
Whenever we are logging a file inode in full sync mode we call btrfs_truncate_inode_items() to delete items of the inode we may have previously logged. That results in doing a btree search for deletion, which is expensive because it always acquires write locks for extent buffers at levels 2, 1 and 0, and it balances any node that is less than half full. Acquiring the write locks can block the task if the extent buffers are already locked by another task or block other tasks attempting to lock them, which is specially bad in case of log trees since they are small due to their short life, with a root node at a level typically not greater than level 2. If we know that we are logging the inode for the first time in the current transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding the deletion search. This change does that. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 7/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8a2b3da191 |
btrfs: add helper to truncate inode items when logging inode
Move the call to btrfs_truncate_inode_items(), and the surrounding retry loop, into a local helper function. This avoids some repetition and avoids making the next change a bit awkward due to a bit of too much indentation. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 6/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
88e221cdac |
btrfs: avoid expensive search when dropping inode items from log
Whenever we are logging a directory inode, logging that an inode exists or logging an inode that has changes in its references or xattrs, we attempt to delete items of this inode we may have previously logged (through calls to drop_objectid_items()). That attempt does a btree search for deletion, which is expensive because it always acquires write locks for extent buffers at levels 2, 1 and 0, and it balances any node that is less than half full. Acquiring the write locks can block the task if the extent buffers are already locked or block other tasks attempting to lock them, which is specially bad in case of log trees since they are small due to their short life, with a root node at a level typically not greater than level 2. If we know that we are logging the inode for the first time in the current transaction, we can skip the search. This change does that. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 5/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
130341be7f |
btrfs: always update the logged transaction when logging new names
When we are logging a new name for an inode, due to a link or rename
operation, if the inode has ancestor inodes that are new, created in the
current transaction, we need to log that these inodes exist. To ensure
that a subsequent explicit fsync on one of these ancestor inodes does
sync the log, we don't set the logged_trans field of these inodes.
This was done in commit
|
||
![]() |
c48792c6ee |
btrfs: do not log new dentries when logging that a new name exists
When logging a new name for an inode, due to a link or rename operation, we don't need to log all new dentries of the parent directories and their subdirectories. We only want to log the names of the inode and that any new parent directories exist. So in this case don't trigger logging of the new dentries, that is only need when doing an explicit fsync on a directory or on a file which requires logging its parent directories. This avoids unnecessary work and reduces contention on the extent buffers of a log tree. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 3/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
289cffcb03 |
btrfs: remove no longer needed checks for NULL log context
Since commit
|
||
![]() |
1e0860f3b3 |
btrfs: check if a log tree exists at inode_logged()
In case an inode was never logged since it was loaded from disk and was modified in the current transaction (its ->last_trans matches the ID of the current transaction), inode_logged() returns true even if there's no existing log tree. In this case we can simply check if a log tree exists and return false if it does not. This avoids a caller of inode_logged() doing some unnecessary, but harmless, work. For btrfs_log_new_name() it avoids it logging an inode in case it was never logged since it was loaded from disk and there is currently no log tree for the inode's root. For the remaining callers of inode_logged(), btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has no effect since they already check if a log tree exists through their calls to join_running_log_trans(). So just add a check to inode_logged() to verify if a log tree exists, and return false if it does not. This patch is part of a patch set comprised of the following patches: btrfs: check if a log tree exists at inode_logged() btrfs: remove no longer needed checks for NULL log context btrfs: do not log new dentries when logging that a new name exists btrfs: always update the logged transaction when logging new names btrfs: avoid expensive search when dropping inode items from log btrfs: add helper to truncate inode items when logging inode btrfs: avoid expensive search when truncating inode items from the log btrfs: avoid search for logged i_size when logging inode if possible btrfs: avoid attempt to drop extents when logging inode for the first time btrfs: do not commit delayed inode when logging a file in full sync mode This is patch 1/10 and test results are listed in the change log of the last patch in the set. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
cfd312695b |
btrfs: check for error when looking up inode during dir entry replay
At replay_one_name(), we are treating any error from btrfs_lookup_inode() as if the inode does not exists. Fix this by checking for an error and returning it to the caller. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8dcbc26194 |
btrfs: unify lookup return value when dir entry is missing
btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir entries and both are used during log replay or when updating a log tree during an unlink. However when the dir item does not exists, btrfs_lookup_dir_item() returns NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if the dir item exists but there is no matching entry for a given name or index, both return NULL. This makes the call sites during log replay to be more verbose than necessary and it makes it easy to miss this slight difference. Since we don't need to distinguish between those two cases, make btrfs_lookup_dir_index_item() always return NULL when there is no matching directory entry - either because there isn't any dir entry or because there is one but it does not match the given name and index. Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to 'index' since it is supposed to match an index number, and the name 'objectid' is not very good because it can easily be confused with an inode number (like the inode number a dir entry points to). CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
52db77791f |
btrfs: deal with errors when adding inode reference during log replay
At __inode_add_ref(), we treating any error returned from btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning that there is no existing directory entry in the fs/subvolume tree. This is not correct since we can get errors such as, for example, -EIO when reading extent buffers while searching the fs/subvolume's btree. So fix that and return the error to the caller when it is not -ENOENT. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
e15ac64137 |
btrfs: deal with errors when replaying dir entry during log replay
At replay_one_one(), we are treating any error returned from btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning that there is no existing directory entry in the fs/subvolume tree. This is not correct since we can get errors such as, for example, -EIO when reading extent buffers while searching the fs/subvolume's btree. So fix that and return the error to the caller when it is not -ENOENT. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
77a5b9e3d1 |
btrfs: deal with errors when checking if a dir entry exists during log replay
Currently inode_in_dir() ignores errors returned from btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating any errors as if the directory entry does not exists in the fs/subvolume tree, which is obviously not correct, as we can get errors such as -EIO when reading extent buffers while searching the fs/subvolume's tree. Fix that by making inode_in_dir() return the errors and making its only caller, add_inode_ref(), deal with returned errors as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
3736127a3a |
btrfs: tree-log: check btrfs_lookup_data_extent return value
Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return values bellow zero if an error happened. Function replay_one_extent currently checks if the search found something (0 returned) and increments the reference, and if not, it seems to evaluate as 'not found'. Fix the condition by checking if the value was bellow zero and return early. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
8be2ba2e0e |
btrfs: avoid unnecessarily logging directories that had no changes
There are several cases where when logging an inode we need to log its parent directories or logging subdirectories when logging a directory. There are cases however where we end up logging a directory even if it was not changed in the current transaction, no dentries added or removed since the last transaction. While this is harmless from a functional point of view, it is a waste time as it brings no advantage. One example where this is triggered is the following: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ touch /mnt/A/foo $ ln /mnt/A/foo /mnt/B/bar $ ln /mnt/A/foo /mnt/C/baz $ sync $ rm -f /mnt/A/foo $ xfs_io -c "fsync" /mnt/B/bar This last fsync ends up logging directories A, B and C, however we only need to log directory A, as B and C were not changed since the last transaction commit. So fix this by changing need_log_inode(), to return false in case the given inode is a directory and has a ->last_trans value smaller than the current transaction's ID. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
1f29537302 |
btrfs: update comment at log_conflicting_inodes()
A comment at log_conflicting_inodes() mentions that we check the inode's
logged_trans field instead of using btrfs_inode_in_log() because the field
last_log_commit is not updated when we log that an inode exists and the
inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
about the full sync flag is not true anymore since commit
|
||
![]() |
d135a53396 |
btrfs: remove no longer needed full sync flag check at inode_logged()
Now that we are checking if the inode's logged_trans is 0 to detect the possibility of the inode having been evicted and reloaded, the test for the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at tree-log.c:inode_logged(). Its purpose was to detect the possibility of a previous eviction as well, since when an inode is loaded the full sync flag is always set on it (and only cleared after the inode is logged). So just remove the check and update the comment. The check for the inode's logged_trans being 0 was added recently by the patch with the subject "btrfs: eliminate some false positives when checking if inode was logged". Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
77eea05e78 |
btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
6e8e777deb |
btrfs: eliminate some false positives when checking if inode was logged
When checking if an inode was previously logged in the current transaction through the helper inode_logged(), we can return some false positives that can be easily eliminated. These correspond to the cases where an inode has a ->logged_trans value that is not zero and its value is smaller then the ID of the current transaction. This means we know exactly that the inode was never logged before in the current transaction, so we can return false and avoid the callers to do extra work: 1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() unnecessarily join a log transaction and do deletion searches in a log tree that will not find anything. This just adds unnecessary contention on extent buffer locks; 2) Having btrfs_log_new_name() unnecessarily log an inode when it is not needed. If the inode was not logged before, we don't need to log it in LOG_INODE_EXISTS mode. So just make sure that any false positive only happens when ->logged_trans has a value of 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
214cc18432 |
btrfs: constify and cleanup variables in comparators
Comparators just read the data and thus get const parameters. This should be also preserved by the local variables, update all comparators passed to sort or bsearch. Cleanups: - unnecessary casts are dropped - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern and 'inline' is dropped as the function address is taken Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
2ac691d8b3 |
btrfs: avoid unnecessary lock and leaf splits when updating inode in the log
During a fast fsync, if we have already fsynced the file before and in the current transaction, we can make the inode item update more efficient and avoid acquiring a write lock on the leaf's parent. To update the inode item we are always using btrfs_insert_empty_item() to get a path pointing to the inode item, which calls btrfs_search_slot() with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) + sizeof(struct btrfs_item)', and that always results in the search taking a write lock on the level 1 node that is the parent of the leaf that contains the inode item. This adds unnecessary lock contention on log trees when we have multiple fsyncs in parallel against inodes in the same subvolume, which has a very significant impact due to the fact that log trees are short lived and their height very rarely goes beyond level 2. Also, by using btrfs_insert_empty_item() when we need to update the inode item, we also end up splitting the leaf of the existing inode item when the leaf has an amount of free space smaller than the size of an inode item. Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument, when we know the inode item already exists in the log. This avoids these two inefficiencies. The following script, using fio, was used to perform the tests: $ cat fio-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 4 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 BLOCK_SIZE=$4 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=$BLOCK_SIZE ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo echo "mount options: $MOUNT_OPTIONS" echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were done on a physical machine, with 12 cores, 64G of RAM, using a NVMEe device and using a non-debug kernel config (the default one from Debian). The summary line from fio is provided below for each test run. With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec After: WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec +1.4% throughput, -1.2% runtime With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec After: WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec +2.2% throughput, -2.1% runtime The changes are small but it's possible to be better on faster hardware as in the test machine used disk utilization was pretty much 100% during the whole time the tests were running (observed with 'iostat -xz 1'). The tests also included the previous patch with the subject of: "btrfs: avoid unnecessary log mutex contention when syncing log". So they compared a branch without that patch and without this patch versus a branch with these two patches applied. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
e68107e51f |
btrfs: remove unnecessary list head initialization when syncing log
One of the last steps of syncing the log is to remove all log contexts from the root's list of contexts, done at btrfs_remove_all_log_ctxs(). There we iterate over all the contexts in the list and delete each one from the list, and after that we call INIT_LIST_HEAD() on the list. That is unnecessary since at that point the list is empty. So just remove the INIT_LIST_HEAD() call. It's not needed, increases code size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after this change) and increases two critical sections delimited by log mutexes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
e1a6d26483 |
btrfs: avoid unnecessary log mutex contention when syncing log
When syncing the log we acquire the root's log mutex just to update the root's last_log_commit. This is unnecessary because: 1) At this point there can only be one task updating this value, which is the task committing the current log transaction. Any task that enters btrfs_sync_log() has to wait for the previous log transaction to commit and wait for the current log transaction to commit if someone else already started it (in this case it never reaches to the point of updating last_log_commit, as that is done by the committing task); 2) All readers of the root's last_log_commit don't acquire the root's log mutex. This is to avoid blocking the readers, potentially for too long and because getting a stale value of last_log_commit does not cause any functional problem, in the worst case getting a stale value results in logging an inode unnecessarily. Plus it's actually very rare to get a stale value that results in unnecessarily logging the inode. So in order to avoid unnecessary contention on the root's log mutex, which is used for several different purposes, like starting/joining a log transaction and starting writeback of a log transaction, stop acquiring the log mutex for updating the root's last_log_commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
ecc64fab7d |
btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
When checking if we need to log the new name of a renamed inode, we are checking if the inode and its parent inode have been logged before, and if not we don't log the new name. The check however is buggy, as it directly compares the logged_trans field of the inodes versus the ID of the current transaction. The problem is that logged_trans is a transient field, only stored in memory and never persisted in the inode item, so if an inode was logged before, evicted and reloaded, its logged_trans field is set to a value of 0, meaning the check will return false and the new name of the renamed inode is not logged. If the old parent directory was previously fsynced and we deleted the logged directory entries corresponding to the old name, we end up with a log that when replayed will delete the renamed inode. The following example triggers the problem: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ echo -n "hello world" > /mnt/A/foo $ sync # Add some new file to A and fsync directory A. $ touch /mnt/A/bar $ xfs_io -c "fsync" /mnt/A # Now trigger inode eviction. We are only interested in triggering # eviction for the inode of directory A. $ echo 2 > /proc/sys/vm/drop_caches # Move foo from directory A to directory B. # This deletes the directory entries for foo in A from the log, and # does not add the new name for foo in directory B to the log, because # logged_trans of A is 0, which is less than the current transaction ID. $ mv /mnt/A/foo /mnt/B/foo # Now make an fsync to anything except A, B or any file inside them, # like for example create a file at the root directory and fsync this # new file. This syncs the log that contains all the changes done by # previous rename operation. $ touch /mnt/baz $ xfs_io -c "fsync" /mnt/baz <power fail> # Mount the filesystem and replay the log. $ mount /dev/sdc /mnt # Check the filesystem content. $ ls -1R /mnt /mnt/: A B baz /mnt/A: bar /mnt/B: $ # File foo is gone, it's neither in A/ nor in B/. Fix this by using the inode_logged() helper at btrfs_log_new_name(), which safely checks if an inode was logged before in the current transaction. A test case for fstests will follow soon. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
![]() |
9acc8103ab |
btrfs: fix unpersisted i_size on fsync after expanding truncate
If we have an inode that does not have the full sync flag set, was changed in the current transaction, then it is logged while logging some other inode (like its parent directory for example), its i_size is increased by a truncate operation, the log is synced through an fsync of some other inode and then finally we explicitly call fsync on our inode, the new i_size is not persisted. The following example shows how to trigger it, with comments explaining how and why the issue happens: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/foo $ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar $ sync # Fsync bar, this will be a noop since the file has not yet been # modified in the current transaction. The goal here is to clear # BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags. $ xfs_io -c "fsync" /mnt/bar # Now rename both files, without changing their parent directory. $ mv /mnt/bar /mnt/bar2 $ mv /mnt/foo /mnt/foo2 # Increase the size of bar2 with a truncate operation. $ xfs_io -c "truncate 2M" /mnt/bar2 # Now fsync foo2, this results in logging its parent inode (the root # directory), and logging the parent results in logging the inode of # file bar2 (its inode item and the new name). The inode of file bar2 # is logged with an i_size of 0 bytes since it's logged in # LOG_INODE_EXISTS mode, meaning we are only logging its names (and # xattrs if it had any) and the i_size of the inode will not be changed # when the log is replayed. $ xfs_io -c "fsync" /mnt/foo2 # Now explicitly fsync bar2. This resulted in doing nothing, not # logging the inode with the new i_size of 2M and the hole from file # offset 1M to 2M. Because the inode did not have the flag # BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the # fsync of file foo2, its last_log_commit field was updated, # resulting in this explicit of file bar2 not doing anything. $ xfs_io -c "fsync" /mnt/bar2 # File bar2 content and size before a power failure. $ od -A d -t x1 /mnt/bar2 |
||
![]() |
ea32af47f0 |
btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
When syncing the log, if we fail to allocate the root node for the log
root tree:
1) We are unlocking fs_info->tree_log_mutex, but at this point we have
not yet locked this mutex;
2) We have locked fs_info->tree_root->log_mutex, but we end up not
unlocking it;
So fix this by unlocking fs_info->tree_root->log_mutex instead of
fs_info->tree_log_mutex.
Fixes:
|