Commit Graph

1506 Commits

Author SHA1 Message Date
Linus Torvalds
701c09c988 for-5.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBctBgACgkQxWXV+ddt
 WDu1nA//bzuPwW3nO+enE+ipi4t6UJTJpHLeDgdMshWwhBIHVt+oFxTUIt4Zd0kT
 0hJ+mbNrZHzmDmzpb6ifQn0D6k+wq6zbsEgLtwgmPmBszaXIw46FvnYnxd9FtCde
 9SQzBKa86i/KMkRtaIvpUcunniIo5Aj0Hvu0oPgTKObqiB4HP2nV6rKody+mP9JW
 RanWbBi0JvI4UE/J2Ud1sNWFdDtVpXpcktj1dsI8gbsYNR05HpM08SEUgeF/ts3I
 yB/L18I5CUeFHyo/yogbj7kkikugPGsmOj/A86UZ6x3NxWoC+m7UXoGrO2/qlFem
 qd3ioXZKlnPqeX29kAy/REa3xjE61istlDVC/vckqmXBfYc6WK/KAJvFAGI+/3VI
 9HvIbBokUQzekhFlA02RTqGcasStXX7VSeJyzyAbXjGhZQKfFTHR8ZBtrREiVBC9
 58K+g8SSqIb/9iJqYV4h82lSBRSdf9kHx7CSB2gOBuifihY+chVr4Xzhq12IlXbK
 TNlue0BTwYLJStwx2dnY2beLbLG34/4FNRsuAR/9JsCio7Bfj0qN8htIyvfsiMxr
 mkrH7+Ykd10FqC8uu6MHiW9k428871Era3B97TgyQ0V17ehh4IN0v9V7kckk9EWw
 3omaPwuF2FGfFOoTR7ipKO0nDx0/y2knnDSTsWknNG09Ciwa+Ww=
 =SuJv
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes for issues that have some user visibility and are simple enough
  for this time of development cycle:

   - a few fixes for rescue= mount option, adding more checks for
     missing trees

   - fix sleeping in atomic context on qgroup deletion

   - fix subvolume deletion on mount

   - fix build with M= syntax

   - fix checksum mismatch error message for direct io"

* tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix check_data_csum() error message for direct I/O
  btrfs: fix sleep while in non-sleep context during qgroup removal
  btrfs: fix subvolume/snapshot deletion not triggered on mount
  btrfs: fix build when using M=fs/btrfs
  btrfs: do not initialize dev replace for bad dev root
  btrfs: initialize device::fs_info always
  btrfs: do not initialize dev stats if we have no dev_root
  btrfs: zoned: remove outdated WARN_ON in direct IO
2021-03-25 15:38:22 -07:00
Filipe Manana
8d488a8c7b btrfs: fix subvolume/snapshot deletion not triggered on mount
During the mount procedure we are calling btrfs_orphan_cleanup() against
the root tree, which will find all orphans items in this tree. When an
orphan item corresponds to a deleted subvolume/snapshot (instead of an
inode space cache), it must not delete the orphan item, because that will
cause btrfs_find_orphan_roots() to not find the orphan item and therefore
not add the corresponding subvolume root to the list of dead roots, which
results in the subvolume's tree never being deleted by the cleanup thread.

The same applies to the remount from RO to RW path.

Fix this by making btrfs_find_orphan_roots() run before calling
btrfs_orphan_cleanup() against the root tree.

A test case for fstests will follow soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
Fixes: 638331fa56 ("btrfs: fix transaction leak and crash after cleaning up orphans on RO mount")
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:22 +01:00
Josef Bacik
820a49dafc btrfs: initialize device::fs_info always
Neal reported a panic trying to use -o rescue=all

  BUG: kernel NULL pointer dereference, address: 0000000000000030
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 696 Comm: mount Tainted: G        W         5.12.0-rc2+ #296
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200
  RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286
  RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070
  RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400
  RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150
  R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000
  R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150
  FS:  00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0
  Call Trace:
   ? btrfs_init_dev_stats+0x1f/0xf0
   ? kmem_cache_alloc+0xef/0x1f0
   btrfs_init_dev_stats+0x5f/0xf0
   open_ctree+0x10cb/0x1720
   btrfs_mount_root.cold+0x12/0xea
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   legacy_get_tree+0x27/0x40
   vfs_get_tree+0x25/0xb0
   path_mount+0x433/0xa00
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens because when we call btrfs_init_dev_stats we do
device->fs_info->dev_root.  However device->fs_info isn't initialized
because we were only calling btrfs_init_devices_late() if we properly
read the device root.  However we don't actually need the device root to
init the devices, this function simply assigns the devices their
->fs_info pointer properly, so this needs to be done unconditionally
always so that we can properly dereference device->fs_info in rescue
cases.

Reported-by: Neal Gompa <ngompa13@gmail.com>
CC: stable@vger.kernel.org # 5.11+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-17 19:42:12 +01:00
Linus Torvalds
6f3952cbe0 for-5.12-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
 WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
 arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
 qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
 ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
 hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
 2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
 3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
 bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
 wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
 8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
 fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
 =0O6r
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings updates of space handling, performance improvements or bug
  fixes. The subpage block size and zoned mode features have reached
  state where they're usable but with limitations.

  Performance or related:

   - do not block on deleted block group mutex in the cleaner, avoids
     some long stalls

   - improved flushing: make it work better with ticket space
     reservations and avoid excessive transaction commits in some
     scenarios, slightly improves throughput for random write load

   - preemptive background flushing: separate the logic from ticket
     reservations, improve the accounting and decisions when to flush in
     low space conditions

   - less lock contention related to running delayed refs, let just one
     thread do the flushing when there are many inside transaction
     commit

   - dbench workload improvements: avoid unnecessary work when logging
     inodes, fewer fallbacks to transaction commit and thus less waiting
     for it (+7% throughput, -20% latency)

  Core:

   - subpage block size
      - currently read-only support
      - refactor and generalize code where sectorsize is assumed to be
        page size, add the subpage handling everywhere
      - the read-write support is on the way, page sizes are still
        limited to 4K or 64K

   - zoned mode, first working version but with limitations
      - SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
        location for structures" and chunked allocation
      - superblock as the only fixed data structure needs special
        handling, uses 2 consecutive zones as a ring buffer
      - tree-log support with a dedicated block group to avoid unordered
        writes
      - emulated zones on non-zoned devices
      - not yet working
      - all non-single block group profiles, requires more zone write
        pointer synchronization between the multiple block groups
      - fitrim due to dependency on space cache, can be implemented

  Fixes:

   - ref-verify: proper tree owner and node level tracking

   - fix pinned byte accounting, causing some early ENOSPC now more
     likely due to other changes in delayed refs

  Other:

   - error handling fixes and improvements

   - more error injection points

   - more function documentation

   - more and updated tracepoints

   - subset of W=1 checked by default

   - update comments to allow more automatic kdoc parameter checks"

* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
  btrfs: zoned: enable to mount ZONED incompat flag
  btrfs: zoned: deal with holes writing out tree-log pages
  btrfs: zoned: reorder log node allocation on zoned filesystem
  btrfs: zoned: serialize log transaction on zoned filesystems
  btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
  btrfs: split alloc_log_tree()
  btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
  btrfs: zoned: enable relocation on a zoned filesystem
  btrfs: zoned: support dev-replace in zoned filesystems
  btrfs: zoned: implement copying for zoned device-replace
  btrfs: zoned: implement cloning for zoned device-replace
  btrfs: zoned: mark block groups to copy for device-replace
  btrfs: zoned: do not use async metadata checksum on zoned filesystems
  btrfs: zoned: wait for existing extents before truncating
  btrfs: zoned: serialize metadata IO
  btrfs: zoned: introduce dedicated data write path for zoned filesystems
  btrfs: zoned: enable zone append writing for direct IO
  btrfs: zoned: use ZONE_APPEND write for zoned mode
  btrfs: save irq flags when looking up an ordered extent
  btrfs: zoned: cache if block group is on a sequential zone
  ...
2021-02-21 10:00:39 -08:00
Su Yue
83c68bbcb6 btrfs: initialize fs_info::csum_size earlier in open_ctree
User reported that btrfs-progs misc-tests/028-superblock-recover fails:

      [TEST/misc]   028-superblock-recover
  unexpected success: mounted fs with corrupted superblock
  test failed for case 028-superblock-recover

The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.

Commit 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size
everywhere") replaces all calls to btrfs_super_csum_size by
fs_info::csum_size. The calls include the place where fs_info->csum_size
is not initialized. So btrfs_check_super_csum() passes because memcmp()
with len 0 always returns 0.

Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
csum type in superblock is valid in open_ctree().

Link: https://github.com/kdave/btrfs-progs/issues/250
Fixes: 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size everywhere")
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-12 14:48:24 +01:00
Naohiro Aota
3ddebf27fc btrfs: zoned: reorder log node allocation on zoned filesystem
This is the 3/3 patch to enable tree-log on zoned filesystems.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.

Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:48:41 +01:00
Naohiro Aota
40ab3be102 btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams.  As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:08 +01:00
Naohiro Aota
6ab6ebb760 btrfs: split alloc_log_tree()
This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().

Also export the latter part is to be used in the next patch.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota
4eef29ef63 btrfs: zoned: do not use async metadata checksum on zoned filesystems
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
the metadata write IOs.

Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU
and not per zone.

To preserve write bio ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as
a single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple
zones are being written simultaneously, HDD seek overhead lowers the
achievable maximum bandwidth, resulting again in a per zone checksum
serialization not affecting the performance.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota
0bc09ca129 btrfs: zoned: serialize metadata IO
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:07 +01:00
Naohiro Aota
cfe94440d1 btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.

Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:05 +01:00
Naohiro Aota
d3575156f6 btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.

Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:04 +01:00
Johannes Thumshirn
b53429bad3 btrfs: zoned: do not load fs_info::zoned from incompat flag
Don't set the zoned flag in fs_info as soon as we're encountering the
incompat filesystem flag for a zoned filesystem on mount. The zoned flag
in fs_info is in a union together with the zone_size, so setting it too
early will result in setting an incorrect zone_size as well.

Once the correct zone_size is read from the device, we can rely on the
zoned flag in fs_info as well to determine if the filesystem is zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:20 +01:00
Naohiro Aota
7365104236 btrfs: zoned: defer loading zone info after opening trees
This is a preparation patch to implement zone emulation on a regular
device.

To emulate a zoned filesystem on a regular (non-zoned) device, we need to
decide an emulated zone size. Instead of making it a compile-time static
value, we'll make it configurable at mkfs time. Since we have one zone ==
one device extent restriction, we can determine the emulated zone size
from the size of a device extent. We can extend btrfs_get_dev_zone_info()
to show a regular device filled with conventional zones once the zone size
is decided.

The current call site of btrfs_get_dev_zone_info() during the mount process
is earlier than loading the file system trees so that we don't know the
size of a device extent at this point. Thus we can't slice a regular device
to conventional zones.

This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
info for all the devices. And, it places this function in open_ctree()
after loading the trees.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:32:16 +01:00
Qu Wenruo
0bb3eb3ee8 btrfs: allow read-only mount of 4K sector size fs on 64K page system
This adds the basic RO mount ability for 4K sector size on 64K page
system.

Currently we only plan to support 4K and 64K page system.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo
371cdc0700 btrfs: introduce subpage metadata validation check
For subpage metadata validation check, there are some differences:

- Read must finish in one bvec
  Since we're just reading one subpage range in one page, it should
  never be split into two bios nor two bvecs.

- How to grab the existing eb
  Instead of grabbing eb using page->private, we have to go search radix
  tree as we don't have any direct pointer at hand.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Josef Bacik
576fa34830 btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing.  However this meant that we essentially
never preemptively flushed.  This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction.  This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer.  With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on.  Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing.  Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
5deb17e18e btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we
have more DIO in flight than we have normal buffered IO.  The reason for
this is because we can't "flush" DIO, we have to just wait on the
ordered extents to finish.

However this is true of all ordered extents.  If we have more ordered
space outstanding than dirty pages we should be waiting on ordered
extents.  We already are ok on this front technically, because we always
do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
the preemptive flushing code as well, so change this to count all
ordered bytes instead of just DIO ordered bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Nikolay Borisov
23125104d8 btrfs: make btrfs_root::free_objectid hold the next available objectid
Adjust the way free_objectid is being initialized, it now stores
BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary,
BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit
that now it becomes unnecessary to explicitly initialize free_objectid
for a newly create fs root.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:50 +01:00
Nikolay Borisov
6b8fad576a btrfs: rename btrfs_root::highest_objectid to free_objectid
This reflects the true purpose of the member as it's being used solely
in context where a new objectid is being allocated. Future changes will
also change the way it's being used to closely follow this semantics.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Nikolay Borisov
543068a217 btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid
This better reflects the semantics of the function i.e no search is
performed whatsoever.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Nikolay Borisov
453e487386 btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid
This function is used to initialize the in-memory
btrfs_root::highest_objectid member, which is used to get an available
objectid. Rename it to better reflect its semantics.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:49 +01:00
Josef Bacik
71008734d2 btrfs: print the actual offset in btrfs_root_name
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid.  Fix this helper to take the key
so we have access to the offset when we need it.

Fixes: 457f1864b5 ("btrfs: pretty print leaked root name")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Filipe Manana
0a31daa4b6 btrfs: add assertion for empty list of transactions at late stage of umount
Add an assertion to close_ctree(), after destroying all the work queues,
to verify we do not have any transaction still open or committing at that
at that point. If we have any, it means something is seriously wrong and
that can cause memory leaks and use-after-free problems. This is motivated
by the previous patches that fixed bugs where we ended up leaking an open
transaction after unmounting the filesystem.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 15:00:06 +01:00
Filipe Manana
a0a1db70df btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.

The following sequence of steps explains how the race happens.

1) The filesystem is mounted in RW mode and the cleaner task is running.
   This means that currently BTRFS_FS_CLEANER_RUNNING is set at
   fs_info->flags;

2) The cleaner task is currently running delayed iputs for example;

3) A filesystem RO remount operation starts;

4) The RO remount task calls btrfs_commit_super(), which commits any
   currently open transaction, and it finishes;

5) At this point the cleaner task is still running and it creates a new
   transaction by doing one of the following things:

   * When running the delayed iput() for an inode with a 0 link count,
     in which case at btrfs_evict_inode() we start a transaction through
     the call to evict_refill_and_join(), use it and then release its
     handle through btrfs_end_transaction();

   * When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
     a transaction is started at btrfs_drop_snapshot() and then its handle
     is released through a call to btrfs_end_transaction_throttle();

   * When the remount task was still running, and before the remount task
     called btrfs_delete_unused_bgs(), the cleaner task also called
     btrfs_delete_unused_bgs() and it picked and removed one block group
     from the list of unused block groups. Before the cleaner task started
     a transaction, through btrfs_start_trans_remove_block_group() at
     btrfs_delete_unused_bgs(), the remount task had already called
     btrfs_commit_super();

6) So at this point the filesystem is in RO mode and we have an open
   transaction that was started by the cleaner task;

7) Shortly after a filesystem unmount operation starts. At close_ctree()
   we stop the transaction kthread before it had a chance to commit the
   transaction, since less than 30 seconds (the default commit interval)
   have elapsed since the last transaction was committed;

8) We end up calling iput() against the btree inode at close_ctree() while
   there is an open transaction, and since that transaction was used to
   update btrees by the cleaner, we have dirty pages in the btree inode
   due to COW operations on metadata extents, and therefore writeback is
   triggered for the btree inode.

   So btree_write_cache_pages() is invoked to flush those dirty pages
   during the final iput() on the btree inode. This results in creating a
   bio and submitting it, which makes us end up at
   btrfs_submit_metadata_bio();

9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
   that calls btrfs_wq_submit_bio(), because check_async_write() returned
   a value of 1. This value of 1 is because we did not have hardware
   acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
   set in fs_info->flags;

10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

  ------------[ cut here ]------------
  WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
  Code: f0 01 00 00 48 39 c2 75 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
  RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
  RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 48 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c6 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
  Code: 48 83 bb b0 03 00 00 00 (...)
  RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
  RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
  RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c7 ]---
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
  Code: ad de 49 be 22 01 00 (...)
  RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
  RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
  RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
  R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
  FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x2ba/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f15ee221ee7
  Code: ff 0b 00 f7 d8 64 89 (...)
  RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
  RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
  R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace dd74718fef1ed5c8 ]---
  BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
  BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
  BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
  BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

  stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
  Code: 54 55 53 48 89 f3 (...)
  RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
  RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
  RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
  R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
  FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
   btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
   submit_one_bio+0x61/0x70 [btrfs]
   btree_write_cache_pages+0x414/0x450 [btrfs]
   ? kobject_put+0x9a/0x1d0
   ? trace_hardirqs_on+0x1b/0xf0
   ? _raw_spin_unlock_irqrestore+0x3c/0x60
   ? free_debug_processing+0x1e1/0x2b0
   do_writepages+0x43/0xe0
   ? lock_acquired+0x199/0x490
   __writeback_single_inode+0x59/0x650
   writeback_single_inode+0xaf/0x120
   write_inode_now+0x94/0xd0
   iput+0x187/0x2b0
   close_ctree+0x2c6/0x2fa [btrfs]
   generic_shutdown_super+0x6c/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x31/0x70
   cleanup_mnt+0x100/0x160
   task_work_run+0x68/0xb0
   exit_to_user_mode_prepare+0x1bb/0x1c0
   syscall_exit_to_user_mode+0x4b/0x260
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f3cfebabee7
  Code: ff 0b 00 f7 d8 64 89 01 (...)
  RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
  RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
  RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
  R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
  R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
  Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
  ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

  =============================================================================
  BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x0000000050cbdd61 @offset=12104
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        sync_filesystem+0x74/0x90
        generic_shutdown_super+0x22/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x0000000086e9b0ff @offset=12776
  INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
        commit_cowonly_roots+0x248/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------

  INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? lock_release+0x20e/0x4c0
   kmem_cache_destroy+0x55/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000001a340018 @offset=4408
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_commit_transaction+0x60/0xc40 [btrfs]
        create_subvol+0x56a/0x990 [btrfs]
        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        btrfs_ioctl+0x1a92/0x36f0 [btrfs]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  INFO: Object 0x000000002b46292a @offset=13648
  INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
  INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  =============================================================================
  BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------
  INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
  CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   slab_err+0xb7/0xdc
   ? lock_acquired+0x199/0x490
   __kmem_cache_shutdown+0x1ac/0x3c0
   ? __mutex_unlock_slowpath+0x45/0x2a0
   kmem_cache_destroy+0x55/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 f5 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  INFO: Object 0x000000004cf95ea8 @offset=6264
  INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
  INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
  CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8d/0xb5
   kmem_cache_destroy+0x119/0x120
   exit_btrfs_fs+0xa/0x59 [btrfs]
   __x64_sys_delete_module+0x194/0x260
   ? fpregs_assert_state_consistent+0x1e/0x40
   ? exit_to_user_mode_prepare+0x55/0x1c0
   ? trace_hardirqs_on+0x1b/0xf0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f693e305897
  Code: 73 01 c3 48 8b 0d f9 (...)
  RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
  RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
  RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
  RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
  R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
  R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
  BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.

This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 15:00:02 +01:00
Filipe Manana
638331fa56 btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
When we delete a root (subvolume or snapshot), at the very end of the
operation, we attempt to remove the root's orphan item from the root tree,
at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any
error from btrfs_del_orphan_item() since it is not a serious problem and
the next time the filesystem is mounted we remove such stray orphan items
at btrfs_find_orphan_roots().

However if the filesystem is mounted RO and we have stray orphan items for
any previously deleted root, we can end up leaking a transaction and other
data structures when unmounting the filesystem, as well as crashing if we
do not have hardware acceleration for crc32c available.

The steps that lead to the transaction leak are the following:

1) The filesystem is mounted in RW mode;

2) A subvolume is deleted;

3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root,
   it gets a failure at btrfs_del_orphan_item(), which is ignored, due to
   an ENOMEM when allocating a path for example. So the orphan item for
   the root remains in the root tree;

4) The filesystem is unmounted;

5) The filesystem is mounted RO (-o ro). During the mount path we call
   btrfs_find_orphan_roots(), which iterates the root tree searching for
   orphan items. It finds the orphan item for our deleted root, and since
   it can not find the root, it starts a transaction to delete the orphan
   item (by calling btrfs_del_orphan_item());

6) The RO mount completes;

7) Before the transaction kthread commits the transaction created for
   deleting the orphan item (i.e. less than 30 seconds elapsed since the
   mount, the default commit interval), a filesystem unmount operation is
   started;

8) At close_ctree(), we stop the transaction kthread, but we still have a
   transaction open with at least one dirty extent buffer, a leaf for the
   tree root which was COWed when deleting the orphan item;

9) We then proceed to destroy the work queues, free the roots and block
   groups, etc. After that we drop the last reference on the btree inode by
   calling iput() on it. Since there are dirty pages for the btree inode,
   corresponding to the COWed extent buffer, btree_write_cache_pages() is
   invoked to flush those dirty pages. This results in creating a bio and
   submitting it, which makes us end up at btrfs_submit_metadata_bio();

10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
    that calls btrfs_wq_submit_bio(), because check_async_write() returned
    a value of 1. This value of 1 is because we did not have hardware
    acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
    set in fs_info->flags;

11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

 ------------[ cut here ]------------
 WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
 Code: f0 01 00 00 48 39 c2 75 (...)
 RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 01 48 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c6 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
 Code: 48 83 bb b0 03 00 00 00 (...)
 RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
 RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 01 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c7 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
 Code: ad de 49 be 22 01 00 (...)
 RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
 FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  close_ctree+0x2ba/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f15ee221ee7
 Code: ff 0b 00 f7 d8 64 89 (...)
 RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace dd74718fef1ed5c8 ]---
 BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
 BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

 stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
 Code: 54 55 53 48 89 f3 (...)
 RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
 FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
  btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
  submit_one_bio+0x61/0x70 [btrfs]
  btree_write_cache_pages+0x414/0x450 [btrfs]
  ? kobject_put+0x9a/0x1d0
  ? trace_hardirqs_on+0x1b/0xf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? free_debug_processing+0x1e1/0x2b0
  do_writepages+0x43/0xe0
  ? lock_acquired+0x199/0x490
  __writeback_single_inode+0x59/0x650
  writeback_single_inode+0xaf/0x120
  write_inode_now+0x94/0xd0
  iput+0x187/0x2b0
  close_ctree+0x2c6/0x2fa [btrfs]
  generic_shutdown_super+0x6c/0x100
  kill_anon_super+0x14/0x30
  btrfs_kill_super+0x12/0x20 [btrfs]
  deactivate_locked_super+0x31/0x70
  cleanup_mnt+0x100/0x160
  task_work_run+0x68/0xb0
  exit_to_user_mode_prepare+0x1bb/0x1c0
  syscall_exit_to_user_mode+0x4b/0x260
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f3cfebabee7
 Code: ff 0b 00 f7 d8 64 89 01 (...)
 RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
 Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
 ---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
 =============================================================================
 BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? lock_release+0x20e/0x4c0
  kmem_cache_destroy+0x55/0x120
  btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x0000000050cbdd61 @offset=12104
 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        sync_filesystem+0x74/0x90
        generic_shutdown_super+0x22/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 INFO: Object 0x0000000086e9b0ff @offset=12776
 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
        commit_cowonly_roots+0x248/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 =============================================================================
 BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? lock_release+0x20e/0x4c0
  kmem_cache_destroy+0x55/0x120
  btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x000000001a340018 @offset=4408
 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_free_tree_block+0x128/0x360 [btrfs]
        __btrfs_cow_block+0x489/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        btrfs_commit_transaction+0x60/0xc40 [btrfs]
        create_subvol+0x56a/0x990 [btrfs]
        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        btrfs_ioctl+0x1a92/0x36f0 [btrfs]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 INFO: Object 0x000000002b46292a @offset=13648
 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
        btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 =============================================================================
 BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
 -----------------------------------------------------------------------------

 INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  slab_err+0xb7/0xdc
  ? lock_acquired+0x199/0x490
  __kmem_cache_shutdown+0x1ac/0x3c0
  ? __mutex_unlock_slowpath+0x45/0x2a0
  kmem_cache_destroy+0x55/0x120
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 f5 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 INFO: Object 0x000000004cf95ea8 @offset=6264
 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
        __slab_alloc.isra.0+0x109/0x1c0
        kmem_cache_alloc+0x7bb/0x830
        btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
        __btrfs_cow_block+0x12d/0x5f0 [btrfs]
        btrfs_cow_block+0xf7/0x220 [btrfs]
        btrfs_search_slot+0x62a/0xc40 [btrfs]
        btrfs_del_orphan_item+0x65/0xd0 [btrfs]
        btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
        open_ctree+0x125a/0x18a0 [btrfs]
        btrfs_mount_root.cold+0x13/0xed [btrfs]
        legacy_get_tree+0x30/0x60
        vfs_get_tree+0x28/0xe0
        fc_mount+0xe/0x40
        vfs_kern_mount.part.0+0x71/0x90
        btrfs_mount+0x13b/0x3e0 [btrfs]
 INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
        kmem_cache_free+0x34c/0x3c0
        __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
        btrfs_run_delayed_refs+0x81/0x210 [btrfs]
        commit_cowonly_roots+0xfb/0x300 [btrfs]
        btrfs_commit_transaction+0x367/0xc40 [btrfs]
        close_ctree+0x113/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x8d/0xb5
  kmem_cache_destroy+0x119/0x120
  exit_btrfs_fs+0xa/0x59 [btrfs]
  __x64_sys_delete_module+0x194/0x260
  ? fpregs_assert_state_consistent+0x1e/0x40
  ? exit_to_user_mode_prepare+0x55/0x1c0
  ? trace_hardirqs_on+0x1b/0xf0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f693e305897
 Code: 73 01 c3 48 8b 0d f9 (...)
 RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

So fix this by calling btrfs_find_orphan_roots() in the mount path only if
we are mounting the filesystem in RW mode. It's pointless to have it called
for RO mounts anyway, since despite adding any deleted roots to the list of
dead roots, we will never have the roots deleted until the filesystem is
remounted in RW mode, as the cleaner kthread does nothing when we are
mounted in RO - btrfs_need_cleaner_sleep() always returns true and the
cleaner spends all time sleeping, never cleaning dead roots.

This is accomplished by moving the call to btrfs_find_orphan_roots() from
open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that
if later the filesystem is remounted RW, we populate the list of dead
roots and have the cleaner task delete the dead roots.

Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-18 14:59:59 +01:00
Qu Wenruo
1941b64b08 btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32.  But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.

Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov
8a6a87cd44 btrfs: fix lockdep warning when creating free space tree
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.

The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.

The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.

Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.

The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.

  [30.480136] ======================================================
  [30.480830] WARNING: possible circular locking dependency detected
  [30.481457] 5.9.0-rc8+ #76 Not tainted
  [30.481897] ------------------------------------------------------
  [30.482500] mount/520 is trying to acquire lock:
  [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.484054]
	      but task is already holding lock:
  [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.485581]
	      which lock already depends on the new lock.

  [30.486397]
	      the existing dependency chain (in reverse order) is:
  [30.487205]
	      -> #2 (btrfs-extent-01#2){++++}-{3:3}:
  [30.487825]        down_read_nested+0x43/0x150
  [30.488306]        __btrfs_tree_read_lock+0x39/0x180
  [30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.489477]        btrfs_search_slot+0x464/0x9b0
  [30.490009]        check_committed_ref+0x59/0x1d0
  [30.490603]        btrfs_cross_ref_exist+0x65/0xb0
  [30.491108]        run_delalloc_nocow+0x405/0x930
  [30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
  [30.492203]        writepage_delalloc+0xd4/0x150
  [30.492688]        __extent_writepage+0x18d/0x3a0
  [30.493199]        extent_write_cache_pages+0x2af/0x450
  [30.493743]        extent_writepages+0x34/0x70
  [30.494231]        do_writepages+0x31/0xd0
  [30.494642]        __filemap_fdatawrite_range+0xad/0xe0
  [30.495194]        btrfs_fdatawrite_range+0x1b/0x50
  [30.495677]        __btrfs_write_out_cache+0x40d/0x460
  [30.496227]        btrfs_write_out_cache+0x8b/0x110
  [30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
  [30.497317]        btrfs_commit_transaction+0xc0/0xba0
  [30.497861]        sync_filesystem+0x71/0x90
  [30.498303]        btrfs_remount+0x81/0x433
  [30.498767]        reconfigure_super+0x9f/0x210
  [30.499261]        path_mount+0x9d1/0xa30
  [30.499722]        do_mount+0x55/0x70
  [30.500158]        __x64_sys_mount+0xc4/0xe0
  [30.500616]        do_syscall_64+0x33/0x40
  [30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.501629]
	      -> #1 (btrfs-root-00){++++}-{3:3}:
  [30.502241]        down_read_nested+0x43/0x150
  [30.502727]        __btrfs_tree_read_lock+0x39/0x180
  [30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.503903]        btrfs_search_slot+0x464/0x9b0
  [30.504405]        btrfs_insert_empty_items+0x60/0xa0
  [30.504973]        btrfs_insert_item+0x60/0xd0
  [30.505412]        btrfs_create_tree+0x1b6/0x210
  [30.505913]        btrfs_create_free_space_tree+0x54/0x110
  [30.506460]        btrfs_mount_rw+0x15d/0x20f
  [30.506937]        btrfs_remount+0x356/0x433
  [30.507369]        reconfigure_super+0x9f/0x210
  [30.507868]        path_mount+0x9d1/0xa30
  [30.508264]        do_mount+0x55/0x70
  [30.508668]        __x64_sys_mount+0xc4/0xe0
  [30.509186]        do_syscall_64+0x33/0x40
  [30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.510271]
	      -> #0 (btrfs-free-space-00){++++}-{3:3}:
  [30.510972]        __lock_acquire+0x11ad/0x1b60
  [30.511432]        lock_acquire+0xa2/0x360
  [30.511917]        down_read_nested+0x43/0x150
  [30.512383]        __btrfs_tree_read_lock+0x39/0x180
  [30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
  [30.513455]        btrfs_search_slot+0x464/0x9b0
  [30.513947]        search_free_space_info+0x45/0x90
  [30.514465]        __add_to_free_space_tree+0x92/0x39d
  [30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
  [30.515639]        btrfs_mount_rw+0x15d/0x20f
  [30.516142]        btrfs_remount+0x356/0x433
  [30.516538]        reconfigure_super+0x9f/0x210
  [30.517065]        path_mount+0x9d1/0xa30
  [30.517438]        do_mount+0x55/0x70
  [30.517824]        __x64_sys_mount+0xc4/0xe0
  [30.518293]        do_syscall_64+0x33/0x40
  [30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.519335]
	      other info that might help us debug this:

  [30.520210] Chain exists of:
		btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2

  [30.521407]  Possible unsafe locking scenario:

  [30.522037]        CPU0                    CPU1
  [30.522456]        ----                    ----
  [30.522941]   lock(btrfs-extent-01#2);
  [30.523311]                                lock(btrfs-root-00);
  [30.523952]                                lock(btrfs-extent-01#2);
  [30.524620]   lock(btrfs-free-space-00);
  [30.525068]
	       *** DEADLOCK ***

  [30.525669] 5 locks held by mount/520:
  [30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
  [30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
  [30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
  [30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
  [30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
  [30.531124]
	      stack backtrace:
  [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
  [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
  [30.533215] Call Trace:
  [30.533452]  dump_stack+0x8d/0xc0
  [30.533797]  check_noncircular+0x13c/0x150
  [30.534233]  __lock_acquire+0x11ad/0x1b60
  [30.534667]  lock_acquire+0xa2/0x360
  [30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
  [30.535525]  down_read_nested+0x43/0x150
  [30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
  [30.536400]  __btrfs_tree_read_lock+0x39/0x180
  [30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
  [30.537304]  btrfs_search_slot+0x464/0x9b0
  [30.537713]  ? trace_hardirqs_on+0x1c/0xf0
  [30.538148]  search_free_space_info+0x45/0x90
  [30.538572]  __add_to_free_space_tree+0x92/0x39d
  [30.539071]  ? printk+0x48/0x4a
  [30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
  [30.539972]  btrfs_mount_rw+0x15d/0x20f
  [30.540350]  btrfs_remount+0x356/0x433
  [30.540773]  ? shrink_dcache_sb+0xd9/0x100
  [30.541203]  reconfigure_super+0x9f/0x210
  [30.541642]  path_mount+0x9d1/0xa30
  [30.542040]  do_mount+0x55/0x70
  [30.542366]  __x64_sys_mount+0xc4/0xe0
  [30.542822]  do_syscall_64+0x33/0x40
  [30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [30.543691] RIP: 0033:0x7f109f7ab93a
  [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
  [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
  [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
  [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
  [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:09 +01:00
Boris Burkov
9484622945 btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.

In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.

We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).

Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov
8b228324a8 btrfs: clear free space tree on ro->rw remount
A user might want to revert to v1 or nospace_cache on a root filesystem,
and much like turning on the free space tree, that can only be done
remounting from ro->rw. Support clearing the free space tree on such
mounts by moving it into the shared remount logic.

Since the CLEAR_CACHE option sticks around across remounts, this change
would result in clearing the tree for ever on every remount, which is
not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
clear at mount end, which has the other bonus of not cluttering the
/proc/mounts output with clear_cache.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov
8cd2908846 btrfs: clear oneshot options on mount and remount
Some options only apply during mount time and are cleared at the end
of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
fits the bill, and this is a preparation patch for also clearing that
option.

One subtlety is that the current code only resets USEBACKUPROOT on rw
mounts, but the option is meaningfully "consumed" by a ro mount, so it
feels appropriate to clear in that case as well. A subsequent read-write
remount would not go through open_ctree, which is the only place that
checks the option, so the change should be benign.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:08 +01:00
Boris Burkov
5011139a47 btrfs: create free space tree on ro->rw remount
When a user attempts to remount a btrfs filesystem with
'mount -o remount,space_cache=v2', that operation silently succeeds.
Unfortunately, this is misleading, because the remount does not create
the free space tree. /proc/mounts will incorrectly show space_cache=v2,
but on the next mount, the file system will revert to the old
space_cache.

For now, we handle only the easier case, where the existing mount is
read-only and the new mount is read-write. In that case, we can create
the free space tree without contending with the block groups changing
as we go.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Boris Burkov
8f1c21d749 btrfs: start orphan cleanup on ro->rw remount
When we mount a rw filesystem, we start the orphan cleanup process in
tree root and filesystem tree. However, when we remount a ro file system
rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
on tree_root and fs_root to the shared rw mount routine to effectively
add them on ro->rw remount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Boris Burkov
44c0ca211a btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and
functionality which result in a correctly setup rw filesystem. Luckily,
there is even a strong unity in the code which implements them. In
mount's open_ctree, these operations mostly happen after an early return
for ro file systems, and in remount, they happen in a section devoted to
remounting ro->rw, after some remount specific validation passes.

However, there are unfortunately a few differences. There are small
deviations in the order of some of the operations, remount does not
start orphan cleanup in root_tree or fs_tree, remount does not create
the free space tree, and remount does not handle "one-shot" mount
options like clear_cache and uuid tree rescan.

Since we want to add building the free space tree to remount, and also
to start the same orphan cleanup process on a filesystem mounted as ro
then remounted rw, we would benefit from unifying the logic between the
two code paths.

This patch only lifts the existing common functionality, and leaves a
natural path for fixing the discrepancies.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:07 +01:00
Nikolay Borisov
5297199a8b btrfs: remove inode number cache feature
It's been deprecated since commit b547a88ea5 ("btrfs: start
deprecation of mount option inode_cache") which enumerates the reasons.

A filesystem that uses the feature (mount -o inode_cache) tracks the
inode numbers in bitmaps, that data stay on the filesystem after this
patch. The size is roughly 5MiB for 1M inodes [1], which is considered
small enough to be left there. Removal of the change can be implemented
in btrfs-progs if needed.

[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Nikolay Borisov
ec7d6dfd73 btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c
Those functions are going to be used even after inode cache is removed
so moved them to a more appropriate place.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
Naohiro Aota
12659251ca btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which
has a fixed location on a device. Since we cannot overwrite in a
sequential write required zone, we cannot place superblock in the zone.
One easy solution is limiting superblock and copies to be placed only in
conventional zones.  However, this method has two downsides: one is
reduced number of superblock copies. The location of the second copy of
superblock is 256GB, which is in a sequential write required zone on
typical devices in the market today.  So, the number of superblock and
copies is limited to be two.  Second downside is that we cannot support
devices which have no conventional zones at all.

To solve these two problems, we employ superblock log writing. It uses
two adjacent zones as a circular buffer to write updated superblocks.
Once the first zone is filled up, start writing into the second one.
Then, when both zones are filled up and before starting to write to the
first zone again, it reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and
  next to it

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:04 +01:00
Naohiro Aota
b70f509774 btrfs: check and enable ZONED mode
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is
enabled on the file system and if the file system consists of zoned
devices with equal zone size.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:03 +01:00
Qu Wenruo
8e1dc982ed btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.

But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.

Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.

But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:14 +01:00
Anand Jain
bacce86ae8 btrfs: drop unused argument step from btrfs_free_extra_devids
Commit cf89af146b ("btrfs: dev-replace: fail mount if we don't have
replace item with target device") dropped the multi stage operation of
btrfs_free_extra_devids() that does not need to check replace target
anymore and we can remove the 'step' argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:08 +01:00
Josef Bacik
e114c545bb btrfs: set the lockdep class for extent buffers on creation
Both Filipe and Fedora QA recently hit the following lockdep splat:

  WARNING: possible recursive locking detected
  5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
  --------------------------------------------
  rsync/2610 is trying to acquire lock:
  ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  but task is already holding lock:
  ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  other info that might help us debug this:
   Possible unsafe locking scenario:
	 CPU0
	 ----
    lock(&eb->lock);
    lock(&eb->lock);

   *** DEADLOCK ***
   May be due to missing lock nesting notation
  2 locks held by rsync/2610:
   #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
   #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140

  stack backtrace:
  CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack+0x8b/0xb0
   __lock_acquire.cold+0x12d/0x2a4
   ? kvm_sched_clock_read+0x14/0x30
   ? sched_clock+0x5/0x10
   lock_acquire+0xc8/0x400
   ? btrfs_tree_read_lock_atomic+0x34/0x140
   ? read_block_for_search.isra.0+0xdd/0x320
   _raw_read_lock+0x3d/0xa0
   ? btrfs_tree_read_lock_atomic+0x34/0x140
   btrfs_tree_read_lock_atomic+0x34/0x140
   btrfs_search_slot+0x616/0x9a0
   btrfs_lookup_dir_item+0x6c/0xb0
   btrfs_lookup_dentry+0xa8/0x520
   ? lockdep_init_map_waits+0x4c/0x210
   btrfs_lookup+0xe/0x30
   __lookup_slow+0x10f/0x1e0
   walk_component+0x11b/0x190
   path_lookupat+0x72/0x1c0
   filename_lookup+0x97/0x180
   ? strncpy_from_user+0x96/0x1e0
   ? getname_flags.part.0+0x45/0x1a0
   vfs_statx+0x64/0x100
   ? lockdep_hardirqs_on_prepare+0xff/0x180
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   __do_sys_newlstat+0x26/0x40
   ? lockdep_hardirqs_on_prepare+0xff/0x180
   ? syscall_enter_from_user_mode+0x27/0x80
   ? syscall_enter_from_user_mode+0x27/0x80
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.

These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly.  This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.

If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.

There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.

However now all refcounted roots have the same class name, so we no
longer need to worry about this.  For non-refcounted trees we know
which root we're on based on the parent.

Fix this by setting the lockdep class on the eb at creation time instead
of read time.  This will fix the splat and the weirdness where the class
changes in the middle of locking the block.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
3fbaf25817 btrfs: pass the owner_root and level to alloc_extent_buffer
Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
1b7ec85ef4 btrfs: pass root owner to read_tree_block
In order to properly set the lockdep class of a newly allocated block we
need to know the owner of the block.  For non-refcounted trees this is
straightforward, we always know in advance what tree we're reading from.
For refcounted trees we don't necessarily know, however all refcounted
trees share the same lockdep class name, tree-<level>.

Fix all the callers of read_tree_block() to pass in the root objectid
we're using.  In places like relocation and backref we could probably
unconditionally use 0, but just in case use the root when we have it,
otherwise use 0 in the cases we don't have the root as it's going to be
a refcounted tree anyway.

This is a preparation patch for further changes.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:07 +01:00
Josef Bacik
bfb484d922 btrfs: cleanup extent buffer readahead
We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead.  Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.

Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Josef Bacik
416e3445ef btrfs: remove lockdep classes for the fs tree
We have this weird problem where our lockdep class is set after we
read a tree block, which can race with concurrent readers and result in
erroneous lockdep errors.  We want to set the lockdep class at
allocation time if possible, but in certain cases we may not have the
actual root owner, such as with relocation or any backref lookups.  This
is only really a problem for reference counted trees, because all other
trees have their root reference set in their extent reference.  Remove
the fs tree specific lock class.  We need to still keep the reloc tree
one, it's still reference counted, because replace_path will lock the
reloc tree and the destination tree, and if they're both set to
tree-<level> we'll have issues.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:05 +01:00
Qu Wenruo
ac303b6987 btrfs: pass bvec to csum_dirty_buffer instead of page
Currently csum_dirty_buffer() uses page to grab extent buffer, but that
only works for sector size == PAGE_SIZE case.

For subpage we need page + page_offset to grab extent buffer.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
77bf40a2ba btrfs: extract extent buffer verification from btrfs_validate_metadata_buffer()
Currently btrfs_validate_metadata_buffer() only needs to handle one
extent buffer as currently one page maps to at most one extent buffer.

For incoming subpage support, we need to extend the support where one
page could contain multiple extent buffers.

Split the function so we can call validate_extent_buffer on extent
buffers independently.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
a26663e7a2 btrfs: make csum_tree_block() handle node smaller than page
For subpage size support, metadata blocks of nodesize are smaller than
one page and this needs to be handled when calculating the checksum.

The checksummed start and length need to be adjusted but only for the
first page:

- start is simply offset in the page

- length is nodesize (subpage) or PAGE_SIZE for all other cases

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Qu Wenruo
2f4d60dfae btrfs: grab fs_info from extent_buffer in btrfs_mark_buffer_dirty
Since commit f28491e0a6 ("Btrfs: move the extent buffer radix tree into
the fs_info"), fs_info can be grabbed from extent_buffer directly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:04 +01:00
Josef Bacik
ac5887c8e0 btrfs: locking: remove all the blocking helpers
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning.  Remove these helpers and all the places they are
called.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:01 +01:00
David Sterba
713cebfb98 btrfs: remove unnecessary local variables for checksum size
Remove local variable that is then used just once and replace it with
fs_info::csum_size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:00 +01:00
David Sterba
223486c27b btrfs: switch cached fs_info::csum_size from u16 to u32
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.

This simple change will shave some bytes on x86_64 and release config:

   text    data     bss     dec     hex filename
1090000   17980   14912 1122892  11224c pre/btrfs.ko
1089794   17980   14912 1122686  11217e post/btrfs.ko

DELTA: -206

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
55fc29bed8 btrfs: use cached value of fs_info::csum_size everywhere
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.

Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba
fe5ecbe818 btrfs: precalculate checksums per leaf once
btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a
candidate for optimizations. After the 64bit division has been replaced
by shift, there's still a calculation done each time the function is
called: checksums per leaf.

As this is a constant value for the entire filesystem lifetime, we
can calculate it once at mount time and reuse. This also allows to
reduce the division to 64bit/32bit as we know the constant will always
fit the 32bit type.

Replace the open-coded rounding up with a macro that internally handles
the 64bit division and as it's now a short function, make it static
inline (slight code increase, slight stack usage reduction).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
22b6331d96 btrfs: store precalculated csum_size in fs_info
In many places we need the checksum size and it is inefficient to read
it from the raw superblock. Store the value into fs_info, actual use
will be in followup patches.  The size is u32 as it allows to generate
better assembly than with u16.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:58 +01:00
David Sterba
ab108d992b btrfs: use precalculated sectorsize_bits from fs_info
We do a lot of calculations where we divide or multiply by sectorsize.
We also know and make sure that sectorsize is a power of two, so this
means all divisions can be turned to shifts and avoid eg. expensive
u64/u32 divisions.

The type is u32 as it's more register friendly on x86_64 compared to u8
and the resulting assembly is smaller (movzbl vs movl).

There's also superblock s_blocksize_bits but it's usually one more
pointer dereference farther than fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:57 +01:00
Qu Wenruo
8896a08d8e btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data,
so there is no need for it to be (void *), replace it with "struct inode
*inode".

While we can extract fs_info from struct inode, also remove the @fs_info
parameter.

Since we're here, also replace all the (void *private_data) into (struct
inode *inode).

Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:54 +01:00
David Sterba
c842268458 btrfs: add set/get accessors for root_item::drop_level
The drop_level member is used directly unlike all the other int types in
root_item. Add the definition and use it everywhere. The type is u8 so
there's no conversion necessary and the helpers are properly inlined,
this is for consistency.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:52 +01:00
David Sterba
f944d2cb20 btrfs: use root_item helpers for limit and flags in btrfs_create_tree
For consistency use the available helpers to set flags and limit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:51 +01:00
David Sterba
ab1405aa25 btrfs: generate lockdep keyset names at compile time
The names in btrfs_lockdep_keysets are generated from a simple pattern
using snprintf but we can generate them directly with some macro magic
and remove the helpers.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
David Sterba
387824afd7 btrfs: use the right number of levels for lockdep keysets
BTRFS_MAX_LEVEL is 8 and the keyset table is supposed to have a key for
each level, but we'll never have more than 8 levels.  The values passed
to btrfs_set_buffer_lockdep_class are always derived from a valid extent
buffer.  Set the array sizes to the right value.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:50 +01:00
Josef Bacik
882dbe0cec btrfs: introduce mount option rescue=ignoredatacsums
There are cases where you can end up with bad data csums because of
misbehaving applications.  This happens when an application modifies a
buffer in-flight when doing an O_DIRECT write.  In order to recover the
file we need a way to turn off data checksums so you can copy the file
off, and then you can delete the file and restore it properly later.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:42 +01:00
Josef Bacik
42437a6386 btrfs: introduce mount option rescue=ignorebadroots
In the face of extent root corruption, or any other core fs wide root
corruption we will fail to mount the file system.  This makes recovery
kind of a pain, because you need to fall back to userspace tools to
scrape off data.  Instead provide a mechanism to gracefully handle bad
roots, so we can at least mount read-only and possibly recover data from
the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:41 +01:00
Josef Bacik
68319c18cb btrfs: show rescue=usebackuproot in /proc/mounts
The standalone option usebackuproot was intended as one-time use and it
was not necessary to keep it in the option list. Now that we're going to
have more rescue options, it's desirable to keep them intact as it could
be confusing why the option disappears.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ remove the btrfs_clear_opt part from open_ctree ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:41 +01:00
Nikolay Borisov
fb8a7e941b btrfs: calculate more accurate remaining time to sleep in transaction_kthread
If transaction_kthread is woken up before btrfs_fs_info::commit_interval
seconds have elapsed it will sleep for a fixed period of 5 seconds. This
is not a problem per-se but is not accurate. Instead the code should
sleep for an interval which guarantees on next wakeup commit_interval
would have passed. Since time tracking is not precise subtract 1 second
from delta to ensure the delay we end up waiting will be longer than
than the wake up period.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:36 +01:00
Nikolay Borisov
643900bee4 btrfs: record delta directly in transaction_kthread
Rename 'now' to 'delta' and store there the delta between transaction
start time and current time. This is in preparation for optimising the
sleep logic in the next patch. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:36 +01:00
Nikolay Borisov
e4e4288161 btrfs: remove redundant time check in transaction kthread loop
The value obtained from ktime_get_seconds() is guaranteed to be
monotonically increasing since it's taken from CLOCK_MONOTONIC. As
transaction_kthread obtains a reference to the currently running
transaction under holding btrfs_fs_info::trans_lock it's guaranteed to:

a) see an initialized 'cur', whose start_time is guaranteed to be smaller
   than 'now'

or

b) not obtain a 'cur' and simply go to sleep.

Given this remove the unnecessary check, if it sees
now < cur->start_time this would imply there are far greater problems on
the machine.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:19 +01:00
Nikolay Borisov
ba1bc00f35 btrfs: use helpers to convert from seconds to jiffies in transaction_kthread
The kernel provides easy to understand helpers to convert from human
understandable units to the kernel-friendly 'jiffies'. So let's use
those to make the code easier to understand. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-07 16:24:00 +01:00
Linus Torvalds
f5d808567a for-5.10-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl+cNwcACgkQxWXV+ddt
 WDth+g//esuxzGaUenIuMgnT8ofnte4I9Kst8ShBAl1Asglq1p1WBJfYtHbMMqeE
 CmJ32hGs6JBoaB/Wdta41F840BxRbaOCyo10oB6jx5QV2rCh99PvPhwmsmaUGQHG
 02umRPRJcILReB53LwJvLQTQMVfuUqfBV2TyyLHrY8R8pIGsG61p1d3cgg+NWtKQ
 c3RC2eH/uIeQTDaZX0ZOpa6TBPOs+MNDiF5d3UxvpiBXWum3yijdXEfhpfiOom4A
 eCH+lj+iQQ4EtoKjXi0q7ziU1eAKWkQ3A4rMo9fr7iQkQIVkvZc2d9WALsF+znXi
 f2ochi3msemX19I5g0RQ1s5XExCKBSbr6v934BDlpAZ8Pc4IFz9rkLZwlMYp/SVJ
 9aYZOG9Rm0P3DaiYPvKZBcxsTRvxXlXlVWfMCGUB4KKaGyawJgSZxnegxP8nWb2C
 +VrvVw2NJsoioQTX+2OqUc2FCuDib7ehjH80q9IXLlozfoKA4Lfj1G6qCUEsuffI
 NbW5Ndkaza/qw3mOTEn/sU9+kzr1P8CVtSFWcI/GJqp/kisTAYtyfU/GWD9JLi8s
 uaHGAZXdCEVNJ2opgnLiW0ZPNMm4oSernn8JckhsUUGUJwJ4c1pGEHSzIHEF+d9E
 HBmjQN6qcW4DDUb1rSQEBsFoKXeebp7P6usNNrUqau1wWKBworA=
 =bR/o
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - lockdep fixes:
     - drop path locks before manipulating sysfs objects or qgroups
     - preliminary fixes before tree locks get switched to rwsem
     - use annotated seqlock

 - build warning fixes (printk format)

 - fix relocation vs fallocate race

 - tree checker properly validates number of stripes and parity

 - readahead vs device replace fixes

 - iomap dio fix for unnecessary buffered io fallback

* tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: convert data_seqcount to seqcount_mutex_t
  btrfs: don't fallback to buffered read if we don't need to
  btrfs: add a helper to read the tree_root commit root for backref lookup
  btrfs: drop the path before adding qgroup items when enabling qgroups
  btrfs: fix readahead hang and use-after-free after removing a device
  btrfs: fix use-after-free on readahead extent after failure to create it
  btrfs: tree-checker: validate number of chunk stripes and parity
  btrfs: tree-checker: fix incorrect printk format
  btrfs: drop the path before adding block group sysfs files
  btrfs: fix relocation failure due to race with fallocate
2020-10-30 13:29:49 -07:00
Josef Bacik
49d11bead7 btrfs: add a helper to read the tree_root commit root for backref lookup
I got the following lockdep splat with tree locks converted to rwsem
patches on btrfs/104:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0+ #102 Not tainted
  ------------------------------------------------------
  btrfs-cleaner/903 is trying to acquire lock:
  ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

  but task is already holding lock:
  ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 caching_thread+0x53/0x5a0
	 btrfs_work_helper+0xfa/0x520
	 process_one_work+0x238/0x540
	 worker_thread+0x55/0x3c0
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7b0
	 btrfs_cache_block_group+0x1e0/0x510
	 find_free_extent+0xb6e/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&space_info->groups_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 find_free_extent+0x2ed/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0xb9/0x3d0
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x32/0x170
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x614/0x9d0
	 btrfs_find_root+0x35/0x1b0
	 btrfs_read_tree_root+0x61/0x120
	 btrfs_get_root_ref+0x14b/0x600
	 find_parent_nodes+0x3e6/0x1b30
	 btrfs_find_all_roots_safe+0xb4/0x130
	 btrfs_find_all_roots+0x60/0x80
	 btrfs_qgroup_trace_extent_post+0x27/0x40
	 btrfs_add_delayed_data_ref+0x3fd/0x460
	 btrfs_free_extent+0x42/0x100
	 __btrfs_mod_ref+0x1d7/0x2f0
	 walk_up_proc+0x11c/0x400
	 walk_up_tree+0xf0/0x180
	 btrfs_drop_snapshot+0x1c7/0x780
	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
	 cleaner_kthread+0xd4/0x140
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->commit_root_sem);
				 lock(&caching_ctl->mutex);
				 lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  3 locks held by btrfs-cleaner/903:
   #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
   #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
   #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  stack backtrace:
  CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   __lock_acquire+0x1167/0x2150
   ? __bfs+0x42/0x210
   lock_acquire+0xb9/0x3d0
   ? __btrfs_tree_read_lock+0x32/0x170
   down_read_nested+0x43/0x130
   ? __btrfs_tree_read_lock+0x32/0x170
   __btrfs_tree_read_lock+0x32/0x170
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x614/0x9d0
   ? find_held_lock+0x2b/0x80
   btrfs_find_root+0x35/0x1b0
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_read_tree_root+0x61/0x120
   btrfs_get_root_ref+0x14b/0x600
   find_parent_nodes+0x3e6/0x1b30
   btrfs_find_all_roots_safe+0xb4/0x130
   btrfs_find_all_roots+0x60/0x80
   btrfs_qgroup_trace_extent_post+0x27/0x40
   btrfs_add_delayed_data_ref+0x3fd/0x460
   btrfs_free_extent+0x42/0x100
   __btrfs_mod_ref+0x1d7/0x2f0
   walk_up_proc+0x11c/0x400
   walk_up_tree+0xf0/0x180
   btrfs_drop_snapshot+0x1c7/0x780
   ? btrfs_clean_one_deleted_snapshot+0x73/0x110
   btrfs_clean_one_deleted_snapshot+0xfb/0x110
   cleaner_kthread+0xd4/0x140
   ? btrfs_alloc_root+0x50/0x50
   kthread+0x13a/0x150
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30
  BTRFS info (device sdb): disk space caching is enabled
  BTRFS info (device sdb): has skinny extents

This happens because qgroups does a backref lookup when we create a
delayed ref.  From here it may have to look up a root from an indirect
ref, which does a normal lookup on the tree_root, which takes the read
lock on the tree_root nodes.

To fix this we need to add a variant for looking up roots that searches
the commit root of the tree_root.  Then when we do the backref search
using the commit root we are sure to not take any locks on the tree_root
nodes.  This gets rid of the lockdep splat when running btrfs/104.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-26 15:04:57 +01:00
Linus Torvalds
3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Anand Jain
96c2e067ed btrfs: skip devices without magic signature when mounting
Many things can happen after the device is scanned and before the device
is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
If it happens we still won't free that device from the memory and cause
the userland confusion.

For example: As the BTRFS_IOC_DEV_INFO still carries the device path
which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
device which does not belong to the filesystem anymore:

  $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
  $ wipefs -a /dev/sdb
  # /dev/sdb does not contain magic signature
  $ mount -o degraded /dev/sda /btrfs
  $ btrfs fi show -m
  Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
	  Total devices 2 FS bytes used 128.00KiB
	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb

We need to distinguish the missing signature and invalid superblock, so
add a specific error code ENODATA for that. This also fixes failure of
fstest btrfs/198.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Nikolay Borisov
905eb88bce btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.

Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:25 +02:00
Nikolay Borisov
1b36294a6c btrfs: call submit_bio_hook directly for metadata pages
No need to go through a function pointer indirection simply call
submit_bio_hook directly by exporting and renaming the helper to
btrfs_submit_metadata_bio. This makes the code more readable and should
result in somewhat faster code due to no longer paying the price for
specualtive attack mitigations that come with indirect function calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:25 +02:00
Nikolay Borisov
1f03d9cfda btrfs: remove extent_io_ops::readpage_end_io_hook
It's no longer used so let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov
9a446d6a9f btrfs: replace readpage_end_io_hook with direct calls
Don't call readpage_end_io_hook for the btree inode.  Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.

This is an improvement in 2 directions:

1. We aren't paying the penalty of indirect calls in a post-speculation
   attacks world.

2. The function is now named more explicitly so it's obvious what's
   going on

This is in preparation to removing struct extent_io_ops altogether.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Qu Wenruo
2c53a14dd3 btrfs: use own btree inode io_tree owner id
Btree inode is special compared to all other inode extent io_trees,
although it has a btrfs inode, it doesn't have the track_uptodate bit at
all.

This means a lot of things like extent locking doesn't even need to be
applied to btree io tree.

Since it's so special, adds a new owner value for it to make debuging a
little easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Nikolay Borisov
208d6341e8 btrfs: remove btree_get_extent
The sole purpose of this function was to satisfy the requirements of
__do_readpage. Since that function is no longer used to read metadata
pages the need to keep btree_get_extent around has also disappeared.
Simply remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Nikolay Borisov
2f1d3e4b93 btrfs: remove btree_readpage
There is no way for this function to be called as ->readpage() since
it's called from
generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead
code. BTRFS doesn't utilize the first 3 for the btree inode and
implements it's owon readhead mechanism. So simply remove the function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Josef Bacik
457f1864b5 btrfs: pretty print leaked root name
I'm a actual human being so am incapable of converting u64 to s64 in my
head, so add a helper to get the pretty name of a root objectid and use
that helper to spit out the name for any special roots for leaked roots,
so I don't have to scratch my head and figure out which root I messed up
the refs for.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Josef Bacik
9631e4cc1a btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
When we COW a block we are holding a lock on the original block, and
then we lock the new COW block.  Because our lockdep maps are based on
root + level, this will make lockdep complain.  We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Nikolay Borisov
217f5004fe btrfs: rework error detection in init_tree_roots
To avoid duplicating 3 lines of code the error detection logic in
init_tree_roots is somewhat quirky. It first checks for the presence of
any error condition, then checks for the specific condition to perform
any specific actions. That's spurious because directly checking for
each respective error condition and doing the necessary steps is more
obvious. While at it change the -EUCLEAN to -EIO in case the extent
buffer is not read correctly, this is in line with other sites which
return -EIO when the eb couldn't be read.

Additionally it results in smaller code and the code reads
more linearly:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
Function                                     old     new   delta
open_ctree                                 17243   17148     -95
Total: Before=113104, After=113009, chg -0.08%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Nikolay Borisov
944d3f9fac btrfs: switch seed device to list api
While this patch touches a bunch of files the conversion is
straighforward. Instead of using the implicit linked list anchored at
btrfs_fs_devices::seed the code is switched to using
list_for_each_entry.

Previous patches in the series already factored out code that processed
both main and seed devices so in those cases the factored out functions
are called on the main fs_devices and then on every seed dev inside
list_for_each_entry.

Using list api also allows to simplify deletion from the seed dev list
performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
substituting a while() loop with a simple list_del_init.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:58 +02:00
Josef Bacik
5705674081 btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data
reservations to use an async reclaim helper to satisfy tickets.  Before
we could have multiple tasks race in and both allocate chunks, resulting
in more data chunks than we would necessarily need.  Serializing these
allocations and making a single thread responsible for flushing will
only allocate chunks as needed, as well as cut down on transaction
commits and other flush related activities.

Priority reservations will still work as they have before, simply
trying to allocate a chunk until they can make their reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Randy Dunlap
260db43cd2 btrfs: delete duplicated words + other fixes in comments
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Christoph Hellwig
ed7b6b4f6e bdi: remove BDI_CAP_CGROUP_WRITEBACK
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
55b2598e84 bdi: initialize ->ra_pages and ->io_pages in bdi_init
Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Linus Torvalds
bffac4b543 for-5.9-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9q8PUACgkQxWXV+ddt
 WDsHZg//YF3Rfeo7/zaRsfUPvNoKDcM69TW+HROJXu4+rYlOukyuh5T+wboRU1Ft
 7ymiR18idPYbtOczmH1Pqw+3wyOr39WafcvAnndoUguXJHsUrriBNqkthQICt0CG
 hUUiofedaB+j+ti7AYGhF/tkqjd8LkCj8SGEz4cSUFCheIHR+ajFwFmx1Sw6NGJV
 h9SdKfbBpIqIpoExFhprNFlxdaKN9rlhYY+zXZYeCBdU6r89CkuLqxZ79GzaU0N7
 PG7FxuuJXvyHhta2a6p8hnEp7perOG22OTXJhzXd5JXiNCfZ/w4SfhH/aPO/3t5V
 x42hO+FvloVSLS3woZqkBsCgCIe0a3QOT0YxZiM+1cwSgg8mVw4UBEB3PIgkfOVT
 LawMbcgSh1evsSazru8gujm4f8RVxpSxxWfhhRwjXtyB8K89e22yBa9Lwfj04SH7
 O5O7VrLDDnHsQWinsEf4Rl6byA13jUCgI5eUxZ5B7Au0Pm9uMexDh3lvgE0W0ucY
 UvD8qAetu2NNZD68gZp597uHPrwu+Lr+VumIh4wF6doeShlIkbf/d+ntOgW9ey1S
 WFSh7sUdKg5pVf6KJQ4yc3aBA6un5lv9LnvPJOwc9HyMUj/cYuxywxWf1YMr5umv
 7/6CkufYjTAmEERQeqE1I6UIgUiWkS9nIisB8BLbYkrMR9Wi1bs=
 =tFUE
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "syzkaller started to hit us with reports, here's a fix for one type
  (stack overflow when printing checksums on read error).

  The other patch is a fix for sysfs object, we have a test for that and
  it leads to a crash."

* tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix put of uninitialized kobject after seed device delete
  btrfs: fix overflow when copying corrupt csums for a message
2020-09-23 14:32:23 -07:00
Johannes Thumshirn
35be8851d1 btrfs: fix overflow when copying corrupt csums for a message
Syzkaller reported a buffer overflow in btree_readpage_end_io_hook()
when loop mounting a crafted image:

  detected buffer overflow in memcpy
  ------------[ cut here ]------------
  kernel BUG at lib/string.c:1129!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: btrfs-endio-meta btrfs_work_helper
  RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
  RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
  RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
  RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
  RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
  R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
  R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
  FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   memcpy include/linux/string.h:405 [inline]
   btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642
   end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854
   bio_endio+0x3cf/0x7f0 block/bio.c:1449
   end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695
   btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318
   process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
   worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
   kthread+0x3b5/0x4a0 kernel/kthread.c:292
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
  Modules linked in:
  ---[ end trace b68924293169feef ]---
  RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
  RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
  RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
  RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
  RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
  R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
  R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
  FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

The overflow happens, because in btree_readpage_end_io_hook() we assume
that we have found a 4 byte checksum instead of the real possible 32
bytes we have for the checksums.

With the fix applied:

[   35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215)
[   35.738994] BTRFS info (device loop0): disk space caching is enabled
[   35.738998] BTRFS info (device loop0): has skinny extents
[   35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0
[   35.743420] BTRFS error (device loop0): failed to read chunk root
[   35.745899] BTRFS error (device loop0): open_ctree failed

Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com
Fixes: d5178578bc ("btrfs: directly call into crypto framework for checksumming")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-21 12:39:21 +02:00
Linus Torvalds
edf6b0e1e4 for-5.9-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9bj/UACgkQxWXV+ddt
 WDs1HhAAgAvJVM2WJuMCjQhQiKljFjRT1a0Kbsp+9ayw5Q225t5S5kCMWrsA6mXF
 /9bGRmmELm/Nr5pSH9hp5Bhbke0vNV+Y9XiRQXpegla4LLMF4MulVgADRIL3WoxO
 ZAtNmZUokkjvB0CkzDuI7PqrF67TXLqV2hlctZo0p5SAFFgLaELyIYC6uAaO9Qo/
 +EAAK+7oJyzWcUp44APu90wBbF79umwNVKEEkDfc6bwiA2Cut1JGzvPWgGvvQnta
 fAd114LFViKg05GXcbnx4NxHYtf9tKHjDk9yYWssR+uV6vo/pWwAkDwYxXm/LzA4
 Zv8QK5uvng1fW4eq9QkN3KflIDn+YhaH1jgwNcgyS+ZCdqZR1Mi949f+6Nj1fXt2
 NeXOx3nhtqgNthKQNvHSMVJZrPjV3bdzOz+bULA+hMvTkr5gJy+ToAs30SLxGF5Y
 BCJEE6b5M5Jnb+UHEBMuoxubBfmPHkY8LxfDzVWDLESsKcW2eYyeJyJXx4DNe/v9
 O7Z5pcku+7R9LOlYQEzKeSuiYMqYLtmQtcNXyFBysksikjFJBWNgENna1LmgvmRH
 j6fC5S9h4sIxzyKQkJgihIDt/a3f9WnhsoHw8EIn62tfdOIvMcT/xWq9YYgWaOjZ
 H9040WXvEAFVcDn4cQ22DNgV+toJMpe0pLg6UXe7VtESUtbwMFM=
 =JTfF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes:

   - regression fix for a crash after failed snapshot creation

   - one more lockep fix: use nofs allocation when allocating missing
     device

   - fix reloc tree leak on degraded mount

   - make some extent buffer alignment checks less strict to mount
     filesystems created by btrfs-convert"

* tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix NULL pointer dereference after failure to create snapshot
  btrfs: free data reloc tree on failed mount
  btrfs: require only sector size alignment for parent eb bytenr
  btrfs: fix lockdep splat in add_missing_dev
2020-09-12 12:28:39 -07:00
Josef Bacik
9e3aa80544 btrfs: free data reloc tree on failed mount
While testing a weird problem with -o degraded, I noticed I was getting
leaked root errors

  BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
  BTRFS error (device loop0): open_ctree failed
  BTRFS error (device loop0): leaked root -9-0 refcount 1

This is the DATA_RELOC root, which gets read before the other fs roots,
but is included in the fs roots radix tree.  Handle this by adding a
btrfs_drop_and_free_fs_root() on the data reloc root if it exists.  This
is ok to do here if we fail further up because we will only drop the ref
if we delete the root from the radix tree, and all other cleanup won't
be duplicated.

CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-07 14:51:15 +02:00
Linus Torvalds
9907ab3714 for-5.9-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9D5EkACgkQxWXV+ddt
 WDto+g/6A/2QzxhgOmqqHTiDvn3DkL60XfjB6lmq3NEvinrST+VH20EoX/EuX2Kn
 u2+gMiWrgBUwlvERkoSasxdJf/6dCCc+9zYDjjKkAxCckENT85Np71o3iEc7Z5z+
 LFgS26mt6aYlCCHyIsHutzHtK2MKiUz7/oaUYZMJBHHkKS/5hL1mzIbwiWAqfU2H
 q0iMz9L2mjp1kZnpwa/yhg/NJ/oGZsKm3UPGDhdc0RlCWHBbDXHFk1wvNRo/yKQW
 l+yy0dh6PAZ45pRL0/WZwvOzcAglb+uSmwa64UOvwio4Na9P7oAcBzTFmtbBtvP4
 WBrOUPCTzkvgQcmoAsWFpD4nrzgW4oS71EICTOIRlPx7A86TP3wYpFEygUlLCoZC
 Pd4e9mPClmW78hcRT12eJeGcJIzgoKWhR8597jNUEYz3R5T2wKHOcNnq9a1E1PLv
 zR+5MFShsylUHd7HbMC1O86XnfXe5esegNQMvx36kTS+cR9Dyt5EWMNIAYK4BPM3
 /tXWZRqlZPOh3T7DZ4QR5oSSDDNq7ROTdv9jmsleno+woG0MNDYsA7jCbeJnGTmI
 CtTUP+p41otyM2lFZjV8PG/XyXDKb3UfU5gcsDOZdGP5S0tkyBiKSA6eqhz6DaTi
 fQOLGZdkNpNN/burbMq7d7YEHr3F6LC17U3L4k5V4MTAm2lp7ZQ=
 =ONgI
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix swapfile activation on subvolumes with deleted snapshots

 - error value mixup when removing directory entries from tree log

 - fix lzo compression level reset after previous level setting

 - fix space cache memory leak after transaction abort

 - fix const function attribute

 - more error handling improvements

* tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: detect nocow for swap after snapshot delete
  btrfs: check the right error variable in btrfs_del_dir_entries_in_log
  btrfs: fix space cache memory leak after transaction abort
  btrfs: use the correct const function attribute for btrfs_get_num_csums
  btrfs: reset compression level for lzo on remount
  btrfs: handle errors from async submission
2020-08-24 12:01:20 -07:00
Filipe Manana
bbc37d6e47 btrfs: fix space cache memory leak after transaction abort
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:

1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
   and it's about to start writing out dirty extent buffers;

2) Transaction N + 1 already started and another task, task A, just called
   btrfs_commit_transaction() on it;

3) Block group B was dirtied (extents allocated from it) by transaction
   N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
   very beginning of the transaction commit, it starts writeback for the
   block group's space cache by calling btrfs_write_out_cache(), which
   allocates the pages array for the block group's io_ctl with a call to
   io_ctl_init(). Block group A is added to the io_list of transaction
   N + 1 by btrfs_start_dirty_block_groups();

4) While transaction N's commit is writing out the extent buffers, it gets
   an IO error and aborts transaction N, also setting the file system to
   RO mode;

5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
   btrfs_commit_transaction() and has set transaction N + 1 state to
   TRANS_STATE_COMMIT_START. Immediately after that it checks that the
   filesystem was turned to RO mode, due to transaction N's abort, and
   jumps to the "cleanup_transaction" label. After that we end up at
   btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
   That helper finds block group B in the transaction's io_list but it
   never releases the pages array of the block group's io_ctl, resulting in
   a memory leak.

In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().

We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.

So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().

This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:

unreferenced object 0xffff9bbf009fa600 (size 512):
  comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
  hex dump (first 32 bytes):
    00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
    80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
  backtrace:
    [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
    [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
    [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
    [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
    [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
    [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
    [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
    [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
    [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
    [<0000000043ace2c9>] do_syscall_64+0x33/0x80
    [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 18:39:46 +02:00
Linus Torvalds
382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Qu Wenruo
adca4d945c btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT
commit a514d63882 ("btrfs: qgroup: Commit transaction in advance to
reduce early EDQUOT") tries to reduce the early EDQUOT problems by
checking the qgroup free against threshold and tries to wake up commit
kthread to free some space.

The problem of that mechanism is, it can only free qgroup per-trans
metadata space, can't do anything to data, nor prealloc qgroup space.

Now since we have the ability to flush qgroup space, and implemented
retry-after-EDQUOT behavior, such mechanism can be completely replaced.

So this patch will cleanup such mechanism in favor of
retry-after-EDQUOT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:43 +02:00
Qu Wenruo
c53e965360 btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.  One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd111 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:

- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait for ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.

Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:42 +02:00
Qu Wenruo
2dfb1e43f5 btrfs: preallocate anon block device at first phase of snapshot creation
[BUG]
When the anonymous block device pool is exhausted, subvolume/snapshot
creation fails with EMFILE (Too many files open). This has been reported
by a user. The allocation happens in the second phase during transaction
commit where it's only way out is to abort the transaction

  BTRFS: Transaction aborted (error -24)
  WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
  RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
  Call Trace:
   create_pending_snapshots+0x82/0xa0 [btrfs]
   btrfs_commit_transaction+0x275/0x8c0 [btrfs]
   btrfs_mksubvol+0x4b9/0x500 [btrfs]
   btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
   btrfs_ioctl+0x11a4/0x2da0 [btrfs]
   do_vfs_ioctl+0xa9/0x640
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x5a/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace 33f2f83f3d5250e9 ]---
  BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
  BTRFS info (device sda1): forced readonly
  BTRFS warning (device sda1): Skipping commit of aborted transaction.
  BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

[CAUSE]
When the global anonymous block device pool is exhausted, the following
call chain will fail, and lead to transaction abort:

 btrfs_ioctl_snap_create_v2()
 |- btrfs_ioctl_snap_create_transid()
    |- btrfs_mksubvol()
       |- btrfs_commit_transaction()
          |- create_pending_snapshot()
             |- btrfs_get_fs_root()
                |- btrfs_init_fs_root()
                   |- get_anon_bdev()

[FIX]
Although we can't enlarge the anonymous block device pool, at least we
can preallocate anon_dev for subvolume/snapshot in the first phase,
outside of transaction context and exactly at the moment the user calls
the creation ioctl.

Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:38 +02:00
Qu Wenruo
851fd730a7 btrfs: don't allocate anonymous block device for user invisible roots
[BUG]
When a lot of subvolumes are created, there is a user report about
transaction aborted:

  BTRFS: Transaction aborted (error -24)
  WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs]
  RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs]
  Call Trace:
   create_pending_snapshots+0x82/0xa0 [btrfs]
   btrfs_commit_transaction+0x275/0x8c0 [btrfs]
   btrfs_mksubvol+0x4b9/0x500 [btrfs]
   btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
   btrfs_ioctl+0x11a4/0x2da0 [btrfs]
   do_vfs_ioctl+0xa9/0x640
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x5a/0x110
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace 33f2f83f3d5250e9 ]---
  BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown
  BTRFS info (device sda1): forced readonly
  BTRFS warning (device sda1): Skipping commit of aborted transaction.
  BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown

[CAUSE]
The error is EMFILE (Too many files open) and comes from the anonymous
block device allocation. The ids are in a shared pool of size 1<<20.

The ids are assigned to live subvolumes, ie. the root structure exists
in memory (eg. after creation or after the root appears in some path).
The pool could be exhausted if the numbers are not reclaimed fast
enough, after subvolume deletion or if other system component uses the
anon block devices.

[WORKAROUND]
Since it's not possible to completely solve the problem, we can only
minimize the time the id is allocated to a subvolume root.

Firstly, we can reduce the use of anon_dev by trees that are not
subvolume roots, like data reloc tree.

This patch will do extra check on root objectid, to skip roots that
don't need anon_dev.  Currently it's only data reloc tree and orphan
roots.

Reported-by: Greed Rong <greedrong@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:38 +02:00
David Sterba
a2570ef330 btrfs: remove unused btrfs_root::defrag_trans_start
Last touched in 2013 by commit de78b51a28 ("btrfs: remove cache only
arguments from defrag path") that was the only code that used the value.
Now it's only set but never used for anything, so we can remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:28 +02:00
Johannes Thumshirn
923eb52365 btrfs: use free_root_extent_buffer to free root
In btrfs_put_root() we're freeing a btrfs_root's 'node' and 'commit_root'
extent buffers manually via kfree(), while we're using
free_root_extent_buffers() in the free_root_pointers() function above.

free_root_extent_buffers() also NULLs the pointers after freeing, which
mitigates potential double frees.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:27 +02:00