linux-yocto/block
Yu Kuai fa6a3dd139 block: fix ordering of recursive split IO
[ Upstream commit b2f5974079 ]

Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up. This problem, on the one hand, will cause
performance degradation, especially for mdraid with large IO size; on
the other hand, will cause write errors for zoned block devices[1].

For example, in raid456 IO will first be split by max_sector from
md_submit_bio(), and then later be split again by chunksize for internal
handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current->bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current->bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -> 1M+512k -> 512k+512k -> 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.

[1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/

Fixes: d89d87965d ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15 12:03:28 +02:00
..
partitions for-6.15/block-20250322 2025-03-26 18:08:55 -07:00
badblocks.c badblocks: Fix a nonsense WARN_ON() which checks whether a u64 variable < 0 2025-03-10 07:41:58 -06:00
bdev.c xfs: New code for 6.16 2025-05-26 12:56:01 -07:00
bfq-cgroup.c Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()" 2024-11-19 19:05:32 -07:00
bfq-iosched.c blk-mq: fix elevator depth_updated method 2025-10-15 12:03:24 +02:00
bfq-iosched.h lib/sbitmap: convert shallow_depth from one word to the whole sbitmap 2025-08-07 06:30:17 -06:00
bfq-wf2q.c block, bfq: inject I/O to underutilized actuators 2023-01-29 15:18:33 -07:00
bio-integrity-auto.c block: rename tuple_size field in blk_integrity to metadata_size 2025-07-01 14:00:14 +02:00
bio-integrity.c block: don't merge different kinds of P2P transfers in a single bio 2025-06-30 15:50:32 -06:00
bio.c block: cleanup bio_issue 2025-10-15 12:03:28 +02:00
blk-cgroup-fc-appid.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
blk-cgroup-rwstat.c blk-cgroup: use group allocation/free of per-cpu counters API 2024-04-03 09:10:17 -06:00
blk-cgroup-rwstat.h blk-cgroup: rwstat: fix kernel-doc warnings in header file 2025-01-13 07:47:09 -07:00
blk-cgroup.c blk-throttle: fix access race during throttle policy activation 2025-10-15 12:03:26 +02:00
blk-cgroup.h block: initialize bio issue time in blk_mq_submit_bio() 2025-10-15 12:03:28 +02:00
blk-core.c block: fix ordering of recursive split IO 2025-10-15 12:03:28 +02:00
blk-crypto-fallback.c block: add a bi_write_stream field 2025-05-06 07:46:43 -06:00
blk-crypto-internal.h blk-crypto: add ioctls to create and prepare hardware-wrapped keys 2025-02-10 09:54:19 -07:00
blk-crypto-profile.c blk-crypto: export wrapped key functions 2025-05-06 19:08:08 +02:00
blk-crypto-sysfs.c blk-crypto: show supported key types in sysfs 2025-02-10 09:54:19 -07:00
blk-crypto.c blk-crypto: add ioctls to create and prepare hardware-wrapped keys 2025-02-10 09:54:19 -07:00
blk-flush.c block: remove unused parameter 2025-03-12 08:25:28 -06:00
blk-ia-ranges.c block: get rid of request queue ->sysfs_dir_lock 2025-01-29 07:16:47 -07:00
blk-integrity.c block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP 2025-07-23 14:55:51 +02:00
blk-ioc.c blk-ioc: don't hold queue_lock for ioc_lookup_icq() 2025-07-29 06:26:34 -06:00
blk-iocost.c for-6.15/block-20250322 2025-03-26 18:08:55 -07:00
blk-iolatency.c block: cleanup bio_issue 2025-10-15 12:03:28 +02:00
blk-ioprio.c blk-cgroup: Simplify policy files registration 2025-03-11 09:22:55 -10:00
blk-ioprio.h blk-ioprio: remove per-disk structure 2024-07-28 16:47:51 -06:00
blk-lib.c block: fix detection of unsupported WRITE SAME in blkdev_issue_write_zeroes 2024-08-28 08:49:25 -06:00
blk-map.c block: simplify bio_map_kern 2025-05-07 07:31:07 -06:00
blk-merge.c block: fix ordering of recursive split IO 2025-10-15 12:03:28 +02:00
blk-mq-cpumap.c blk-mq: add number of queue calc helper 2025-07-01 10:24:19 -06:00
blk-mq-debugfs.c block: avoid cpu_hotplug_lock depedency on freeze_lock 2025-08-21 07:11:11 -06:00
blk-mq-debugfs.h block: Replace zone_wlock debugfs entry with zone_wplugs entry 2024-04-17 08:44:03 -06:00
blk-mq-dma.c block: add scatterlist-less DMA mapping helpers 2025-06-30 15:50:32 -06:00
blk-mq-sched.c block: fix potential deadlock while running nr_hw_queue update 2025-07-30 06:20:51 -06:00
blk-mq-sched.h blk-mq: fix elevator depth_updated method 2025-10-15 12:03:24 +02:00
blk-mq-sysfs.c blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx 2025-10-15 12:03:22 +02:00
blk-mq-tag.c blk-mq: fix blk_mq_tags double free while nr_requests grown 2025-10-06 11:20:04 +02:00
blk-mq.c block: initialize bio issue time in blk_mq_submit_bio() 2025-10-15 12:03:28 +02:00
blk-mq.h block: clean up blk_mq_in_flight_rw() 2025-05-10 16:11:21 +08:00
blk-pm.c block: force noio scope in blk_mq_freeze_queue 2025-01-31 07:20:08 -07:00
blk-pm.h
blk-rq-qos.c block: avoid cpu_hotplug_lock depedency on freeze_lock 2025-08-21 07:11:11 -06:00
blk-rq-qos.h block: validate QoS before calling __rq_qos_done_bio() 2025-08-26 10:34:08 -06:00
blk-settings.c block: use int to store blk_stack_limits() return value 2025-10-15 12:03:23 +02:00
blk-stat.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
blk-stat.h treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
blk-sysfs.c block: restore default wbt enablement 2025-08-13 05:33:48 -06:00
blk-throttle.c block: fix ordering of recursive split IO 2025-10-15 12:03:28 +02:00
blk-throttle.h blk-throttle: fix access race during throttle policy activation 2025-10-15 12:03:26 +02:00
blk-timeout.c
blk-wbt.c blk-wbt: Eliminate ambiguity in the comments of struct rq_wb 2025-08-11 10:21:38 -06:00
blk-wbt.h blk-wbt: remove the separate write cache tracking 2023-12-26 09:28:10 -07:00
blk-zoned.c blk-zoned: Fix a lockdep complaint about recursive locking 2025-08-26 08:27:24 -06:00
blk.h block: fix ordering of recursive split IO 2025-10-15 12:03:28 +02:00
bsg-lib.c block: remove unused parameter 'q' parameter in __blk_rq_map_sg() 2025-03-13 05:46:19 -06:00
bsg.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
disk-events.c block: move bdev_mark_dead out of disk_check_media_change 2023-10-28 13:29:23 +02:00
early-lookup.c wrapper for access to ->bd_partno 2024-05-02 17:48:09 -04:00
elevator.c block: fix potential deadlock while running nr_hw_queue update 2025-07-30 06:20:51 -06:00
elevator.h blk-mq: fix elevator depth_updated method 2025-10-15 12:03:24 +02:00
fops.c block: don't silently ignore metadata for sync read/write 2025-08-20 11:13:01 +02:00
genhd.c block: fix kobject double initialization in add_disk 2025-08-11 08:00:49 -06:00
holder.c block: fix deadlock between bd_link_disk_holder and partition scan 2024-02-23 07:44:19 -07:00
ioctl.c vfs-6.17-rc1.integrity 2025-07-28 15:12:00 -07:00
ioprio.c block: remove test of incorrect io priority level 2025-05-08 09:04:12 -06:00
Kconfig block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} 2025-05-14 05:43:56 -06:00
Kconfig.iosched block: Default to use cgroup support for BFQ 2023-01-30 09:42:42 -07:00
kyber-iosched.c blk-mq: fix elevator depth_updated method 2025-10-15 12:03:24 +02:00
Makefile blk-mq: move the DMA mapping code to a separate file 2025-05-16 08:43:41 -06:00
mq-deadline.c blk-mq: fix elevator depth_updated method 2025-10-15 12:03:24 +02:00
opal_proto.h block: sed-opal: handle empty atoms when parsing response 2024-02-16 15:52:45 -07:00
sed-opal.c block: sed-opal: add ioctl IOC_OPAL_SET_SID_PW 2024-10-22 08:16:40 -06:00
t10-pi.c block: rename tuple_size field in blk_integrity to metadata_size 2025-07-01 14:00:14 +02:00