linux-yocto/block
Nilay Shroff aea08fc350 block: restore two stage elevator switch while running nr_hw_queue update
[ Upstream commit 5989bfe6ac ]

The kmemleak reports memory leaks related to elevator resources that
were originally allocated in the ->init_hctx() method. The following
leak traces are observed after running blktests block/040:

unreferenced object 0xffff8881b82f7400 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70
    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f6000 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70
    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f5800 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70

    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

The issue arises while we run nr_hw_queue update,  Specifically, we first
reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
then later invoke elevator_switch() (assuming q->elevator is not NULL).
The elevator switch code would first exit old elevator (elevator_exit)
and then switches to the new elevator. The elevator_exit loops through
each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
which releases resources allocated during ->init_hctx().

This memleak manifests when we reduce the num of h/w queues - for example,
when the initial update sets the number of queues to X, and a later update
reduces it to Y, where Y < X. In this case, we'd loose the access to old
hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
would have already released the old hctxs. As we don't now have any
reference left to the old hctxs, we don't have any way to free the
scheduler resources (which are allocate in ->init_hctx()) and kmemleak
complains about it.

This issue was caused due to the commit 596dce110b ("block: simplify
elevator reattachment for updating nr_hw_queues"). That change unified
the two-stage elevator teardown and reattachment into a single call that
occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.

This patch restores the previous two-stage elevator switch logic during
nr_hw_queues updates. First, the elevator is switched to 'none', which
ensures all scheduler resources are properly freed. Then, the hardware
contexts (hctxs) are reallocated, and the software-to-hardware queue
mappings are updated. Finally, the original elevator is reattached. This
sequence prevents loss of references to old hctxs and avoids the scheduler
resource leaks reported by kmemleak.

Reported-by : Yi Zhang <yi.zhang@redhat.com>

Fixes: 596dce110b ("block: simplify elevator reattachment for updating nr_hw_queues")
Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250724102540.1366308-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-15 16:38:24 +02:00
..
partitions for-6.15/block-20250322 2025-03-26 18:08:55 -07:00
badblocks.c
bdev.c xfs: New code for 6.16 2025-05-26 12:56:01 -07:00
bfq-cgroup.c
bfq-iosched.c block: move wbt_enable_default() out of queue freezing from sched ->exit() 2025-05-06 07:43:43 -06:00
bfq-iosched.h
bfq-wf2q.c
bio-integrity-auto.c block: always allocate integrity buffer when required 2025-05-12 07:14:03 -06:00
bio-integrity.c block: drop direction param from bio_integrity_copy_user() 2025-06-03 12:45:45 -06:00
bio.c for-6.16/block-20250523 2025-05-26 11:39:36 -07:00
blk-cgroup-fc-appid.c
blk-cgroup-rwstat.c
blk-cgroup-rwstat.h
blk-cgroup.c cgroup: Changes for v6.16 2025-05-27 20:59:53 -07:00
blk-cgroup.h block: correct locking order for protecting blk-wbt parameters 2025-03-19 11:35:45 -06:00
blk-core.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
blk-crypto-fallback.c block: add a bi_write_stream field 2025-05-06 07:46:43 -06:00
blk-crypto-internal.h
blk-crypto-profile.c blk-crypto: export wrapped key functions 2025-05-06 19:08:08 +02:00
blk-crypto-sysfs.c
blk-crypto.c
blk-flush.c
blk-ia-ranges.c
blk-integrity.c block: flip iter directions in blk_rq_integrity_map_user() 2025-06-03 17:24:59 -06:00
blk-ioc.c
blk-iocost.c for-6.15/block-20250322 2025-03-26 18:08:55 -07:00
blk-iolatency.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
blk-ioprio.c
blk-ioprio.h
blk-lib.c
blk-map.c block: simplify bio_map_kern 2025-05-07 07:31:07 -06:00
blk-merge.c block: use plug request list tail for one-shot backmerge attempt 2025-06-11 08:48:46 -06:00
blk-mq-cpumap.c
blk-mq-debugfs.c block: add new helper for disabling elevator switch when deleting disk 2025-05-06 07:43:43 -06:00
blk-mq-debugfs.h
blk-mq-dma.c blk-mq: add a copyright notice to blk-mq-dma.c 2025-05-16 08:43:41 -06:00
blk-mq-sched.c block: fail to show/store elevator sysfs attribute if elevator is dying 2025-05-06 07:43:43 -06:00
blk-mq-sched.h
blk-mq-sysfs.c
blk-mq-tag.c
blk-mq.c block: restore two stage elevator switch while running nr_hw_queue update 2025-08-15 16:38:24 +02:00
blk-mq.h block: clean up blk_mq_in_flight_rw() 2025-05-10 16:11:21 +08:00
blk-pm.c
blk-pm.h
blk-rq-qos.c block: blk-rq-qos: guard rq-qos helpers by static key 2025-04-21 05:07:03 -06:00
blk-rq-qos.h block: blk-rq-qos: guard rq-qos helpers by static key 2025-04-21 05:07:03 -06:00
blk-settings.c block: sanitize chunk_sectors for atomic write limits 2025-08-15 16:38:23 +02:00
blk-stat.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
blk-stat.h treewide: Switch/rename to timer_delete[_sync]() 2025-04-05 10:30:12 +02:00
blk-sysfs.c block: fix kobject leak in blk_unregister_queue 2025-07-11 20:39:23 -06:00
blk-throttle.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
blk-throttle.h blk-throttle: Prevents the bps restricted io from entering the bps queue again 2025-05-13 12:08:27 -06:00
blk-timeout.c
blk-wbt.c for-6.16/block-20250523 2025-05-26 11:39:36 -07:00
blk-wbt.h
blk-zoned.c block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work 2025-06-11 06:42:27 -06:00
blk.h block: restore two stage elevator switch while running nr_hw_queue update 2025-08-15 16:38:24 +02:00
bsg-lib.c
bsg.c
disk-events.c
early-lookup.c
elevator.c block: restore two stage elevator switch while running nr_hw_queue update 2025-08-15 16:38:24 +02:00
elevator.h block: move wbt_enable_default() out of queue freezing from sched ->exit() 2025-05-06 07:43:43 -06:00
fops.c block: expose write streams for block device nodes 2025-05-06 07:46:43 -06:00
genhd.c block: fix false warning in bdev_count_inflight_rw() 2025-06-26 07:34:11 -06:00
holder.c
ioctl.c block: fix race between set_blocksize and read paths 2025-04-23 13:58:06 -06:00
ioprio.c block: remove test of incorrect io priority level 2025-05-08 09:04:12 -06:00
Kconfig block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO} 2025-05-14 05:43:56 -06:00
Kconfig.iosched
kyber-iosched.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
Makefile blk-mq: move the DMA mapping code to a separate file 2025-05-16 08:43:41 -06:00
mq-deadline.c block: take rq_list instead of plug in dispatch functions 2025-05-02 09:21:08 -06:00
opal_proto.h
sed-opal.c
t10-pi.c for-6.15/block-20250322 2025-03-26 18:08:55 -07:00