linux-yocto/mm
Kairui Song 28e98022b3 mm/list_lru: simplify reparenting and initial allocation
Currently, there is a lot of code for detecting reparent racing using
kmemcg_id as the synchronization flag.  And an intermediate table is
required to record and compare the kmemcg_id.

We can simplify this by just checking the cgroup css status, skip if
cgroup is being offlined.  On the reparenting side, ensure no more
allocation is on going and no further allocation will occur by using the
XArray lock as barrier.

Combined with a O(n^2) top-down walk for the allocation, we get rid of the
intermediate table allocation completely.  Despite being O(n^2), it should
be actually faster because it's not practical to have a very deep cgroup
level, and in most cases the parent cgroup should have been allocated
already.

This also avoided changing kmemcg_id before reparenting, making cgroups
have a stable index for list_lru_memcg.  After this change it's possible
that a dying cgroup will see a NULL value in XArray corresponding to the
kmemcg_id, because the kmemcg_id will point to an empty slot.  In such
case, just fallback to use its parent.

As a result the code is simpler, following test also showed a very slight
performance gain (12 test runs):

prepare() {
        mkdir /tmp/test-fs
        modprobe brd rd_nr=1 rd_size=16777216
        mkfs.xfs -f /dev/ram0
        mount -t xfs /dev/ram0 /tmp/test-fs
        for i in $(seq 10000); do
                seq 8000 > "/tmp/test-fs/$i"
        done
        mkdir -p /sys/fs/cgroup/system.slice/bench/test/1
        echo +memory > /sys/fs/cgroup/system.slice/bench/cgroup.subtree_control
        echo +memory > /sys/fs/cgroup/system.slice/bench/test/cgroup.subtree_control
        echo +memory > /sys/fs/cgroup/system.slice/bench/test/1/cgroup.subtree_control
        echo 768M > /sys/fs/cgroup/system.slice/bench/memory.max
}

do_test() {
        read_worker() {
                mkdir -p "/sys/fs/cgroup/system.slice/bench/test/1/$1"
                echo $BASHPID > "/sys/fs/cgroup/system.slice/bench/test/1/$1/cgroup.procs"
                read -r __TMP < "/tmp/test-fs/$1";
        }
        read_in_all() {
                for i in $(seq 10000); do
                        read_worker "$i" &
                done; wait
        }
        echo 3 > /proc/sys/vm/drop_caches
        time read_in_all
        for i in $(seq 1 10000); do
                rmdir "/sys/fs/cgroup/system.slice/bench/test/1/$i" &>/dev/null
        done
}

Before:
real    0m3.498s   user    0m11.037s  sys     0m35.872s
real    1m33.860s  user    0m11.593s  sys     3m1.169s
real    1m31.883s  user    0m11.265s  sys     2m59.198s
real    1m32.394s  user    0m11.294s  sys     3m1.616s
real    1m31.017s  user    0m11.379s  sys     3m1.349s
real    1m31.931s  user    0m11.295s  sys     2m59.863s
real    1m32.758s  user    0m11.254s  sys     2m59.538s
real    1m35.198s  user    0m11.145s  sys     3m1.123s
real    1m30.531s  user    0m11.393s  sys     2m58.089s
real    1m31.142s  user    0m11.333s  sys     3m0.549s

After:
real    0m3.489s   user    0m10.943s  sys     0m36.036s
real    1m10.893s  user    0m11.495s  sys     2m38.545s
real    1m29.129s  user    0m11.382s  sys     3m1.601s
real    1m29.944s  user    0m11.494s  sys     3m1.575s
real    1m31.208s  user    0m11.451s  sys     2m59.693s
real    1m25.944s  user    0m11.327s  sys     2m56.394s
real    1m28.599s  user    0m11.312s  sys     3m0.162s
real    1m26.746s  user    0m11.538s  sys     2m55.462s
real    1m30.668s  user    0m11.475s  sys     3m2.075s
real    1m29.258s  user    0m11.292s  sys     3m0.780s

Which is slightly faster in real time.

Link: https://lkml.kernel.org/r/20241104175257.60853-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:22:25 -08:00
..
damon Merge branch 'mm-hotfixes-stable' into mm-stable 2024-11-11 00:04:10 -08:00
kasan kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller 2024-11-11 17:22:25 -08:00
kfence mm: kfence: fix elapsed time for allocated/freed track 2024-09-26 14:01:44 -07:00
kmsan mm, kasan, kmsan: instrument copy_from/to_kernel_nofault 2024-11-06 20:11:14 -08:00
backing-dev.c
balloon_compaction.c
bootmem_info.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
cma_debug.c
cma_sysfs.c
cma.c mm/cma: fix useless return in void function 2024-11-05 16:56:30 -08:00
cma.h
compaction.c mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() 2024-09-03 21:15:47 -07:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
debug.c mm: support only one page_type per page 2024-09-03 21:15:43 -07:00
dmapool_test.c
dmapool.c
early_ioremap.c
execmem.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
fadvise.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c memcg-v1: remove memcg move locking code 2024-11-06 20:11:19 -08:00
folio-compat.c mm: remove putback_lru_page() 2024-09-09 16:38:59 -07:00
gup_test.c
gup_test.h
gup.c gup: convert FOLL_TOUCH case in follow_page_pte() to folio 2024-11-06 20:11:08 -08:00
highmem.c
hmm.c
huge_memory.c mm: huge_memory: use strscpy() instead of strcpy() 2024-11-11 13:09:43 -08:00
hugetlb_cgroup.c mm: memcg: don't call propagate_protected_usage() needlessly 2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hugetlb.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
hwpoison-inject.c
init-mm.c
internal.h mm: move `get_order_from_str()` to internal.h 2024-11-11 13:09:43 -08:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig resource: remove dependency on SPARSEMEM from GET_FREE_REGION 2024-10-28 21:40:39 -07:00
Kconfig.debug slub: Introduce CONFIG_SLUB_RCU_DEBUG 2024-08-27 14:12:51 +02:00
khugepaged.c mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock() 2024-11-05 16:56:27 -08:00
kmemleak.c mm/kmemleak: fix typo in object_no_scan() comment 2024-11-06 20:11:12 -08:00
ksm.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
list_lru.c mm/list_lru: simplify reparenting and initial allocation 2024-11-11 17:22:25 -08:00
maccess.c kasan: migrate copy_user_test to kunit 2024-11-11 00:26:44 -08:00
madvise.c mm: madvise: implement lightweight guard page mechanism 2024-11-11 00:26:45 -08:00
Makefile mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
mapping_dirty_helpers.c
memblock.c memblock: updates for 6.12-rc1 2024-09-25 11:35:19 -07:00
memcontrol-v1.c memcg-v1: remove memcg move locking code 2024-11-06 20:11:19 -08:00
memcontrol-v1.h memcg-v1: remove charge move code 2024-11-06 20:11:18 -08:00
memcontrol.c mm/list_lru: code clean up for reparenting 2024-11-11 17:22:25 -08:00
memfd.c mm/hugetlb: simplify refs in memfd_alloc_folio 2024-09-26 14:01:44 -07:00
memory_hotplug.c kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end 2024-11-06 20:11:11 -08:00
memory-failure.c mm/memory-failure: replace sprintf() with sysfs_emit() 2024-11-11 00:26:46 -08:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
mempolicy.c mm: renovate page_address_in_vma() 2024-11-07 14:38:07 -08:00
mempool.c
memremap.c
memtest.c
migrate_device.c mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
migrate.c mm: remove redundant condition for THP folio 2024-11-06 20:11:16 -08:00
mincore.c
mlock.c mm/mlock: set the correct prev on failure 2024-11-07 14:14:58 -08:00
mm_init.c alloc_tag: support for page allocation tag compression 2024-11-07 14:25:16 -08:00
mm_slot.h
mmap_lock.c
mmap.c mm: remove unnecessary page_table_lock on stack expansion 2024-11-11 13:09:43 -08:00
mmu_gather.c
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
mremap.c mm/mremap: remove goto from mremap_to() 2024-11-06 20:11:16 -08:00
mseal.c mm: madvise: implement lightweight guard page mechanism 2024-11-11 00:26:45 -08:00
msync.c
nommu.c mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling 2024-11-05 16:49:55 -08:00
numa_emulation.c mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
numa_memblks.c mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id 2024-10-28 21:40:40 -07:00
numa.c mm: make range-to-target_node lookup facility a part of numa_memblks 2024-09-03 21:15:32 -07:00
oom_kill.c mm: move mm flags to mm_types.h 2024-11-05 16:56:26 -08:00
page_alloc.c Merge branch 'mm-hotfixes-stable' into mm-stable 2024-11-11 00:04:10 -08:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_idle.c
page_io.c mm: add per-order mTHP swpin counters 2024-11-11 00:26:43 -08:00
page_isolation.c mm: remove migration for HugePage in isolate_single_pageblock() 2024-09-03 21:15:40 -07:00
page_owner.c
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c
page_vma_mapped.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
page-writeback.c memcg-v1: no need for memcg locking for writeback tracking 2024-11-06 20:11:19 -08:00
pagewalk.c mm: pagewalk: add the ability to install PTEs 2024-11-11 00:26:44 -08:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: use page->private instead of page->index in percpu 2024-11-07 14:38:07 -08:00
pgalloc-track.h
pgtable-generic.c mm: pgtable: remove pte_offset_map_nolock() 2024-11-05 16:56:29 -08:00
process_vm_access.c mm: refactor mm_access() to not return NULL 2024-11-05 16:56:23 -08:00
ptdump.c
readahead.c mm: don't set readahead flag on a folio when lookahead_size > nr_to_read 2024-11-06 20:11:15 -08:00
rmap.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
rodata_test.c
secretmem.c secretmem: disable memfd_secret() if arch cannot set direct map 2024-10-09 12:47:19 -07:00
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
shmem.c mm: shmem: override mTHP shmem default with a kernel parameter 2024-11-11 13:09:43 -08:00
show_mem.c mm/show_mem: use str_yes_no() helper in show_free_areas() 2024-11-07 14:38:08 -08:00
shrinker_debug.c mm: shrinker: use min() to improve shrinker_debugfs_scan_write() 2024-09-03 21:15:40 -07:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shuffle.c
shuffle.h
slab_common.c mm: krealloc: Fix MTE false alarm in __do_krealloc 2024-10-29 10:40:53 +01:00
slab.h mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
slub.c mm, slab: suppress warnings in test_leak_destroy kunit test 2024-10-02 16:28:46 +02:00
sparse-vmemmap.c LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space 2024-10-21 22:11:19 +08:00
sparse.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
swap_cgroup.c mm: attempt to batch free swap entries for zap_pte_range() 2024-09-03 21:15:33 -07:00
swap_slots.c
swap_state.c mm: swap: use str_true_false() helper function 2024-11-06 20:11:14 -08:00
swap.c mm: delete the unused put_pages_list() 2024-11-11 00:26:45 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swapfile.c mm, swap: avoid over reclaim of full clusters 2024-10-30 20:14:11 -07:00
truncate.c mm/truncate: reset xa_has_values flag on each iteration 2024-11-06 20:11:09 -08:00
usercopy.c
userfaultfd.c mm: remove unused hugepage for vma_alloc_folio() 2024-11-06 20:11:12 -08:00
util.c mm: renovate page_address_in_vma() 2024-11-07 14:38:07 -08:00
vma_internal.h mm: isolate mmap internal logic to mm/vma.c 2024-11-06 20:11:19 -08:00
vma.c vma: detect infinite loop in vma tree 2024-11-11 13:09:42 -08:00
vma.h mm: isolate mmap internal logic to mm/vma.c 2024-11-06 20:11:19 -08:00
vmalloc.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
vmpressure.c
vmscan.c mm/vmscan: wake up flushers conditionally to avoid cgroup OOM 2024-11-07 14:38:07 -08:00
vmstat.c Merge branch 'mm-hotfixes-stable' into mm-stable 2024-11-11 00:04:10 -08:00
workingset.c mm/list_lru: don't pass unnecessary key parameters 2024-11-11 17:22:25 -08:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc: use memcpy_from/to_page whereever possible 2024-11-07 14:38:07 -08:00
zswap.c mm/list_lru: simplify reparenting and initial allocation 2024-11-11 17:22:25 -08:00