mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2025-07-06 13:55:22 +02:00
ed8fc80b19
1572 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
9a79822590 |
mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
commit
|
||
![]() |
44befc8939 |
mm, memcg: reconsider kmem.limit_in_bytes deprecation
commit |
||
![]() |
471ce44a1f |
memcg: drop kmem.limit_in_bytes
commit |
||
![]() |
7bc7cbf9ef |
mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
commit |
||
![]() |
1204b65e85 |
Revert "memcg: drop kmem.limit_in_bytes"
This reverts commit |
||
![]() |
b9d30c38ee |
memcontrol: ensure memcg acquired by id is properly set up
commit |
||
![]() |
75bb29b6c4 |
memcg: drop kmem.limit_in_bytes
commit |
||
![]() |
49b0638502 |
mm: enable page walking API to lock vmas during the walk
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
91f0dccef1 |
mm/memcontrol: do not tweak node in mem_cgroup_init()
mem_cgroup_init() request for allocations from each possible node, and
it's used to be a problem because NODE_DATA is not allocated for offline
node. Things have already changed since commit
|
||
![]() |
025b7799b3 |
mm/memcg: remove return value of mem_cgroup_scan_tasks()
No user checks the return value of mem_cgroup_scan_tasks(). Make the return value void. Link: https://lkml.kernel.org/r/20230616063030.977586-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6c77b607ee |
mm: kill lock|unlock_page_memcg()
Since commit
|
||
![]() |
c33c794828 |
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
04dee9e85c |
mm/various: give up if pte_offset_map[_lock]() fails
Following the examples of nearby code, various functions can just give up if pte_offset_map() or pte_offset_map_lock() fails. And there's no need for a preliminary pmd_trans_unstable() or other such check, since such cases are now safely handled inside. Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
396faf8898 |
memcg: use helper macro FLUSH_TIME
Use helper macro FLUSH_TIME to indicate the flush time to improve the readability a bit. No functional change intended. Link: https://lkml.kernel.org/r/20230603072116.1101690-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
e0e0b4126c |
mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg
This patch is similar to commit
|
||
![]() |
5c7e7a0d79 |
mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()
lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a cleaner interface on the caller side. Link: https://lkml.kernel.org/r/20230522112058.2965866-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
08e0f49e99 |
mm/memcontrol: fix typo in comment
Replace 'then' with 'than'. Link: https://lkml.kernel.org/r/20230522095233.4246-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
18b1d18bc2 |
memcg, oom: remove explicit wakeup in mem_cgroup_oom_synchronize()
Before commit
|
||
![]() |
857f21397f |
memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize()
mem_cgroup_oom_synchronize() is only used when the memcg oom handling is
handed over to the edge of the #PF path. Since commit
|
||
![]() |
35822fdae3 |
memcg: remove mem_cgroup_flush_stats_atomic()
Previous patches removed all callers of mem_cgroup_flush_stats_atomic(). Remove the function and simplify the code. Link: https://lkml.kernel.org/r/20230421174020.2994750-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f82a7a86db |
memcg: calculate root usage from global state
Currently, we approximate the root usage by adding the memcg stats for anon, file, and conditionally swap (for memsw). To read the memcg stats we need to invoke an rstat flush. rstat flushes can be expensive, they scale with the number of cpus and cgroups on the system. mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold() with irqs disabled, so such an expensive operation with irqs disabled can cause problems. Instead, approximate the root usage from global state. This is not 100% accurate, but the root usage has always been ill-defined anyway. Link: https://lkml.kernel.org/r/20230421174020.2994750-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
190409caaf |
memcg: flush stats non-atomically in mem_cgroup_wb_stats()
The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats() code path in wb_writeback() outside the lock section. We no longer need to flush the stats atomically. Flush the stats non-atomically. Link: https://lkml.kernel.org/r/20230421174020.2994750-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
dddb44ffa0 |
memcg: dump memory.stat during cgroup OOM for v1
Patch series "memcg: OOM log improvements", v2. This short patch series brings back some cgroup v1 stats in OOM logs that were unnecessarily changed before. It also makes memcg OOM logs less reliant on printk() internals. This patch (of 2): Commit |
||
![]() |
5b42360c73 |
memcg: use seq_buf_do_printk() with mem_cgroup_print_oom_meminfo()
Currently, we format all the memcg stats into a buffer in mem_cgroup_print_oom_meminfo() and use pr_info() to dump it to the logs. However, this buffer is large in size. Although it is currently working as intended, ther is a dependency between the memcg stats buffer and the printk record size limit. If we add more stats in the future and the buffer becomes larger than the printk record size limit, or if the prink record size limit is reduced, the logs may be truncated. It is safer to use seq_buf_do_printk(), which will automatically break up the buffer at line breaks and issue small printk() calls. Refactor the code to move the seq_buf from memory_stat_format() to its callers, and use seq_buf_do_printk() to print the seq_buf in mem_cgroup_print_oom_meminfo(). Link: https://lkml.kernel.org/r/20230428132406.2540811-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f785a8f21a |
mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->cached
A memcg pointer in the percpu stock can be accessed by drain_all_stock() from another cpu in a lockless way. In theory it might lead to an issue, similar to the one which has been discovered with stock->cached_objcg, where the pointer was zeroed between the check for being NULL and dereferencing. In this case the issue is unlikely a real problem, but to make it bulletproof and similar to stock->cached_objcg, let's annotate all accesses to stock->cached with READ_ONCE()/WTRITE_ONCE(). Link: https://lkml.kernel.org/r/20230502160839.361544-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3b8abb3239 |
mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required()
KCSAN found an issue in obj_stock_flush_required():
stock->cached_objcg can be reset between the check and dereference:
==================================================================
BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock
write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0:
drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306
refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340
obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408
memcg_slab_free_hook mm/slab.h:587 [inline]
__cache_free mm/slab.c:3373 [inline]
__do_kmem_cache_free mm/slab.c:3577 [inline]
kmem_cache_free+0x105/0x280 mm/slab.c:3602
__d_free fs/dcache.c:298 [inline]
dentry_free fs/dcache.c:375 [inline]
__dentry_kill+0x422/0x4a0 fs/dcache.c:621
dentry_kill+0x8d/0x1e0
dput+0x118/0x1f0 fs/dcache.c:913
__fput+0x3bf/0x570 fs/file_table.c:329
____fput+0x15/0x20 fs/file_table.c:349
task_work_run+0x123/0x160 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171
exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203
__syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1:
obj_stock_flush_required mm/memcontrol.c:3319 [inline]
drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361
try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703
try_charge mm/memcontrol.c:2837 [inline]
mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290
sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025
sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525
udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692
udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668
__sys_setsockopt+0x1c3/0x230 net/socket.c:2271
__do_sys_setsockopt net/socket.c:2282 [inline]
__se_sys_setsockopt net/socket.c:2279 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffff8881382d52c0 -> 0xffff888138893740
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g534293368afa #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to
stock->cached_objcg.
Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev
Fixes:
|
||
![]() |
ec342603e6 |
memcg: page_cgroup_ino() get memcg from the page's folio
In a kernel with added WARN_ON_ONCE(PageTail) in page_memcg_check(), we observed a warning from page_cgroup_ino() when reading /proc/kpagecgroup. This warning was added to catch fragile reads of a page memcg. Make page_cgroup_ino() get memcg from the page's folio using folio_memcg_check(): that gives it the correct memcg for each page of a folio, so is the right fix. Note that page_folio() is racy, the page's folio can change from under us, but the entire function is racy and documented as such. I dithered between the right fix and the safer "fix": it's unlikely but conceivable that some userspace has learnt that /proc/kpagecgroup gives no memcg on tail pages, and compensates for that in some (racy) way: so continuing to give no memcg on tails, without warning, might be safer. But hwpoison_filter_task(), the only other user of page_cgroup_ino(), persuaded me. It looks as if it currently leaves out tail pages of the selected memcg, by mistake: whereas hwpoison_inject() uses compound_head() and expects the tails to be included. So hwpoison testing coverage has probably been restricted by the wrong output from page_cgroup_ino() (if that memcg filter is used at all): in the short term, it might be safer not to enable wider coverage there, but long term we would regret that. This is based on a patch originally written by Hugh Dickins and retains most of the original commit log [1] The patch was changed to use folio_memcg_check(page_folio(page)) instead of page_memcg_check(compound_head(page)) based on discussions with Matthew Wilcox; where he stated that callers of page_memcg_check() should stop using it due to the ambiguity around tail pages -- instead they should use folio_memcg_check() and handle tail pages themselves. Link: https://lkml.kernel.org/r/20230412003451.4018887-1-yosryahmed@google.com Link: https://lore.kernel.org/linux-mm/20230313083452.1319968-1-yosryahmed@google.com/ [1] Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6b0ba2abbe |
memcg v1: provide read access to memory.pressure_level
cgroups v1 has a unique way of setting up memory pressure notifications: the user opens "memory.pressure_level" of the cgroup they want to monitor for pressure, then open "cgroup.event_control" and write the fd (among other things) to that file. memory.pressure_level has no other use, specifically it does not support any read or write operations. Consequently, no handlers are provided, and cgroup_file_mode() sets the permissions to 000. However, to actually use the mechanism, the subscribing user must have read access to the file and open the fd for reading, see memcg_write_event_control(). This is all fine as long as the subscribing process runs as root and is otherwise unconfined by further restrictions. However, if you add strict access controls such as selinux, the permission bits will be enforced, and opening memory.pressure_level for reading will fail, preventing the process from subscribing, even as root. To work around this issue, introduce a dummy read handler. When memory.pressure_level is created, cgroup_file_mode() will notice the existence of a handler, and therefore add read permissions to the file. Link: https://lkml.kernel.org/r/20230404105900.2005-1-flosch@nutanix.com Signed-off-by: Florian Schmidt <flosch@nutanix.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f9d911ca49 |
memcg: do not modify rstat tree for zero updates
In some situations, we may end up calling memcg_rstat_updated() with a value of 0, which means the stat was not actually updated. An example is if we fail to reclaim any pages in shrink_folio_list(). Do not add the cgroup to the rstat updated tree in this case, to avoid unnecessarily flushing it. Link: https://lkml.kernel.org/r/20230330191801.1967435-9-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
4009b2f188 |
workingset: memcg: sleep when flushing stats in workingset_refault()
In workingset_refault(), we call mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats within an RCU read section and with sleeping disallowed. Move the call above the RCU read section to make it non-atomic. Flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system, so avoid doing it atomically where possible. Since workingset_refault() is the only caller of mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic, and rename it to mem_cgroup_flush_stats_ratelimited(). Link: https://lkml.kernel.org/r/20230330191801.1967435-7-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9fad9aee1f |
memcg: sleep during flushing stats in safe contexts
Currently, all contexts that flush memcg stats do so with sleeping not allowed. Some of these contexts are perfectly safe to sleep in, such as reading cgroup files from userspace or the background periodic flusher. Flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system, so avoid doing it atomically where possible. Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka sleepable), and provide a separate atomic version. The atomic version is used in reclaim, refault, writeback, and in mem_cgroup_usage(). All other code paths are left to use the non-atomic version. This includes callbacks for userspace reads and the periodic flusher. Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(), change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and refault code paths are modified to do non-atomic flushing in separate later patches -- so it will eventually be changed back to mem_cgroup_flush_stats_ratelimited(). Link: https://lkml.kernel.org/r/20230330191801.1967435-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3cd9992b93 |
memcg: replace stats_flush_lock with an atomic
As Johannes notes in [1], stats_flush_lock is currently used to: (a) Protect updated to stats_flush_threshold. (b) Protect updates to flush_next_time. (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits. However: 1. stats_flush_threshold is already an atomic 2. flush_next_time is not atomic. The writer is locked, but the reader is lockless. If the reader races with a flush, you could see this: if (time_after(jiffies, flush_next_time)) spin_trylock() flush_next_time = now + delay flush() spin_unlock() spin_trylock() flush_next_time = now + delay flush() spin_unlock() which means we already can get flushes at a higher frequency than FLUSH_TIME during races. But it isn't really a problem. The reader could also see garbled partial updates if the compiler decides to split the write, so it needs at least READ_ONCE and WRITE_ONCE protection. 3. Serializing cgroup_rstat_flush() calls against the ratelimit factors is currently broken because of the race in 2. But the race is actually harmless, all we might get is the occasional earlier flush. If there is no delta, the flush won't do much. And if there is, the flush is justified. So the lock can be removed all together. However, the lock also served the purpose of preventing a thundering herd problem for concurrent flushers, see [2]. Use an atomic instead to serve the purpose of unifying concurrent flushers. [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/ [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/ Link: https://lkml.kernel.org/r/20230330191801.1967435-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
a2174e95cc |
memcg: do not flush stats in irq context
Currently, the only context in which we can invoke an rstat flush from irq context is through mem_cgroup_usage() on the root memcg when called from memcg_check_events(). An rstat flush is an expensive operation that should not be done in irq context, so do not flush stats and use the stale stats in this case. Arguably, usage threshold events are not reliable on the root memcg anyway since its usage is ill-defined. Link: https://lkml.kernel.org/r/20230330191801.1967435-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Shakeel Butt <shakeelb@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
92fbbc7202 |
memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work, but this is actually sometimes flushing directly from the callsite. What it's doing is ratelimited calls. A better name would be mem_cgroup_flush_stats_ratelimited(). Link: https://lkml.kernel.org/r/20230330191801.1967435-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
8bff9a04ca |
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
Patch series "memcg: avoid flushing stats atomically where possible", v3. rstat flushing is an expensive operation that scales with the number of cpus and the number of cgroups in the system. The purpose of this series is to minimize the contexts where we flush stats atomically. Patches 1 and 2 are cleanups requested during reviews of prior versions of this series. Patch 3 makes sure we never try to flush from within an irq context. Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for atomic and non-atomic flushing, and make sure we only flush the stats atomically when necessary. Patch 8 is a slightly tangential optimization that limits the work done by rstat flushing in some scenarios. This patch (of 8): cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as "irqs are disabled throughout", which is what the current implementation does (currently under discussion [1]), but is not the intention. The intention is that this function is safe to call from atomic contexts. Name it as such. Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
6a792697a5 |
memcg: do not drain charge pcp caches on remote isolated cpus
Leonardo Bras has noticed that pcp charge cache draining might be disruptive on workloads relying on 'isolated cpus', a feature commonly used on workloads that are sensitive to interruption and context switching such as vRAN and Industrial Control Systems. There are essentially two ways how to approach the issue. We can either allow the pcp cache to be drained on a different rather than a local cpu or avoid remote flushing on isolated cpus. The current pcp charge cache is really optimized for high performance and it always relies to stick with its cpu. That means it only requires local_lock (preempt_disable on !RT) and draining is handed over to pcp WQ to drain locally again. The former solution (remote draining) would require to add an additional locking to prevent local charges from racing with the draining. This adds an atomic operation to otherwise simple arithmetic fast path in the try_charge path. Another concern is that the remote draining can cause a lock contention for the isolated workloads and therefore interfere with it indirectly via user space interfaces. Another option is to avoid draining scheduling on isolated cpus altogether. That means that those remote cpus would keep their charges even after drain_all_stock returns. This is certainly not optimal either but it shouldn't really cause any major problems. In the worst case (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH i.e 64 page) the memory consumption of a memcg would be artificially higher than can be immediately used from other cpus. Theoretically a memcg OOM killer could be triggered pre-maturely. Currently it is not really clear whether this is a practical problem though. Tight memcg limit would be really counter productive to cpu isolated workloads pretty much by definition because any memory reclaimed induced by memcg limit could break user space timing expectations as those usually expect execution in the userspace most of the time. Also charges could be left behind on memcg removal. Any future charge on those isolated cpus will drain that pcp cache so this won't be a permanent leak. Considering cons and pros of both approaches this patch is implementing the second option and simply do not schedule remote draining if the target cpu is isolated. This solution is much more simpler. It doesn't add any new locking and it is more more predictable from the user space POV. Should the pre-mature memcg OOM become a real life problem, we can revisit this decision. [akpm@linux-foundation.org: memcontrol.c needs sched/isolation.h] Link: https://lore.kernel.org/oe-kbuild-all/202303180617.7E3aIlHf-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230317134448.11082-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: Leonardo Bras <leobras@redhat.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
66dabbb65d |
mm: return an ERR_PTR from __filemap_get_folio
Instead of returning NULL for all errors, distinguish between: - no entry found and not asked to allocated (-ENOENT) - failed to allocate memory (-ENOMEM) - would block (-EAGAIN) so that callers don't have to guess the error based on the passed in flags. Also pass through the error through the direct callers: filemap_get_folio, filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio. [hch@lst.de: fix null-pointer deref] Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004 Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs2] Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
2178e20c24 |
mm, memcg: Prevent memory.soft_limit_in_bytes load/store tearing
The knob for cgroup v1 memory controller: memory.soft_limit_in_bytes is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->soft_limit is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->soft_limit are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-5-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-5-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
17c56de6a8 |
mm, memcg: Prevent memory.oom_control load/store tearing
The knob for cgroup v1 memory controller: memory.oom_control is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_kill_disable is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->oom_kill_disable are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-4-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.377-4-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
82b3aa2681 |
mm, memcg: Prevent memory.swappiness load/store tearing
The knob for cgroup v1 memory controller: memory.swappiness is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->swappiness and vm_swappiness is lockless, so both of them can be concurrently set at the same time as we are trying to read them. All occurrences of memcg->swappiness and vm_swappiness are updated with [READ|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-3-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-3-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
eaf7b66b76 |
mm, memcg: Prevent memory.oom.group load/store tearing
Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing fixes", v2. This patch series helps to prevent load/store tearing in several cgroup knobs. As kindly pointed out by Michal Hocko and Roman Gushchin , the changelog has been rephrased. Besides, more knobs were checked, according to kind suggestions from Shakeel Butt and Muchun Song. This patch (of 4): The knob for cgroup v2 memory controller: memory.oom.group is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely (the knob is usually configured long before any workloads hits actual memcg oom) but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_group is lockless, so it can be concurrently set at the same time as we are trying to read it. Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
3822a7c409 |
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ... |
||
![]() |
f7f9c00dfa |
mm: change to return bool for isolate_lru_page()
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f7a449f779 |
mm: memcontrol: rename memcg_kmem_enabled()
Currently there are two kmem-related helper functions with a confusing semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled(). The problem is that an obvious expectation memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(), can be false. mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is disabled using a boot time kernel option "cgroup.memory=nokmem". It never changes the value dynamically. memcg_kmem_enabled() is different: it always returns false until the first non-root memory cgroup will get online (assuming the kernel memory accounting is enabled). It's goal is to improve the performance on systems without the cgroupfs mounted/memory controller enabled or on the systems with only the root memory cgroup. To make things more obvious and avoid potential bugs, let's rename memcg_kmem_enabled() to memcg_kmem_online(). Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b6c1a8af5b |
mm: memcontrol: add new kernel parameter cgroup.memory=nobpf
Add new kernel parameter cgroup.memory=nobpf to allow user disable bpf memory accounting. This is a preparation for the followup patch. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20230210154734.4416-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
36c7b4db7c |
mm: multi-gen LRU: section for memcg LRU
Move memcg LRU code into a dedicated section. Improve the design doc to outline its architecture. Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
75376c6fb9 |
mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
Only one caller doesn't have a folio, so move the page_folio() call to that one caller from mem_cgroup_css_from_folio(). Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5ab0fc155d |
Sync mm-stable with mm-hotfixes-stable to pick up dependent patches
Merge branch 'mm-hotfixes-stable' into mm-stable |
||
![]() |
55ab834a86 |
Revert "mm: add nodes= arg to memory.reclaim"
This reverts commit |
||
![]() |
becacb04fd |
mm: memcg: add folio_memcg_check()
Patch series "mm: convert page_idle/damon to use folios", v4. This patch (of 8): Convert page_memcg_check() into folio_memcg_check() and add a page_memcg_check() wrapper. The behaviour of page_memcg_check() is unchanged; tail pages always had a NULL ->memcg_data. Link: https://lkml.kernel.org/r/20221230070849.63358-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20221230070849.63358-2-wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |