mirror of
https://github.com/nxp-imx/linux-imx.git
synced 2025-07-16 22:29:37 +02:00
ef901b38d3
363 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
586b222d74 |
Scheduler changes for v6.4:
- Allow unprivileged PSI poll()ing - Fix performance regression introduced by mm_cid - Improve livepatch stalls by adding livepatch task switching to cond_resched(), this resolves livepatching busy-loop stalls with certain CPU-bound kthreads. - Improve sched_move_task() performance on autogroup configs. - On core-scheduling CPUs, avoid selecting throttled tasks to run - Misc cleanups, fixes and improvements. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK39cRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hXPhAAk2WqOV2cW4BjSCHjWWE05IfTb0HMn8si mFGBAnr1GIkJRvICAusAwDU3FcmP5mWyXA+LK110d3x4fKJP15vCD5ru5lHnBfX7 fSD+Ml8uM4Xlp8iUoQspilbQwmWkQSwhudbDs3Nj7XGUzJCvNgm1sM3xPRDlqSJ5 6zumfVOPTfzSGcZY3a8sMuJnCepZHLRR6NkLzo/DuI1NMy2Jw1dK43dh77AO1mBF M53PF2IQgm6Wu/67p2k5eDq4c0AKL4PyIb4dRTGOPyljWMf41n28jwMv1tjlvu+Y uT0JD8MJSrFiylyT41x7Asr7orAGXj3cPhShK5R0vrutx/SbqBiaaE1MO9U3aC3B 7xVXEORHWD6KIDqTvzmWGrMBkIdyWB6CLk6EJKr3MqM9hUtP2ift7bkAgIad9h+4 G9DdVePGoCyh/TQtJ9EPIULAYeu9mmDZe8rTQ8C5MCSg//05/CTMgBbb0NiFWhnd 0JQl1B0nNUA87whVUxK8Hfu4DLh7m9jrzgQr9Ww8/FwQ6tQHBOKWgDdbv45ckkaG cJIQt/+vLilddazc8u8E+BGaD5w2uIYF0uL7kvG6Q5oARX06AZ5dj1m06vhZe/Ym laOVZEpJsbQnxviY6jwj1n+CSB9aK7feiQfDePBPbpJGGUHyZoKrnLN6wmW2se+H VCHtdgsEl5I= =Hgci -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Allow unprivileged PSI poll()ing - Fix performance regression introduced by mm_cid - Improve livepatch stalls by adding livepatch task switching to cond_resched(). This resolves livepatching busy-loop stalls with certain CPU-bound kthreads - Improve sched_move_task() performance on autogroup configs - On core-scheduling CPUs, avoid selecting throttled tasks to run - Misc cleanups, fixes and improvements * tag 'sched-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/clock: Fix local_clock() before sched_clock_init() sched/rt: Fix bad task migration for rt tasks sched: Fix performance regression introduced by mm_cid sched/core: Make sched_dynamic_mutex static sched/psi: Allow unprivileged polling of N*2s period sched/psi: Extract update_triggers side effect sched/psi: Rename existing poll members in preparation sched/psi: Rearrange polling code in preparation sched/fair: Fix inaccurate tally of ttwu_move_affine vhost: Fix livepatch timeouts in vhost_worker() livepatch,sched: Add livepatch task switching to cond_resched() livepatch: Skip task_call_func() for current task livepatch: Convert stack entries array to percpu sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization sched/core: Reduce cost of sched_move_task when config autogroup sched/core: Avoid selecting the task that is throttled to run when core-sched enable sched/topology: Make sched_energy_mutex,update static |
||
![]() |
6e98b09da9 |
Networking changes for 6.4.
Core ---- - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances. - Reduce compound page head access for zero-copy data transfers. - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible. - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance. - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking. - Add lockless accesses annotation to sk_err[_soft]. - Optimize again the skb struct layout. - Extends the skb drop reasons to make it usable by multiple subsystems. - Better const qualifier awareness for socket casts. BPF --- - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses. - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward. - Add more precise memory usage reporting for all BPF map types. - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params. - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton. - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities. - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc. - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps. - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps. - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree. - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them. - Add ARM32 USDT support to libbpf. - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations. Protocols --------- - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address. - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition. - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf. - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures. - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers. - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction. - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore. - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter --------- - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged. - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support. - The iptables 32bit compat interface isn't compiled in by default anymore. - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used. - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device. Driver API ---------- - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time. - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them. - Allow the page_pool to directly recycle the pages from safely localized NAPI. - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization. - Add YNL support for user headers and struct attrs. - Add partial YNL specification for devlink. - Add partial YNL specification for ethtool. - Add tc-mqprio and tc-taprio support for preemptible traffic classes. - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device. - Add basic LED support for switch/phy. - Add NAPI documentation, stop relaying on external links. - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space. - Add transceiver support and improve the error messages for CAN-FD controllers. New hardware / drivers ---------------------- - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers ------- - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors. - add support for configuring max SDU for each Tx queue. - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll. - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates. - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ vP7ugFnttneg =ITVa -----END PGP SIGNATURE----- Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the default value allows for better BIG TCP performances - Reduce compound page head access for zero-copy data transfers - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible - Threaded NAPI improvements, adding defer skb free support and unneeded softirq avoidance - Address dst_entry reference count scalability issues, via false sharing avoidance and optimize refcount tracking - Add lockless accesses annotation to sk_err[_soft] - Optimize again the skb struct layout - Extends the skb drop reasons to make it usable by multiple subsystems - Better const qualifier awareness for socket casts BPF: - Add skb and XDP typed dynptrs which allow BPF programs for more ergonomic and less brittle iteration through data and variable-sized accesses - Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward - Add more precise memory usage reporting for all BPF map types - Adds support for using {FOU,GUE} encap with an ipip device operating in collect_md mode and add a set of BPF kfuncs for controlling encap params - Allow BPF programs to detect at load time whether a particular kfunc exists or not, and also add support for this in light skeleton - Bigger batch of BPF verifier improvements to prepare for upcoming BPF open-coded iterators allowing for less restrictive looping capabilities - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF programs to NULL-check before passing such pointers into kfunc - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in local storage maps - Enable RCU semantics for task BPF kptrs and allow referenced kptr tasks to be stored in BPF maps - Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree - Add BPF verifier support for ST instructions in convert_ctx_access() which will help new -mcpu=v4 clang flag to start emitting them - Add ARM32 USDT support to libbpf - Improve bpftool's visual program dump which produces the control flow graph in a DOT format by adding C source inline annotations Protocols: - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value indicates the provenance of the IP address - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition - Add the handshake upcall mechanism, allowing the user-space to implement generic TLS handshake on kernel's behalf - Bridge: support per-{Port, VLAN} neighbor suppression, increasing resilience to nodes failures - SCTP: add support for Fair Capacity and Weighted Fair Queueing schedulers - MPTCP: delay first subflow allocation up to its first usage. This will allow for later better LSM interaction - xfrm: Remove inner/outer modes from input/output path. These are not needed anymore - WiFi: - reduced neighbor report (RNR) handling for AP mode - HW timestamping support - support for randomized auth/deauth TA for PASN privacy - per-link debugfs for multi-link - TC offload support for mac80211 drivers - mac80211 mesh fast-xmit and fast-rx support - enable Wi-Fi 7 (EHT) mesh support Netfilter: - Add nf_tables 'brouting' support, to force a packet to be routed instead of being bridged - Update bridge netfilter and ovs conntrack helpers to handle IPv6 Jumbo packets properly, i.e. fetch the packet length from hop-by-hop extension header. This is needed for BIT TCP support - The iptables 32bit compat interface isn't compiled in by default anymore - Move ip(6)tables builtin icmp matches to the udptcp one. This has the advantage that icmp/icmpv6 match doesn't load the iptables/ip6tables modules anymore when iptables-nft is used - Extended netlink error report for netdevice in flowtables and netdev/chains. Allow for incrementally add/delete devices to netdev basechain. Allow to create netdev chain without device Driver API: - Remove redundant Device Control Error Reporting Enable, as PCI core has already error reporting enabled at enumeration time - Move Multicast DB netlink handlers to core, allowing devices other then bridge to use them - Allow the page_pool to directly recycle the pages from safely localized NAPI - Implement lockless TX queue stop/wake combo macros, allowing for further code de-duplication and sanitization - Add YNL support for user headers and struct attrs - Add partial YNL specification for devlink - Add partial YNL specification for ethtool - Add tc-mqprio and tc-taprio support for preemptible traffic classes - Add tx push buf len param to ethtool, specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device - Add basic LED support for switch/phy - Add NAPI documentation, stop relaying on external links - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory work to make the hardware timestamping layer selectable by user space - Add transceiver support and improve the error messages for CAN-FD controllers New hardware / drivers: - Ethernet: - AMD/Pensando core device support - MediaTek MT7981 SoC - MediaTek MT7988 SoC - Broadcom BCM53134 embedded switch - Texas Instruments CPSW9G ethernet switch - Qualcomm EMAC3 DWMAC ethernet - StarFive JH7110 SoC - NXP CBTX ethernet PHY - WiFi: - Apple M1 Pro/Max devices - RealTek rtl8710bu/rtl8188gu - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset - Bluetooth: - Realtek RTL8821CS, RTL8851B, RTL8852BS - Mediatek MT7663, MT7922 - NXP w8997 - Actions Semi ATS2851 - QTI WCN6855 - Marvell 88W8997 - Can: - STMicroelectronics bxcan stm32f429 Drivers: - Ethernet NICs: - Intel (1G, icg): - add tracking and reporting of QBV config errors - add support for configuring max SDU for each Tx queue - Intel (100G, ice): - refactor mailbox overflow detection to support Scalable IOV - GNSS interface optimization - Intel (i40e): - support XDP multi-buffer - nVidia/Mellanox: - add the support for linux bridge multicast offload - enable TC offload for egress and engress MACVLAN over bond - add support for VxLAN GBP encap/decap flows offload - extend packet offload to fully support libreswan - support tunnel mode in mlx5 IPsec packet offload - extend XDP multi-buffer support - support MACsec VLAN offload - add support for dynamic msix vectors allocation - drop RX page_cache and fully use page_pool - implement thermal zone to report NIC temperature - Netronome/Corigine: - add support for multi-zone conntrack offload - Solarflare/Xilinx: - support offloading TC VLAN push/pop actions to the MAE - support TC decap rules - support unicast PTP - Other NICs: - Broadcom (bnxt): enforce software based freq adjustments only on shared PHC NIC - RealTek (r8169): refactor to addess ASPM issues during NAPI poll - Micrel (lan8841): add support for PTP_PF_PEROUT - Cadence (macb): enable PTP unicast - Engleder (tsnep): add XDP socket zero-copy support - virtio-net: implement exact header length guest feature - veth: add page_pool support for page recycling - vxlan: add MDB data path support - gve: add XDP support for GQI-QPL format - geneve: accept every ethertype - macvlan: allow some packets to bypass broadcast queue - mana: add support for jumbo frame - Ethernet high-speed switches: - Microchip (sparx5): Add support for TC flower templates - Ethernet embedded switches: - Broadcom (b54): - configure 6318 and 63268 RGMII ports - Marvell (mv88e6xxx): - faster C45 bus scan - Microchip: - lan966x: - add support for IS1 VCAP - better TX/RX from/to CPU performances - ksz9477: add ETS Qdisc support - ksz8: enhance static MAC table operations and error handling - sama7g5: add PTP capability - NXP (ocelot): - add support for external ports - add support for preemptible traffic classes - Texas Instruments: - add CPSWxG SGMII support for J7200 and J721E - Intel WiFi (iwlwifi): - preparation for Wi-Fi 7 EHT and multi-link support - EHT (Wi-Fi 7) sniffer support - hardware timestamping support for some devices/firwmares - TX beacon protection on newer hardware - Qualcomm 802.11ax WiFi (ath11k): - MU-MIMO parameters support - ack signal support for management packets - RealTek WiFi (rtw88): - SDIO bus support - better support for some SDIO devices (e.g. MAC address from efuse) - RealTek WiFi (rtw89): - HW scan support for 8852b - better support for 6 GHz scanning - support for various newer firmware APIs - framework firmware backwards compatibility - MediaTek WiFi (mt76): - P2P support - mesh A-MSDU support - EHT (Wi-Fi 7) support - coredump support" * tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits) net: phy: hide the PHYLIB_LEDS knob net: phy: marvell-88x2222: remove unnecessary (void*) conversions tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. net: amd: Fix link leak when verifying config failed net: phy: marvell: Fix inconsistent indenting in led_blink_set lan966x: Don't use xdp_frame when action is XDP_TX tsnep: Add XDP socket zero-copy TX support tsnep: Add XDP socket zero-copy RX support tsnep: Move skb receive action to separate function tsnep: Add functions for queue enable/disable tsnep: Rework TX/RX queue initialization tsnep: Replace modulo operation with mask net: phy: dp83867: Add led_brightness_set support net: phy: Fix reading LED reg property drivers: nfc: nfcsim: remove return value check of `dev_dir` net: phy: dp83867: Remove unnecessary (void*) conversions net: ethtool: coalesce: try to make user settings stick twice net: mana: Check if netdev/napi_alloc_frag returns single page net: mana: Rename mana_refill_rxoob and remove some empty lines net: veth: add page_pool stats ... |
||
![]() |
2f31fa029d |
cgroup_get_from_fd(): switch to fdget_raw()
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
![]() |
d82caa2735 |
sched/psi: Allow unprivileged polling of N*2s period
PSI offers 2 mechanisms to get information about a specific resource pressure. One is reading from /proc/pressure/<resource>, which gives average pressures aggregated every 2s. The other is creating a pollable fd for a specific resource and cgroup. The trigger creation requires CAP_SYS_RESOURCE, and gives the possibility to pick specific time window and threshold, spawing an RT thread to aggregate the data. Systemd would like to provide containers the option to monitor pressure on their own cgroup and sub-cgroups. For example, if systemd launches a container that itself then launches services, the container should have the ability to poll() for pressure in individual services. But neither the container nor the services are privileged. This patch implements a mechanism to allow unprivileged users to create pressure triggers. The difference with privileged triggers creation is that unprivileged ones must have a time window that's a multiple of 2s. This is so that we can avoid unrestricted spawning of rt threads, and use instead the same aggregation mechanism done for the averages, which runs independently of any triggers. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20230330105418.77061-5-cerasuolodomenico@gmail.com |
||
![]() |
4cdb91b0de |
cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
b8a2e3f93d |
cgroup: Make current_cgns_cgroup_dfl() safe to call after exit_task_namespace()
The commit |
||
![]() |
4609e1f18e
|
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
![]() |
7e68dd7d07 |
Networking changes for 6.2.
Core ---- - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations. - Add inet drop monitor support. - A few GRO performance improvements. - Add infrastructure for atomic dev stats, addressing long standing data races. - De-duplicate common code between OVS and conntrack offloading infrastructure. - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements. - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs. - Add the helper support for connection-tracking OVS offload. BPF --- - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF. - Make cgroup local storage available to non-cgroup attached BPF programs. - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers. - A relevant bunch of BPF verifier fixes and improvements. - Veristat tool improvements to support custom filtering, sorting, and replay of results. - Add LLVM disassembler as default library for dumping JITed code. - Lots of new BPF documentation for various BPF maps. - Add bpf_rcu_read_{,un}lock() support for sleepable programs. - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs. - Add support storing struct task_struct objects as kptrs in maps. - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values. - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions. Protocols --------- - TCP: implement Protective Load Balancing across switch links. - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path. - UDP: Introduce optional per-netns hash lookup table. - IPv6: simplify and cleanup sockets disposal. - Netlink: support different type policies for each generic netlink operation. - MPTCP: add MSG_FASTOPEN and FastOpen listener side support. - MPTCP: add netlink notification support for listener sockets events. - SCTP: add VRF support, allowing sctp sockets binding to VRF devices. - Add bridging MAC Authentication Bypass (MAB) support. - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios. - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage. - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading. - IPSec: extended ack support for more descriptive XFRM error reporting. - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking. - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks. - Tun: bump the link speed from 10Mbps to 10Gbps. - Tun/VirtioNet: implement UDP segmentation offload support. Driver API ---------- - PHY/SFP: improve power level switching between standard level 1 and the higher power levels. - New API for netdev <-> devlink_port linkage. - PTP: convert existing drivers to new frequency adjustment implementation. - DSA: add support for rx offloading. - Autoload DSA tagging driver when dynamically changing protocol. - Add new PCP and APPTRUST attributes to Data Center Bridging. - Add configuration support for 800Gbps link speed. - Add devlink port function attribute to enable/disable RoCE and migratable. - Extend devlink-rate to support strict prioriry and weighted fair queuing. - Add devlink support to directly reading from region memory. - New device tree helper to fetch MAC address from nvmem. - New big TCP helper to simplify temporary header stripping. New hardware / drivers ---------------------- - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches. - Marvel Prestera AC5X Ethernet Switch. - WangXun 10 Gigabit NIC. - Motorcomm yt8521 Gigabit Ethernet. - Microchip ksz9563 Gigabit Ethernet Switch. - Microsoft Azure Network Adapter. - Linux Automation 10Base-T1L adapter. - PHY: - Aquantia AQR112 and AQR412. - Motorcomm YT8531S. - PTP: - Orolia ART-CARD. - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices. - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices. - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets. - Realtek RTL8852BE and RTL8723DS. - Cypress.CYW4373A0 WiFi + Bluetooth combo device. Drivers ------- - CAN: - gs_usb: bus error reporting support. - kvaser_usb: listen only and bus error reporting support. - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping. - implement devlink-rate support. - support direct read from memory. - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate. - Support for enhanced events compression. - extend H/W offload packet manipulation capabilities. - implement IPSec packet offload mode. - nVidia/Mellanox (mlx4): - better big TCP support. - Netronome Ethernet NICs (nfp): - IPsec offload support. - add support for multicast filter. - Broadcom: - RSS and PTP support improvements. - AMD/SolarFlare: - netlink extened ack improvements. - add basic flower matches to offload, and related stats. - Virtual NICs: - ibmvnic: introduce affinity hint support. - small / embedded: - FreeScale fec: add initial XDP support. - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood. - TI am65-cpsw: add suspend/resume support. - Mediatek MT7986: add RX wireless wthernet dispatch support. - Realtek 8169: enable GRO software interrupt coalescing per default. - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP. - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support. - add ip6gre support. - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support. - enable flow offload support. - Renesas: - add rswitch R-Car Gen4 gPTP support. - Microchip (lan966x): - add full XDP support. - add TC H/W offload via VCAP. - enable PTP on bridge interfaces. - Microchip (ksz8): - add MTU support for KSZ8 series. - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan. - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support. - add ack signal support. - enable coredump support. - remain_on_channel support. - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities. - 320 MHz channels support. - RealTek WiFi (rtw89): - new dynamic header firmware format support. - wake-over-WLAN support. Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf 5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0 QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm V8gnt1EL+0cc =CbJC -----END PGP SIGNATURE----- Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: Introduce optional per-netns hash lookup table - IPv6: simplify and cleanup sockets disposal - Netlink: support different type policies for each generic netlink operation - MPTCP: add MSG_FASTOPEN and FastOpen listener side support - MPTCP: add netlink notification support for listener sockets events - SCTP: add VRF support, allowing sctp sockets binding to VRF devices - Add bridging MAC Authentication Bypass (MAB) support - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading - IPSec: extended ack support for more descriptive XFRM error reporting - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks - Tun: bump the link speed from 10Mbps to 10Gbps - Tun/VirtioNet: implement UDP segmentation offload support Driver API: - PHY/SFP: improve power level switching between standard level 1 and the higher power levels - New API for netdev <-> devlink_port linkage - PTP: convert existing drivers to new frequency adjustment implementation - DSA: add support for rx offloading - Autoload DSA tagging driver when dynamically changing protocol - Add new PCP and APPTRUST attributes to Data Center Bridging - Add configuration support for 800Gbps link speed - Add devlink port function attribute to enable/disable RoCE and migratable - Extend devlink-rate to support strict prioriry and weighted fair queuing - Add devlink support to directly reading from region memory - New device tree helper to fetch MAC address from nvmem - New big TCP helper to simplify temporary header stripping New hardware / drivers: - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches - Marvel Prestera AC5X Ethernet Switch - WangXun 10 Gigabit NIC - Motorcomm yt8521 Gigabit Ethernet - Microchip ksz9563 Gigabit Ethernet Switch - Microsoft Azure Network Adapter - Linux Automation 10Base-T1L adapter - PHY: - Aquantia AQR112 and AQR412 - Motorcomm YT8531S - PTP: - Orolia ART-CARD - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets - Realtek RTL8852BE and RTL8723DS - Cypress.CYW4373A0 WiFi + Bluetooth combo device Drivers: - CAN: - gs_usb: bus error reporting support - kvaser_usb: listen only and bus error reporting support - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping - implement devlink-rate support - support direct read from memory - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate - Support for enhanced events compression - extend H/W offload packet manipulation capabilities - implement IPSec packet offload mode - nVidia/Mellanox (mlx4): - better big TCP support - Netronome Ethernet NICs (nfp): - IPsec offload support - add support for multicast filter - Broadcom: - RSS and PTP support improvements - AMD/SolarFlare: - netlink extened ack improvements - add basic flower matches to offload, and related stats - Virtual NICs: - ibmvnic: introduce affinity hint support - small / embedded: - FreeScale fec: add initial XDP support - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood - TI am65-cpsw: add suspend/resume support - Mediatek MT7986: add RX wireless wthernet dispatch support - Realtek 8169: enable GRO software interrupt coalescing per default - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support - add ip6gre support - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support - enable flow offload support - Renesas: - add rswitch R-Car Gen4 gPTP support - Microchip (lan966x): - add full XDP support - add TC H/W offload via VCAP - enable PTP on bridge interfaces - Microchip (ksz8): - add MTU support for KSZ8 series - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support - add ack signal support - enable coredump support - remain_on_channel support - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities - 320 MHz channels support - RealTek WiFi (rtw89): - new dynamic header firmware format support - wake-over-WLAN support" * tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits) ipvs: fix type warning in do_div() on 32 bit net: lan966x: Remove a useless test in lan966x_ptp_add_trap() net: ipa: add IPA v4.7 support dt-bindings: net: qcom,ipa: Add SM6350 compatible bnxt: Use generic HBH removal helper in tx path IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver selftests: forwarding: Add bridge MDB test selftests: forwarding: Rename bridge_mdb test bridge: mcast: Support replacement of MDB port group entries bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add a flag for user installed source entries bridge: mcast: Expose __br_multicast_del_group_src() bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a centralized error path bridge: mcast: Place netlink policy before validation functions bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Do not derive entry type from its filter mode ... |
||
![]() |
a312a8cc3c |
cgroup changes for v6.2-rc1
Nothing too interesting. * Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations kprobable. * A couple cpuset optimizations. * Other misc changes including doc and test updates. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY5bHvg4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGcYrAQCfrlzrbWw6gTQ7fmr0Avxjy5FxLjsdzEGPcmGY ByEMhgD/VdUf3zI/Khr91Gsi5JXQxQf7a5caD369xupRWUWjqA8= =Nf+E -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting: - Add CONFIG_DEBUG_GROUP_REF which makes cgroup refcnt operations kprobable - A couple cpuset optimizations - Other misc changes including doc and test updates" * tag 'cgroup-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq() cgroup/cpuset: Improve cpuset_css_alloc() description kselftest/cgroup: Add cleanup() to test_cpuset_prs.sh cgroup/cpuset: Optimize cpuset_attach() on v2 cgroup/cpuset: Skip spread flags update on v2 kselftest/cgroup: Fix gathering number of CPUs cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF cgroup: Implement DEBUG_CGROUP_REF |
||
![]() |
674b745e22 |
cgroup: remove rcu_read_lock()/rcu_read_unlock() in critical section of spin_lock_irq()
spin_lock_irq() already disable preempt, so remove rcu_read_lock(). Signed-off-by: Ran Tian <tianran_trtr@163.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
b54a0d4094 |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2GuKgAKCRDbK58LschI gy32AP9PI0e/bUGDExKJ8g97PeeEtnpj4TTI6g+XKILtYnyXlgD/Rk4j2D/f3IBF Ha9TmqYvAUim+U/g50vUrNuoNLNJ5w8= =OKC1 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2022-11-02 We've added 70 non-merge commits during the last 14 day(s) which contain a total of 96 files changed, 3203 insertions(+), 640 deletions(-). The main changes are: 1) Make cgroup local storage available to non-cgroup attached BPF programs such as tc BPF ones, from Yonghong Song. 2) Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers, from Martin KaFai Lau. 3) Add LLVM disassembler as default library for dumping JITed code in bpftool, from Quentin Monnet. 4) Various kprobe_multi_link fixes related to kernel modules, from Jiri Olsa. 5) Optimize x86-64 JIT with emitting BMI2-based shift instructions, from Jie Meng. 6) Improve BPF verifier's memory type compatibility for map key/value arguments, from Dave Marchevsky. 7) Only create mmap-able data section maps in libbpf when data is exposed via skeletons, from Andrii Nakryiko. 8) Add an autoattach option for bpftool to load all object assets, from Wang Yufen. 9) Various memory handling fixes for libbpf and BPF selftests, from Xu Kuohai. 10) Initial support for BPF selftest's vmtest.sh on arm64, from Manu Bretelle. 11) Improve libbpf's BTF handling to dedup identical structs, from Alan Maguire. 12) Add BPF CI and denylist documentation for BPF selftests, from Daniel Müller. 13) Check BPF cpumap max_entries before doing allocation work, from Florian Lehner. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits) samples/bpf: Fix typo in README bpf: Remove the obsolte u64_stats_fetch_*_irq() users. bpf: check max_entries before allocating memory bpf: Fix a typo in comment for DFS algorithm bpftool: Fix spelling mistake "disasembler" -> "disassembler" selftests/bpf: Fix bpftool synctypes checking failure selftests/bpf: Panic on hard/soft lockup docs/bpf: Add documentation for new cgroup local storage selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x selftests/bpf: Add selftests for new cgroup local storage selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str bpftool: Support new cgroup local storage libbpf: Support new cgroup local storage bpf: Implement cgroup storage available to non-cgroup-attached bpf progs bpf: Refactor some inode/task/sk storage functions for reuse bpf: Make struct cgroup btf id global selftests/bpf: Tracing prog can still do lookup under busy lock selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection bpf: Add new bpf_task_storage_delete proto with no deadlock detection bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check ... ==================== Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
79a7f41f7f |
cgroup: cgroup refcnt functions should be exported when CONFIG_DEBUG_CGROUP_REF
|
||
![]() |
6ab428604f |
cgroup: Implement DEBUG_CGROUP_REF
It's really difficult to debug when cgroup or css refs leak. Let's add a debug option to force the refcnt function to not be inlined so that they can be kprobed for debugging. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
c4bcfb38a9 |
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage. There already exists a local storage implementation for cgroup-attached bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper bpf_get_local_storage(). But there are use cases such that non-cgroup attached bpf progs wants to access cgroup local storage data. For example, tc egress prog has access to sk and cgroup. It is possible to use sk local storage to emulate cgroup local storage by storing data in socket. But this is a waste as it could be lots of sockets belonging to a particular cgroup. Alternatively, a separate map can be created with cgroup id as the key. But this will introduce additional overhead to manipulate the new map. A cgroup local storage, similar to existing sk/inode/task storage, should help for this use case. The life-cycle of storage is managed with the life-cycle of the cgroup struct. i.e. the storage is destroyed along with the owning cgroup with a call to bpf_cgrp_storage_free() when cgroup itself is deleted. The userspace map operations can be done by using a cgroup fd as a key passed to the lookup, update and delete operations. Typically, the following code is used to get the current cgroup: struct task_struct *task = bpf_get_current_task_btf(); ... task->cgroups->dfl_cgrp ... and in structure task_struct definition: struct task_struct { .... struct css_set __rcu *cgroups; .... } With sleepable program, accessing task->cgroups is not protected by rcu_read_lock. So the current implementation only supports non-sleepable program and supporting sleepable program will be the next step together with adding rcu_read_lock protection for rcu tagged structures. Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used for cgroup storage available to non-cgroup-attached bpf programs. The old cgroup storage supports bpf_get_local_storage() helper to get the cgroup data. The new cgroup storage helper bpf_cgrp_storage_get() can provide similar functionality. While old cgroup storage pre-allocates storage memory, the new mechanism can also pre-allocate with a user space bpf_map_update_elem() call to avoid potential run-time memory allocation failure. Therefore, the new cgroup storage can provide all functionality w.r.t. the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can be deprecated since the new one can provide the same functionality. Acked-by: David Vernet <void@manifault.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
bb1a114646 |
cgroup fixes for v6.1-rc1
* Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held. * Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file(). Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. * Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. * Add Josef Bacik as a blkcg maintainer. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY03MlA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGTkUAQD7fNcSPuc2m/BvW+gySKQkp9PZMA6E6yOIqirc QKmIgwEAwWECW7RR1alhOGD50RtYkuYVsLD1+6Ka4eMHe+EhwA4= =kGLI -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file() Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. - Add Josef Bacik as a blkcg maintainer. * tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Update MAINTAINERS entry mm: cgroup: fix comments for get from fd/file helpers perf stat: Support old kernels for bperf cgroup counting bpf: cgroup_iter: support cgroup1 using cgroup fd cgroup: add cgroup_v1v2_get_from_[fd/file]() Revert "cgroup: enable cgroup_get_from_file() on cgroup1" cgroup: Reorganize css_set_lock and kernfs path processing |
||
![]() |
bd9a3dba18 |
PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW SIZD/S/hzdI= =91tB -----END PGP SIGNATURE----- Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull PSI updates from Ingo Molnar: - Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. * tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Per-cgroup PSI accounting disable/re-enable interface sched/psi: Cache parent psi_group to speed up group iteration sched/psi: Consolidate cgroup_psi() sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure sched/psi: Remove NR_ONCPU task accounting sched/psi: Optimize task switch inside shared cgroups again sched/psi: Move private helpers to sched/stats.h sched/psi: Save percpu memory when !psi_cgroups_enabled sched/psi: Don't create cgroup PSI files when psi_disabled sched/psi: Fix periodic aggregation shut off |
||
![]() |
b675d4bdfe |
mm: cgroup: fix comments for get from fd/file helpers
Fix the documentation comments for cgroup_[v1v2_]get_from_[fd/file](). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
a6d1ce5951 |
cgroup: add cgroup_v1v2_get_from_[fd/file]()
Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that support both cgroup1 and cgroup2. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
03db771615 |
Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
This reverts commit
|
||
![]() |
46307fd6e2 |
cgroup: Reorganize css_set_lock and kernfs path processing
The commit |
||
![]() |
adf4bfc4a9 |
cgroup changes for v6.1-rc1.
* cpuset now support isolated cpus.partition type, which will enable dynamic
CPU isolation.
* pids.peak added to remember the max number of pids used.
* Holes in cgroup namespace plugged.
* Internal cleanups.
Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
follow-up cleanups and each merge commit has details.
Also,
|
||
![]() |
e8bc52cb8d |
Driver core changes for 6.1-rc1
Here is the big set of driver core and debug printk changes for 6.1-rc1. Included in here is: - dynamic debug updates for the core and the drm subsystem. The drm changes have all been acked by the relevant maintainers. - kernfs fixes for syzbot reported problems - kernfs refactors and updates for cgroup requirements - magic number cleanups and removals from the kernel tree (they were not being used and they really did not actually do anything.) - other tiny cleanups All of these have been in linux-next for a while with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY0BYUA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylozwCdFRlcghaf7XBUyNgRZRwMC+oQI8EAn1G/nEDE 6aFd2er41uK0IGQnSmYO =OK0k -----END PGP SIGNATURE----- Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core and debug printk changes for 6.1-rc1. Included in here is: - dynamic debug updates for the core and the drm subsystem. The drm changes have all been acked by the relevant maintainers - kernfs fixes for syzbot reported problems - kernfs refactors and updates for cgroup requirements - magic number cleanups and removals from the kernel tree (they were not being used and they really did not actually do anything) - other tiny cleanups All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits) docs: filesystems: sysfs: Make text and code for ->show() consistent Documentation: NBD_REQUEST_MAGIC isn't a magic number a.out: restore CMAGIC device property: Add const qualifier to device_get_match_data() parameter drm_print: add _ddebug descriptor to drm_*dbg prototypes drm_print: prefer bare printk KERN_DEBUG on generic fn drm_print: optimize drm_debug_enabled for jump-label drm-print: add drm_dbg_driver to improve namespace symmetry drm-print.h: include dyndbg header drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro drm_print: interpose drm_*dbg with forwarding macros drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers. drm_print: condense enum drm_debug_category debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs() Documentation: ENI155_MAGIC isn't a magic number Documentation: NBD_REPLY_MAGIC isn't a magic number nbd: remove define-only NBD_MAGIC, previously magic number Documentation: FW_HEADER_MAGIC isn't a magic number Documentation: EEPROM_MAGIC_VALUE isn't a magic number ... |
||
![]() |
accc3b4a57 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
![]() |
8619e94d3b |
cgroup: use strscpy() is more robust and safer
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
61c41711b1 |
cgroup: simplify code in cgroup_apply_control
It could directly return 'cgroup_update_dfl_csses' to simplify code. Signed-off-by: William Dean <williamsukatube@163.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
7e1eb5437d |
cgroup: Make cgroup_get_from_id() prettier
After merging 836ac87d ("cgroup: fix cgroup_get_from_id") into for-6.1, its combination with two commits in for-6.1 - |
||
![]() |
026e14a276 |
Merge branch 'for-6.0-fixes' into for-6.1
for-6.0 has the following fix for cgroup_get_from_id(). 836ac87d ("cgroup: fix cgroup_get_from_id") which conflicts with the following two commits in for-6.1. |
||
![]() |
df02452f3d |
cgroup: cgroup_get_from_id() must check the looked-up kn is a directory
cgroup has to be one kernfs dir, otherwise kernel panic is caused,
especially cgroup id is provide from userspace.
Reported-by: Marco Patalano <mpatalan@redhat.com>
Fixes:
|
||
![]() |
a791dc1353 |
Linux 6.0-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmMeQ2keHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYRMH+gLNHiGirGZlm2GQ tKaZQUy7MiXuIP0hGDonDIIIAmIVhnjm9MDG8KT4W8AvEd7ukncyYqJfwWeWQPhP 4mZcf6l3Z8Ke+qiaFpXpMPCxTyWcln1ox0EoNx2g9gdPxZntaRuuaTQVljUfTiey aVPHxve8ip3G7jDoJnuLSxESOqWxkb8v/SshBP1E5bF5BZ+cgZRqq7FNigFqxjbk wF29K09BVOPjdgkSvY/b0/SnL5KlSdMAv+FrPcJNGivcdIPgf/qJks5cI2HRUo7o CpKgbcLorCVyD+d+zLonJBwIy3arbmKD8JqYnfdTSIqVOUqHXWUDfeydsH32u1Gu lPSI2Hw= =7LTL -----END PGP SIGNATURE----- Merge 6.0-rc5 into driver-core-next We need the driver core and debugfs changes in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
34f26a1561 |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit
|
||
![]() |
57899a6610 |
sched/psi: Consolidate cgroup_psi()
cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi". This patch move cgroup_psi() definition to <linux/psi.h>, in which we can return psi_system for root cgroup, so can handle all cgroups. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com |
||
![]() |
52b1364ba0 |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com |
||
![]() |
58d8c2586c |
sched/psi: Don't create cgroup PSI files when psi_disabled
commit
|
||
![]() |
0fb7b6f9d3 |
Merge branch 'driver-core/driver-core-next'
Pull in dependent cgroup patches Signed-off-by: Peter Zijlstra <peterz@infradead.org> |
||
![]() |
8a693f7766 |
cgroup: Remove CFTYPE_PRESSURE
CFTYPE_PRESSURE is used to flag PSI related files so that they are not created if PSI is disabled during boot. It's a bit weird to use a generic flag to mark a specific file type. Let's instead move the PSI files into its own cftypes array and add/rm them conditionally. This is a bit more code but cleaner. No userland visible changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> |
||
![]() |
0083d27b21 |
cgroup: Improve cftype add/rm error handling
Let's track whether a cftype is currently added or not using a new flag __CFTYPE_ADDED so that duplicate operations can be failed safely and consistently allow using empty cftypes. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
dc79ec1b23 |
cgroup: Remove data-race around cgrp_dfl_visible
There's a seemingly harmless data-race around cgrp_dfl_visible detected by kernel concurrency sanitizer. Let's remove it by throwing WRITE/READ_ONCE at it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/netdev/20220819072256.fn7ctciefy4fc4cu@wittgenstein/ |
||
![]() |
e2691f6b44 |
cgroup: Implement cgroup_file_show()
Add cgroup_file_show() which allows toggling visibility of a cgroup file using the new kernfs_show(). This will be used to hide psi interface files on cgroups where it's disabled. Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
075b593f54 |
cgroup: Use cgroup_attach_{lock,unlock}() from cgroup_attach_task_all()
No behavior changes; preparing for potential locking changes in future. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by:Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
265efc941f |
Merge branch 'for-6.0-fixes' into for-6.1
Pulling to receive
|
||
![]() |
fa7e439cf9 |
cgroup: Homogenize cgroup_get_from_id() return value
Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()). This change also fixes commit |
||
![]() |
4534dee941 |
cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id
Cgroup ids are resolved in the global scope. That may be needed sometime
(in future) but currently it violates virtual view provided through
cgroup namespaces.
There are currently following users of the resolution:
- fc_appid_store
- bpf_iter_attach_cgroup
- mem_cgroup_get_from_ino
None of the is a called on behalf of kernel but the resolution is made
with proper userspace context, hence the default to current->nsproxy
makes sens. (This doesn't rule out cgroup_get_from_id with cgroup NS
parameter in the future.)
Since cgroup ids are defined on v2 hierarchy only, we simply check
existence in the cgroup namespace by looking at ancestry on the default
hierarchy.
Fixes:
|
||
![]() |
74e4b956eb |
cgroup: Honor caller's cgroup NS when resolving path
cgroup_get_from_path() is not widely used function. Its callers presume
the path is resolved under cgroup namespace. (There is one caller
currently and resolving in init NS won't make harm (netfilter). However,
future users may be subject to different effects when resolving
globally.)
Since, there's currently no use for the global resolution, modify the
existing function to take cgroup NS into account.
Fixes:
|
||
![]() |
880b0dd94f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c |
||
![]() |
763f4fb76e |
cgroup: Fix race condition at rebind_subsystems()
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().
Solution:
Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes:
|
||
![]() |
4f7e723643 |
cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.
Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes:
|
||
![]() |
76b079ef4c |
sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
7f203bc89e |
cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers directly and this makes the array useless for finding an actual ancestor cgroup forcing cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up from cgroup_ancestor(). While at it, improve comments around cgroup_root->cgrp_ancestor_storage. This patch shouldn't cause user-visible behavior differences. v2: Update cgroup_ancestor() to use ->ancestors[]. v3: cgroup_root->cgrp_ancestor_storage's type is updated to match cgroup->ancestors[]. Better comments. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> |
||
![]() |
f3a2aebdd6 |
cgroup: enable cgroup_get_from_file() on cgroup1
cgroup_get_from_file() currently fails with -EBADF if called on cgroup v1. However, the current implementation works on cgroup v1 as well, so the restriction is unnecessary. This enabled cgroup_get_from_fd() to work on cgroup v1, which would be the only thing stopping bpf cgroup_iter from supporting cgroup v1. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220805214821.1058337-3-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
![]() |
b6bb70f9ab |
Several core optimizations:
* threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). * threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. * psi no longer allocates memory when disabled. along with some code cleanups. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYugHIQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGd+oAP9lfD3fTRdNo4qWV2VsZsYzoOxzNIuJSwN/dnYx IEbQOwD/cd2YMfeo6zcb427U/VfTFqjJjFK04OeljYtJU8fFywo= =sucy -----END PGP SIGNATURE----- Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several core optimizations: - threadgroup_rwsem write locking is skipped when configuring controllers in empty subtrees. Combined with CLONE_INTO_CGROUP, this allows the common static usage pattern to not grab threadgroup_rwsem at all (glibc still doesn't seem ready for CLONE_INTO_CGROUP unfortunately). - threadgroup_rwsem used to be put into non-percpu mode by default due to latency concerns in specific use cases. There's no reason for everyone else to pay for it. Make the behavior optional. - psi no longer allocates memory when disabled. ... along with some code cleanups" * tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Skip subtree root in cgroup_update_dfl_csses() cgroup: remove "no" prefixed mount options cgroup: Make !percpu threadgroup_rwsem operations optional cgroup: Add "no" prefixed mount options cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes psi: dont alloc memory for psi by default |
||
![]() |
265792d0de |
cgroup: Skip subtree root in cgroup_update_dfl_csses()
The cgroup_update_dfl_csses() function updates css associations when a cgroup's subtree_control file is modified. Any changes made to a cgroup's subtree_control file, however, will only affect its descendants but not the cgroup itself. So there is no point in migrating csses associated with that cgroup. We can skip them instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
c808f46323 |
cgroup: remove "no" prefixed mount options
|
||
![]() |
6a010a49b6 |
cgroup: Make !percpu threadgroup_rwsem operations optional
|
||
![]() |
30312730bd |
cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed variants so that they can be turned off too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> |
||
![]() |
671c11f061 |
cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup_update_dfl_csses() write-lock the threadgroup_rwsem as updating the csses can trigger process migrations. However, if the subtree doesn't contain any tasks, there aren't gonna be any cgroup migrations. This condition can be trivially detected by testing whether mgctx.preloaded_src_csets is empty. Elide write-locking threadgroup_rwsem if the subtree is empty. After this optimization, the usage pattern of creating a cgroup, enabling the necessary controllers, and then seeding it with CLONE_INTO_CGROUP and then removing the cgroup after it becomes empty doesn't need to write-lock threadgroup_rwsem at all. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> |
||
![]() |
d75cd55ae2 |
cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
We have: int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp) { ... /* mixables don't care */ if (cgroup_is_mixable(dst_cgrp)) return 0; /* * If @dst_cgrp is already or can become a thread root or is * threaded, it doesn't matter. */ if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp)) return 0; ... } but in fact the entry of cgroup_can_be_thread_root() covers case that checking cgroup_is_mixable() as following: static bool cgroup_can_be_thread_root(struct cgroup *cgrp) { /* mixables don't care */ if (cgroup_is_mixable(cgrp)) return true; ... } so explicitly checking in cgroup_migrate_vet_dst is unnecessary. Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
07fd5b6cdf |
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes:
|
||
![]() |
e210a89f5b |
cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
No funtionality change, but save us some lines. Signed-off-by: Lin Feng <linf@wangsu.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
5f69a6577b |
psi: dont alloc memory for psi by default
Memory about struct psi_group is allocated by default for each cgroup even if psi_disabled is true, in this case, these allocated memory is waste, so alloc memory for struct psi_group only when psi_disabled is false. Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
b154a017c9 |
cgroup: remove the superfluous judgment
Remove the superfluous judgment since the function is never called for a root cgroup, as suggested by Tejun. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
29ed17389c |
cgroup: Make cgroup_debug static
Make cgroup_debug static since it's only used in cgroup.c Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
266d17a8c0 |
Driver core changes for 5.18-rc1
Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). There will be a merge conflict in drivers/power/supply/ab8500_chargalg.c with your tree, the merge conflict should be easy (take all the changes). All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYkG6PA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylMFwCfSIyAU4oLEgj+/Rfmx4o45cAVIWMAnit3zbdU wUUCGqKcOnTJEcW6dMPh =1VVi -----END PGP SIGNATURE----- Merge tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ... |
||
![]() |
2fce7ea0e0 |
Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "All trivial cleanups without meaningful behavior changes" * 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: cleanup comments cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment cgroup: rstat: retrieve current bstat to delta directly cgroup: rstat: use same convention to assign cgroup_base_stat |
||
![]() |
4a248f85b3 |
Merge 5.17-rc6 into driver-core-next
We need the driver core fix in here as well for future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
f2eb478f2f |
kernfs: move struct kernfs_root out of the public view.
There is no need to have struct kernfs_root be part of kernfs.h for the whole kernel to see and poke around it. Move it internal to kernfs code and provide a helper function, kernfs_root_to_node(), to handle the one field that kernfs users were directly accessing from the structure. Cc: Imran Khan <imran.f.khan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
![]() |
5c1ee56966 |
Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo: - Fix for a subtle bug in the recent release_agent permission check update - Fix for a long-standing race condition between cpuset and cpu hotplug - Comment updates * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix kernel-doc cgroup-v1: Correct privileges check in release_agent writes cgroup: clarify cgroup_css_set_fork() cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug |
||
![]() |
6d3971dab2 |
cgroup: clarify cgroup_css_set_fork()
With recent fixes for the permission checking when moving a task into a cgroup using a file descriptor to a cgroup's cgroup.procs file and calling write() it seems a good idea to clarify CLONE_INTO_CGROUP permission checking with a comment. Cc: Tejun Heo <tj@kernel.org> Cc: <cgroups@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
a06247c680 |
psi: Fix uaf issue when psi trigger is destroyed while being polled
With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.
Fixes:
|
||
![]() |
ffacbd11e2 |
cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment
Add the description of @kargs in cgroup_can_fork() and cgroup_post_fork() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. kernel/cgroup/cgroup.c:6235: warning: Function parameter or member 'kargs' not described in 'cgroup_can_fork' kernel/cgroup/cgroup.c:6296: warning: Function parameter or member 'kargs' not described in 'cgroup_post_fork' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
ea1ca66d3c |
Merge branch 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Nothing too interesting. The only two noticeable changes are a subtle cpuset behavior fix and trace event id field being expanded to u64 from int. Most others are code cleanups" * 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean cgroup/rstat: check updated_next only for root cgroup: rstat: explicitly put loop variant in while cgroup: return early if it is already on preloaded list cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy cgroup: Trace event cgroup id fields should be u64 cgroup: fix a typo in comment cgroup: get the wrong css for css_alloc() during cgroup_init_subsys() cgroup: rstat: Mark benign data race to silence KCSAN |
||
![]() |
8efd0d9c31 |
Networking changes for 5.17.
Core ---- - Defer freeing TCP skbs to the BH handler, whenever possible, or at least perform the freeing outside of the socket lock section to decrease cross-CPU allocator work and improve latency. - Add netdevice refcount tracking to locate sources of netdevice and net namespace refcount leaks. - Make Tx watchdog less intrusive - avoid pausing Tx and restarting all queues from a single CPU removing latency spikes. - Various small optimizations throughout the stack from Eric Dumazet. - Make netdev->dev_addr[] constant, force modifications to go via appropriate helpers to allow us to keep addresses in ordered data structures. - Replace unix_table_lock with per-hash locks, improving performance of bind() calls. - Extend skb drop tracepoint with a drop reason. - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW. BPF --- - New helpers: - bpf_find_vma(), find and inspect VMAs for profiling use cases - bpf_loop(), runtime-bounded loop helper trading some execution time for much faster (if at all converging) verification - bpf_strncmp(), improve performance, avoid compiler flakiness - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt() for tracing programs, all inlined by the verifier - Support BPF relocations (CO-RE) in the kernel loader. - Further the support for BTF_TYPE_TAG annotations. - Allow access to local storage in sleepable helpers. - Convert verifier argument types to a composable form with different attributes which can be shared across types (ro, maybe-null). - Prepare libbpf for upcoming v1.0 release by cleaning up APIs, creating new, extensible ones where missing and deprecating those to be removed. Protocols --------- - WiFi (mac80211/cfg80211): - notify user space about long "come back in N" AP responses, allow it to react to such temporary rejections - allow non-standard VHT MCS 10/11 rates - use coarse time in airtime fairness code to save CPU cycles - Bluetooth: - rework of HCI command execution serialization to use a common queue and work struct, and improve handling errors reported in the middle of a batch of commands - rework HCI event handling to use skb_pull_data, avoiding packet parsing pitfalls - support AOSP Bluetooth Quality Report - SMC: - support net namespaces, following the RDMA model - improve connection establishment latency by pre-clearing buffers - introduce TCP ULP for automatic redirection to SMC - Multi-Path TCP: - support ioctls: SIOCINQ, OUTQ, and OUTQNSD - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT, IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY - support cmsgs: TCP_INQ - improvements in the data scheduler (assigning data to subflows) - support fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP) - MCTP (Management Component Transport) over serial, as defined by DMTF spec DSP0253 - "MCTP Serial Transport Binding". Driver API ---------- - Support timestamping on bond interfaces in active/passive mode. - Introduce generic phylink link mode validation for drivers which don't have any quirks and where MAC capability bits fully express what's supported. Allow PCS layer to participate in the validation. Convert a number of drivers. - Add support to set/get size of buffers on the Rx rings and size of the tx copybreak buffer via ethtool. - Support offloading TC actions as first-class citizens rather than only as attributes of filters, improve sharing and device resource utilization. - WiFi (mac80211/cfg80211): - support forwarding offload (ndo_fill_forward_path) - support for background radar detection hardware - SA Query Procedures offload on the AP side New hardware / drivers ---------------------- - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with real-time requirements for isochronous communication with protocols like OPC UA Pub/Sub. - Qualcomm BAM-DMUX WWAN - driver for data channels of modems integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974 (qcom_bam_dmux). - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver with support for bridging, VLANs and multicast forwarding (lan966x). - iwlmei driver for co-operating between Intel's WiFi driver and Intel's Active Management Technology (AMT) devices. - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips - Bluetooth: - MediaTek MT7921 SDIO devices - Foxconn MT7922A - Realtek RTL8852AE Drivers ------- - Significantly improve performance in the datapaths of: lan78xx, ax88179_178a, lantiq_xrx200, bnxt. - Intel Ethernet NICs: - igb: support PTP/time PEROUT and EXTTS SDP functions on 82580/i354/i350 adapters - ixgbevf: new PF -> VF mailbox API which avoids the risk of mailbox corruption with ESXi - iavf: support configuration of VLAN features of finer granularity, stacked tags and filtering - ice: PTP support for new E822 devices with sub-ns precision - ice: support firmware activation without reboot - Mellanox Ethernet NICs (mlx5): - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - support TC forwarding when tunnel encap and decap happen between two ports of the same NIC - dynamically size and allow disabling various features to save resources for running in embedded / SmartNIC scenarios - Broadcom Ethernet NICs (bnxt): - use page frag allocator to improve Rx performance - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - Other Ethernet NICs: - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support - Microsoft cloud/virtual NIC (mana): - add XDP support (PASS, DROP, TX) - Mellanox Ethernet switches (mlxsw): - initial support for Spectrum-4 ASICs - VxLAN with IPv6 underlay - Marvell Ethernet switches (prestera): - support flower flow templates - add basic IP forwarding support - NXP embedded Ethernet switches (ocelot & felix): - support Per-Stream Filtering and Policing (PSFP) - enable cut-through forwarding between ports by default - support FDMA to improve packet Rx/Tx to CPU - Other embedded switches: - hellcreek: improve trapping management (STP and PTP) packets - qca8k: support link aggregation and port mirroring - Qualcomm 802.11ax WiFi (ath11k): - qca6390, wcn6855: enable 802.11 power save mode in station mode - BSS color change support - WCN6855 hw2.1 support - 11d scan offload support - scan MAC address randomization support - full monitor mode, only supported on QCN9074 - qca6390/wcn6855: report signal and tx bitrate - qca6390: rfkill support - qca6390/wcn6855: regdb.bin support - Intel WiFi (iwlwifi): - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS) in cooperation with the BIOS - support for Optimized Connectivity Experience (OCE) scan - support firmware API version 68 - lots of preparatory work for the upcoming Bz device family - MediaTek WiFi (mt76): - Specific Absorption Rate (SAR) support - mt7921: 160 MHz channel support - RealTek WiFi (rtw88): - Specific Absorption Rate (SAR) support - scan offload - Other WiFi NICs - ath10k: support fetching (pre-)calibration data from nvmem - brcmfmac: configure keep-alive packet on suspend - wcn36xx: beacon filter support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ 0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo= =KLm3 -----END PGP SIGNATURE----- Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Defer freeing TCP skbs to the BH handler, whenever possible, or at least perform the freeing outside of the socket lock section to decrease cross-CPU allocator work and improve latency. - Add netdevice refcount tracking to locate sources of netdevice and net namespace refcount leaks. - Make Tx watchdog less intrusive - avoid pausing Tx and restarting all queues from a single CPU removing latency spikes. - Various small optimizations throughout the stack from Eric Dumazet. - Make netdev->dev_addr[] constant, force modifications to go via appropriate helpers to allow us to keep addresses in ordered data structures. - Replace unix_table_lock with per-hash locks, improving performance of bind() calls. - Extend skb drop tracepoint with a drop reason. - Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW. BPF --- - New helpers: - bpf_find_vma(), find and inspect VMAs for profiling use cases - bpf_loop(), runtime-bounded loop helper trading some execution time for much faster (if at all converging) verification - bpf_strncmp(), improve performance, avoid compiler flakiness - bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt() for tracing programs, all inlined by the verifier - Support BPF relocations (CO-RE) in the kernel loader. - Further the support for BTF_TYPE_TAG annotations. - Allow access to local storage in sleepable helpers. - Convert verifier argument types to a composable form with different attributes which can be shared across types (ro, maybe-null). - Prepare libbpf for upcoming v1.0 release by cleaning up APIs, creating new, extensible ones where missing and deprecating those to be removed. Protocols --------- - WiFi (mac80211/cfg80211): - notify user space about long "come back in N" AP responses, allow it to react to such temporary rejections - allow non-standard VHT MCS 10/11 rates - use coarse time in airtime fairness code to save CPU cycles - Bluetooth: - rework of HCI command execution serialization to use a common queue and work struct, and improve handling errors reported in the middle of a batch of commands - rework HCI event handling to use skb_pull_data, avoiding packet parsing pitfalls - support AOSP Bluetooth Quality Report - SMC: - support net namespaces, following the RDMA model - improve connection establishment latency by pre-clearing buffers - introduce TCP ULP for automatic redirection to SMC - Multi-Path TCP: - support ioctls: SIOCINQ, OUTQ, and OUTQNSD - support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT, IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY - support cmsgs: TCP_INQ - improvements in the data scheduler (assigning data to subflows) - support fastclose option (quick shutdown of the full MPTCP connection, similar to TCP RST in regular TCP) - MCTP (Management Component Transport) over serial, as defined by DMTF spec DSP0253 - "MCTP Serial Transport Binding". Driver API ---------- - Support timestamping on bond interfaces in active/passive mode. - Introduce generic phylink link mode validation for drivers which don't have any quirks and where MAC capability bits fully express what's supported. Allow PCS layer to participate in the validation. Convert a number of drivers. - Add support to set/get size of buffers on the Rx rings and size of the tx copybreak buffer via ethtool. - Support offloading TC actions as first-class citizens rather than only as attributes of filters, improve sharing and device resource utilization. - WiFi (mac80211/cfg80211): - support forwarding offload (ndo_fill_forward_path) - support for background radar detection hardware - SA Query Procedures offload on the AP side New hardware / drivers ---------------------- - tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with real-time requirements for isochronous communication with protocols like OPC UA Pub/Sub. - Qualcomm BAM-DMUX WWAN - driver for data channels of modems integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974 (qcom_bam_dmux). - Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver with support for bridging, VLANs and multicast forwarding (lan966x). - iwlmei driver for co-operating between Intel's WiFi driver and Intel's Active Management Technology (AMT) devices. - mse102x - Vertexcom MSE102x Homeplug GreenPHY chips - Bluetooth: - MediaTek MT7921 SDIO devices - Foxconn MT7922A - Realtek RTL8852AE Drivers ------- - Significantly improve performance in the datapaths of: lan78xx, ax88179_178a, lantiq_xrx200, bnxt. - Intel Ethernet NICs: - igb: support PTP/time PEROUT and EXTTS SDP functions on 82580/i354/i350 adapters - ixgbevf: new PF -> VF mailbox API which avoids the risk of mailbox corruption with ESXi - iavf: support configuration of VLAN features of finer granularity, stacked tags and filtering - ice: PTP support for new E822 devices with sub-ns precision - ice: support firmware activation without reboot - Mellanox Ethernet NICs (mlx5): - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - support TC forwarding when tunnel encap and decap happen between two ports of the same NIC - dynamically size and allow disabling various features to save resources for running in embedded / SmartNIC scenarios - Broadcom Ethernet NICs (bnxt): - use page frag allocator to improve Rx performance - expose control over IRQ coalescing mode (CQE vs EQE) via ethtool - Other Ethernet NICs: - amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support - Microsoft cloud/virtual NIC (mana): - add XDP support (PASS, DROP, TX) - Mellanox Ethernet switches (mlxsw): - initial support for Spectrum-4 ASICs - VxLAN with IPv6 underlay - Marvell Ethernet switches (prestera): - support flower flow templates - add basic IP forwarding support - NXP embedded Ethernet switches (ocelot & felix): - support Per-Stream Filtering and Policing (PSFP) - enable cut-through forwarding between ports by default - support FDMA to improve packet Rx/Tx to CPU - Other embedded switches: - hellcreek: improve trapping management (STP and PTP) packets - qca8k: support link aggregation and port mirroring - Qualcomm 802.11ax WiFi (ath11k): - qca6390, wcn6855: enable 802.11 power save mode in station mode - BSS color change support - WCN6855 hw2.1 support - 11d scan offload support - scan MAC address randomization support - full monitor mode, only supported on QCN9074 - qca6390/wcn6855: report signal and tx bitrate - qca6390: rfkill support - qca6390/wcn6855: regdb.bin support - Intel WiFi (iwlwifi): - support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS) in cooperation with the BIOS - support for Optimized Connectivity Experience (OCE) scan - support firmware API version 68 - lots of preparatory work for the upcoming Bz device family - MediaTek WiFi (mt76): - Specific Absorption Rate (SAR) support - mt7921: 160 MHz channel support - RealTek WiFi (rtw88): - Specific Absorption Rate (SAR) support - scan offload - Other WiFi NICs - ath10k: support fetching (pre-)calibration data from nvmem - brcmfmac: configure keep-alive packet on suspend - wcn36xx: beacon filter support" * tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits) tcp: tcp_send_challenge_ack delete useless param `skb` net/qla3xxx: Remove useless DMA-32 fallback configuration rocker: Remove useless DMA-32 fallback configuration hinic: Remove useless DMA-32 fallback configuration lan743x: Remove useless DMA-32 fallback configuration net: enetc: Remove useless DMA-32 fallback configuration cxgb4vf: Remove useless DMA-32 fallback configuration cxgb4: Remove useless DMA-32 fallback configuration cxgb3: Remove useless DMA-32 fallback configuration bnx2x: Remove useless DMA-32 fallback configuration et131x: Remove useless DMA-32 fallback configuration be2net: Remove useless DMA-32 fallback configuration vmxnet3: Remove useless DMA-32 fallback configuration bna: Simplify DMA setting net: alteon: Simplify DMA setting myri10ge: Simplify DMA setting qlcnic: Simplify DMA setting net: allwinner: Fix print format page_pool: remove spinlock in page_pool_refill_alloc_cache() amt: fix wrong return type of amt_send_membership_update() ... |
||
![]() |
e574576416 |
cgroup: Use open-time cgroup namespace for process migration perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's cgroup namespace which is
a potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes cgroup remember the cgroup namespace at the time of open
and uses it for migration permission checks instad of current's. Note that
this only applies to cgroup2 as cgroup1 doesn't have namespace support.
This also fixes a use-after-free bug on cgroupns reported in
https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Note that backporting this fix also requires the preceding patch.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com
Fixes:
|
||
![]() |
0d2b5955b3 |
cgroup: Allocate cgroup_file_ctx for kernfs_open_file->priv
of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> |
||
![]() |
1756d7994a |
cgroup: Use open-time credentials for process migraton perm checks
cgroup process migration permission checks are performed at write time as
whether a given operation is allowed or not is dependent on the content of
the write - the PID. This currently uses current's credentials which is a
potential security weakness as it may allow scenarios where a less
privileged process tricks a more privileged one into writing into a fd that
it created.
This patch makes both cgroup2 and cgroup1 process migration interfaces to
use the credentials saved at the time of open (file->f_cred) instead of
current's.
Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes:
|
||
![]() |
aef2feda97 |
add missing bpf-cgroup.h includes
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency, make sure those who actually need more than the definition of struct cgroup_bpf include bpf-cgroup.h explicitly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20211216025538.1649516-3-kuba@kernel.org |
||
![]() |
1815775e74 |
cgroup: return early if it is already on preloaded list
If a cset is already on preloaded list, this means we have already setup this cset properly for migration. This patch just relocates the root cgrp lookup which isn't used anyway when the cset is already on the preloaded list. [tj@kernel.org: rephrase the commit log] Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
8291471ea5 |
cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()
css_alloc() needs the parent css, while cgroup_css() gets current cgropu's css. So we are getting the wrong css during cgroup_init_subsys(). Fortunately, cgrp_dfl_root.cgrp's css is not set yet, so the value we pass to css_alloc() is NULL anyway. Let's pass NULL directly during init, since we know there is no parent yet. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
a85373fe44 |
Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - The misc controller now reports allocation rejections through misc.events instead of printking - cgroup_mutex usage is reduced to improve scalability of some operations - vhost helper threads are now assigned to the right cgroup on cgroup2 - Bug fixes * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c cgroup: Fix rootcg cpu.stat guest double counting cgroup: no need for cgroup_mutex for /proc/cgroups cgroup: remove cgroup_mutex from cgroupstats_build cgroup: reduce dependency on cgroup_mutex cgroup: cgroup-v1: do not exclude cgrp_dfl_root cgroup: Make rebind_subsystems() disable v2 controllers all at once docs/cgroup: add entry for misc.events misc_cgroup: remove error log to avoid log flood misc_cgroup: introduce misc.events to count failures |
||
![]() |
588e5d8766 |
cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf") cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only used in kernel/bpf/cgroup.c, so move these function to kernel/bpf/cgroup.c, like cgroup_bpf_replace(). Signed-off-by: He Fengqing <hefengqing@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
be28816971 |
cgroup: reduce dependency on cgroup_mutex
Currently cgroup_get_from_path() and cgroup_get_from_id() grab cgroup_mutex before traversing the default hierarchy to find the kernfs_node corresponding to the path/id and then extract the linked cgroup. Since cgroup_mutex is still held, it is guaranteed that the cgroup will be alive and the reference can be taken on it. However similar guarantee can be provided without depending on the cgroup_mutex and potentially reducing avenues of cgroup_mutex contentions. The kernfs_node's priv pointer is RCU protected pointer and with just rcu read lock we can grab the reference on the cgroup without cgroup_mutex. So, remove cgroup_mutex from them. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
04f8ef5643 |
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running the command as below: $mount -t cgroup -o none,name=foo cgroup cgroup/ $umount cgroup/ unreferenced object 0xc3585c40 (size 64): comm "mount", pid 425, jiffies 4294959825 (age 31.990s) hex dump (first 32 bytes): 01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(......... 00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC...... backtrace: [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c [<1f03679c>] cgroup_setup_root+0x174/0x37c [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0 [<f85b12fd>] vfs_get_tree+0x24/0x108 [<f55aec5c>] path_mount+0x384/0x988 [<e2d5e9cd>] do_mount+0x64/0x9c [<208c9cfe>] sys_mount+0xfc/0x1f4 [<06dd06e0>] ret_fast_syscall+0x0/0x48 [<a8308cb3>] 0xbeb4daa8 This is because that since the commit |
||
![]() |
78cc316e95 |
bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt
If cgroup_sk_alloc() is called from interrupt context, then just assign the root cgroup to skcd->cgroup. Prior to commit |
||
![]() |
7ee285395b |
cgroup: Make rebind_subsystems() disable v2 controllers all at once
It was found that the following warning was displayed when remounting
controllers from cgroup v2 to v1:
[ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190
:
[ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190
[ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3
[ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202
[ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a
[ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000
[ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004
[ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000
[ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20
[ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000
[ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0
[ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8043.191804] PKRU: 55555554
[ 8043.194517] Call Trace:
[ 8043.196970] rebind_subsystems+0x18c/0x470
[ 8043.201070] cgroup_setup_root+0x16c/0x2f0
[ 8043.205177] cgroup1_root_to_use+0x204/0x2a0
[ 8043.209456] cgroup1_get_tree+0x3e/0x120
[ 8043.213384] vfs_get_tree+0x22/0xb0
[ 8043.216883] do_new_mount+0x176/0x2d0
[ 8043.220550] __x64_sys_mount+0x103/0x140
[ 8043.224474] do_syscall_64+0x38/0x90
[ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae
It was caused by the fact that rebind_subsystem() disables
controllers to be rebound one by one. If more than one disabled
controllers are originally from the default hierarchy, it means that
cgroup_apply_control_disable() will be called multiple times for the
same default hierarchy. A controller may be killed by css_kill() in
the first round. In the second round, the killed controller may not be
completely dead yet leading to the warning.
To avoid this problem, we collect all the ssid's of controllers that
needed to be disabled from the default hierarchy and then disable them
in one go instead of one by one.
Fixes:
|
||
![]() |
8520e224f5 |
bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used. Back in the days, commit |
||
![]() |
d20d30ebb1 |
cgroup: Avoid compiler warnings with no subsystems
As done before in commit
|
||
![]() |
1f8c543f14 |
cgroup: remove cgroup_mount from comments
Git rid of an outdated comment. Since cgroup was fully switched to fs_context, cgroup_mount() is gone and it's confusing to mention in comments of cgroup_kill_sb(). Delete it. Signed-off-by: zhaoxiaoqiang11 <zhaoxiaoqiang11@jd.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
bd31b9efbf |
SCSI misc on 20210702
This series consists of the usual driver updates (ufs, ibmvfc, megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with elx and mpi3mr being new drivers. The major core change is a rework to drop the status byte handling macros and the old bit shifted definitions and the rest of the updates are minor fixes. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYN7I6iYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishXpRAQCkngYZ 35yQrqOxgOk2pfrysE95tHrV1MfJm2U49NFTwAEAuZutEvBUTfBF+sbcJ06r6q7i H0hkJN/Io7enFs5v3WA= =zwIa -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, ibmvfc, megaraid_sas, lpfc, elx, mpi3mr, qedi, iscsi, storvsc, mpt3sas) with elx and mpi3mr being new drivers. The major core change is a rework to drop the status byte handling macros and the old bit shifted definitions and the rest of the updates are minor fixes" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (287 commits) scsi: aha1740: Avoid over-read of sense buffer scsi: arcmsr: Avoid over-read of sense buffer scsi: ips: Avoid over-read of sense buffer scsi: ufs: ufs-mediatek: Add missing of_node_put() in ufs_mtk_probe() scsi: elx: libefc: Fix IRQ restore in efc_domain_dispatch_frame() scsi: elx: libefc: Fix less than zero comparison of a unsigned int scsi: elx: efct: Fix pointer error checking in debugfs init scsi: elx: efct: Fix is_originator return code type scsi: elx: efct: Fix link error for _bad_cmpxchg scsi: elx: efct: Eliminate unnecessary boolean check in efct_hw_command_cancel() scsi: elx: efct: Do not use id uninitialized in efct_lio_setup_session() scsi: elx: efct: Fix error handling in efct_hw_init() scsi: elx: efct: Remove redundant initialization of variable lun scsi: elx: efct: Fix spelling mistake "Unexected" -> "Unexpected" scsi: lpfc: Fix build error in lpfc_scsi.c scsi: target: iscsi: Remove redundant continue statement scsi: qla4xxx: Remove redundant continue statement scsi: ppa: Switch to use module_parport_driver() scsi: imm: Switch to use module_parport_driver() scsi: mpt3sas: Fix error return value in _scsih_expander_add() ... |
||
![]() |
3dbdb38e28 |
Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - cgroup.kill is added which implements atomic killing of the whole subtree. Down the line, this should be able to replace the multiple userland implementations of "keep killing till empty". - PSI can now be turned off at boot time to avoid overhead for configurations which don't care about PSI. * 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: make per-cgroup pressure stall tracking configurable cgroup: Fix kernel-doc cgroup: inline cgroup_task_freeze() tests/cgroup: test cgroup.kill tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() tests/cgroup: use cgroup.kill in cg_killall() docs/cgroup: add entry for cgroup.kill cgroup: introduce cgroup.kill |
||
![]() |
c74d40e8b5 |
loop: charge i/o to mem and blk cg
The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
6b658c4863 |
scsi: cgroup: Add cgroup_get_from_id()
Add a new function, cgroup_get_from_id(), to retrieve the cgroup associated with a cgroup id. Also export the function cgroup_get_e_css() as this is needed in blk-cgroup.h. Link: https://lore.kernel.org/r/20210608043556.274139-2-muneendra.kumar@broadcom.com Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> |
||
![]() |
3958e2d0c3 |
cgroup: make per-cgroup pressure stall tracking configurable
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This causes additional overhead with psi_avgs_work being called for each cgroup in the hierarchy. psi_avgs_work has been highly optimized, however on systems with large number of cgroups the overhead becomes noticeable. Systems which use PSI only at the system level could avoid this overhead if PSI can be configured to skip per-cgroup stall accounting. Add "cgroup_disable=pressure" kernel command-line option to allow requesting system-wide only pressure stall accounting. When set, it keeps system-wide accounting under /proc/pressure/ but skips accounting for individual cgroups and does not expose PSI nodes in cgroup hierarchy. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
2ca11b0e04 |
cgroup: Fix kernel-doc
Fix function name in cgroup.c and rstat.c kernel-doc comment
to remove these warnings found by clang_w1.
kernel/cgroup/cgroup.c:2401: warning: expecting prototype for
cgroup_taskset_migrate(). Prototype was for cgroup_migrate_execute()
instead.
kernel/cgroup/rstat.c:233: warning: expecting prototype for
cgroup_rstat_flush_begin(). Prototype was for cgroup_rstat_flush_hold()
instead.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 'commit
|
||
![]() |
c2a1197154 | Merge branch 'for-5.13-fixes' into for-5.14 | ||
![]() |
08b2b6fdf6 |
cgroup: fix spelling mistakes
Fix some spelling mistakes in comments: hierarhcy ==> hierarchy automtically ==> automatically overriden ==> overridden In absense of .. or ==> In absence of .. and assocaited ==> associated taget ==> target initate ==> initiate succeded ==> succeeded curremt ==> current udpated ==> updated Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
45e1ba4083 |
cgroup: disable controllers at parse time
This patch effectively reverts the commit |
||
![]() |
f4f809f66b |
cgroup: inline cgroup_task_freeze()
After the introduction of the cgroup.kill there is only one call site of cgroup_task_freeze() left: cgroup_exit(). cgroup_task_freeze() is currently taking rcu_read_lock() to read task's cgroup flags, but because it's always called with css_set_lock locked, the rcu protection is excessive. Simplify the code by inlining cgroup_task_freeze(). v2: fix build Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
661ee62809 |
cgroup: introduce cgroup.kill
Introduce the cgroup.kill file. It does what it says on the tin and allows a caller to kill a cgroup by writing "1" into cgroup.kill. The file is available in non-root cgroups. Killing cgroups is a process directed operation, i.e. the whole thread-group is affected. Consequently trying to write to cgroup.kill in threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior aligns with cgroup.procs where reads in threaded-cgroups are rejected with EOPNOTSUPP. The cgroup.kill file is write-only since killing a cgroup is an event not which makes it different from e.g. freezer where a cgroup transitions between the two states. As with all new cgroup features cgroup.kill is recursive by default. Killing a cgroup is protected against concurrent migrations through the cgroup mutex. To protect against forkbombs and to mitigate the effect of racing forks a new CGRP_KILL css set lock protected flag is introduced that is set prior to killing a cgroup and unset after the cgroup has been killed. We can then check in cgroup_post_fork() where we hold the css set lock already whether the cgroup is currently being killed. If so we send the child a SIGKILL signal immediately taking it down as soon as it returns to userspace. To make the killing of the child semantically clean it is killed after all cgroup attachment operations have been finalized. There are various use-cases of this interface: - Containers usually have a conservative layout where each container usually has a delegated cgroup. For such layouts there is a 1:1 mapping between container and cgroup. If the container in addition uses a separate pid namespace then killing a container usually becomes a simple kill -9 <container-init-pid> from an ancestor pid namespace. However, there are quite a few scenarios where that isn't true. For example, there are containers that share the cgroup with other processes on purpose that are supposed to be bound to the lifetime of the container but are not in the same pidns of the container. Containers that are in a delegated cgroup but share the pid namespace with the host or other containers. - Service managers such as systemd use cgroups to group and organize processes belonging to a service. They usually rely on a recursive algorithm now to kill a service. With cgroup.kill this becomes a simple write to cgroup.kill. - Userspace OOM implementations can make good use of this feature to efficiently take down whole cgroups quickly. - The kill program can gain a new kill --cgroup /sys/fs/cgroup/delegated flag to take down cgroups. A few observations about the semantics: - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is not specified we are not taking cgroup mutex meaning the cgroup can be killed while a process in that cgroup is forking. If the kill request happens right before cgroup_can_fork() and before the parent grabs its siglock the parent is guaranteed to see the pending SIGKILL. In addition we perform another check in cgroup_post_fork() whether the cgroup is being killed and is so take down the child (see above). This is robust enough and protects gainst forkbombs. If userspace really really wants to have stricter protection the simple solution would be to grab the write side of the cgroup threadgroup rwsem which will force all ongoing forks to complete before killing starts. We concluded that this is not necessary as the semantics for concurrent forking should simply align with freezer where a similar check as cgroup_post_fork() is performed. For all other cases CLONE_INTO_CGROUP is required. In this case we will grab the cgroup mutex so the cgroup can't be killed while we fork. Once we're done with the fork and have dropped cgroup mutex we are visible and will be found by any subsequent kill request. - We obviously don't kill kthreads. This means a cgroup that has a kthread will not become empty after killing and consequently no unpopulated event will be generated. The assumption is that kthreads should be in the root cgroup only anyway so this is not an issue. - We skip killing tasks that already have pending fatal signals. - Freezer doesn't care about tasks in different pid namespaces, i.e. if you have two tasks in different pid namespaces the cgroup would still be frozen. The cgroup.kill mechanism consequently behaves the same way, i.e. we kill all processes and ignore in which pid namespace they exist. - If the caller is located in a cgroup that is killed the caller will obviously be killed as well. Link: https://lore.kernel.org/r/20210503143922.3093755-1-brauner@kernel.org Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: cgroups@vger.kernel.org Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
![]() |
a7df69b81a |
cgroup: rstat: support cgroup1
Rstat currently only supports the default hierarchy in cgroup2. In order to replace memcg's private stats infrastructure - used in both cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1. The initialization and destruction callbacks for regular cgroups are already in place. Remove the cgroup_on_dfl() guards to handle cgroup1. The initialization of the root cgroup is currently hardcoded to only handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root() and cgroup_destroy_root() to handle the default root as well as the various cgroup1 roots we may set up during mounting. The linking of css to cgroups happens in code shared between cgroup1 and cgroup2 as well. Simply remove the cgroup_on_dfl() guard. Linkage of the root css to the root cgroup is a bit trickier: per default, the root css of a subsystem controller belongs to the default hierarchy (i.e. the cgroup2 root). When a controller is mounted in its cgroup1 version, the root css is stolen and moved to the cgroup1 root; on unmount, the css moves back to the default hierarchy. Annotate rebind_subsystems() to move the root css linkage along between roots. Link: https://lkml.kernel.org/r/20210209163304.77088-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
7d6beb71da |
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
|
||
![]() |
4b3bd22b12 |
Merge branch 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Nothing interesting. Just two minor patches" * 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix typos in comments cgroup: cgroup.{procs,threads} factor out common parts |