mirror of git://git.yoctoproject.org/linux-yocto.git synced 2025-08-21 16:31:14 +02:00

Daniel Borkmann e420bed025 bpf: Add fd-based tcx multi-prog infra with link support

This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]

- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

2023-07-19 10:07:27 -07:00

15 KiB

Raw Permalink Blame History

SPDX-License-Identifier: GPL-2.0-only

Network configuration

menuconfig NET bool "Networking support" select NLATTR select GENERIC_NET_UTILS select BPF help Unless you really know what you are doing, you should say Y here. The reason is that some programs need kernel networking support even when running on a stand-alone machine that isn't connected to any other computer.

  If you are upgrading from an older kernel, you
  should consider updating your networking tools too because changes
  in the kernel and the tools often go hand in hand. The tools are
  contained in the package net-tools, the location and version number
  of which are given in <file:Documentation/Changes>.

  For a general introduction to Linux networking, it is highly
  recommended to read the NET-HOWTO, available from
  <http://www.tldp.org/docs.html#howto>.

if NET

config WANT_COMPAT_NETLINK_MESSAGES bool help This option can be selected by other options that need compat netlink messages.

config COMPAT_NETLINK_MESSAGES def_bool y depends on COMPAT depends on WEXT_CORE || WANT_COMPAT_NETLINK_MESSAGES help This option makes it possible to send different netlink messages to tasks depending on whether the task is a compat task or not. To achieve this, you need to set skb_shinfo(skb)->frag_list to the compat skb before sending the skb, the netlink code will sort out which message to actually pass to the task.

  Newly written code should NEVER need this option but do
  compat-independent messages instead!

config NET_INGRESS bool

config NET_EGRESS bool

config NET_XGRESS select NET_INGRESS select NET_EGRESS bool

config NET_REDIRECT bool

config SKB_EXTENSIONS bool

menu "Networking options"

source "net/packet/Kconfig" source "net/unix/Kconfig" source "net/tls/Kconfig" source "net/xfrm/Kconfig" source "net/iucv/Kconfig" source "net/smc/Kconfig" source "net/xdp/Kconfig"

config NET_HANDSHAKE bool depends on SUNRPC || NVME_TARGET_TCP || NVME_TCP default y

config NET_HANDSHAKE_KUNIT_TEST tristate "KUnit tests for the handshake upcall mechanism" if !KUNIT_ALL_TESTS default KUNIT_ALL_TESTS depends on KUNIT help This builds the KUnit tests for the handshake upcall mechanism.

  KUnit tests run during boot and output the results to the debug
  log in TAP format (https://testanything.org/). Only useful for
  kernel devs running KUnit test harness and are not for inclusion
  into a production build.

  For more information on KUnit and unit tests in general, refer
  to the KUnit documentation in Documentation/dev-tools/kunit/.

config INET bool "TCP/IP networking" help These are the protocols used on the Internet and on most local Ethernets. It is highly recommended to say Y here (this will enlarge your kernel by about 400 KB), since some programs (e.g. the X window system) use TCP/IP even if your machine is not connected to any other computer. You will get the so-called loopback device which allows you to ping yourself (great fun, that!).

  For an excellent introduction to Linux networking, please read the
  Linux Networking HOWTO, available from
  <http://www.tldp.org/docs.html#howto>.

  If you say Y here and also to "/proc file system support" and
  "Sysctl support" below, you can change various aspects of the
  behavior of the TCP/IP code by writing to the (virtual) files in
  /proc/sys/net/ipv4/*; the options are explained in the file
  <file:Documentation/networking/ip-sysctl.rst>.

  Short answer: say Y.

if INET source "net/ipv4/Kconfig" source "net/ipv6/Kconfig" source "net/netlabel/Kconfig" source "net/mptcp/Kconfig"

endif # if INET

config NETWORK_SECMARK bool "Security Marking" help This enables security marking of network packets, similar to nfmark, but designated for security purposes. If you are unsure how to answer this question, answer N.

config NET_PTP_CLASSIFY def_bool n

config NETWORK_PHY_TIMESTAMPING bool "Timestamping in PHY devices" select NET_PTP_CLASSIFY help This allows timestamping of network packets by PHYs (or other MII bus snooping devices) with hardware timestamping capabilities. This option adds some overhead in the transmit and receive paths.

  If you are unsure how to answer this question, answer N.

menuconfig NETFILTER bool "Network packet filtering framework (Netfilter)" help Netfilter is a framework for filtering and mangling network packets that pass through your Linux box.

  The most common use of packet filtering is to run your Linux box as
  a firewall protecting a local network from the Internet. The type of
  firewall provided by this kernel support is called a "packet
  filter", which means that it can reject individual network packets
  based on type, source, destination etc. The other kind of firewall,
  a "proxy-based" one, is more secure but more intrusive and more
  bothersome to set up; it inspects the network traffic much more
  closely, modifies it and has knowledge about the higher level
  protocols, which a packet filter lacks. Moreover, proxy-based
  firewalls often require changes to the programs running on the local
  clients. Proxy-based firewalls don't need support by the kernel, but
  they are often combined with a packet filter, which only works if
  you say Y here.

  You should also say Y here if you intend to use your Linux box as
  the gateway to the Internet for a local network of machines without
  globally valid IP addresses. This is called "masquerading": if one
  of the computers on your local network wants to send something to
  the outside, your box can "masquerade" as that computer, i.e. it
  forwards the traffic to the intended outside destination, but
  modifies the packets to make it look like they came from the
  firewall box itself. It works both ways: if the outside host
  replies, the Linux box will silently forward the traffic to the
  correct local computer. This way, the computers on your local net
  are completely invisible to the outside world, even though they can
  reach the outside and can receive replies. It is even possible to
  run globally visible servers from within a masqueraded local network
  using a mechanism called portforwarding. Masquerading is also often
  called NAT (Network Address Translation).

  Another use of Netfilter is in transparent proxying: if a machine on
  the local network tries to connect to an outside host, your Linux
  box can transparently forward the traffic to a local server,
  typically a caching proxy server.

  Yet another use of Netfilter is building a bridging firewall. Using
  a bridge with Network packet filtering enabled makes iptables "see"
  the bridged traffic. For filtering on the lower network and Ethernet
  protocols over the bridge, use ebtables (under bridge netfilter
  configuration).

  Various modules exist for netfilter which replace the previous
  masquerading (ipmasqadm), packet filtering (ipchains), transparent
  proxying, and portforwarding mechanisms. Please see
  <file:Documentation/Changes> under "iptables" for the location of
  these packages.

if NETFILTER

config NETFILTER_ADVANCED bool "Advanced netfilter configuration" depends on NETFILTER default y help If you say Y here you can select between all the netfilter modules. If you say N the more unusual ones will not be shown and the basic ones needed by most people will default to 'M'.

  If unsure, say Y.

config BRIDGE_NETFILTER tristate "Bridged IP/ARP packets filtering" depends on BRIDGE depends on NETFILTER && INET depends on NETFILTER_ADVANCED select NETFILTER_FAMILY_BRIDGE select SKB_EXTENSIONS help Enabling this option will let arptables resp. iptables see bridged ARP resp. IP traffic. If you want a bridging firewall, you probably want this option enabled. Enabling or disabling this option doesn't enable or disable ebtables.

  If unsure, say N.

source "net/netfilter/Kconfig" source "net/ipv4/netfilter/Kconfig" source "net/ipv6/netfilter/Kconfig" source "net/bridge/netfilter/Kconfig"

endif

source "net/bpfilter/Kconfig"

source "net/dccp/Kconfig" source "net/sctp/Kconfig" source "net/rds/Kconfig" source "net/tipc/Kconfig" source "net/atm/Kconfig" source "net/l2tp/Kconfig" source "net/802/Kconfig" source "net/bridge/Kconfig" source "net/dsa/Kconfig" source "net/8021q/Kconfig" source "net/llc/Kconfig" source "drivers/net/appletalk/Kconfig" source "net/x25/Kconfig" source "net/lapb/Kconfig" source "net/phonet/Kconfig" source "net/6lowpan/Kconfig" source "net/ieee802154/Kconfig" source "net/mac802154/Kconfig" source "net/sched/Kconfig" source "net/dcb/Kconfig" source "net/dns_resolver/Kconfig" source "net/batman-adv/Kconfig" source "net/openvswitch/Kconfig" source "net/vmw_vsock/Kconfig" source "net/netlink/Kconfig" source "net/mpls/Kconfig" source "net/nsh/Kconfig" source "net/hsr/Kconfig" source "net/switchdev/Kconfig" source "net/l3mdev/Kconfig" source "net/qrtr/Kconfig" source "net/ncsi/Kconfig"

config PCPU_DEV_REFCNT bool "Use percpu variables to maintain network device refcount" depends on SMP default y help network device refcount are using per cpu variables if this option is set. This can be forced to N to detect underflows (with a performance drop).

config MAX_SKB_FRAGS int "Maximum number of fragments per skb_shared_info" range 17 45 default 17 help Having more fragments per skb_shared_info can help GRO efficiency. This helps BIG TCP workloads, but might expose bugs in some legacy drivers. This also increases memory overhead of small packets, and in drivers using build_skb(). If unsure, say 17.

config RPS bool depends on SMP && SYSFS default y

config RFS_ACCEL bool depends on RPS select CPU_RMAP default y

config SOCK_RX_QUEUE_MAPPING bool

config XPS bool depends on SMP select SOCK_RX_QUEUE_MAPPING default y

config HWBM bool

config CGROUP_NET_PRIO bool "Network priority cgroup" depends on CGROUPS select SOCK_CGROUP_DATA help Cgroup subsystem for use in assigning processes to network priorities on a per-interface basis.

config CGROUP_NET_CLASSID bool "Network classid cgroup" depends on CGROUPS select SOCK_CGROUP_DATA help Cgroup subsystem for use as general purpose socket classid marker that is being used in cls_cgroup and for netfilter matching.

config NET_RX_BUSY_POLL bool default y if !PREEMPT_RT || (PREEMPT_RT && !NETCONSOLE)

config BQL bool depends on SYSFS select DQL default y

config BPF_STREAM_PARSER bool "enable BPF STREAM_PARSER" depends on INET depends on BPF_SYSCALL depends on CGROUP_BPF select STREAM_PARSER select NET_SOCK_MSG help Enabling this allows a TCP stream parser to be used with BPF_MAP_TYPE_SOCKMAP.

config NET_FLOW_LIMIT bool depends on RPS default y help The network stack has to drop packets when a receive processing CPU's backlog reaches netdev_max_backlog. If a few out of many active flows generate the vast majority of load, drop their traffic earlier to maintain capacity for the other flows. This feature provides servers with many clients some protection against DoS by a single (spoofed) flow that greatly exceeds average workload.

menu "Network testing"

config NET_PKTGEN tristate "Packet Generator (USE WITH CAUTION)" depends on INET && PROC_FS help This module will inject preconfigured packets, at a configurable rate, out of a given interface. It is used for network interface stress testing and performance analysis. If you don't understand what was just said, you don't need it: say N.

  Documentation on how to use the packet generator can be found
  at <file:Documentation/networking/pktgen.rst>.

  To compile this code as a module, choose M here: the
  module will be called pktgen.

config NET_DROP_MONITOR tristate "Network packet drop alerting service" depends on INET && TRACEPOINTS help This feature provides an alerting service to userspace in the event that packets are discarded in the network stack. Alerts are broadcast via netlink socket to any listening user space process. If you don't need network drop alerts, or if you are ok just checking the various proc files and other utilities for drop statistics, say N here.

endmenu

source "net/ax25/Kconfig" source "net/can/Kconfig" source "net/bluetooth/Kconfig" source "net/rxrpc/Kconfig" source "net/kcm/Kconfig" source "net/strparser/Kconfig" source "net/mctp/Kconfig"

config FIB_RULES bool

menuconfig WIRELESS bool "Wireless" depends on !S390 default y

if WIRELESS

source "net/wireless/Kconfig" source "net/mac80211/Kconfig"

endif # WIRELESS

source "net/rfkill/Kconfig" source "net/9p/Kconfig" source "net/caif/Kconfig" source "net/ceph/Kconfig" source "net/nfc/Kconfig" source "net/psample/Kconfig" source "net/ife/Kconfig"

config LWTUNNEL bool "Network light weight tunnels" help This feature provides an infrastructure to support light weight tunnels like mpls. There is no netdevice associated with a light weight tunnel endpoint. Tunnel encapsulation parameters are stored with light weight tunnel state associated with fib routes.

config LWTUNNEL_BPF bool "Execute BPF program as route nexthop action" depends on LWTUNNEL && INET default y if LWTUNNEL=y help Allows to run BPF programs as a nexthop action following a route lookup for incoming and outgoing packets.

config DST_CACHE bool default n

config GRO_CELLS bool default n

config SOCK_VALIDATE_XMIT bool

config NET_SELFTESTS def_tristate PHYLIB depends on PHYLIB && INET

config NET_SOCK_MSG bool default n help The NET_SOCK_MSG provides a framework for plain sockets (e.g. TCP) or ULPs (upper layer modules, e.g. TLS) to process L7 application data with the help of BPF programs.

config NET_DEVLINK bool default n

config PAGE_POOL bool

config PAGE_POOL_STATS default n bool "Page pool stats" depends on PAGE_POOL help Enable page pool statistics to track page allocation and recycling in page pools. This option incurs additional CPU cost in allocation and recycle paths and additional memory cost to store the statistics. These statistics are only available if this option is enabled and if the driver using the page pool supports exporting this data.

  If unsure, say N.

config FAILOVER tristate "Generic failover module" help The failover module provides a generic interface for paravirtual drivers to register a netdev and a set of ops with a failover instance. The ops are used as event handlers that get called to handle netdev register/unregister/link change/name change events on slave pci ethernet devices with the same mac address as the failover netdev. This enables paravirtual drivers to use a VF as an accelerated low latency datapath. It also allows live migration of VMs with direct attached VFs by failing over to the paravirtual datapath when the VF is unplugged.

config ETHTOOL_NETLINK bool "Netlink interface for ethtool" default y help An alternative userspace interface for ethtool based on generic netlink. It provides better extensibility and some new features, e.g. notification messages.

config NETDEV_ADDR_LIST_TEST tristate "Unit tests for device address list" default KUNIT_ALL_TESTS depends on KUNIT

endif # if NET

15 KiB Raw Permalink Blame History

SPDX-License-Identifier: GPL-2.0-only

Network configuration

15 KiB

Raw Permalink Blame History