Commit Graph

1149 Commits

Author SHA1 Message Date
Jamal Hadi Salim
26cc8714fc net/sched: Remove uapi support for ATM qdisc
Commit fb38306ceb ("net/sched: Retire ATM qdisc") retired the ATM qdisc.
Remove UAPI for it. Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02 14:25:51 +00:00
Jamal Hadi Salim
fe3b739a54 net/sched: Remove uapi support for dsmark qdisc
Commit bbe77c14ee ("net/sched: Retire dsmark qdisc") retired the dsmark
classifier. Remove UAPI support for it.
Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02 14:25:51 +00:00
Jamal Hadi Salim
82b2545ed9 net/sched: Remove uapi support for tcindex classifier
commit 8c710f7525 ("net/sched: Retire tcindex classifier") retired the TC
tcindex classifier.
Remove UAPI for it.  Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02 14:25:51 +00:00
Jamal Hadi Salim
41bc3e8fc1 net/sched: Remove uapi support for rsvp classifier
commit 265b4da82d ("net/sched: Retire rsvp classifier") retired the TC RSVP
classifier.
Remove UAPI for it. Iproute2 will sync by equally removing it from user space.

Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02 14:25:51 +00:00
Andrii Nakryiko
d17aff807f Revert BPF token-related functionality
This patch includes the following revert (one  conflicting BPF FS
patch and three token patch sets, represented by merge commits):
  - revert 0f5d5454c7 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
  - revert 750e785796 "bpf: Support uid and gid when mounting bpffs";
  - revert 733763285a "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
  - revert c35919dcce "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".

Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2023-12-19 08:23:03 -08:00
Jakub Kicinski
c49b292d03 netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
 bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
 oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
 uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
 TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
 NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
 VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
 BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
 SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
 xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
 CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
 l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
 =stU2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-12-18

This PR is larger than usual and contains changes in various parts
of the kernel.

The main changes are:

1) Fix kCFI bugs in BPF, from Peter Zijlstra.

End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.

2) Introduce BPF token object, from Andrii Nakryiko.

It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.

Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
             -o delegate_cmds=prog_load:MAP_CREATE \
             -o delegate_progs=kprobe \
             -o delegate_attachs=xdp

3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.

 - Complete precision tracking support for register spills
 - Fix verification of possibly-zero-sized stack accesses
 - Fix access to uninit stack slots
 - Track aligned STACK_ZERO cases as imprecise spilled registers.
   It improves the verifier "instructions processed" metric from single
   digit to 50-60% for some programs.
 - Fix verifier retval logic

4) Support for VLAN tag in XDP hints, from Larysa Zaremba.

5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.

End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.

6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.

7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.

It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.

8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.

9) BPF file verification via fsverity, from Song Liu.

It allows BPF progs get fsverity digest.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
  bpf: Ensure precise is reset to false in __mark_reg_const_zero()
  selftests/bpf: Add more uprobe multi fail tests
  bpf: Fail uprobe multi link with negative offset
  selftests/bpf: Test the release of map btf
  s390/bpf: Fix indirect trampoline generation
  selftests/bpf: Temporarily disable dummy_struct_ops test on s390
  x86/cfi,bpf: Fix bpf_exception_cb() signature
  bpf: Fix dtor CFI
  cfi: Add CFI_NOSEAL()
  x86/cfi,bpf: Fix bpf_struct_ops CFI
  x86/cfi,bpf: Fix bpf_callback_t CFI
  x86/cfi,bpf: Fix BPF JIT call
  cfi: Flip headers
  selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
  selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
  selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
  bpf: Limit the number of kprobes when attaching program to multiple kprobes
  bpf: Limit the number of uprobes when attaching program to multiple uprobes
  bpf: xdp: Register generic_kfunc_set with XDP programs
  selftests/bpf: utilize string values for delegate_xxx mount options
  ...
====================

Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-18 16:46:08 -08:00
Arnaldo Carvalho de Melo
ab1c247094 Merge remote-tracking branch 'torvalds/master' into perf-tools-next
To pick up fixes that went thru perf-tools for v6.7 and to get in sync
with upstream to check for drift in the copies of headers, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-18 21:37:07 -03:00
Larysa Zaremba
e6795330f8 xdp: Add VLAN tag hint
Implement functionality that enables drivers to expose VLAN tag
to XDP code.

VLAN tag is represented by 2 variables:
- protocol ID, which is passed to bpf code in BE
- VLAN TCI, in host byte order

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20231205210847.28460-10-larysa.zaremba@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-13 16:16:40 -08:00
Andrei Vagin
e6a9a2cbc1 fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
The PAGEMAP_SCAN ioctl returns information regarding page table entries. 
It is more efficient compared to reading pagemap files.  CRIU can start to
utilize this ioctl, but it needs info about soft-dirty bits to track
memory changes.

We are aware of a new method for tracking memory changes implemented in
the PAGEMAP_SCAN ioctl.  For CRIU, the primary advantage of this method is
its usability by unprivileged users.  However, it is not feasible to
transparently replace the soft-dirty tracker with the new one.  The main
problem here is userfault descriptors that have to be preserved between
pre-dump iterations.  It means criu continues supporting the soft-dirty
method to avoid breakage for current users.  The new method will be
implemented as a separate feature.

[avagin@google.com: update tools/include/uapi/linux/fs.h]
  Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com
Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:35 -08:00
Jakub Kicinski
2483e7f04c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/stmicro/stmmac/dwmac5.c
drivers/net/ethernet/stmicro/stmmac/dwmac5.h
drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
drivers/net/ethernet/stmicro/stmmac/hwif.h
  37e4b8df27 ("net: stmmac: fix FPE events losing")
  c3f3b97238 ("net: stmmac: Refactor EST implementation")
https://lore.kernel.org/all/20231206110306.01e91114@canb.auug.org.au/

Adjacent changes:

net/ipv4/tcp_ao.c
  9396c4ee93 ("net/tcp: Don't store TCP-AO maclen on reqsk")
  7b0f570f87 ("tcp: Move TCP-AO bits from cookie_v[46]_check() to tcp_ao_syncookie().")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 17:53:17 -08:00
Andrii Nakryiko
7065eefb38 bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency
To stay consistent with the naming pattern used for similar cases in BPF
UAPI (__MAX_BPF_ATTACH_TYPE, etc), rename MAX_BPF_LINK_TYPE into
__MAX_BPF_LINK_TYPE.

Also similar to MAX_BPF_ATTACH_TYPE and MAX_BPF_REG, add:

  #define MAX_BPF_LINK_TYPE __MAX_BPF_LINK_TYPE

Not all __MAX_xxx enums have such #define, so I'm not sure if we should
add it or not, but I figured I'll start with a completely backwards
compatible way, and we can drop that, if necessary.

Also adjust a selftest that used MAX_BPF_LINK_TYPE enum.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231206190920.1651226-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 14:41:16 -08:00
Andrii Nakryiko
e1cef620f5 bpf: add BPF token support to BPF_PROG_LOAD command
Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
ee54b1a910 bpf: add BPF token support to BPF_BTF_LOAD command
Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
through delegated BPF token. BTF loading is a pretty straightforward
operation, so as long as BPF token is created with allow_cmds granting
BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
BTF object.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
688b7270b3 bpf: add BPF token support to BPF_MAP_CREATE command
Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Andrii Nakryiko
4527358b76 bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-06 10:02:59 -08:00
Amritha Nambiar
8481a249a0 netdev-genl: spec: Add PID in netdev netlink YAML spec
Add support in netlink spec(netdev.yaml) for PID of the
NAPI thread. Add code generated from the spec.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147335301.5260.11872351477120434501.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04 18:04:06 -08:00
Amritha Nambiar
5a5131d66f netdev-genl: spec: Add irq in netdev netlink YAML spec
Add support in netlink spec(netdev.yaml) for interrupt number
among the NAPI attributes. Add code generated from the spec.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147334210.5260.18178387869057516983.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04 18:04:06 -08:00
Amritha Nambiar
ff9991499f netdev-genl: spec: Extend netdev netlink spec in YAML for NAPI
Add support in netlink spec(netdev.yaml) for napi related information.
Add code generated from the spec.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147333119.5260.7050639053080529108.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04 18:04:05 -08:00
Amritha Nambiar
bc87795627 netdev-genl: spec: Extend netdev netlink spec in YAML for queue
Add support in netlink spec(netdev.yaml) for queue information.
Add code generated from the spec.

Note: The "queue-type" attribute takes values 0 and 1 for rx
and tx queue type respectively.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://lore.kernel.org/r/170147330963.5260.2576294626647300472.stgit@anambiarhost.jf.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-04 18:04:05 -08:00
Jakub Kicinski
753c8608f3 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZWiCPAAKCRDbK58LschI
 g4djAQC1FdqCRIFkhbiIRNHTgHjnfQShELQbd9ofJqzylLqmmgD+JI1E7D9SXagm
 pIXQ26EGmq8/VcCT3VLncA8EsC76Gg4=
 =Xowm
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-11-30

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 58 files changed, 1598 insertions(+), 154 deletions(-).

The main changes are:

1) Add initial TX metadata implementation for AF_XDP with support in mlx5
   and stmmac drivers. Two types of offloads are supported right now, that
   is, TX timestamp and TX checksum offload, from Stanislav Fomichev with
   stmmac implementation from Song Yoong Siang.

2) Change BPF verifier logic to validate global subprograms lazily instead
   of unconditionally before the main program, so they can be guarded using
   BPF CO-RE techniques, from Andrii Nakryiko.

3) Add BPF link_info support for uprobe multi link along with bpftool
   integration for the latter, from Jiri Olsa.

4) Use pkg-config in BPF selftests to determine ld flags which is
   in particular needed for linking statically, from Akihiko Odaki.

5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits)
  bpf/tests: Remove duplicate JSGT tests
  selftests/bpf: Add TX side to xdp_hw_metadata
  selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
  selftests/bpf: Add TX side to xdp_metadata
  selftests/bpf: Add csum helpers
  selftests/xsk: Support tx_metadata_len
  xsk: Add option to calculate TX checksum in SW
  xsk: Validate xsk_tx_metadata flags
  xsk: Document tx_metadata_len layout
  net: stmmac: Add Tx HWTS support to XDP ZC
  net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  tools: ynl: Print xsk-features from the sample
  xsk: Add TX timestamp and TX checksum offload support
  xsk: Support tx_metadata_len
  selftests/bpf: Use pkg-config for libelf
  selftests/bpf: Override PKG_CONFIG for static builds
  selftests/bpf: Choose pkg-config for the target
  bpftool: Add support to display uprobe_multi links
  selftests/bpf: Add link_info test for uprobe_multi link
  selftests/bpf: Use bpf_link__destroy in fill_link_info tests
  ...
====================

Conflicts:

Documentation/netlink/specs/netdev.yaml:
  839ff60df3 ("net: page_pool: add nlspec for basic access to page pools")
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/

While at it also regen, tree is dirty after:
  48eb03dd26 ("xsk: Add TX timestamp and TX checksum offload support")
looks like code wasn't re-rendered after "render-max" was removed.

Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30 16:58:42 -08:00
Stanislav Fomichev
11614723af xsk: Add option to calculate TX checksum in SW
For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM
to call skb_checksum_help in transmit path. Might be useful
to debugging issues with real hardware. I also use this mode
in the selftests.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231127190319.1190813-9-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29 14:59:40 -08:00
Stanislav Fomichev
48eb03dd26 xsk: Add TX timestamp and TX checksum offload support
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).

The struct is forward-compatible and can be extended in the future
by appending more fields.

Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231127190319.1190813-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29 14:59:40 -08:00
Stanislav Fomichev
341ac980ea xsk: Support tx_metadata_len
For zerocopy mode, tx_desc->addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
  timestamp in the completion)
- non-zero

This data is not interpreted in any way right now.

Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20231127190319.1190813-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-29 14:59:40 -08:00
Jiri Olsa
e56fdbfb06 bpf: Add link_info support for uprobe multi link
Adding support to get uprobe_link details through bpf_link_info
interface.

Adding new struct uprobe_multi to struct bpf_link_info to carry
the uprobe_multi link details.

The uprobe_multi.count is passed from user space to denote size
of array fields (offsets/ref_ctr_offsets/cookies). The actual
array size is stored back to uprobe_multi.count (allowing user
to find out the actual array size) and array fields are populated
up to the user passed size.

All the non-array fields (path/count/flags/pid) are always set.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
2023-11-28 21:50:09 -08:00
Jakub Kicinski
637567e4a3 tools: ynl: add sample for getting page-pool information
Regenerate the tools/ code after netdev spec changes.

Add sample to query page-pool info in a concise fashion:

$ ./page-pool
    eth0[2]	page pools: 10 (zombies: 0)
		refs: 41984 bytes: 171966464 (refs: 0 bytes: 0)
		recycling: 90.3% (alloc: 656:397681 recycle: 89652:270201)

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28 15:48:39 +01:00
Namhyung Kim
daa9751341 tools headers UAPI: Update tools's copy of vhost.h header
tldr; Just FYI, I'm carrying this on the perf tools tree.

Full explanation:

There used to be no copies, with tools/ code using kernel headers
directly. From time to time tools/perf/ broke due to legitimate kernel
hacking. At some point Linus complained about such direct usage. Then we
adopted the current model.

The way these headers are used in perf are not restricted to just
including them to compile something.

There are sometimes used in scripts that convert defines into string
tables, etc, so some change may break one of these scripts, or new MSRs
may use some different #define pattern, etc.

E.g.:

  $ ls -1 tools/perf/trace/beauty/*.sh | head -5
  tools/perf/trace/beauty/arch_errno_names.sh
  tools/perf/trace/beauty/drm_ioctl.sh
  tools/perf/trace/beauty/fadvise.sh
  tools/perf/trace/beauty/fsconfig.sh
  tools/perf/trace/beauty/fsmount.sh
  $
  $ tools/perf/trace/beauty/fadvise.sh
  static const char *fadvise_advices[] = {
        [0] = "NORMAL",
        [1] = "RANDOM",
        [2] = "SEQUENTIAL",
        [3] = "WILLNEED",
        [4] = "DONTNEED",
        [5] = "NOREUSE",
  };
  $

The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
Cc: netdev@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20231121225650.390246-5-namhyung@kernel.org
2023-11-22 10:57:46 -08:00
Namhyung Kim
fb3648a6a8 tools headers UAPI: Update tools's copy of mount.h header
tldr; Just FYI, I'm carrying this on the perf tools tree.

Full explanation:

There used to be no copies, with tools/ code using kernel headers
directly. From time to time tools/perf/ broke due to legitimate kernel
hacking. At some point Linus complained about such direct usage. Then we
adopted the current model.

The way these headers are used in perf are not restricted to just
including them to compile something.

There are sometimes used in scripts that convert defines into string
tables, etc, so some change may break one of these scripts, or new MSRs
may use some different #define pattern, etc.

E.g.:

  $ ls -1 tools/perf/trace/beauty/*.sh | head -5
  tools/perf/trace/beauty/arch_errno_names.sh
  tools/perf/trace/beauty/drm_ioctl.sh
  tools/perf/trace/beauty/fadvise.sh
  tools/perf/trace/beauty/fsconfig.sh
  tools/perf/trace/beauty/fsmount.sh
  $
  $ tools/perf/trace/beauty/fadvise.sh
  static const char *fadvise_advices[] = {
        [0] = "NORMAL",
        [1] = "RANDOM",
        [2] = "SEQUENTIAL",
        [3] = "WILLNEED",
        [4] = "DONTNEED",
        [5] = "NOREUSE",
  };
  $

The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20231121225650.390246-4-namhyung@kernel.org
2023-11-22 10:57:46 -08:00
Namhyung Kim
5a9f95b670 tools headers UAPI: Update tools's copy of kvm.h header
tldr; Just FYI, I'm carrying this on the perf tools tree.

Full explanation:

There used to be no copies, with tools/ code using kernel headers
directly. From time to time tools/perf/ broke due to legitimate kernel
hacking. At some point Linus complained about such direct usage. Then we
adopted the current model.

The way these headers are used in perf are not restricted to just
including them to compile something.

There are sometimes used in scripts that convert defines into string
tables, etc, so some change may break one of these scripts, or new MSRs
may use some different #define pattern, etc.

E.g.:

  $ ls -1 tools/perf/trace/beauty/*.sh | head -5
  tools/perf/trace/beauty/arch_errno_names.sh
  tools/perf/trace/beauty/drm_ioctl.sh
  tools/perf/trace/beauty/fadvise.sh
  tools/perf/trace/beauty/fsconfig.sh
  tools/perf/trace/beauty/fsmount.sh
  $
  $ tools/perf/trace/beauty/fadvise.sh
  static const char *fadvise_advices[] = {
        [0] = "NORMAL",
        [1] = "RANDOM",
        [2] = "SEQUENTIAL",
        [3] = "WILLNEED",
        [4] = "DONTNEED",
        [5] = "NOREUSE",
  };
  $

The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20231121225650.390246-3-namhyung@kernel.org
2023-11-22 10:57:46 -08:00
Namhyung Kim
1118446666 tools headers UAPI: Update tools's copy of fscrypt.h header
tldr; Just FYI, I'm carrying this on the perf tools tree.

Full explanation:

There used to be no copies, with tools/ code using kernel headers
directly. From time to time tools/perf/ broke due to legitimate kernel
hacking. At some point Linus complained about such direct usage. Then we
adopted the current model.

The way these headers are used in perf are not restricted to just
including them to compile something.

There are sometimes used in scripts that convert defines into string
tables, etc, so some change may break one of these scripts, or new MSRs
may use some different #define pattern, etc.

E.g.:

  $ ls -1 tools/perf/trace/beauty/*.sh | head -5
  tools/perf/trace/beauty/arch_errno_names.sh
  tools/perf/trace/beauty/drm_ioctl.sh
  tools/perf/trace/beauty/fadvise.sh
  tools/perf/trace/beauty/fsconfig.sh
  tools/perf/trace/beauty/fsmount.sh
  $
  $ tools/perf/trace/beauty/fadvise.sh
  static const char *fadvise_advices[] = {
        [0] = "NORMAL",
        [1] = "RANDOM",
        [2] = "SEQUENTIAL",
        [3] = "WILLNEED",
        [4] = "DONTNEED",
        [5] = "NOREUSE",
  };
  $

The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers.

Cc: Eric Biggers <ebiggers@kernel.org>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: linux-fscrypt@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20231121225650.390246-2-namhyung@kernel.org
2023-11-22 10:57:46 -08:00
Andrii Nakryiko
ff8867af01 bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].

A few selftests and veristat need to be adjusted in the same patch as
well.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17 10:30:02 -08:00
Andrii Nakryiko
5f99f312bd bpf: add register bounds sanity checks and sanitization
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.

These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.

Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.

By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15 12:03:42 -08:00
Jordan Rome
b8e3a87a62 bpf: Add crosstask check to __bpf_get_stack
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.

This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.

It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.

Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
2023-11-10 11:06:10 -08:00
Yonghong Song
155addf081 bpf: Use named fields for certain bpf uapi structs
Martin and Vadim reported a verifier failure with bpf_dynptr usage.
The issue is mentioned but Vadim workarounded the issue with source
change ([1]). The below describes what is the issue and why there
is a verification failure.

  int BPF_PROG(skb_crypto_setup) {
    struct bpf_dynptr algo, key;
    ...

    bpf_dynptr_from_mem(..., ..., 0, &algo);
    ...
  }

The bpf program is using vmlinux.h, so we have the following definition in
vmlinux.h:
  struct bpf_dynptr {
        long: 64;
        long: 64;
  };
Note that in uapi header bpf.h, we have
  struct bpf_dynptr {
        long: 64;
        long: 64;
} __attribute__((aligned(8)));

So we lost alignment information for struct bpf_dynptr by using vmlinux.h.
Let us take a look at a simple program below:
  $ cat align.c
  typedef unsigned long long __u64;
  struct bpf_dynptr_no_align {
        __u64 :64;
        __u64 :64;
  };
  struct bpf_dynptr_yes_align {
        __u64 :64;
        __u64 :64;
  } __attribute__((aligned(8)));

  void bar(void *, void *);
  int foo() {
    struct bpf_dynptr_no_align a;
    struct bpf_dynptr_yes_align b;
    bar(&a, &b);
    return 0;
  }
  $ clang --target=bpf -O2 -S -emit-llvm align.c

Look at the generated IR file align.ll:
  ...
  %a = alloca %struct.bpf_dynptr_no_align, align 1
  %b = alloca %struct.bpf_dynptr_yes_align, align 8
  ...

The compiler dictates the alignment for struct bpf_dynptr_no_align is 1 and
the alignment for struct bpf_dynptr_yes_align is 8. So theoretically compiler
could allocate variable %a with alignment 1 although in reallity the compiler
may choose a different alignment by considering other local variables.

In [1], the verification failure happens because variable 'algo' is allocated
on the stack with alignment 4 (fp-28). But the verifer wants its alignment
to be 8.

To fix the issue, the RFC patch ([1]) tried to add '__attribute__((aligned(8)))'
to struct bpf_dynptr plus other similar structs. Andrii suggested that
we could directly modify uapi struct with named fields like struct 'bpf_iter_num':
  struct bpf_iter_num {
        /* opaque iterator state; having __u64 here allows to preserve correct
         * alignment requirements in vmlinux.h, generated from BTF
         */
        __u64 __opaque[1];
  } __attribute__((aligned(8)));

Indeed, adding named fields for those affected structs in this patch can preserve
alignment when bpf program references them in vmlinux.h. With this patch,
the verification failure in [1] can also be resolved.

  [1] https://lore.kernel.org/bpf/1b100f73-7625-4c1f-3ae5-50ecf84d3ff0@linux.dev/
  [2] https://lore.kernel.org/bpf/20231103055218.2395034-1-yonghong.song@linux.dev/

Cc: Vadim Fedorenko <vadfed@meta.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231104024900.1539182-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09 19:07:52 -08:00
Kan Liang
76db7aab1f tools headers UAPI: Sync include/uapi/linux/perf_event.h header with the kernel
Sync the new sample type for the branch counters feature.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tinghao Zhang <tinghao.zhang@intel.com>
Link: https://lore.kernel.org/r/20231025201626.3000228-6-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-11-06 18:13:34 -03:00
Linus Torvalds
ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds
f5277ad1e9 for-6.7/io_uring-sockopt-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vdwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr2rD/0astIsj/AACVSPzHARg9lnhkIvUeweMSSl
 CjifLTzK3a9E3R2IrC4sflObUKIEL3fste0Lva141eNULZvBJ6cQJDvY7Bp72Bkc
 CTPEwEQiwDJKLhTzQh3gY0H0+nFMWwEm1uc4dyeNAft/R9bPP/qOq62ttCoCp9+S
 1UoFmTlJE3bhejyS7fytoGZvKqhkpdR7rtbR4ya7CXWPoAG+v9amo8fputbxm0dj
 WECpKdd65JHWwYV4rbPA69T7jZ9V0oUsLen9RJ9BmjMLOFggHYqQdvEwG0Htirhw
 t5uaXqSvc8pXsJhKXMS3tXCrLNtBha5nlWHBpSE+6ovcmKiRzFjUaRXkRbcIrOAx
 ljIm0HHto1+xv0pDrNl3/lIjv5dpNOEauqqgMeYytQJIHa0JpSWbYzvjwQ8EZXQv
 WWDiRfH5Z0/3BsFdOCVqd8mTt4Pbksp2VFcxGkojRtSqSr4CML3mPZSmqGcs3nE6
 Fc16XXw7oLEWoF1tQYMP6KG0cVLem4on28c8CcVMJ/pRvcun3jBCif2gmMHJkWyA
 a6Uq116amqQ61f1p+EQ3ChqyTA5uALrXPmovu6Ne3Y/btW5yG4+Vu7AsPLjPHdFN
 oGHjOPV77XQzEqzUWRXmXPecZ+QifkcCV/8kbqtEHQqk5n+HUKQZmpC8+014ms3V
 Af6LYI/vYg==
 =sk8+
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux

Pull io_uring {get,set}sockopt support from Jens Axboe:
 "This adds support for using getsockopt and setsockopt via io_uring.

  The main use cases for this is to enable use of direct descriptors,
  rather than first instantiating a normal file descriptor, doing the
  option tweaking needed, then turning it into a direct descriptor. With
  this support, we can avoid needing a regular file descriptor
  completely.

  The net and bpf bits have been signed off on their side"

* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
  selftests/bpf/sockopt: Add io_uring support
  io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
  io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
  io_uring/cmd: return -EOPNOTSUPP if net is disabled
  selftests/net: Extract uring helpers to be reusable
  tools headers: Grab copy of io_uring.h
  io_uring/cmd: Pass compat mode in issue_flags
  net/socket: Break down __sys_getsockopt
  net/socket: Break down __sys_setsockopt
  bpf: Add sockptr support for setsockopt
  bpf: Add sockptr support for getsockopt
2023-11-01 11:16:34 -10:00
Daniel Borkmann
5c1b994de4 tools: Sync if_link uapi header
Sync if_link uapi header to the latest version as we need the refresher
in tooling for netkit device. Given it's been a while since the last sync
and the diff is fairly big, it has been done as its own commit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231024214904.29825-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24 16:06:43 -07:00
Daniel Borkmann
35dfaad718 netkit, bpf: Add bpf programmable net device
This work adds a new, minimal BPF-programmable device called "netkit"
(former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
core idea is that BPF programs are executed within the drivers xmit routine
and therefore e.g. in case of containers/Pods moving BPF processing closer
to the source.

One of the goals was that in case of Pod egress traffic, this allows to
move BPF programs from hostns tcx ingress into the device itself, providing
earlier drop or forward mechanisms, for example, if the BPF program
determines that the skb must be sent out of the node, then a redirect to
the physical device can take place directly without going through per-CPU
backlog queue. This helps to shift processing for such traffic from softirq
to process context, leading to better scheduling decisions/performance (see
measurements in the slides).

In this initial version, the netkit device ships as a pair, but we plan to
extend this further so it can also operate in single device mode. The pair
comes with a primary and a peer device. Only the primary device, typically
residing in hostns, can manage BPF programs for itself and its peer. The
peer device is designated for containers/Pods and cannot attach/detach
BPF programs. Upon the device creation, the user can set the default policy
to 'pass' or 'drop' for the case when no BPF program is attached.

Additionally, the device can be operated in L3 (default) or L2 mode. The
management of BPF programs is done via bpf_mprog, so that multi-attach is
supported right from the beginning with similar API and dependency controls
as tcx. For details on the latter see commit 053c8e1f23 ("bpf: Add generic
attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
so that existing programs can be easily migrated.

Going forward, we plan to use netkit devices in Cilium as the main device
type for connecting Pods. They will be operated in L3 mode in order to
simplify a Pod's neighbor management and the peer will operate in default
drop mode, so that no traffic is leaving between the time when a Pod is
brought up by the CNI plugin and programs attached by the agent.
Additionally, the programs we attach via tcx on the physical devices are
using bpf_redirect_peer() for inbound traffic into netkit device, hence the
latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
directly. Also, BIG TCP is supported on netkit device. For the follow-up
work in single device mode, we plan to convert Cilium's cilium_host/_net
devices into a single one.

An extensive test suite for checking device operations and the BPF program
and link management API comes as BPF selftests in this series.

Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://github.com/borkmann/iproute2/tree/pr/netkit
Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-24 16:06:03 -07:00
Breno Leitao
7746a6adfc tools headers: Grab copy of io_uring.h
This file will be used by mini_uring.h and allow tests to run without
the need of installing liburing to run the tests.

This is needed to run io_uring tests in BPF, such as
(tools/testing/selftests/bpf/prog_tests/sockopt.c).

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-7-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19 16:42:03 -06:00
Muhammad Usama Anjum
b58aa0f4fe tools headers UAPI: update linux/fs.h with the kernel sources
New IOCTL and macros has been added in the kernel sources. Update the
tools header file as well.

Link: https://lkml.kernel.org/r/20230821141518.870589-5-usama.anjum@collabora.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Miroslaw <emmir@google.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Paul Gofman <pgofman@codeweavers.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yun Zhou <yun.zhou@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:34:13 -07:00
Jakub Kicinski
a3c2dd9648 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZS1d4wAKCRDbK58LschI
 g4DSAP441CdKh8fd+wNKUSKHFbpCQ6EvocR6Nf+Sj2DFUx/w/QEA7mfju7Abqjc3
 xwDEx0BuhrjMrjV5MmEpxc7lYl9XcQU=
 =vuWk
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-10-16

We've added 90 non-merge commits during the last 25 day(s) which contain
a total of 120 files changed, 3519 insertions(+), 895 deletions(-).

The main changes are:

1) Add missed stats for kprobes to retrieve the number of missed kprobe
   executions and subsequent executions of BPF programs, from Jiri Olsa.

2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is
   for systemd to reimplement the LogNamespace feature which allows
   running multiple instances of systemd-journald to process the logs
   of different services, from Daan De Meyer.

3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich.

4) Improve BPF verifier log output for scalar registers to better
   disambiguate their internal state wrt defaults vs min/max values
   matching, from Andrii Nakryiko.

5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving
   the source IP address with a new BPF_FIB_LOOKUP_SRC flag,
   from Martynas Pumputis.

6) Add support for open-coded task_vma iterator to help with symbolization
   for BPF-collected user stacks, from Dave Marchevsky.

7) Add libbpf getters for accessing individual BPF ring buffers which
   is useful for polling them individually, for example, from Martin Kelly.

8) Extend AF_XDP selftests to validate the SHARED_UMEM feature,
   from Tushar Vyavahare.

9) Improve BPF selftests cross-building support for riscv arch,
   from Björn Töpel.

10) Add the ability to pin a BPF timer to the same calling CPU,
   from David Vernet.

11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic
   implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments,
   from Alexandre Ghiti.

12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen.

13) Fix bpftool's skeleton code generation to guarantee that ELF data
    is 8 byte aligned, from Ian Rogers.

14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4
    security mitigations in BPF verifier, from Yafang Shao.

15) Annotate struct bpf_stack_map with __counted_by attribute to prepare
    BPF side for upcoming __counted_by compiler support, from Kees Cook.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits)
  bpf: Ensure proper register state printing for cond jumps
  bpf: Disambiguate SCALAR register state output in verifier logs
  selftests/bpf: Make align selftests more robust
  selftests/bpf: Improve missed_kprobe_recursion test robustness
  selftests/bpf: Improve percpu_alloc test robustness
  selftests/bpf: Add tests for open-coded task_vma iter
  bpf: Introduce task_vma open-coded iterator kfuncs
  selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c
  bpf: Don't explicitly emit BTF for struct btf_iter_num
  bpf: Change syscall_nr type to int in struct syscall_tp_t
  net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set
  bpf: Avoid unnecessary audit log for CPU security mitigations
  selftests/bpf: Add tests for cgroup unix socket address hooks
  selftests/bpf: Make sure mount directory exists
  documentation/bpf: Document cgroup unix socket address hooks
  bpftool: Add support for cgroup unix socket address hooks
  libbpf: Add support for cgroup unix socket address hooks
  bpf: Implement cgroup sockaddr hooks for unix sockets
  bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf
  bpf: Propagate modified uaddrlen from cgroup sockaddr programs
  ...
====================

Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16 21:05:33 -07:00
Daan De Meyer
859051dd16 bpf: Implement cgroup sockaddr hooks for unix sockets
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().

These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.

We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.

We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11 17:27:47 -07:00
Martynas Pumputis
dab4e1f06c bpf: Derive source IP addr via bpf_*_fib_lookup()
Extend the bpf_fib_lookup() helper by making it to return the source
IPv4/IPv6 address if the BPF_FIB_LOOKUP_SRC flag is set.

For example, the following snippet can be used to derive the desired
source IP address:

    struct bpf_fib_lookup p = { .ipv4_dst = ip4->daddr };

    ret = bpf_skb_fib_lookup(skb, p, sizeof(p),
            BPF_FIB_LOOKUP_SRC | BPF_FIB_LOOKUP_SKIP_NEIGH);
    if (ret != BPF_FIB_LKUP_RET_SUCCESS)
        return TC_ACT_SHOT;

    /* the p.ipv4_src now contains the source address */

The inability to derive the proper source address may cause malfunctions
in BPF-based dataplanes for hosts containing netdevs with more than one
routable IP address or for multi-homed hosts.

For example, Cilium implements packet masquerading in BPF. If an
egressing netdev to which the Cilium's BPF prog is attached has
multiple IP addresses, then only one [hardcoded] IP address can be used for
masquerading. This breaks connectivity if any other IP address should have
been selected instead, for example, when a public and private addresses
are attached to the same egress interface.

The change was tested with Cilium [1].

Nikolay Aleksandrov helped to figure out the IPv6 addr selection.

[1]: https://github.com/cilium/cilium/pull/28283

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Link: https://lore.kernel.org/r/20231007081415.33502-2-m@lambda.lt
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09 16:28:35 -07:00
David Vernet
d6247ecb6c bpf: Add ability to pin bpf timer to calling CPU
BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.

This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2023-10-09 16:28:49 +02:00
Florent Revest
24e41bf8a6 mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
the process doesn't want MDWE protection to propagate to children.

To implement this no-inherit mode, the tag in current->mm->flags must be
absent from MMF_INIT_MASK.  This means that the encoding for "MDWE but
without inherit" is different in the prctl than in the mm flags.  This
leads to a bit of bit-mangling in the prctl implementation.

Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org
Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ayush Jain <ayush.jain3@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 14:44:11 -07:00
Florent Revest
0da668333f mm: make PR_MDWE_REFUSE_EXEC_GAIN an unsigned long
Defining a prctl flag as an int is a footgun because on a 64 bit machine
and with a variadic implementation of prctl (like in musl and glibc), when
used directly as a prctl argument, it can get casted to long with garbage
upper bits which would result in unexpected behaviors.

This patch changes the constant to an unsigned long to eliminate that
possibilities.  This does not break UAPI.

I think that a stable backport would be "nice to have": to reduce the
chances that users build binaries that could end up with garbage bits in
their MDWE prctl arguments.  We are not aware of anyone having yet
encountered this corner case with MDWE prctls but a backport would reduce
the likelihood it happens, since this sort of issues has happened with
other prctls.  But If this is perceived as a backporting burden, I suppose
we could also live without a stable backport.

Link: https://lkml.kernel.org/r/20230828150858.393570-5-revest@chromium.org
Fixes: b507808ebc ("mm: implement memory-deny-write-execute as a prctl")
Signed-off-by: Florent Revest <revest@chromium.org>
Suggested-by: Alexey Izbyshev <izbyshev@ispras.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ayush Jain <ayush.jain3@amd.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: Topi Miettinen <toiwoton@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 14:44:11 -07:00
Jakub Kicinski
2606cf059c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts (or adjacent changes of note).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 13:16:47 -07:00
Linus Torvalds
5c519bc075 perf tools fixes for v6.6: 1st batch
Build:
 
  - Update header files in the tools/**/include directory to sync with
    the kernel sources as usual.
 
  - Remove unused bpf-prologue files.  While it's not strictly a fix,
    but the functionality was removed in this cycle so better to get
    rid of the code together.
 
  - Other minor build fixes.
 
 Misc:
 
  - Fix uninitialized memory access in PMU parsing code
 
  - Fix segfaults on software event
 
 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZRIFKAAKCRCMstVUGiXM
 g/pXAP9HLB2s+beBTK5iQU4/NfqmAVSl303QCoR9xLByo38vfAEAlLiRIh061pTi
 PRlXVuY9bUQPyCSYsiBHv/fmLqdQdwU=
 =ti6G
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-fixes-for-v6.6-1-2023-09-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools fixes from Namhyung Kim:
 "Build:

   - Update header files in the tools/**/include directory to sync with
     the kernel sources as usual.

   - Remove unused bpf-prologue files. While it's not strictly a fix,
     but the functionality was removed in this cycle so better to get
     rid of the code together.

   - Other minor build fixes.

  Misc:

   - Fix uninitialized memory access in PMU parsing code

   - Fix segfaults on software event"

* tag 'perf-tools-fixes-for-v6.6-1-2023-09-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  perf jevent: fix core dump on software events on s390
  perf pmu: Ensure all alias variables are initialized
  perf jevents metric: Fix type of strcmp_cpuid_str
  perf trace: Avoid compile error wrt redefining bool
  perf bpf-prologue: Remove unused file
  tools headers UAPI: Update tools's copy of drm.h headers
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  perf bench sched-seccomp-notify: Use the tools copy of seccomp.h UAPI
  tools headers UAPI: Copy seccomp.h to be able to build 'perf bench' in older systems
  tools headers UAPI: Sync files changed by new fchmodat2 and map_shadow_stack syscalls with the kernel sources
  perf tools: Update copy of libbpf's hashmap.c
2023-09-26 08:41:26 -07:00
Jiri Olsa
3acf8ace68 bpf: Add missed value to kprobe perf link info
Add missed value to kprobe attached through perf link info to
hold the stats of missed kprobe handler execution.

The kprobe's missed counter gets incremented when kprobe handler
is not executed due to another kprobe running on the same cpu.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-25 16:37:44 -07:00
Jiri Olsa
e2b2cd592a bpf: Add missed value to kprobe_multi link info
Add missed value to kprobe_multi link info to hold the stats of missed
kprobe_multi probe.

The missed counter gets incremented when fprobe fails the recursion
check or there's no rethook available for return probe. In either
case the attached bpf program is not executed.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Song Liu <song@kernel.org>
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230920213145.1941596-3-jolsa@kernel.org
2023-09-25 16:37:44 -07:00
Paolo Abeni
e9cbc89067 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-21 21:49:45 +02:00
Stanislav Fomichev
a9c2a60854 bpf: expose information about supported xdp metadata kfunc
Add new xdp-rx-metadata-features member to netdev netlink
which exports a bitmask of supported kfuncs. Most of the patch
is autogenerated (headers), the only relevant part is netdev.yaml
and the changes in netdev-genl.c to marshal into netlink.

Example output on veth:

$ ip link add veth0 type veth peer name veth1 # ifndex == 12
$ ./tools/net/ynl/samples/netdev 12

Select ifc ($ifindex; or 0 = dump; or -2 ntf check): 12
   veth1[12]    xdp-features (23): basic redirect rx-sg xdp-rx-metadata-features (3): timestamp hash xdp-zc-max-segs=0

Cc: netdev@vger.kernel.org
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230913171350.369987-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15 11:26:58 -07:00
Arnaldo Carvalho de Melo
417ecb614f tools headers UAPI: Copy seccomp.h to be able to build 'perf bench' in older systems
The new 'perf bench' for sched-seccomp-notify uses defines and types not
available in older systems where we want to have perf available, so grab
a copy of this UAPI from the kernel sources to allow that.

This will be checked in the future for drift from the original when we
build the perf tool, that will warn when that happens like:

  make: Entering directory '/var/home/acme/git/perf-tools/tools/perf'
    BUILD:   Doing 'make -j32' parallel build
  Warning: Kernel ABI header differences:

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrei Vagin <avagin@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kees Kook <keescook@chromium.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/ZQGhMXtwX7RvV3ya@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-09-13 08:48:48 -03:00
Stanislav Fomichev
7cb779a686 bpf: Clarify error expectations from bpf_clone_redirect
Commit 151e887d8f ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.

This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."

Update the UAPI to reflect the fact that bpf_clone_redirect can
return positive error numbers, but don't explicitly define
their meaning.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com
2023-09-11 22:29:19 +02:00
Yonghong Song
9bc95a95ab bpf: Mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated
Now 'BPF_MAP_TYPE_CGRP_STORAGE + local percpu ptr'
can cover all BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE functionality
and more. So mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated.
Also make changes in selftests/bpf/test_bpftool_synctypes.py
and selftest libbpf_str to fix otherwise test errors.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152837.2003563-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:18 -07:00
Jiri Olsa
b733eeade4 bpf: Add pid filter support for uprobe_multi link
Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.

Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.

We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).

Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
0b779b61f6 bpf: Add cookies support for uprobe_multi link
Adding support to specify cookies array for uprobe_multi link.

The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).

The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
89ae89f53d bpf: Add multi uprobe link
Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.

Uprobes to attach are specified via new link_create uprobe_multi
union:

  struct {
    __aligned_u64   path;
    __aligned_u64   offsets;
    __aligned_u64   ref_ctr_offsets;
    __u32           cnt;
    __u32           flags;
  } uprobe_multi;

Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.

The 'flags' supports single bit for now that marks the uprobe as
return probe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
c5487f8d91 bpf: Switch BPF_F_KPROBE_MULTI_RETURN macro to enum
Switching BPF_F_KPROBE_MULTI_RETURN macro to anonymous enum,
so it'd show up in vmlinux.h. There's not functional change
compared to having this as macro.

Acked-by: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-08-21 15:51:25 -07:00
Jiri Olsa
a3c485a5d8 bpf: Add support for bpf_get_func_ip helper for uprobe program
Adding support for bpf_get_func_ip helper for uprobe program to return
probed address for both uprobe and return uprobe.

We discussed this in [1] and agreed that uprobe can have special use
of bpf_get_func_ip helper that differs from kprobe.

The kprobe bpf_get_func_ip returns:
  - address of the function if probe is attach on function entry
    for both kprobe and return kprobe
  - 0 if the probe is not attach on function entry

The uprobe bpf_get_func_ip returns:
  - address of the probe for both uprobe and return uprobe

The reason for this semantic change is that kernel can't really tell
if the probe user space address is function entry.

The uprobe program is actually kprobe type program attached as uprobe.
One of the consequences of this design is that uprobes do not have its
own set of helpers, but share them with kprobes.

As we need different functionality for bpf_get_func_ip helper for uprobe,
I'm adding the bool value to the bpf_trace_run_ctx, so the helper can
detect that it's executed in uprobe context and call specific code.

The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which
is currently used only for executing bpf programs in uprobe.

Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe
to address that it's only used for uprobes and that it sets the
run_ctx.is_uprobe as suggested by Yafang Shao.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
[1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-07 16:42:58 -07:00
Jakub Kicinski
d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Daniel Xu
91721c2d02 netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
This commit adds support for enabling IP defrag using pre-existing
netfilter defrag support. Basically all the flag does is bump a refcnt
while the link the active. Checks are also added to ensure the prog
requesting defrag support is run _after_ netfilter defrag hooks.

We also take care to avoid any issues w.r.t. module unloading -- while
defrag is active on a link, the module is prevented from unloading.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28 16:52:08 -07:00
Stanislav Fomichev
25b5a2a190 ynl: regenerate all headers
Also add support to pass topdir to ynl-regen.sh (Jakub) and call
it from the makefile to update the UAPI headers.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230727163001.3952878-4-sdf@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 09:33:12 -07:00
Yonghong Song
1f9a1ea821 bpf: Support new sign-extension load insns
Add interpreter/jit support for new sign-extension load insns
which adds a new mode (BPF_MEMSX).
Also add verifier support to recognize these insns and to
do proper verification with new insns. In verifier, besides
to deduce proper bounds for the dst_reg, probed memory access
is also properly handled.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230728011156.3711870-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-27 18:52:33 -07:00
Lorenz Bauer
9c02bec959 bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
sockets. This means we can't use the helper to steer traffic to Envoy,
which configures SO_REUSEPORT on its sockets. In turn, we're blocked
from removing TPROXY from our setup.

The reason that bpf_sk_assign refuses such sockets is that the
bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead,
one of the reuseport sockets is selected by hash. This could cause
dispatch to the "wrong" socket:

    sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
    bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed

Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
helpers unfortunately. In the tc context, L2 headers are at the start
of the skb, while SK_REUSEPORT expects L3 headers instead.

Instead, we execute the SK_REUSEPORT program when the assigned socket
is pulled out of the skb, further up the stack. This creates some
trickiness with regards to refcounting as bpf_sk_assign will put both
refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
freed. We can infer that the sk_assigned socket is RCU freed if the
reuseport lookup succeeds, but convincing yourself of this fact isn't
straight forward. Therefore we defensively check refcounting on the
sk_assign sock even though it's probably not required in practice.

Fixes: 8e368dc72e ("bpf: Fix use of sk->sk_reuseport from sk_assign")
Fixes: cf7fbe660f ("bpf: Add socket assign support")
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@cilium.io>
Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-25 13:55:55 -07:00
Jakub Kicinski
59be3baa8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-20 15:52:55 -07:00
Alan Maguire
41ee0145a4 bpf: sync tools/ uapi header with
Seeing the following:

Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'

...so sync tools version missing some list_node/rb_tree fields.

Fixes: c3c510ce43 ("bpf: Add 'owner' field to bpf_{list,rb}_node")
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20230719162257.20818-1-alan.maguire@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:13:09 -07:00
Daniel Borkmann
e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Daniel Borkmann
053c8e1f23 bpf: Add generic attach/detach/query API for multi-progs
This adds a generic layer called bpf_mprog which can be reused by different
attachment layers to enable multi-program attachment and dependency resolution.
In-kernel users of the bpf_mprog don't need to care about the dependency
resolution internals, they can just consume it with few API calls.

The initial idea of having a generic API sparked out of discussion [0] from an
earlier revision of this work where tc's priority was reused and exposed via
BPF uapi as a way to coordinate dependencies among tc BPF programs, similar
as-is for classic tc BPF. The feedback was that priority provides a bad user
experience and is hard to use [1], e.g.:

  I cannot help but feel that priority logic copy-paste from old tc, netfilter
  and friends is done because "that's how things were done in the past". [...]
  Priority gets exposed everywhere in uapi all the way to bpftool when it's
  right there for users to understand. And that's the main problem with it.

  The user don't want to and don't need to be aware of it, but uapi forces them
  to pick the priority. [...] Your cover letter [0] example proves that in
  real life different service pick the same priority. They simply don't know
  any better. Priority is an unnecessary magic that apps _have_ to pick, so
  they just copy-paste and everyone ends up using the same.

The course of the discussion showed more and more the need for a generic,
reusable API where the "same look and feel" can be applied for various other
program types beyond just tc BPF, for example XDP today does not have multi-
program support in kernel, but also there was interest around this API for
improving management of cgroup program types. Such common multi-program
management concept is useful for BPF management daemons or user space BPF
applications coordinating internally about their attachments.

Both from Cilium and Meta side [2], we've collected the following requirements
for a generic attach/detach/query API for multi-progs which has been implemented
as part of this work:

  - Support prog-based attach/detach and link API
  - Dependency directives (can also be combined):
    - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none}
      - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user
        space application does not need CAP_SYS_ADMIN to retrieve foreign fds
        via bpf_*_get_fd_by_id()
      - BPF_F_LINK flag as {prog,link} toggle
      - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and
        BPF_F_AFTER will just append for attaching
      - Enforced only at attach time
    - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their
      own infra for replacing their internal prog
    - If no flags are set, then it's default append behavior for attaching
  - Internal revision counter and optionally being able to pass expected_revision
  - User space application can query current state with revision, and pass it
    along for attachment to assert current state before doing updates
  - Query also gets extension for link_ids array and link_attach_flags:
    - prog_ids are always filled with program IDs
    - link_ids are filled with link IDs when link was used, otherwise 0
    - {prog,link}_attach_flags for holding {prog,link}-specific flags
  - Must be easy to integrate/reuse for in-kernel users

The uapi-side changes needed for supporting bpf_mprog are rather minimal,
consisting of the additions of the attachment flags, revision counter, and
expanding existing union with relative_{fd,id} member.

The bpf_mprog framework consists of an bpf_mprog_entry object which holds
an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path
structure) is part of bpf_mprog_bundle. Both have been separated, so that
fast-path gets efficient packing of bpf_prog pointers for maximum cache
efficiency. Also, array has been chosen instead of linked list or other
structures to remove unnecessary indirections for a fast point-to-entry in
tc for BPF.

The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of
updates the peer bpf_mprog_entry is populated and then just swapped which
avoids additional allocations that could otherwise fail, for example, in
detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could
be converted to dynamic allocation if necessary at a point in future.
Locking is deferred to the in-kernel user of bpf_mprog, for example, in case
of tcx which uses this API in the next patch, it piggybacks on rtnl.

An extensive test suite for checking all aspects of this API for prog-based
attach/detach and link API comes as BPF selftests in this series.

Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog
management.

  [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net
  [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com
  [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Magnus Karlsson
f540d44e05 selftests/xsk: add basic multi-buffer test
Add the first basic multi-buffer test that sends a stream of 9K
packets and validates that they are received at the other end. In
order to enable sending and receiving multi-buffer packets, code that
sets the MTU is introduced as well as modifications to the XDP
programs so that they signal that they are multi-buffer enabled.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-20-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:56:50 -07:00
Magnus Karlsson
17f1034dd7 selftests/xsk: transmit and receive multi-buffer packets
Add the ability to send and receive packets that are larger than the
size of a umem frame, using the AF_XDP /XDP multi-buffer
support. There are three pieces of code that need to be changed to
achieve this: the Rx path, the Tx path, and the validation logic.

Both the Rx path and Tx could only deal with a single fragment per
packet. The Tx path is extended with a new function called
pkt_nb_frags() that can be used to retrieve the number of fragments a
packet will consume. We then create these many fragments in a loop and
fill the N-1 first ones to the max size limit to use the buffer space
efficiently, and the Nth one with whatever data that is left. This
goes on until we have filled in at the most BATCH_SIZE worth of
descriptors and fragments. If we detect that the next packet would
lead to BATCH_SIZE number of fragments sent being exceeded, we do not
send this packet and finish the batch. This packet is instead sent in
the next iteration of BATCH_SIZE fragments.

For Rx, we loop over all fragments we receive as usual, but for every
descriptor that we receive we call a new validation function called
is_frag_valid() to validate the consistency of this fragment. The code
then checks if the packet continues in the next frame. If so, it loops
over the next packet and performs the same validation. once we have
received the last fragment of the packet we also call the function
is_pkt_valid() to validate the packet as a whole. If we get to the end
of the batch and we are not at the end of the current packet, we back
out the partial packet and end the loop. Once we get into the receive
loop next time, we start over from the beginning of that packet. This
so the code becomes simpler at the cost of some performance.

The validation function is_frag_valid() checks that the sequence and
packet numbers are correct at the start and end of each fragment.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-19-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:56:50 -07:00
Maciej Fijalkowski
13ce2daa25 xsk: add new netlink attribute dedicated for ZC max frags
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:56:49 -07:00
Arnaldo Carvalho de Melo
7b86159355 tools include UAPI: Sync linux/vhost.h with the kernel sources
To get the changes in:

  228a27cf78 ("vhost: Allow worker switching while work is queueing")
  c1ecd8e950 ("vhost: allow userspace to create workers")

To pick up these changes and support them:

  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
  $ cp include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
  $ diff -u before after
  --- before	2023-07-14 09:58:14.268249807 -0300
  +++ after	2023-07-14 09:58:23.041493892 -0300
  @@ -10,6 +10,7 @@
   	[0x12] = "SET_VRING_BASE",
   	[0x13] = "SET_VRING_ENDIAN",
   	[0x14] = "GET_VRING_ENDIAN",
  +	[0x15] = "ATTACH_VRING_WORKER",
   	[0x20] = "SET_VRING_KICK",
   	[0x21] = "SET_VRING_CALL",
   	[0x22] = "SET_VRING_ERR",
  @@ -31,10 +32,12 @@
   	[0x7C] = "VDPA_SET_GROUP_ASID",
   	[0x7D] = "VDPA_SUSPEND",
   	[0x7E] = "VDPA_RESUME",
  +	[0x9] = "FREE_WORKER",
   };
   static const char *vhost_virtio_ioctl_read_cmds[] = {
   	[0x00] = "GET_FEATURES",
   	[0x12] = "GET_VRING_BASE",
  +	[0x16] = "GET_VRING_WORKER",
   	[0x26] = "GET_BACKEND_FEATURES",
   	[0x70] = "VDPA_GET_DEVICE_ID",
   	[0x71] = "VDPA_GET_STATUS",
  @@ -44,6 +47,7 @@
   	[0x79] = "VDPA_GET_CONFIG_SIZE",
   	[0x7A] = "VDPA_GET_AS_NUM",
   	[0x7B] = "VDPA_GET_VRING_GROUP",
  +	[0x8] = "NEW_WORKER",
   	[0x80] = "VDPA_GET_VQS_COUNT",
   	[0x81] = "VDPA_GET_GROUP_NUM",
   };
  $

For instance, see how those 'cmd' ioctl arguments get translated, now
ATTACH_VRING_WORKER, GET_VRING_WORKER and NEW_WORKER, will be as well:

  # perf trace -a -e ioctl --max-events=10
       0.000 ( 0.011 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                   = 0
      21.353 ( 0.014 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                   = 0
      25.766 ( 0.014 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)    = 0
      25.845 ( 0.034 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70) = 0
      25.916 ( 0.011 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)       = 0
      25.941 ( 0.025 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ATOMIC, arg: 0x7ffe4a22c840)       = 0
      32.915 ( 0.009 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_RMFB, arg: 0x7ffe4a22cf9c)         = 0
      42.522 ( 0.013 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)    = 0
      42.579 ( 0.031 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70) = 0
      42.644 ( 0.010 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)       = 0
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZLFJ%2FRsDGYiaH5nj@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-14 10:16:03 -03:00
Yafang Shao
1b715e1b0e bpf: Support ->fill_link_info for perf_event
By introducing support for ->fill_link_info to the perf_event link, users
gain the ability to inspect it using `bpftool link show`. While the current
approach involves accessing this information via `bpftool perf show`,
consolidating link information for all link types in one place offers
greater convenience. Additionally, this patch extends support to the
generic perf event, which is not currently accommodated by
`bpftool perf show`. While only the perf type and config are exposed to
userspace, other attributes such as sample_period and sample_freq are
ignored. It's important to note that if kptr_restrict is not permitted, the
probed address will not be exposed, maintaining security measures.

A new enum bpf_perf_event_type is introduced to help the user understand
which struct is relevant.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:51 -07:00
Yafang Shao
7ac8d0d261 bpf: Support ->fill_link_info for kprobe_multi
With the addition of support for fill_link_info to the kprobe_multi link,
users will gain the ability to inspect it conveniently using the
`bpftool link show`. This enhancement provides valuable information to the
user, including the count of probed functions and their respective
addresses. It's important to note that if the kptr_restrict setting is not
permitted, the probed address will not be exposed, ensuring security.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230709025630.3735-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11 20:07:50 -07:00
Arnaldo Carvalho de Melo
ad07149f34 tools headers UAPI: Sync linux/prctl.h with the kernel sources
To pick the changes in:

  1fd96a3e9d ("riscv: Add prctl controls for userspace vector management")

That adds some RISC-V specific prctl options:

  $ tools/perf/trace/beauty/prctl_option.sh > before
  $ cp include/uapi/linux/prctl.h tools/include/uapi/linux/prctl.h
  $ tools/perf/trace/beauty/prctl_option.sh > after
  $ diff -u before after
  --- before	2023-07-11 13:22:01.928705942 -0300
  +++ after	2023-07-11 13:22:36.342645970 -0300
  @@ -63,6 +63,8 @@
   	[66] = "GET_MDWE",
   	[67] = "SET_MEMORY_MERGE",
   	[68] = "GET_MEMORY_MERGE",
  +	[69] = "RISCV_V_SET_CONTROL",
  +	[70] = "RISCV_V_GET_CONTROL",
   };
   static const char *prctl_set_mm_options[] = {
   	[1] = "START_CODE",
  $

That now will be used to decode the syscall option and also to compose
filters, for instance:

  [root@five ~]# perf trace -e syscalls:sys_enter_prctl --filter option==SET_NAME
       0.000 Isolated Servi/3474327 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23f13b7aee)
       0.032 DOM Worker/3474327 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23deb25670)
       7.920 :3474328/3474328 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24fbb10)
       7.935 StreamT~s #374/3474328 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24fb970)
       8.400 Isolated Servi/3474329 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24bab10)
       8.418 StreamT~s #374/3474329 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24ba970)
  ^C[root@five ~]#

This addresses this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Chiu <andy.chiu@sifive.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/lkml/ZK2DhOB6JJKu2A7M@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-11 13:30:40 -03:00
Arnaldo Carvalho de Melo
920b91d927 tools include UAPI: Sync linux/mount.h copy with the kernel sources
To pick the changes from:

  6ac3928156 ("fs: allow to mount beneath top mount")

That, after a fix to the move_mount_flags.sh script, harvests the new
MOVE_MOUNT_BENEATH move_mount flag:

  $ tools/perf/trace/beauty/move_mount_flags.sh > before
  $ cp include/uapi/linux/mount.h tools/include/uapi/linux/mount.h
  $ tools/perf/trace/beauty/move_mount_flags.sh > after
  $
  $ diff -u before after
  --- before	2023-07-11 12:38:49.244886707 -0300
  +++ after	2023-07-11 12:51:15.125255940 -0300
  @@ -6,4 +6,5 @@
   	[ilog2(0x00000020) + 1] = "T_AUTOMOUNTS",
   	[ilog2(0x00000040) + 1] = "T_EMPTY_PATH",
   	[ilog2(0x00000100) + 1] = "SET_GROUP",
  +	[ilog2(0x00000200) + 1] = "BENEATH",
   };
  $

That will then be properly decoded when used in tools like:

  # perf trace -e move_mount

This addresses this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h

Cc: Christian Brauner <brauner@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZK17kifP%2FiYl+Hcc@kernel.org/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-11 13:01:23 -03:00
Arnaldo Carvalho de Melo
225bbf44bf tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:

  89d01306e3 ("RISC-V: KVM: Implement device interface for AIA irqchip")
  22725266bd ("KVM: Fix comment for KVM_ENABLE_CAP")
  2f440b72e8 ("KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE")

That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.

This addresses this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Anup Patel <apatel@ventanamicro.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Ricardo Koller <ricarkol@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/lkml/ZK12+virXMIXMysy@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-11 12:36:38 -03:00
Arnaldo Carvalho de Melo
48fa42c945 tools headers uapi: Sync linux/fcntl.h with the kernel sources
To get the changes in:

  96b2b072ee ("exportfs: allow exporting non-decodeable file handles to userspace")

That don't add anything that is handled by existing hard coded tables or
table generation scripts.

This silences this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZK11P5AwRBUxxutI@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-11 12:29:23 -03:00
Arnaldo Carvalho de Melo
9350a91791 tools headers UAPI: Sync files changed by new cachestat syscall with the kernel sources
To pick the changes in these csets:

  cf264e1329 ("cachestat: implement cachestat syscall")

That add support for this new syscall in tools such as 'perf trace'.

For instance, this is now possible:

  # perf trace -e cachestat
  ^C[root@five ~]#
  # perf trace -v -e cachestat
  Using CPUID AuthenticAMD-25-21-0
  event qualifier tracepoint filter: (common_pid != 3163687 && common_pid != 3147) && (id == 451)
  mmap size 528384B
  ^C[root@five ~]

  # perf trace -v -e *stat* --max-events=10
  Using CPUID AuthenticAMD-25-21-0
  event qualifier tracepoint filter: (common_pid != 3163713 && common_pid != 3147) && (id == 4 || id == 5 || id == 6 || id == 136 || id == 137 || id == 138 || id == 262 || id == 332 || id == 451)
  mmap size 528384B
       0.000 ( 0.009 ms): Cache2 I/O/4544 statfs(pathname: 0x45635288, buf: 0x7f8745725b60)                     = 0
       0.012 ( 0.003 ms): Cache2 I/O/4544 newfstatat(dfd: CWD, filename: 0x45635288, statbuf: 0x7f874569d250)   = 0
       0.036 ( 0.002 ms): Cache2 I/O/4544 newfstatat(dfd: 138, filename: 0x541b7093, statbuf: 0x7f87457256f0, flag: 4096) = 0
       0.372 ( 0.006 ms): Cache2 I/O/4544 statfs(pathname: 0x45635288, buf: 0x7f8745725b10)                     = 0
       0.379 ( 0.003 ms): Cache2 I/O/4544 newfstatat(dfd: CWD, filename: 0x45635288, statbuf: 0x7f874569d250)   = 0
       0.390 ( 0.002 ms): Cache2 I/O/4544 newfstatat(dfd: 138, filename: 0x541b7093, statbuf: 0x7f87457256a0, flag: 4096) = 0
       0.609 ( 0.005 ms): Cache2 I/O/4544 statfs(pathname: 0x45635288, buf: 0x7f8745725b60)                     = 0
       0.615 ( 0.003 ms): Cache2 I/O/4544 newfstatat(dfd: CWD, filename: 0x45635288, statbuf: 0x7f874569d250)   = 0
       0.625 ( 0.002 ms): Cache2 I/O/4544 newfstatat(dfd: 138, filename: 0x541b7093, statbuf: 0x7f87457256f0, flag: 4096) = 0
       0.826 ( 0.005 ms): Cache2 I/O/4544 statfs(pathname: 0x45635288, buf: 0x7f8745725b10)                     = 0
  #

That is the filter expression attached to the raw_syscalls:sys_{enter,exit}
tracepoints.

  $ find tools/perf/arch/ -name "syscall*tbl" | xargs grep -w sys_cachestat
  tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl:451	n64	cachestat			sys_cachestat
  tools/perf/arch/powerpc/entry/syscalls/syscall.tbl:451	common	cachestat			sys_cachestat
  tools/perf/arch/s390/entry/syscalls/syscall.tbl:451  common	cachestat		sys_cachestat			sys_cachestat
  tools/perf/arch/x86/entry/syscalls/syscall_64.tbl:451	common	cachestat		sys_cachestat
  $

  $ grep -w cachestat /tmp/build/perf-tools/arch/x86/include/generated/asm/syscalls_64.c
  	[451] = "cachestat",
  $

This addresses these perf build warnings:

Warning: Kernel ABI header differences:
  diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
  diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h
  diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
  diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
  diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
  diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Link: https://lore.kernel.org/lkml/ZK1pVBJpbjujJNJW@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-11 11:41:15 -03:00
Jakub Kicinski
a685d0df75 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
 g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
 TMaNUiHABijAFtChxoSI35j3OOSRrwE=
 =GJgN
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-06-23

We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).

The main changes are:

1) Extend bpf_fib_lookup helper to allow passing the route table ID,
   from Louis DeLosSantos.

2) Fix regsafe() in verifier to call check_ids() for scalar registers,
   from Eduard Zingerman.

3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
   and a rework of bpf_cpumask_any*() kfuncs. Additionally,
   add selftests, from David Vernet.

4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
   from Gilad Sever.

5) Change bpf_link_put() to use workqueue unconditionally to fix it
   under PREEMPT_RT, from Sebastian Andrzej Siewior.

6) Follow-ups to address issues in the bpf_refcount shared ownership
   implementation, from Dave Marchevsky.

7) A few general refactorings to BPF map and program creation permissions
   checks which were part of the BPF token series, from Andrii Nakryiko.

8) Various fixes for benchmark framework and add a new benchmark
   for BPF memory allocator to BPF selftests, from Hou Tao.

9) Documentation improvements around iterators and trusted pointers,
   from Anton Protopopov.

10) Small cleanup in verifier to improve allocated object check,
    from Daniel T. Lee.

11) Improve performance of bpf_xdp_pointer() by avoiding access
    to shared_info when XDP packet does not have frags,
    from Jesper Dangaard Brouer.

12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
    from Yonghong Song.

13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
    from Jarkko Sakkinen.

14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  bpf, docs: Document existing macros instead of deprecated
  bpf, docs: BPF Iterator Document
  selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
  selftests/bpf: Add vrf_socket_lookup tests
  bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
  bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
  bpf: Factor out socket lookup functions for the TC hookpoint.
  selftests/bpf: Set the default value of consumer_cnt as 0
  selftests/bpf: Ensure that next_cpu() returns a valid CPU number
  selftests/bpf: Output the correct error code for pthread APIs
  selftests/bpf: Use producer_cnt to allocate local counter array
  xsk: Remove unused inline function xsk_buff_discard()
  bpf: Keep BPF_PROG_LOAD permission checks clear of validations
  bpf: Centralize permissions checks for all BPF map types
  bpf: Inline map creation logic in map_create() function
  bpf: Move unprivileged checks into map_create() and bpf_prog_load()
  bpf: Remove in_atomic() from bpf_link_put().
  selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
  bpf: Verify scalar ids mapping in regsafe() using check_ids()
  selftests/bpf: Check if mark_chain_precision() follows scalar ids
  ...
====================

Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 14:52:28 -07:00
Jakub Kicinski
449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Jakub Kicinski
c9d99cfa66 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZIDxUwAKCRDbK58LschI
 g5hDAQD7ukrniCvMRNIm2yUZIGSxE4RvGiXptO4a0NfLck5R/wEAsfN2KUsPcPhW
 HS37lVfx7VVXfj42+REf7lWLu4TXpwk=
 =6mS/
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2023-06-07

We've added 7 non-merge commits during the last 7 day(s) which contain
a total of 12 files changed, 112 insertions(+), 7 deletions(-).

The main changes are:

1) Fix a use-after-free in BPF's task local storage, from KP Singh.

2) Make struct path handling more robust in bpf_d_path, from Jiri Olsa.

3) Fix a syzbot NULL-pointer dereference in sockmap, from Eric Dumazet.

4) UAPI fix for BPF_NETFILTER before final kernel ships,
   from Florian Westphal.

5) Fix map-in-map array_map_gen_lookup code generation where elem_size was
   not being set for inner maps, from Rhys Rustad-Elliott.

6) Fix sockopt_sk selftest's NETLINK_LIST_MEMBERSHIPS assertion,
   from Yonghong Song.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add extra path pointer check to d_path helper
  selftests/bpf: Fix sockopt_sk selftest
  bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
  selftests/bpf: Add access_inner_map selftest
  bpf: Fix elem_size not being set for inner maps
  bpf: Fix UAF in task local storage
  bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()
====================

Link: https://lore.kernel.org/r/20230607220514.29698-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-07 21:47:11 -07:00
Florian Westphal
132328e8e8 bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
Andrii Nakryiko writes:

 And we currently don't have an attach type for NETLINK BPF link.
 Thankfully it's not too late to add it. I see that link_create() in
 kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
 have done that. Instead we need to add BPF_NETLINK attach type to enum
 bpf_attach_type. And wire all that properly throughout the kernel and
 libbpf itself.

This adds BPF_NETFILTER and uses it.  This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.

v2: check link_attack prog type in link_create too

Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
2023-06-05 15:01:43 -07:00
Jakub Kicinski
a03a91bd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/sfc/tc.c
  622ab65634 ("sfc: fix error unwinds in TC offload")
  b6583d5e9e ("sfc: support TC decap rules matching on enc_src_port")

net/mptcp/protocol.c
  5b825727d0 ("mptcp: add annotations around msk->subflow accesses")
  e76c8ef5cc ("mptcp: refactor mptcp_stream_accept()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-01 15:38:26 -07:00
Louis DeLosSantos
8ad77e72ca bpf: Add table ID to bpf_fib_lookup BPF helper
Add ability to specify routing table ID to the `bpf_fib_lookup` BPF
helper.

A new field `tbid` is added to `struct bpf_fib_lookup` used as
parameters to the `bpf_fib_lookup` BPF helper.

When the helper is called with the `BPF_FIB_LOOKUP_DIRECT` and
`BPF_FIB_LOOKUP_TBID` flags the `tbid` field in `struct bpf_fib_lookup`
will be used as the table ID for the fib lookup.

If the `tbid` does not exist the fib lookup will fail with
`BPF_FIB_LKUP_RET_NOT_FWDED`.

The `tbid` field becomes a union over the vlan related output fields
in `struct bpf_fib_lookup` and will be zeroed immediately after usage.

This functionality is useful in containerized environments.

For instance, if a CNI wants to dictate the next-hop for traffic leaving
a container it can create a container-specific routing table and perform
a fib lookup against this table in a "host-net-namespace-side" TC program.

This functionality also allows `ip rule` like functionality at the TC
layer, allowing an eBPF program to pick a routing table based on some
aspect of the sk_buff.

As a concrete use case, this feature will be used in Cilium's SRv6 L3VPN
datapath.

When egress traffic leaves a Pod an eBPF program attached by Cilium will
determine which VRF the egress traffic should target, and then perform a
FIB lookup in a specific table representing this VRF's FIB.

Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230505-bpf-add-tbid-fib-lookup-v2-1-0a31c22c748c@gmail.com
2023-06-01 19:58:44 +02:00
Jakub Kicinski
75455b906d bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZHEm+wAKCRDbK58LschI
 gyIKAQCqO7B4sIu8hYVxBTwfHV2tIuXSMSCV4P9e78NUOPcO2QEAvLP/WVSjB0Bm
 vpyTKKM22SpZvPe/jSp52j6t20N+qAc=
 =HFxD
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-05-26

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 76 files changed, 2729 insertions(+), 1003 deletions(-).

The main changes are:

1) Add the capability to destroy sockets in BPF through a new kfunc,
   from Aditi Ghag.

2) Support O_PATH fds in BPF_OBJ_PIN and BPF_OBJ_GET commands,
   from Andrii Nakryiko.

3) Add capability for libbpf to resize datasec maps when backed via mmap,
   from JP Kobryn.

4) Move all the test kfuncs for CI out of the kernel and into bpf_testmod,
   from Jiri Olsa.

5) Big batch of xsk selftest improvements to prep for multi-buffer testing,
   from Magnus Karlsson.

6) Show the target_{obj,btf}_id in tracing link's fdinfo and dump it
   via bpftool, from Yafang Shao.

7) Various misc BPF selftest improvements to work with upcoming LLVM 17,
   from Yonghong Song.

8) Extend bpftool to specify netdevice for resolving XDP hints,
   from Larysa Zaremba.

9) Document masking in shift operations for the insn set document,
   from Dave Thaler.

10) Extend BPF selftests to check xdp_feature support for bond driver,
    from Lorenzo Bianconi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  bpf: Fix bad unlock balance on freeze_mutex
  libbpf: Ensure FD >= 3 during bpf_map__reuse_fd()
  libbpf: Ensure libbpf always opens files with O_CLOEXEC
  selftests/bpf: Check whether to run selftest
  libbpf: Change var type in datasec resize func
  bpf: drop unnecessary bpf_capable() check in BPF_MAP_FREEZE command
  libbpf: Selftests for resizing datasec maps
  libbpf: Add capability for resizing datasec maps
  selftests/bpf: Add path_fd-based BPF_OBJ_PIN and BPF_OBJ_GET tests
  libbpf: Add opts-based bpf_obj_pin() API and add support for path_fd
  bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
  libbpf: Start v1.3 development cycle
  bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
  bpftool: Specify XDP Hints ifname when loading program
  selftests/bpf: Add xdp_feature selftest for bond device
  selftests/bpf: Test bpf_sock_destroy
  selftests/bpf: Add helper to get port using getsockname
  bpf: Add bpf_sock_destroy kfunc
  bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  bpf: udp: Implement batching for sockets iterator
  ...
====================

Link: https://lore.kernel.org/r/20230526222747.17775-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-26 17:26:01 -07:00
Arnaldo Carvalho de Melo
c8268a9b91 tools headers UAPI: Sync the linux/in.h with the kernel sources
Picking the changes from:

  3632679d9e ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol")

That includes a define (IP_PROTOCOL) that isn't being used in generating
any id -> string table used by 'perf trace'.

Addresses this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
  diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/lkml/ZHD/Ms0DMq7viaq+@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-26 16:03:27 -03:00
Andrii Nakryiko
cb8edce280 bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
forces users to specify pinning location as a string-based absolute or
relative (to current working directory) path. This has various
implications related to security (e.g., symlink-based attacks), forces
BPF FS to be exposed in the file system, which can cause races with
other applications.

One of the feedbacks we got from folks working with containers heavily
was that inability to use purely FD-based location specification was an
unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
commands. This patch closes this oversight, adding path_fd field to
BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
*at() syscalls for dirfd + pathname combinations.

This now allows interesting possibilities like working with detached BPF
FS mount (e.g., to perform multiple pinnings without running a risk of
someone interfering with them), and generally making pinning/getting
more secure and not prone to any races and/or security attacks.

This is demonstrated by a selftest added in subsequent patch that takes
advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
creating detached BPF FS mount, pinning, and then getting BPF map out of
it, all while never exposing this private instance of BPF FS to outside
worlds.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-05-23 23:31:42 +02:00
Arnaldo Carvalho de Melo
71852cd882 tools headers UAPI: Sync linux/prctl.h with the kernel sources
ddc65971bb ("prctl: add PR_GET_AUXV to copy auxv to userspace")

To pick the changes in:

That don't result in any changes in tooling:

  $ tools/perf/trace/beauty/prctl_option.sh > before
  $ cp include/uapi/linux/prctl.h tools/include/uapi/linux/prctl.h
  $ tools/perf/trace/beauty/prctl_option.sh > after
  $ diff -u before after
  $

This actually adds a new prctl arg, but it has to be dealt with
differently, as it is not in sequence with the other arguments.

Just silences this perf tools build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/prctl.h' differs from latest version at 'include/uapi/linux/prctl.h'
  diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-16 15:23:47 -03:00
Yanteng Si
705049ca4f tools headers kvm: Sync uapi/{asm/linux} kvm.h headers with the kernel sources
Picking the changes from:

  e65733b5c5 ("KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL")
  30ec7997d1 ("KVM: arm64: timers: Allow userspace to set the global counter offset")
  821d935c87 ("KVM: arm64: Introduce support for userspace SMCCC filtering")
  81dc9504a7 ("KVM: arm64: nv: timers: Support hyp timer emulation")
  a8308b3fc9 ("KVM: arm64: Refactor hvc filtering to support different actions")
  0e5c9a9d65 ("KVM: arm64: Expose SMC/HVC width to userspace")

Silencing these perf build warnings:

 Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
 diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
 Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/kvm.h' differs from latest version at 'arch/x86/include/uapi/asm/kvm.h'
 diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
 Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
 diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: loongson-kernel@lists.loongnix.cn
Link: https://lore.kernel.org/r/ac5adb58411d23b3360d436a65038fefe91c32a8.1683712945.git.siyanteng@loongson.cn
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-10 14:19:20 -03:00
Yanteng Si
92b8e61e88 tools headers UAPI: Sync the linux/const.h with the kernel headers
Picking the changes from:

  31088f6f79 ("uapi/linux/const.h: prefer ISO-friendly __typeof__")

Silencing these perf build warnings::

  Warning: Kernel ABI header at 'tools/include/uapi/linux/const.h' differs from latest version at 'include/uapi/linux/const.h'
  diff -u tools/include/uapi/linux/const.h include/uapi/linux/const.h

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: loongson-kernel@lists.loongnix.cn
Link: https://lore.kernel.org/r/33e963df304394f932d9108a1b0bb327f23a4eca.1683712945.git.siyanteng@loongson.cn
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-10 14:19:20 -03:00
Yanteng Si
5d1ac59ff7 tools headers UAPI: Sync the linux/in.h with the kernel sources
Picking the changes from:

  91d0b78c51 ("inet: Add IP_LOCAL_PORT_RANGE socket option")

Silencing these perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
  diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h

Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: loongson-kernel@lists.loongnix.cn
Link: https://lore.kernel.org/r/23aabc69956ac94fbf388b05c8be08a64e8c7ccc.1683712945.git.siyanteng@loongson.cn
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-10 14:19:20 -03:00
Linus Torvalds
f085df1be6 Disable building BPF based features by default for v6.4.
We need to better polish building with BPF skels, so revert back to
 making it an experimental feature that has to be explicitely enabled
 using BUILD_BPF_SKEL=1.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZFbCXwAKCRCyPKLppCJ+
 J7cHAP97erKY4hBXArjpfzcvpFmboh/oqhbTLntyIpS6TEnOyQEAyervAPGIjQYC
 DCo4foyXmOWn3dhNtK9M+YiRl3o2SgQ=
 =7G78
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-for-v6.4-3-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tool updates from Arnaldo Carvalho de Melo:
 "Third version of perf tool updates, with the build problems with with
  using a 'vmlinux.h' generated from the main build fixed, and the bpf
  skeleton build disabled by default.

  Build:

   - Require libtraceevent to build, one can disable it using
     NO_LIBTRACEEVENT=1.

     It is required for tools like 'perf sched', 'perf kvm', 'perf
     trace', etc.

     libtraceevent is available in most distros so installing
     'libtraceevent-devel' should be a one-time event to continue
     building perf as usual.

     Using NO_LIBTRACEEVENT=1 produces tooling that is functional and
     sufficient for lots of users not interested in those libtraceevent
     dependent features.

   - Allow Python support in 'perf script' when libtraceevent isn't
     linked, as not all features requires it, for instance Intel PT does
     not use tracepoints.

   - Error if the python interpreter needed for jevents to work isn't
     available and NO_JEVENTS=1 isn't set, preventing a build without
     support for JSON vendor events, which is a rare but possible
     condition. The two check error messages:

        $(error ERROR: No python interpreter needed for jevents generation. Install python or build with NO_JEVENTS=1.)
        $(error ERROR: Python interpreter needed for jevents generation too old (older than 3.6). Install a newer python or build with NO_JEVENTS=1.)

   - Make libbpf 1.0 the minimum required when building with out of
     tree, distro provided libbpf.

   - Use libsdtc++'s and LLVM's libcxx's __cxa_demangle, a portable C++
     demangler, add 'perf test' entry for it.

   - Make binutils libraries opt in, as distros disable building with it
     due to licensing, they were used for C++ demangling, for instance.

   - Switch libpfm4 to opt-out rather than opt-in, if libpfm-devel (or
     equivalent) isn't installed, we'll just have a build warning:

       Makefile.config:1144: libpfm4 not found, disables libpfm4 support. Please install libpfm4-dev

   - Add a feature test for scandirat(), that is not implemented so far
     in musl and uclibc, disabling features that need it, such as
     scanning for tracepoints in /sys/kernel/tracing/events.

  perf BPF filters:

   - New feature where BPF can be used to filter samples, for instance:

      $ sudo ./perf record -e cycles --filter 'period > 1000' true
      $ sudo ./perf script
           perf-exec 2273949 546850.708501:       5029 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
           perf-exec 2273949 546850.708508:      32409 cycles:  ffffffff826f9e25 finish_wait+0x5 ([kernel.kallsyms])
           perf-exec 2273949 546850.708526:     143369 cycles:  ffffffff82b4cdbf xas_start+0x5f ([kernel.kallsyms])
           perf-exec 2273949 546850.708600:     372650 cycles:  ffffffff8286b8f7 __pagevec_lru_add+0x117 ([kernel.kallsyms])
           perf-exec 2273949 546850.708791:     482953 cycles:  ffffffff829190de __mod_memcg_lruvec_state+0x4e ([kernel.kallsyms])
                true 2273949 546850.709036:     501985 cycles:  ffffffff828add7c tlb_gather_mmu+0x4c ([kernel.kallsyms])
                true 2273949 546850.709292:     503065 cycles:      7f2446d97c03 _dl_map_object_deps+0x973 (/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)

   - In addition to 'period' (PERF_SAMPLE_PERIOD), the other
     PERF_SAMPLE_ can be used for filtering, and also some other sample
     accessible values, from tools/perf/Documentation/perf-record.txt:

        Essentially the BPF filter expression is:

        <term> <operator> <value> (("," | "||") <term> <operator> <value>)*

     The <term> can be one of:
        ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr,
        code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat,
        p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock,
        mem_dtlb, mem_blk, mem_hops

     The <operator> can be one of:
        ==, !=, >, >=, <, <=, &

     The <value> can be one of:
        <number> (for any term)
        na, load, store, pfetch, exec (for mem_op)
        l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl)
        na, none, hit, miss, hitm, fwd, peer (for mem_snoop)
        remote (for mem_remote)
        na, locked (for mem_locked)
        na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb)
        na, by_data, by_addr (for mem_blk)
        hops0, hops1, hops2, hops3 (for mem_hops)

  perf lock contention:

   - Show lock type with address.

   - Track and show mmap_lock, siglock and per-cpu rq_lock with address.
     This is done for mmap_lock by following the current->mm pointer:

      $ sudo ./perf lock con -abl -- sleep 10
       contended   total wait     max wait     avg wait            address   symbol
       ...
           16344    312.30 ms      2.22 ms     19.11 us   ffff8cc702595640
           17686    310.08 ms      1.49 ms     17.53 us   ffff8cc7025952c0
               3     84.14 ms     45.79 ms     28.05 ms   ffff8cc78114c478   mmap_lock
            3557     76.80 ms     68.75 us     21.59 us   ffff8cc77ca3af58
               1     68.27 ms     68.27 ms     68.27 ms   ffff8cda745dfd70
               9     54.53 ms      7.96 ms      6.06 ms   ffff8cc7642a48b8   mmap_lock
           14629     44.01 ms     60.00 us      3.01 us   ffff8cc7625f9ca0
            3481     42.63 ms    140.71 us     12.24 us   ffffffff937906ac   vmap_area_lock
           16194     38.73 ms     42.15 us      2.39 us   ffff8cd397cbc560
              11     38.44 ms     10.39 ms      3.49 ms   ffff8ccd6d12fbb8   mmap_lock
               1      5.43 ms      5.43 ms      5.43 ms   ffff8cd70018f0d8
            1674      5.38 ms    422.93 us      3.21 us   ffffffff92e06080   tasklist_lock
             581      4.51 ms    130.68 us      7.75 us   ffff8cc9b1259058
               5      3.52 ms      1.27 ms    703.23 us   ffff8cc754510070
             112      3.47 ms     56.47 us     31.02 us   ffff8ccee38b3120
             381      3.31 ms     73.44 us      8.69 us   ffffffff93790690   purge_vmap_area_lock
             255      3.19 ms     36.35 us     12.49 us   ffff8d053ce30c80

   - Update default map size to 16384.

   - Allocate single letter option -M for --map-nr-entries, as it is
     proving being frequently used.

   - Fix struct rq lock access for older kernels with BPF's CO-RE
     (Compile once, run everywhere).

   - Fix problems found with MSAn.

  perf report/top:

   - Add inline information when using --call-graph=fp or lbr, as was
     already done to the --call-graph=dwarf callchain mode.

   - Improve the 'srcfile' sort key performance by really using an
     optimization introduced in 6.2 for the 'srcline' sort key that
     avoids calling addr2line for comparision with each sample.

  perf sched:

   - Make 'perf sched latency/map/replay' to use "sched:sched_waking"
     instead of "sched:sched_waking", consistent with 'perf record'
     since d566a9c2d4 ("perf sched: Prefer sched_waking event when it
     exists").

  perf ftrace:

   - Make system wide the default target for latency subcommand, run the
     following command then generate some network traffic and press
     control+C:

       # perf ftrace latency -T __kfree_skb
     ^C
         DURATION     |      COUNT | GRAPH                                          |
          0 - 1    us |         27 | #############                                  |
          1 - 2    us |         22 | ###########                                    |
          2 - 4    us |          8 | ####                                           |
          4 - 8    us |          5 | ##                                             |
          8 - 16   us |         24 | ############                                   |
         16 - 32   us |          2 | #                                              |
         32 - 64   us |          1 |                                                |
         64 - 128  us |          0 |                                                |
        128 - 256  us |          0 |                                                |
        256 - 512  us |          0 |                                                |
        512 - 1024 us |          0 |                                                |
          1 - 2    ms |          0 |                                                |
          2 - 4    ms |          0 |                                                |
          4 - 8    ms |          0 |                                                |
          8 - 16   ms |          0 |                                                |
         16 - 32   ms |          0 |                                                |
         32 - 64   ms |          0 |                                                |
         64 - 128  ms |          0 |                                                |
        128 - 256  ms |          0 |                                                |
        256 - 512  ms |          0 |                                                |
        512 - 1024 ms |          0 |                                                |
          1 - ...   s |          0 |                                                |
       #

  perf top:

   - Add --branch-history (LBR: Last Branch Record) option, just like
     already available for 'perf record'.

   - Fix segfault in thread__comm_len() where thread->comm was being
     used outside thread->comm_lock.

  perf annotate:

   - Allow configuring objdump and addr2line in ~/.perfconfig., so that
     you can use alternative binaries, such as llvm's.

  perf kvm:

   - Add TUI mode for 'perf kvm stat report'.

  Reference counting:

   - Add reference count checking infrastructure to check for use after
     free, done to the 'cpumap', 'namespaces', 'maps' and 'map' structs,
     more to come.

     To build with it use -DREFCNT_CHECKING=1 in the make command line
     to build tools/perf. Documented at:

       https://perf.wiki.kernel.org/index.php/Reference_Count_Checking

   - The above caught, for instance, fix, present in this series:

        - Fix maps use after put in 'perf test "Share thread maps"':

          'maps' is copied from leader, but the leader is put on line 79
          and then 'maps' is used to read the reference count below - so
          a use after put, with the put of maps happening within
          thread__put.

     Fixed by reversing the order of puts so that the leader is put
     last.

   - Also several fixes were made to places where reference counts were
     not being held.

   - Make this one of the tests in 'make -C tools/perf build-test' to
     regularly build test it and to make sure no direct access to the
     reference counted structs are made, doing that via accessors to
     check the validity of the struct pointer.

  ARM64:

   - Fix 'perf report' segfault when filtering coresight traces by
     sparse lists of CPUs.

   - Add support for 'simd' as a sort field for 'perf report', to show
     ARM's NEON SIMD's predicate flags: "partial" and "empty".

  arm64 vendor events:

   - Add N1 metrics.

  Intel vendor events:

   - Add graniterapids, grandridge and sierraforrest events.

   - Refresh events for: alderlake, aldernaken, broadwell, broadwellde,
     broadwellx, cascadelakx, haswell, haswellx, icelake, icelakex,
     jaketown, meteorlake, knightslanding, sandybridge, sapphirerapids,
     silvermont, skylake, tigerlake and westmereep-dp

   - Refresh metrics for alderlake-n, broadwell, broadwellde,
     broadwellx, haswell, haswellx, icelakex, ivybridge, ivytown and
     skylakex.

  perf stat:

   - Implement --topdown using JSON metrics.

   - Add TopdownL1 JSON metric as a default if present, but disable it
     for now for some Intel hybrid architectures, a series of patches
     addressing this is being reviewed and will be submitted for v6.5.

   - Use metrics for --smi-cost.

   - Update topdown documentation.

  Vendor events (JSON) infrastructure:

   - Add support for computing and printing metric threshold values. For
     instance, here is one found in thesapphirerapids json file:

       {
           "BriefDescription": "Percentage of cycles spent in System Management Interrupts.",
           "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0 else 0)",
           "MetricGroup": "smi",
           "MetricName": "smi_cycles",
           "MetricThreshold": "smi_cycles > 0.1",
           "ScaleUnit": "100%"
       },

   - Test parsing metric thresholds with the fake PMU in 'perf test
     pmu-events'.

   - Support for printing metric thresholds in 'perf list'.

   - Add --metric-no-threshold option to 'perf stat'.

   - Add rand (reverse and) and has_pmem (optane memory) support to
     metrics.

   - Sort list of input files to avoid depending on the order from
     readdir() helping in obtaining reproducible builds.

  S/390:

   - Add common metrics: - CPI (cycles per instruction), prbstate (ratio
     of instructions executed in problem state compared to total number
     of instructions), l1mp (Level one instruction and data cache misses
     per 100 instructions).

   - Add cache metrics for z13, z14, z15 and z16.

   - Add metric for TLB and cache.

  ARM:

   - Add raw decoding for SPE (Statistical Profiling Extension) v1.3 MTE
     (Memory Tagging Extension) and MOPS (Memory Operations) load/store.

  Intel PT hardware tracing:

   - Add event type names UINTR (User interrupt delivered) and UIRET
     (Exiting from user interrupt routine), documented in table 32-50
     "CFE Packet Type and Vector Fields Details" in the Intel Processor
     Trace chapter of The Intel SDM Volume 3 version 078.

   - Add support for new branch instructions ERETS and ERETU.

   - Fix CYC timestamps after standalone CBR

  ARM CoreSight hardware tracing:

   - Allow user to override timestamp and contextid settings.

   - Fix segfault in dso lookup.

   - Fix timeless decode mode detection.

   - Add separate decode paths for timeless and per-thread modes.

  auxtrace:

   - Fix address filter entire kernel size.

  Miscellaneous:

   - Fix use-after-free and unaligned bugs in the PLT handling routines.

   - Use zfree() to reduce chances of use after free.

   - Add missing 0x prefix for addresses printed in hexadecimal in 'perf
     probe'.

   - Suppress massive unsupported target platform errors in the unwind
     code.

   - Fix return incorrect build_id size in elf_read_build_id().

   - Fix 'perf scripts intel-pt-events.py' IPC output for Python 2 .

   - Add missing new parameter in kfree_skb tracepoint to the python
     scripts using it.

   - Add 'perf bench syscall fork' benchmark.

   - Add support for printing PERF_MEM_LVLNUM_UNC (Uncached access) in
     'perf mem'.

   - Fix wrong size expectation for perf test 'Setup struct
     perf_event_attr' caused by the patch adding
     perf_event_attr::config3.

   - Fix some spelling mistakes"

* tag 'perf-tools-for-v6.4-3-2023-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (365 commits)
  Revert "perf build: Make BUILD_BPF_SKEL default, rename to NO_BPF_SKEL"
  Revert "perf build: Warn for BPF skeletons if endian mismatches"
  perf metrics: Fix SEGV with --for-each-cgroup
  perf bpf skels: Stop using vmlinux.h generated from BTF, use subset of used structs + CO-RE
  perf stat: Separate bperf from bpf_profiler
  perf test record+probe_libc_inet_pton: Fix call chain match on x86_64
  perf test record+probe_libc_inet_pton: Fix call chain match on s390
  perf tracepoint: Fix memory leak in is_valid_tracepoint()
  perf cs-etm: Add fix for coresight trace for any range of CPUs
  perf build: Fix unescaped # in perf build-test
  perf unwind: Suppress massive unsupported target platform errors
  perf script: Add new parameter in kfree_skb tracepoint to the python scripts using it
  perf script: Print raw ip instead of binary offset for callchain
  perf symbols: Fix return incorrect build_id size in elf_read_build_id()
  perf list: Modify the warning message about scandirat(3)
  perf list: Fix memory leaks in print_tracepoint_events()
  perf lock contention: Rework offset calculation with BPF CO-RE
  perf lock contention: Fix struct rq lock access
  perf stat: Disable TopdownL1 on hybrid
  perf stat: Avoid SEGV on counter->name
  ...
2023-05-07 11:32:18 -07:00
Linus Torvalds
c8c655c34e s390:
* More phys_to_virt conversions
 
 * Improvement of AP management for VSIE (nested virtualization)
 
 ARM64:
 
 * Numerous fixes for the pathological lock inversion issue that
   plagued KVM/arm64 since... forever.
 
 * New framework allowing SMCCC-compliant hypercalls to be forwarded
   to userspace, hopefully paving the way for some more features
   being moved to VMMs rather than be implemented in the kernel.
 
 * Large rework of the timer code to allow a VM-wide offset to be
   applied to both virtual and physical counters as well as a
   per-timer, per-vcpu offset that complements the global one.
   This last part allows the NV timer code to be implemented on
   top.
 
 * A small set of fixes to make sure that we don't change anything
   affecting the EL1&0 translation regime just after having having
   taken an exception to EL2 until we have executed a DSB. This
   ensures that speculative walks started in EL1&0 have completed.
 
 * The usual selftest fixes and improvements.
 
 KVM x86 changes for 6.4:
 
 * Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
   and by giving the guest control of CR0.WP when EPT is enabled on VMX
   (VMX-only because SVM doesn't support per-bit controls)
 
 * Add CR0/CR4 helpers to query single bits, and clean up related code
   where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
   as a bool
 
 * Move AMD_PSFD to cpufeatures.h and purge KVM's definition
 
 * Avoid unnecessary writes+flushes when the guest is only adding new PTEs
 
 * Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
   when emulating invalidations
 
 * Clean up the range-based flushing APIs
 
 * Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
   A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
   changed SPTE" overhead associated with writing the entire entry
 
 * Track the number of "tail" entries in a pte_list_desc to avoid having
   to walk (potentially) all descriptors during insertion and deletion,
   which gets quite expensive if the guest is spamming fork()
 
 * Disallow virtualizing legacy LBRs if architectural LBRs are available,
   the two are mutually exclusive in hardware
 
 * Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
   after KVM_RUN, similar to CPUID features
 
 * Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
 
 * Apply PMU filters to emulated events and add test coverage to the
   pmu_event_filter selftest
 
 x86 AMD:
 
 * Add support for virtual NMIs
 
 * Fixes for edge cases related to virtual interrupts
 
 x86 Intel:
 
 * Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
   not being reported due to userspace not opting in via prctl()
 
 * Fix a bug in emulation of ENCLS in compatibility mode
 
 * Allow emulation of NOP and PAUSE for L2
 
 * AMX selftests improvements
 
 * Misc cleanups
 
 MIPS:
 
 * Constify MIPS's internal callbacks (a leftover from the hardware enabling
   rework that landed in 6.3)
 
 Generic:
 
 * Drop unnecessary casts from "void *" throughout kvm_main.c
 
 * Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
   size by 8 bytes on 64-bit kernels by utilizing a padding hole
 
 Documentation:
 
 * Fix goof introduced by the conversion to rST
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
 Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
 WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
 huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
 XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
 j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
 =S2Hq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "s390:

   - More phys_to_virt conversions

   - Improvement of AP management for VSIE (nested virtualization)

  ARM64:

   - Numerous fixes for the pathological lock inversion issue that
     plagued KVM/arm64 since... forever.

   - New framework allowing SMCCC-compliant hypercalls to be forwarded
     to userspace, hopefully paving the way for some more features being
     moved to VMMs rather than be implemented in the kernel.

   - Large rework of the timer code to allow a VM-wide offset to be
     applied to both virtual and physical counters as well as a
     per-timer, per-vcpu offset that complements the global one. This
     last part allows the NV timer code to be implemented on top.

   - A small set of fixes to make sure that we don't change anything
     affecting the EL1&0 translation regime just after having having
     taken an exception to EL2 until we have executed a DSB. This
     ensures that speculative walks started in EL1&0 have completed.

   - The usual selftest fixes and improvements.

  x86:

   - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
     enabled, and by giving the guest control of CR0.WP when EPT is
     enabled on VMX (VMX-only because SVM doesn't support per-bit
     controls)

   - Add CR0/CR4 helpers to query single bits, and clean up related code
     where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
     return as a bool

   - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

   - Avoid unnecessary writes+flushes when the guest is only adding new
     PTEs

   - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
     optimizations when emulating invalidations

   - Clean up the range-based flushing APIs

   - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
     single A/D bit using a LOCK AND instead of XCHG, and skip all of
     the "handle changed SPTE" overhead associated with writing the
     entire entry

   - Track the number of "tail" entries in a pte_list_desc to avoid
     having to walk (potentially) all descriptors during insertion and
     deletion, which gets quite expensive if the guest is spamming
     fork()

   - Disallow virtualizing legacy LBRs if architectural LBRs are
     available, the two are mutually exclusive in hardware

   - Disallow writes to immutable feature MSRs (notably
     PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features

   - Overhaul the vmx_pmu_caps selftest to better validate
     PERF_CAPABILITIES

   - Apply PMU filters to emulated events and add test coverage to the
     pmu_event_filter selftest

   - AMD SVM:
       - Add support for virtual NMIs
       - Fixes for edge cases related to virtual interrupts

   - Intel AMX:
       - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
         XTILE_DATA is not being reported due to userspace not opting in
         via prctl()
       - Fix a bug in emulation of ENCLS in compatibility mode
       - Allow emulation of NOP and PAUSE for L2
       - AMX selftests improvements
       - Misc cleanups

  MIPS:

   - Constify MIPS's internal callbacks (a leftover from the hardware
     enabling rework that landed in 6.3)

  Generic:

   - Drop unnecessary casts from "void *" throughout kvm_main.c

   - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
     struct size by 8 bytes on 64-bit kernels by utilizing a padding
     hole

  Documentation:

   - Fix goof introduced by the conversion to rST"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
  KVM: s390: pci: fix virtual-physical confusion on module unload/load
  KVM: s390: vsie: clarifications on setting the APCB
  KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: selftests: Test the PMU event "Instructions retired"
  KVM: selftests: Copy full counter values from guest in PMU event filter test
  KVM: selftests: Use error codes to signal errors in PMU event filter test
  KVM: selftests: Print detailed info in PMU event filter asserts
  KVM: selftests: Add helpers for PMC asserts in PMU event filter test
  KVM: selftests: Add a common helper for the PMU event filter guest code
  KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: selftests: Add test to verify KVM's supported XCR0
  ...
2023-05-01 12:06:20 -07:00
Linus Torvalds
7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds
6e98b09da9 Networking changes for 6.4.
Core
 ----
 
  - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
    default value allows for better BIG TCP performances.
 
  - Reduce compound page head access for zero-copy data transfers.
 
  - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
 
  - Threaded NAPI improvements, adding defer skb free support and unneeded
    softirq avoidance.
 
  - Address dst_entry reference count scalability issues, via false
    sharing avoidance and optimize refcount tracking.
 
  - Add lockless accesses annotation to sk_err[_soft].
 
  - Optimize again the skb struct layout.
 
  - Extends the skb drop reasons to make it usable by multiple
    subsystems.
 
  - Better const qualifier awareness for socket casts.
 
 BPF
 ---
 
  - Add skb and XDP typed dynptrs which allow BPF programs for more
    ergonomic and less brittle iteration through data and variable-sized
    accesses.
 
  - Add a new BPF netfilter program type and minimal support to hook
    BPF programs to netfilter hooks such as prerouting or forward.
 
  - Add more precise memory usage reporting for all BPF map types.
 
  - Adds support for using {FOU,GUE} encap with an ipip device operating
    in collect_md mode and add a set of BPF kfuncs for controlling encap
    params.
 
  - Allow BPF programs to detect at load time whether a particular kfunc
    exists or not, and also add support for this in light skeleton.
 
  - Bigger batch of BPF verifier improvements to prepare for upcoming BPF
    open-coded iterators allowing for less restrictive looping capabilities.
 
  - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
    programs to NULL-check before passing such pointers into kfunc.
 
  - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
    local storage maps.
 
  - Enable RCU semantics for task BPF kptrs and allow referenced kptr
    tasks to be stored in BPF maps.
 
  - Add support for refcounted local kptrs to the verifier for allowing
    shared ownership, useful for adding a node to both the BPF list and
    rbtree.
 
  - Add BPF verifier support for ST instructions in convert_ctx_access()
    which will help new -mcpu=v4 clang flag to start emitting them.
 
  - Add ARM32 USDT support to libbpf.
 
  - Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations.
 
 Protocols
 ---------
 
  - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
    indicates the provenance of the IP address.
 
  - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
 
  - Add the handshake upcall mechanism, allowing the user-space
    to implement generic TLS handshake on kernel's behalf.
 
  - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
    resilience to nodes failures.
 
  - SCTP: add support for Fair Capacity and Weighted Fair Queueing
    schedulers.
 
  - MPTCP: delay first subflow allocation up to its first usage. This
    will allow for later better LSM interaction.
 
  - xfrm: Remove inner/outer modes from input/output path. These are
    not needed anymore.
 
  - WiFi:
    - reduced neighbor report (RNR) handling for AP mode
    - HW timestamping support
    - support for randomized auth/deauth TA for PASN privacy
    - per-link debugfs for multi-link
    - TC offload support for mac80211 drivers
    - mac80211 mesh fast-xmit and fast-rx support
    - enable Wi-Fi 7 (EHT) mesh support
 
 Netfilter
 ---------
 
  - Add nf_tables 'brouting' support, to force a packet to be routed
    instead of being bridged.
 
  - Update bridge netfilter and ovs conntrack helpers to handle
    IPv6 Jumbo packets properly, i.e. fetch the packet length
    from hop-by-hop extension header. This is needed for BIT TCP
    support.
 
  - The iptables 32bit compat interface isn't compiled in by default
    anymore.
 
  - Move ip(6)tables builtin icmp matches to the udptcp one.
    This has the advantage that icmp/icmpv6 match doesn't load the
    iptables/ip6tables modules anymore when iptables-nft is used.
 
  - Extended netlink error report for netdevice in flowtables and
    netdev/chains. Allow for incrementally add/delete devices to netdev
    basechain. Allow to create netdev chain without device.
 
 Driver API
 ----------
 
  - Remove redundant Device Control Error Reporting Enable, as PCI core
    has already error reporting enabled at enumeration time.
 
  - Move Multicast DB netlink handlers to core, allowing devices other
    then bridge to use them.
 
  - Allow the page_pool to directly recycle the pages from safely
    localized NAPI.
 
  - Implement lockless TX queue stop/wake combo macros, allowing for
    further code de-duplication and sanitization.
 
  - Add YNL support for user headers and struct attrs.
 
  - Add partial YNL specification for devlink.
 
  - Add partial YNL specification for ethtool.
 
  - Add tc-mqprio and tc-taprio support for preemptible traffic classes.
 
  - Add tx push buf len param to ethtool, specifies the maximum number
    of bytes of a transmitted packet a driver can push directly to the
    underlying device.
 
  - Add basic LED support for switch/phy.
 
  - Add NAPI documentation, stop relaying on external links.
 
  - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
    work to make the hardware timestamping layer selectable by user
    space.
 
  - Add transceiver support and improve the error messages for CAN-FD
    controllers.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - AMD/Pensando core device support
    - MediaTek MT7981 SoC
    - MediaTek MT7988 SoC
    - Broadcom BCM53134 embedded switch
    - Texas Instruments CPSW9G ethernet switch
    - Qualcomm EMAC3 DWMAC ethernet
    - StarFive JH7110 SoC
    - NXP CBTX ethernet PHY
 
  - WiFi:
    - Apple M1 Pro/Max devices
    - RealTek rtl8710bu/rtl8188gu
    - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
 
  - Bluetooth:
    - Realtek RTL8821CS, RTL8851B, RTL8852BS
    - Mediatek MT7663, MT7922
    - NXP w8997
    - Actions Semi ATS2851
    - QTI WCN6855
    - Marvell 88W8997
 
  - Can:
    - STMicroelectronics bxcan stm32f429
 
 Drivers
 -------
  - Ethernet NICs:
    - Intel (1G, icg):
      - add tracking and reporting of QBV config errors.
      - add support for configuring max SDU for each Tx queue.
    - Intel (100G, ice):
      - refactor mailbox overflow detection to support Scalable IOV
      - GNSS interface optimization
    - Intel (i40e):
      - support XDP multi-buffer
    - nVidia/Mellanox:
      - add the support for linux bridge multicast offload
      - enable TC offload for egress and engress MACVLAN over bond
      - add support for VxLAN GBP encap/decap flows offload
      - extend packet offload to fully support libreswan
      - support tunnel mode in mlx5 IPsec packet offload
      - extend XDP multi-buffer support
      - support MACsec VLAN offload
      - add support for dynamic msix vectors allocation
      - drop RX page_cache and fully use page_pool
      - implement thermal zone to report NIC temperature
    - Netronome/Corigine:
      - add support for multi-zone conntrack offload
    - Solarflare/Xilinx:
      - support offloading TC VLAN push/pop actions to the MAE
      - support TC decap rules
      - support unicast PTP
 
  - Other NICs:
    - Broadcom (bnxt): enforce software based freq adjustments only
 		on shared PHC NIC
    - RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
    - Micrel (lan8841): add support for PTP_PF_PEROUT
    - Cadence (macb): enable PTP unicast
    - Engleder (tsnep): add XDP socket zero-copy support
    - virtio-net: implement exact header length guest feature
    - veth: add page_pool support for page recycling
    - vxlan: add MDB data path support
    - gve: add XDP support for GQI-QPL format
    - geneve: accept every ethertype
    - macvlan: allow some packets to bypass broadcast queue
    - mana: add support for jumbo frame
 
  - Ethernet high-speed switches:
    - Microchip (sparx5): Add support for TC flower templates.
 
  - Ethernet embedded switches:
    - Broadcom (b54):
      - configure 6318 and 63268 RGMII ports
    - Marvell (mv88e6xxx):
      - faster C45 bus scan
    - Microchip:
      - lan966x:
        - add support for IS1 VCAP
        - better TX/RX from/to CPU performances
      - ksz9477: add ETS Qdisc support
      - ksz8: enhance static MAC table operations and error handling
      - sama7g5: add PTP capability
    - NXP (ocelot):
      - add support for external ports
      - add support for preemptible traffic classes
    - Texas Instruments:
      - add CPSWxG SGMII support for J7200 and J721E
 
  - Intel WiFi (iwlwifi):
    - preparation for Wi-Fi 7 EHT and multi-link support
    - EHT (Wi-Fi 7) sniffer support
    - hardware timestamping support for some devices/firwmares
    - TX beacon protection on newer hardware
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - MU-MIMO parameters support
    - ack signal support for management packets
 
  - RealTek WiFi (rtw88):
    - SDIO bus support
    - better support for some SDIO devices
      (e.g. MAC address from efuse)
 
  - RealTek WiFi (rtw89):
    - HW scan support for 8852b
    - better support for 6 GHz scanning
    - support for various newer firmware APIs
    - framework firmware backwards compatibility
 
  - MediaTek WiFi (mt76):
    - P2P support
    - mesh A-MSDU support
    - EHT (Wi-Fi 7) support
    - coredump support
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
 Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
 Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
 HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
 eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
 pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
 F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
 vP7ugFnttneg
 =ITVa
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
2023-04-26 16:07:23 -07:00
Paolo Bonzini
4f382a79a6 KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that
   plagued KVM/arm64 since... forever.
 
 - New framework allowing SMCCC-compliant hypercalls to be forwarded
   to userspace, hopefully paving the way for some more features
   being moved to VMMs rather than be implemented in the kernel.
 
 - Large rework of the timer code to allow a VM-wide offset to be
   applied to both virtual and physical counters as well as a
   per-timer, per-vcpu offset that complements the global one.
   This last part allows the NV timer code to be implemented on
   top.
 
 - A small set of fixes to make sure that we don't change anything
   affecting the EL1&0 translation regime just after having having
   taken an exception to EL2 until we have executed a DSB. This
   ensures that speculative walks started in EL1&0 have completed.
 
 - The usual selftest fixes and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2
 rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1
 HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL
 vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE
 61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf
 RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je
 +5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e
 0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB
 bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1
 1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT
 8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK
 DHT7RBE0
 =Bhta
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.4

- Numerous fixes for the pathological lock inversion issue that
  plagued KVM/arm64 since... forever.

- New framework allowing SMCCC-compliant hypercalls to be forwarded
  to userspace, hopefully paving the way for some more features
  being moved to VMMs rather than be implemented in the kernel.

- Large rework of the timer code to allow a VM-wide offset to be
  applied to both virtual and physical counters as well as a
  per-timer, per-vcpu offset that complements the global one.
  This last part allows the NV timer code to be implemented on
  top.

- A small set of fixes to make sure that we don't change anything
  affecting the EL1&0 translation regime just after having having
  taken an exception to EL2 until we have executed a DSB. This
  ensures that speculative walks started in EL1&0 have completed.

- The usual selftest fixes and improvements.
2023-04-26 15:46:52 -04:00
Linus Torvalds
53b5e72b9d asm-generic updates for 6.4
These are various cleanups, fixing a number of uapi header files to no
 longer reference CONFIG_* symbols, and one patch that introduces the
 new CONFIG_HAS_IOPORT symbol for architectures that provide working
 inb()/outb() macros, as a preparation for adding driver dependencies
 on those in the following release.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
 Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
 kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
 Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
 Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
 Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
 SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
 XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
 UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
 PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
 cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
 swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
 =H3k4
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are various cleanups, fixing a number of uapi header files to no
  longer reference CONFIG_* symbols, and one patch that introduces the
  new CONFIG_HAS_IOPORT symbol for architectures that provide working
  inb()/outb() macros, as a preparation for adding driver dependencies
  on those in the following release"

* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  Kconfig: introduce HAS_IOPORT option and select it as necessary
  scripts: Update the CONFIG_* ignore list in headers_install.sh
  pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
  Move bp_type_idx to include/linux/hw_breakpoint.h
  Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
  Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
2023-04-25 12:22:11 -07:00
Stefan Roesch
07115fcc15 selftests/mm: add new selftests for KSM
This adds three new tests to the selftests for KSM.  These tests use the
new prctl API's to enable and disable KSM.

1) add new prctl flags to prctl header file in tools dir

   This adds the new prctl flags to the include file prct.h in the
   tools directory.  This makes sure they are available for testing.

2) add KSM prctl merge test to ksm_tests

   This adds the -t option to the ksm_tests program.  The -t flag
   allows to specify if it should use madvise or prctl ksm merging.

3) add two functions for debugging merge outcome for ksm_tests

   This adds two functions to report the metrics in /proc/self/ksm_stat
   and /sys/kernel/debug/mm/ksm. The debug output is enabled with the
   -d option.

4) add KSM prctl test to ksm_functional_tests

   This adds a test to the ksm_functional_test that verifies that the
   prctl system call to enable / disable KSM works.

5) add KSM fork test to ksm_functional_test

   Add fork test to verify that the MMF_VM_MERGE_ANY flag is inherited
   by the child process.

Link: https://lkml.kernel.org/r/20230418051342.1919757-4-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Florian Westphal
d0fe92fb5e tools: bpftool: print netfilter link info
Dump protocol family, hook and priority value:
$ bpftool link
2: netfilter  prog 14
        ip input prio -128
        pids install(3264)
5: netfilter  prog 14
        ip6 forward prio 21
        pids a.out(3387)
9: netfilter  prog 14
        ip prerouting prio 123
        pids a.out(5700)
10: netfilter  prog 14
        ip input prio 21
        pids test2(5701)

v2: Quentin Monnet suggested to also add 'bpftool net' support:

$ bpftool net
xdp:

tc:

flow_dissector:

netfilter:

        ip prerouting prio 21 prog_id 14
        ip input prio -128 prog_id 14
        ip input prio 21 prog_id 14
        ip forward prio 21 prog_id 14
        ip output prio 21 prog_id 14
        ip postrouting prio 21 prog_id 14

'bpftool net' only dumps netfilter link type, links are sorted by protocol
family, hook and priority.

v5: fix bpf ci failure: libbpf needs small update to prog_type_name[]
    and probe_prog_load helper.
v4: don't fail with -EOPNOTSUPP in libbpf probe_prog_load, update
    prog_type_name[] with "netfilter" entry (bpf ci)
v3: fix bpf.h copy, 'reserved' member was removed (Alexei)
    use p_err, not fprintf (Quentin)

Suggested-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/eeeaac99-9053-90c2-aa33-cc1ecb1ae9ca@isovalent.com/
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-6-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21 11:34:49 -07:00
Dave Marchevsky
d54730b50b bpf: Introduce opaque bpf_refcount struct and add btf_record plumbing
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types
meant for use in BPF programs. Similarly to other opaque types like
bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in
user-defined struct types a bpf_refcount can be located, so necessary
btf_record plumbing is added to enable this. bpf_refcount is sized to
hold a refcount_t.

Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in
btf_record as refcount_off in addition to being in the field array.
Caching refcount_off makes sense for this field because further patches
in the series will modify functions that take local kptrs (e.g.
bpf_obj_drop) to change their behavior if the type they're operating on
is refcounted. So enabling fast "is this type refcounted?" checks is
desirable.

No such verifier behavior changes are introduced in this patch, just
logic to recognize 'struct bpf_refcount' in btf_record.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15 17:36:49 -07:00
Jakub Kicinski
c2865b1122 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
 g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
 mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
 =StAc
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f4 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc4 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cde ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a816 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
Andrii Nakryiko
47a71c1f9a bpf: Add log_true_size output field to return necessary log buffer size
Add output-only log_true_size and btf_log_true_size field to
BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return
the size of log buffer necessary to fit in all the log contents at
specified log_level. This is very useful for BPF loader libraries like
libbpf to be able to size log buffer correctly, but could be used by
users directly, if necessary, as well.

This patch plumbs all this through the code, taking into account actual
bpf_attr size provided by user to determine if these new fields are
expected by users. And if they are, set them from kernel on return.

We refactory btf_parse() function to accommodate this, moving attr and
uattr handling inside it. The rest is very straightforward code, which
is split from the logging accounting changes in the previous patch to
make it simpler to review logic vs UAPI changes.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org
2023-04-11 18:05:43 +02:00
Ravi Bangoria
e0999b0e21 tools include UAPI: Sync uapi/linux/perf_event.h with the kernel sources
... to bring PERF_MEM_LVLNUM_UNC definition to userspace

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20230407112459.548-5-ravi.bangoria@amd.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-04-10 19:25:12 -03:00
Herbert Xu
954d1fa1ac macvlan: Add netlink attribute for broadcast cutoff
Make the broadcast cutoff configurable through netlink.  Note
that macvlan is weird because there is no central device for
us to configure (the lowerdev could be anything).  So all the
options are duplicated over what could be thousands of child
devices.

IFLA_MACVLAN_BC_QUEUE_LEN took the approach of taking the maximum
of all child device settings.  This is unnecessary as we could
simply store the option in the port device and take the last
child device that gets updated as the value to use.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-29 09:03:32 +01:00
Kui-Feng Lee
aef56f2e91 bpf: Update the struct_ops of a bpf_link.
By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
to conveniently switch between different struct_ops on a single
bpf_link. This would enable smoother transitions from one struct_ops
to another.

The struct_ops maps passing along with BPF_LINK_UPDATE should have the
BPF_F_LINK flag.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230323032405.3735486-6-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:53:02 -07:00
Kui-Feng Lee
68b04864ca bpf: Create links for BPF struct_ops maps.
Make bpf_link support struct_ops.  Previously, struct_ops were always
used alone without any associated links. Upon updating its value, a
struct_ops would be activated automatically. Yet other BPF program
types required to make a bpf_link with their instances before they
could become active. Now, however, you can create an inactive
struct_ops, and create a link to activate it later.

With bpf_links, struct_ops has a behavior similar to other BPF program
types. You can pin/unpin them from their links and the struct_ops will
be deactivated when its link is removed while previously need someone
to delete the value for it to be deactivated.

bpf_links are responsible for registering their associated
struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
set to create a bpf_link, while a structs without this flag behaves in
the same manner as before and is registered upon updating its value.

The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
used to craft the links for BPF struct_ops programs, but also to
create links for BPF struct_ops them-self.  Since the links of BPF
struct_ops programs are only used to create trampolines internally,
they are never seen in other contexts. Thus, they can be reused for
struct_ops themself.

To maintain a reference to the map supporting this link, we add
bpf_struct_ops_link as an additional type. The pointer of the map is
RCU and won't be necessary until later in the patchset.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22 22:53:02 -07:00
Jakub Kicinski
1118aa4c70 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/wireless/nl80211.c
  b27f07c50a ("wifi: nl80211: fix puncturing bitmap policy")
  cbbaf2bb82 ("wifi: nl80211: add a command to enable/disable HW timestamping")
https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au

tools/testing/selftests/net/Makefile
  62199e3f16 ("selftests: net: Add VXLAN MDB test")
  13715acf8a ("selftest: Add test for bind() conflicts.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17 16:29:25 -07:00
Linus Torvalds
478a351ce0 Including fixes from netfilter, wifi and ipsec.
Current release - regressions:
 
  - phy: mscc: fix deadlock in phy_ethtool_{get,set}_wol()
 
  - virtio: vsock: don't use skbuff state to account credit
 
  - virtio: vsock: don't drop skbuff on copy failure
 
  - virtio_net: fix page_to_skb() miscalculating the memory size
 
 Current release - new code bugs:
 
  - eth: correct xdp_features after device reconfig
 
  - wifi: nl80211: fix the puncturing bitmap policy
 
  - net/mlx5e: flower:
    - fix raw counter initialization
    - fix missing error code
    - fix cloned flow attribute
 
  - ipa:
    - fix some register validity checks
    - fix a surprising number of bad offsets
    - kill FILT_ROUT_CACHE_CFG IPA register
 
 Previous releases - regressions:
 
  - tcp: fix bind() conflict check for dual-stack wildcard address
 
  - veth: fix use after free in XDP_REDIRECT when skb headroom is small
 
  - ipv4: fix incorrect table ID in IOCTL path
 
  - ipvlan: make skb->skb_iif track skb->dev for l3s mode
 
  - mptcp:
   - fix possible deadlock in subflow_error_report
   - fix UaFs when destroying unaccepted and listening sockets
 
  - dsa: mv88e6xxx: fix max_mtu of 1492 on 6165, 6191, 6220, 6250, 6290
 
 Previous releases - always broken:
 
  - tcp: tcp_make_synack() can be called from process context,
    don't assume preemption is disabled when updating stats
 
  - netfilter: correct length for loading protocol registers
 
  - virtio_net: add checking sq is full inside xdp xmit
 
  - bonding: restore IFF_MASTER/SLAVE flags on bond enslave
    Ethertype change
 
  - phy: nxp-c45-tja11xx: fix MII_BASIC_CONFIG_REV bit number
 
  - eth: i40e: fix crash during reboot when adapter is in recovery mode
 
  - eth: ice: avoid deadlock on rtnl lock when auxiliary device
    plug/unplug meets bonding
 
  - dsa: mt7530:
    - remove now incorrect comment regarding port 5
    - set PLL frequency and trgmii only when trgmii is used
 
  - eth: mtk_eth_soc: reset PCS state when changing interface types
 
 Misc:
 
  - ynl: another license adjustment
 
  - move the TCA_EXT_WARN_MSG attribute for tc action
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQUzTgACgkQMUZtbf5S
 IrvulQ/9GA5GaT52r5T9HaV5slygkHw9ValpfJAddI0MbBjeYfDhkSoTUujIr92W
 VMj+VpRcqS67pqzD2Z77s2EwB445NCOralB9ji8623tkCDevZU3gUKmjtiO5G7fP
 4iAUbibfXjQiDKIeCdcVZ+SXYYdBSQDfFvQskU6/nzKuqjEbhC+GbiMWz7rt2SKe
 q9gHFSK1du2SGa6fIfJTEosa+MX4UTwAhLOReS5vSFhrlOsUCeMGTCzBfDuacQqn
 Iq1MJqW2yLceUar164xkYAAwRdL/ZLVkWaMza7KjM8Qi04MiopuFB2+moFDowrM9
 D9lX6HMX9NUrHTFGjyZVk845PFxPW+Rnhu1/OKINdugOmcCHApYrtkxB6/Z+piS5
 sW3kfkTPsQydA6Dx/RINJE39z6EYabwIQCc68D1HlPuTpOjYWTQdn0CvwxCmOFCr
 saTkd1wOeiwy8BheBSeX1QCkx4MwO6Dg+ObX/eKsYXGGWPMZcbMdbmmvFu7dZHhO
 cH4AGypRMrDa2IoYGqIs5sgkjxAMZZSkeQ1E+EpPw3n4us/QjQYrey5uto8uvErm
 zz7hI1qAwM8dooxsKdPyaARzM//Bq/gmYbqD0Ahts2t6BMX6eX2weneuQ4VJEf94
 8RTtIu9BbBH0ysgBkgqMwCeM4YVtG+/e7p390z4tqPrwOi7bZ5A=
 =5/YI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, wifi and ipsec.

  A little more changes than usual, but it's pretty normal for us that
  the rc3/rc4 PRs are oversized as people start testing in earnest.

  Possibly an extra boost from people deploying the 6.1 LTS but that's
  more of an unscientific hunch.

  Current release - regressions:

   - phy: mscc: fix deadlock in phy_ethtool_{get,set}_wol()

   - virtio: vsock: don't use skbuff state to account credit

   - virtio: vsock: don't drop skbuff on copy failure

   - virtio_net: fix page_to_skb() miscalculating the memory size

  Current release - new code bugs:

   - eth: correct xdp_features after device reconfig

   - wifi: nl80211: fix the puncturing bitmap policy

   - net/mlx5e: flower:
      - fix raw counter initialization
      - fix missing error code
      - fix cloned flow attribute

   - ipa:
      - fix some register validity checks
      - fix a surprising number of bad offsets
      - kill FILT_ROUT_CACHE_CFG IPA register

  Previous releases - regressions:

   - tcp: fix bind() conflict check for dual-stack wildcard address

   - veth: fix use after free in XDP_REDIRECT when skb headroom is small

   - ipv4: fix incorrect table ID in IOCTL path

   - ipvlan: make skb->skb_iif track skb->dev for l3s mode

   - mptcp:
      - fix possible deadlock in subflow_error_report
      - fix UaFs when destroying unaccepted and listening sockets

   - dsa: mv88e6xxx: fix max_mtu of 1492 on 6165, 6191, 6220, 6250, 6290

  Previous releases - always broken:

   - tcp: tcp_make_synack() can be called from process context, don't
     assume preemption is disabled when updating stats

   - netfilter: correct length for loading protocol registers

   - virtio_net: add checking sq is full inside xdp xmit

   - bonding: restore IFF_MASTER/SLAVE flags on bond enslave Ethertype
     change

   - phy: nxp-c45-tja11xx: fix MII_BASIC_CONFIG_REV bit number

   - eth: i40e: fix crash during reboot when adapter is in recovery mode

   - eth: ice: avoid deadlock on rtnl lock when auxiliary device
     plug/unplug meets bonding

   - dsa: mt7530:
      - remove now incorrect comment regarding port 5
      - set PLL frequency and trgmii only when trgmii is used

   - eth: mtk_eth_soc: reset PCS state when changing interface types

  Misc:

   - ynl: another license adjustment

   - move the TCA_EXT_WARN_MSG attribute for tc action"

* tag 'net-6.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits)
  selftests: bonding: add tests for ether type changes
  bonding: restore bond's IFF_SLAVE flag if a non-eth dev enslave fails
  bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change
  net: renesas: rswitch: Fix GWTSDIE register handling
  net: renesas: rswitch: Fix the output value of quote from rswitch_rx()
  ethernet: sun: add check for the mdesc_grab()
  net: ipa: fix some register validity checks
  net: ipa: kill FILT_ROUT_CACHE_CFG IPA register
  net: ipa: add two missing declarations
  net: ipa: reg: include <linux/bug.h>
  net: xdp: don't call notifiers during driver init
  net/sched: act_api: add specific EXT_WARN_MSG for tc action
  Revert "net/sched: act_api: move TCA_EXT_WARN_MSG to the correct hierarchy"
  net: dsa: microchip: fix RGMII delay configuration on KSZ8765/KSZ8794/KSZ8795
  ynl: make the tooling check the license
  ynl: broaden the license even more
  tools: ynl: make definitions optional again
  hsr: ratelimit only when errors are printed
  qed/qed_mng_tlv: correctly zero out ->min instead of ->hour
  selftests: net: devlink_port_split.py: skip test if no suitable device available
  ...
2023-03-17 13:31:16 -07:00
Jakub Kicinski
4e16b6a748 ynl: broaden the license even more
I relicensed Netlink spec code to GPL-2.0 OR BSD-3-Clause but
we still put a slightly different license on the uAPI header
than the rest of the code. Use the Linux-syscall-note on all
the specs and all generated code. It's moot for kernel code,
but should not hurt. This way the licenses match everywhere.

Cc: Chuck Lever <chuck.lever@oracle.com>
Fixes: 37d9df224d ("ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause")
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-16 21:20:32 -07:00
Thomas Huth
c5edd753a0 KVM: x86: Remove the KVM_GET_NR_MMU_PAGES ioctl
The KVM_GET_NR_MMU_PAGES ioctl is quite questionable on 64-bit hosts
since it fails to return the full 64 bits of the value that can be
set with the corresponding KVM_SET_NR_MMU_PAGES call. Its "long" return
value is truncated into an "int" in the kvm_arch_vm_ioctl() function.

Since this ioctl also never has been used by userspace applications
(QEMU, Google's internal VMM, kvmtool and CrosVM have been checked),
it's likely the best if we remove this badly designed ioctl before
anybody really tries to use it.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230208140105.655814-4-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-16 10:18:06 -04:00
Ross Zwisler
27d7fdf06f bpf: use canonical ftrace path
The canonical location for the tracefs filesystem is at /sys/kernel/tracing.

But, from Documentation/trace/ftrace.rst:

  Before 4.1, all ftrace tracing control files were within the debugfs
  file system, which is typically located at /sys/kernel/debug/tracing.
  For backward compatibility, when mounting the debugfs file system,
  the tracefs file system will be automatically mounted at:

  /sys/kernel/debug/tracing

Many comments and samples in the bpf code still refer to this older
debugfs path, so let's update them to avoid confusion.  There are a few
spots where the bpf code explicitly checks both tracefs and debugfs
(tools/bpf/bpftool/tracelog.c and tools/lib/api/fs/fs.c) and I've left
those alone so that the tools can continue to work with both paths.

Signed-off-by: Ross Zwisler <zwisler@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230313205628.1058720-2-zwisler@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-13 21:51:30 -07:00
Lorenzo Bianconi
f85949f982 xdp: add xdp_set_features_flag utility routine
Introduce xdp_set_features_flag utility routine in order to update
dynamically xdp_features according to the dynamic hw configuration via
ethtool (e.g. changing number of hw rx/tx queues).
Add xdp_clear_features_flag() in order to clear all xdp_feature flag.

Reviewed-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10 21:33:47 -08:00
Palmer Dabbelt
d3c7ec7588 Move bp_type_idx to include/linux/hw_breakpoint.h
This has a "#ifdef CONFIG_*" that used to be exposed to userspace.

The names in here are so generic that I don't think it's a good idea
to expose them to userspace (or even the rest of the kernel).  There are
multiple in-kernel users, so it's been moved to a kernel header file.

Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Reviewed-by: Andrew Waterman <waterman@eecs.berkeley.edu>
Reviewed-by: Albert Ou <aou@eecs.berkeley.edu>
Message-Id: <1447119071-19392-10-git-send-email-palmer@dabbelt.com>
[thuth: Remove it also from tools/include/uapi/linux/hw_breakpoint.h]
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-03-10 21:05:16 +01:00
Linus Torvalds
49be4fb281 perf tools fixes for v6.3:
- Add Adrian Hunter to MAINTAINERS as a perf tools reviewer.
 
 - Sync various tools/ copies of kernel headers with the kernel sources, this
   time trying to avoid first merging with upstream to then update but instead
   copy from upstream so that a merge is avoided and the end result after merging
   this pull request is the one expected, tools/perf/check-headers.sh (mostly)
   happy, less warnings while building tools/perf/.
 
 - Fix counting when initial delay configured by setting
   perf_attr.enable_on_exec when starting workloads from the perf command line.
 
 - Don't avoid emitting a PERF_RECORD_MMAP2 in 'perf inject --buildid-all' when
   that record comes with a build-id, otherwise we end up not being able to
   resolve symbols.
 
 - Don't use comma as the CSV output separator the "stat+csv_output" test, as
   comma can appear on some tests as a modifier for an event, use @ instead,
   ditto for the JSON linter test.
 
 - The offcpu test was looking for some bits being set on
   task_struct->prev_state without masking other bits not important for this
   specific 'perf test', fix it.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCZApKjQAKCRCyPKLppCJ+
 JzdfAQDRnwDCxhb4cvx7lVR32L1XMIFW6qLWRBJWoxC2SJi6lgD/SoQgKswkxrJv
 XnBP7jEaIsh3M3ak82MxLKbjSAEvnwk=
 =jup7
 -----END PGP SIGNATURE-----

Merge tag 'perf-tools-fixes-for-v6.3-1-2023-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Add Adrian Hunter to MAINTAINERS as a perf tools reviewer

 - Sync various tools/ copies of kernel headers with the kernel sources,
   this time trying to avoid first merging with upstream to then update
   but instead copy from upstream so that a merge is avoided and the end
   result after merging this pull request is the one expected,
   tools/perf/check-headers.sh (mostly) happy, less warnings while
   building tools/perf/

 - Fix counting when initial delay configured by setting
   perf_attr.enable_on_exec when starting workloads from the perf
   command line

 - Don't avoid emitting a PERF_RECORD_MMAP2 in 'perf inject
   --buildid-all' when that record comes with a build-id, otherwise we
   end up not being able to resolve symbols

 - Don't use comma as the CSV output separator the "stat+csv_output"
   test, as comma can appear on some tests as a modifier for an event,
   use @ instead, ditto for the JSON linter test

 - The offcpu test was looking for some bits being set on
   task_struct->prev_state without masking other bits not important for
   this specific 'perf test', fix it

* tag 'perf-tools-fixes-for-v6.3-1-2023-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf tools: Add Adrian Hunter to MAINTAINERS as a reviewer
  tools headers UAPI: Sync linux/perf_event.h with the kernel sources
  tools headers x86 cpufeatures: Sync with the kernel sources
  tools include UAPI: Sync linux/vhost.h with the kernel sources
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  tools headers kvm: Sync uapi/{asm/linux} kvm.h headers with the kernel sources
  tools include UAPI: Synchronize linux/fcntl.h with the kernel sources
  tools headers: Synchronize {linux,vdso}/bits.h with the kernel sources
  tools headers UAPI: Sync linux/prctl.h with the kernel sources
  tools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench'
  perf stat: Fix counting when initial delay configured
  tools headers svm: Sync svm headers with the kernel sources
  perf test: Avoid counting commas in json linter
  perf tests stat+csv_output: Switch CSV separator to @
  perf inject: Fix --buildid-all not to eat up MMAP2
  tools arch x86: Sync the msr-index.h copy with the kernel sources
  perf test: Fix offcpu test prev_state check
2023-03-10 08:18:46 -08:00
Jakub Kicinski
d0ddf5065f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  d56b0c461d ("bpf, docs: Fix link to netdev-FAQ target")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-09 22:22:11 -08:00
Michael Weiß
5a70f4a630 bpf: Fix a typo for BPF_F_ANY_ALIGNMENT in bpf.h
Fix s/BPF_PROF_LOAD/BPF_PROG_LOAD/ typo in the documentation comment
for BPF_F_ANY_ALIGNMENT in bpf.h.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230309133823.944097-1-michael.weiss@aisec.fraunhofer.de
2023-03-09 20:42:57 +01:00
Andrii Nakryiko
6018e1f407 bpf: implement numbers iterator
Implement the first open-coded iterator type over a range of integers.

It's public API consists of:
  - bpf_iter_num_new() constructor, which accepts [start, end) range
    (that is, start is inclusive, end is exclusive).
  - bpf_iter_num_next() which will keep returning read-only pointer to int
    until the range is exhausted, at which point NULL will be returned.
    If bpf_iter_num_next() is kept calling after this, NULL will be
    persistently returned.
  - bpf_iter_num_destroy() destructor, which needs to be called at some
    point to clean up iterator state. BPF verifier enforces that iterator
    destructor is called at some point before BPF program exits.

Note that `start = end = X` is a valid combination to setup an empty
iterator. bpf_iter_num_new() will return 0 (success) for any such
combination.

If bpf_iter_num_new() detects invalid combination of input arguments, it
returns error, resets iterator state to, effectively, empty iterator, so
any subsequent call to bpf_iter_num_next() will keep returning NULL.

BPF verifier has no knowledge that returned integers are in the
[start, end) value range, as both `start` and `end` are not statically
known and enforced: they are runtime values.

While the implementation is pretty trivial, some care needs to be taken
to avoid overflows and underflows. Subsequent selftests will validate
correctness of [start, end) semantics, especially around extremes
(INT_MIN and INT_MAX).

Similarly to bpf_loop(), we enforce that no more than BPF_MAX_LOOPS can
be specified.

bpf_iter_num_{new,next,destroy}() is a logical evolution from bounded
BPF loops and bpf_loop() helper and is the basis for implementing
ergonomic BPF loops with no statically known or verified bounds.
Subsequent patches implement bpf_for() macro, demonstrating how this can
be wrapped into something that works and feels like a normal for() loop
in C language.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-08 16:19:51 -08:00
Jakub Kicinski
37d9df224d ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause
I was intending to make all the Netlink Spec code BSD-3-Clause
to ease the adoption but it appears that:
 - I fumbled the uAPI and used "GPL WITH uAPI note" there
 - it gives people pause as they expect GPL in the kernel
As suggested by Chuck re-license under dual. This gives us benefit
of full BSD freedom while fulfilling the broad "kernel is under GPL"
expectations.

Link: https://lore.kernel.org/all/20230304120108.05dd44c5@kernel.org/
Link: https://lore.kernel.org/r/20230306200457.3903854-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07 13:44:30 -08:00
Jakub Kicinski
36e5e391a2 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZAZsBwAKCRDbK58LschI
 g3W1AQCQnO6pqqX5Q2aYDAZPlZRtV2TRLjuqrQE0dHW/XLAbBgD/bgsAmiKhPSCG
 2mTt6izpTQVlZB0e8KcDIvbYd9CE3Qc=
 =EjJQ
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-03-06

We've added 85 non-merge commits during the last 13 day(s) which contain
a total of 131 files changed, 7102 insertions(+), 1792 deletions(-).

The main changes are:

1) Add skb and XDP typed dynptrs which allow BPF programs for more
   ergonomic and less brittle iteration through data and variable-sized
   accesses, from Joanne Koong.

2) Bigger batch of BPF verifier improvements to prepare for upcoming BPF
   open-coded iterators allowing for less restrictive looping capabilities,
   from Andrii Nakryiko.

3) Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
   programs to NULL-check before passing such pointers into kfunc,
   from Alexei Starovoitov.

4) Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
   local storage maps, from Kumar Kartikeya Dwivedi.

5) Add BPF verifier support for ST instructions in convert_ctx_access()
   which will help new -mcpu=v4 clang flag to start emitting them,
   from Eduard Zingerman.

6) Make uprobe attachment Android APK aware by supporting attachment
   to functions inside ELF objects contained in APKs via function names,
   from Daniel Müller.

7) Add a new flag BPF_F_TIMER_ABS flag for bpf_timer_start() helper
   to start the timer with absolute expiration value instead of relative
   one, from Tero Kristo.

8) Add a new kfunc bpf_cgroup_from_id() to look up cgroups via id,
   from Tejun Heo.

9) Extend libbpf to support users manually attaching kprobes/uprobes
   in the legacy/perf/link mode, from Menglong Dong.

10) Implement workarounds in the mips BPF JIT for DADDI/R4000,
   from Jiaxun Yang.

11) Enable mixing bpf2bpf and tailcalls for the loongarch BPF JIT,
    from Hengqi Chen.

12) Extend BPF instruction set doc with describing the encoding of BPF
    instructions in terms of how bytes are stored under big/little endian,
    from Jose E. Marchesi.

13) Follow-up to enable kfunc support for riscv BPF JIT, from Pu Lehui.

14) Fix bpf_xdp_query() backwards compatibility on old kernels,
    from Yonghong Song.

15) Fix BPF selftest cross compilation with CLANG_CROSS_FLAGS,
    from Florent Revest.

16) Improve bpf_cpumask_ma to only allocate one bpf_mem_cache,
    from Hou Tao.

17) Fix BPF verifier's check_subprogs to not unnecessarily mark
    a subprogram with has_tail_call, from Ilya Leoshkevich.

18) Fix arm syscall regs spec in libbpf's bpf_tracing.h, from Puranjay Mohan.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits)
  selftests/bpf: Add test for legacy/perf kprobe/uprobe attach mode
  selftests/bpf: Split test_attach_probe into multi subtests
  libbpf: Add support to set kprobe/uprobe attach mode
  tools/resolve_btfids: Add /libsubcmd to .gitignore
  bpf: add support for fixed-size memory pointer returns for kfuncs
  bpf: generalize dynptr_get_spi to be usable for iters
  bpf: mark PTR_TO_MEM as non-null register type
  bpf: move kfunc_call_arg_meta higher in the file
  bpf: ensure that r0 is marked scratched after any function call
  bpf: fix visit_insn()'s detection of BPF_FUNC_timer_set_callback helper
  bpf: clean up visit_insn()'s instruction processing
  selftests/bpf: adjust log_fixup's buffer size for proper truncation
  bpf: honor env->test_state_freq flag in is_state_visited()
  selftests/bpf: enhance align selftest's expected log matching
  bpf: improve regsafe() checks for PTR_TO_{MEM,BUF,TP_BUFFER}
  bpf: improve stack slot state printing
  selftests/bpf: Disassembler tests for verifier.c:convert_ctx_access()
  selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
  bpf: allow ctx writes using BPF_ST_MEM instruction
  bpf: Use separate RCU callbacks for freeing selem
  ...
====================

Link: https://lore.kernel.org/r/20230307004346.27578-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06 20:36:39 -08:00
Arnaldo Carvalho de Melo
06a1574b94 tools headers UAPI: Sync linux/perf_event.h with the kernel sources
To pick up the changes in:

  09519ec3b1 ("perf: Add perf_event_attr::config3")

The patches for the tooling side will come later.

This addresses this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
  diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/lkml/ZAZLYmDjWjSItWOq@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-06 17:21:23 -03:00
Arnaldo Carvalho de Melo
14e998ed42 tools include UAPI: Sync linux/vhost.h with the kernel sources
To get the changes in:

  3b688d7a08 ("vhost-vdpa: uAPI to resume the device")

To pick up these changes and support them:

  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
  $ cp ../linux/include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
  $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
  $ diff -u before after
  --- before	2023-03-06 09:26:14.889251817 -0300
  +++ after	2023-03-06 09:26:20.594406270 -0300
  @@ -30,6 +30,7 @@
   	[0x77] = "VDPA_SET_CONFIG_CALL",
   	[0x7C] = "VDPA_SET_GROUP_ASID",
   	[0x7D] = "VDPA_SUSPEND",
  +	[0x7E] = "VDPA_RESUME",
   };
   static const char *vhost_virtio_ioctl_read_cmds[] = {
   	[0x00] = "GET_FEATURES",
  $

For instance, see how those 'cmd' ioctl arguments get translated, now
VDPA_RESUME will be as well:

  # perf trace -a -e ioctl --max-events=10
       0.000 ( 0.011 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                        = 0
      21.353 ( 0.014 ms): pipewire/2261 ioctl(fd: 60, cmd: SNDRV_PCM_HWSYNC, arg: 0x1)                        = 0
      25.766 ( 0.014 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)            = 0
      25.845 ( 0.034 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70)            = 0
      25.916 ( 0.011 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)               = 0
      25.941 ( 0.025 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ATOMIC, arg: 0x7ffe4a22c840)               = 0
      32.915 ( 0.009 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_RMFB, arg: 0x7ffe4a22cf9c)                 = 0
      42.522 ( 0.013 ms): gnome-shell/2196 ioctl(fd: 14, cmd: DRM_I915_IRQ_WAIT, arg: 0x7ffe4a22c740)            = 0
      42.579 ( 0.031 ms): gnome-shel:cs0/2212 ioctl(fd: 14, cmd: DRM_I915_IRQ_EMIT, arg: 0x7fd43915dc70)            = 0
      42.644 ( 0.010 ms): gnome-shell/2196 ioctl(fd: 9, cmd: DRM_MODE_ADDFB2, arg: 0x7ffe4a22c8a0)               = 0
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Link: https://lore.kernel.org/lkml/ZAXdCTecxSNwAoeK@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-06 09:31:26 -03:00
Arnaldo Carvalho de Melo
33c53f9b5a tools headers kvm: Sync uapi/{asm/linux} kvm.h headers with the kernel sources
To pick up the changes in:

  89b0e7de34 ("KVM: arm64: nv: Introduce nested virtualization VCPU feature")
  14329b825f ("KVM: x86/pmu: Introduce masked events to the pmu event filter")
  6213b701a9 ("KVM: x86: Replace 0-length arrays with flexible arrays")
  3fd49805d1 ("KVM: s390: Extend MEM_OP ioctl by storage key checked cmpxchg")
  14329b825f ("KVM: x86/pmu: Introduce masked events to the pmu event filter")

That don't change functionality in tools/perf, as no new ioctl is added
for the 'perf trace' scripts to harvest.

This addresses these perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
  diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
  Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/kvm.h' differs from latest version at 'arch/x86/include/uapi/asm/kvm.h'
  diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
  Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
  diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h

Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kees Kook <keescook@chromium.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/lkml/ZAJlg7%2FfWDVGX0F3@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-03 23:24:10 -03:00
Arnaldo Carvalho de Melo
5f800380af tools include UAPI: Synchronize linux/fcntl.h with the kernel sources
To pick up the changes in:

  6fd7353829 ("mm/memfd: add F_SEAL_EXEC")

That doesn't add or change any perf tools functionality, only addresses
these build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fcntl.h' differs from latest version at 'include/uapi/linux/fcntl.h'
  diff -u tools/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-03 22:34:20 -03:00
Arnaldo Carvalho de Melo
df4b933e0e tools headers UAPI: Sync linux/prctl.h with the kernel sources
To pick new prctl options introduced in:

  b507808ebc ("mm: implement memory-deny-write-execute as a prctl")

That results in:

  $ diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h
  --- tools/include/uapi/linux/prctl.h	2022-06-20 17:54:43.884515663 -0300
  +++ include/uapi/linux/prctl.h	2023-03-03 11:18:51.090923569 -0300
  @@ -281,6 +281,12 @@
   # define PR_SME_VL_LEN_MASK		0xffff
   # define PR_SME_VL_INHERIT		(1 << 17) /* inherit across exec */

  +/* Memory deny write / execute */
  +#define PR_SET_MDWE			65
  +# define PR_MDWE_REFUSE_EXEC_GAIN	1
  +
  +#define PR_GET_MDWE			66
  +
   #define PR_SET_VMA		0x53564d41
   # define PR_SET_VMA_ANON_NAME		0

  $ tools/perf/trace/beauty/prctl_option.sh > before
  $ cp include/uapi/linux/prctl.h tools/include/uapi/linux/prctl.h
  $ tools/perf/trace/beauty/prctl_option.sh > after
  $ diff -u before after
  --- before	2023-03-03 11:47:43.320013146 -0300
  +++ after	2023-03-03 11:47:50.937216229 -0300
  @@ -59,6 +59,8 @@
   	[62] = "SCHED_CORE",
   	[63] = "SME_SET_VL",
   	[64] = "SME_GET_VL",
  +	[65] = "SET_MDWE",
  +	[66] = "GET_MDWE",
   };
   static const char *prctl_set_mm_options[] = {
   	[1] = "START_CODE",
  $

Now users can do:

  # perf trace -e syscalls:sys_enter_prctl --filter "option==SET_MDWE||option==GET_MDWE"
^C#
  # trace -v -e syscalls:sys_enter_prctl --filter "option==SET_MDWE||option==GET_MDWE"
  New filter for syscalls:sys_enter_prctl: (option==65||option==66) && (common_pid != 5519 && common_pid != 3404)
^C#

And when these prctl options appears in a session, they will be
translated to the corresponding string.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZAI%2FAoPXb%2Fsxz1%2Fm@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-03-03 22:34:08 -03:00
Tero Kristo
f71f853049 bpf: Add support for absolute value BPF timers
Add a new flag BPF_F_TIMER_ABS that can be passed to bpf_timer_start()
to start an absolute value timer instead of the default relative value.
This makes the timer expire at an exact point in time, instead of a time
with latencies induced by both the BPF and timer subsystems.

Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Link: https://lore.kernel.org/r/20230302114614.2985072-2-tero.kristo@linux.intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-02 22:41:32 -08:00
Joanne Koong
66e3a13e7c bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr
Two new kfuncs are added, bpf_dynptr_slice and bpf_dynptr_slice_rdwr.
The user must pass in a buffer to store the contents of the data slice
if a direct pointer to the data cannot be obtained.

For skb and xdp type dynptrs, these two APIs are the only way to obtain
a data slice. However, for other types of dynptrs, there is no
difference between bpf_dynptr_slice(_rdwr) and bpf_dynptr_data.

For skb type dynptrs, the data is copied into the user provided buffer
if any of the data is not in the linear portion of the skb. For xdp type
dynptrs, the data is copied into the user provided buffer if the data is
between xdp frags.

If the skb is cloned and a call to bpf_dynptr_data_rdwr is made, then
the skb will be uncloned (see bpf_unclone_prologue()).

Please note that any bpf_dynptr_write() automatically invalidates any prior
data slices of the skb dynptr. This is because the skb may be cloned or
may need to pull its paged buffer into the head. As such, any
bpf_dynptr_write() will automatically have its prior data slices
invalidated, even if the write is to data in the skb head of an uncloned
skb. Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well, for the same reasons.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-10-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
05421aecd4 bpf: Add xdp dynptrs
Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. Data slices through the bpf_dynptr_data
API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() should be used.

For examples of how xdp dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-9-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Joanne Koong
b5964b968a bpf: Add skb dynptrs
Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only (bpf_dynptr_write() will return an error)

For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
interfaces, reading and writing from/to data in the head as well as from/to
non-linear paged buffers is supported. Data slices through the
bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and
bpf_dynptr_slice_rdwr() (added in subsequent commit) should be used.

For examples of how skb dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230301154953.641654-8-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01 09:55:24 -08:00
Tariq Toukan
1862de92c8 netdev-genl: fix repeated typo oflloading -> offloading
Fix a repeated copy/paste typo.

Fixes: d3d854fd6a ("netdev-genl: create a simple family for netdev stuff")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-24 11:01:16 +00:00
Martin KaFai Lau
31de4105f0 bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
The bpf_fib_lookup() also looks up the neigh table.
This was done before bpf_redirect_neigh() was added.

In the use case that does not manage the neigh table
and requires bpf_fib_lookup() to lookup a fib to
decide if it needs to redirect or not, the bpf prog can
depend only on using bpf_redirect_neigh() to lookup the
neigh. It also keeps the neigh entries fresh and connected.

This patch adds a bpf_fib_lookup flag, SKIP_NEIGH, to avoid
the double neigh lookup when the bpf prog always call
bpf_redirect_neigh() to do the neigh lookup. The params->smac
output is skipped together when SKIP_NEIGH is set because
bpf_redirect_neigh() will figure out the smac also.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217205515.3583372-1-martin.lau@linux.dev
2023-02-17 22:12:04 +01:00
Dave Marchevsky
9c395c1b99 bpf: Add basic bpf_rb_{root,node} support
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.

structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.

btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.

btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.

After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13 19:31:13 -08:00
Florian Lehner
17c9b4e1a7 bpf: fix typo in header for bpf_perf_prog_read_value
Fix a simple typo in the documentation for bpf_perf_prog_read_value.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230203121439.25884-1-dev@der-flo.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-02-03 22:11:21 -08:00
Jakub Kicinski
d3d854fd6a netdev-genl: create a simple family for netdev stuff
Add a Netlink spec-compatible family for netdevs.
This is a very simple implementation without much
thought going into it.

It allows us to reap all the benefits of Netlink specs,
one can use the generic client to issue the commands:

  $ ./cli.py --spec netdev.yaml --dump dev_get
  [{'ifindex': 1, 'xdp-features': set()},
   {'ifindex': 2, 'xdp-features': {'basic', 'ndo-xmit', 'redirect'}},
   {'ifindex': 3, 'xdp-features': {'rx-sg'}}]

the generic python library does not have flags-by-name
support, yet, but we also don't have to carry strings
in the messages, as user space can get the names from
the spec.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/327ad9c9868becbe1e601b580c962549c8cd81f2.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-02 20:48:23 -08:00
Tiezhu Yang
e2bd974298 tools/bpf: Use tab instead of white spaces to sync bpf.h
Just silence the following build warning:

Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Link: https://lore.kernel.org/r/1675319486-27744-2-git-send-email-yangtiezhu@loongson.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-02 20:38:32 -08:00
Jakub Kicinski
2d104c390f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
 gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
 QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
 =yYqA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 00:00:14 -08:00
Stanislav Fomichev
2b3486bc2d bpf: Introduce device-bound XDP programs
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-23 09:38:10 -08:00
Jakub Kicinski
b3c588cd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ipa/ipa_interrupt.c
drivers/net/ipa/ipa_interrupt.h
  9ec9b2a308 ("net: ipa: disable ipa interrupt during suspend")
  8e461e1f09 ("net: ipa: introduce ipa_interrupt_enable()")
  d50ed35587 ("net: ipa: enable IPA interrupt handlers separate from registration")
https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/
https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20 12:28:23 -08:00
Arnaldo Carvalho de Melo
8026a31df6 tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:

  b0305c1e0e ("KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi")

That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
  diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lore.kernel.org/lkml/Y7Loj5slB908QSXf@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-01-17 15:48:43 -03:00
Ziyang Xuan
d219df60a7 bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()
Add ipip6 and ip6ip decap support for bpf_skb_adjust_room().
Main use case is for using cls_bpf on ingress hook to decapsulate
IPv4 over IPv6 and IPv6 over IPv4 tunnel packets.

Add two new flags BPF_F_ADJ_ROOM_DECAP_L3_IPV{4,6} to indicate the
new IP header version after decapsulating the outer IP header.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/b268ec7f0ff9431f4f43b1b40ab856ebb28cb4e1.1673574419.git.william.xuanziyang@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-15 12:56:17 -08:00
Jakub Kicinski
4aea86b403 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05 15:34:11 -08:00
Jakub Kicinski
d75858ef10 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY7X/4wAKCRDbK58LschI
 g7gzAQCjKsLtAWg1OplW+B7pvEPwkQ8g3O1+PYWlToCUACTlzQD+PEMrqGnxB573
 oQAk6I2yOTwLgvlHkrm+TIdKSouI4gs=
 =2hUY
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-04

We've added 45 non-merge commits during the last 21 day(s) which contain
a total of 50 files changed, 1454 insertions(+), 375 deletions(-).

The main changes are:

1) Fixes, improvements and refactoring of parts of BPF verifier's
   state equivalence checks, from Andrii Nakryiko.

2) Fix a few corner cases in libbpf's BTF-to-C converter in particular
   around padding handling and enums, also from Andrii Nakryiko.

3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better
  support decap on GRE tunnel devices not operating in collect metadata,
  from Christian Ehrig.

4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks,
   from Dave Marchevsky.

5) Remove the need for trace_printk_lock for bpf_trace_printk
   and bpf_trace_vprintk helpers, from Jiri Olsa.

6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps,
   from Maryam Tahhan.

7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du.

8) Bigger batch of improvements to BPF tracing code samples,
   from Daniel T. Lee.

9) Add LoongArch support to libbpf's bpf_tracing helper header,
   from Hengqi Chen.

10) Fix a libbpf compiler warning in perf_event_open_probe on arm32,
    from Khem Raj.

11) Optimize bpf_local_storage_elem by removing 56 bytes of padding,
    from Martin KaFai Lau.

12) Use pkg-config to locate libelf for resolve_btfids build,
    from Shen Jiamin.

13) Various libbpf improvements around API documentation and errno
    handling, from Xin Liu.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
  libbpf: Return -ENODATA for missing btf section
  libbpf: Add LoongArch support to bpf_tracing.h
  libbpf: Restore errno after pr_warn.
  libbpf: Added the description of some API functions
  libbpf: Fix invalid return address register in s390
  samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs
  samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro
  samples/bpf: Change _kern suffix to .bpf with syscall tracing program
  samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program
  samples/bpf: Use kyscall instead of kprobe in syscall tracing program
  bpf: rename list_head -> graph_root in field info types
  libbpf: fix errno is overwritten after being closed.
  bpf: fix regs_exact() logic in regsafe() to remap IDs correctly
  bpf: perform byte-by-byte comparison only when necessary in regsafe()
  bpf: reject non-exact register type matches in regsafe()
  bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule
  bpf: reorganize struct bpf_reg_state fields
  bpf: teach refsafe() to take into account ID remapping
  bpf: Remove unused field initialization in bpf's ctl_table
  selftests/bpf: Add jit probe_mem corner case tests to s390x denylist
  ...
====================

Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-04 20:21:25 -08:00
Arnaldo Carvalho de Melo
b235e5b51f tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:

  86bdf3ebcf ("KVM: Support dirty ring in conjunction with bitmap")

That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the the 'perf trace' ioctl syscall argument
beautifiers.

This is also by now used by tools/testing/selftests/kvm/, a simple test
build didn't succeed, but for another reason:

  lib/kvm_util.c: In function ‘vm_enable_dirty_ring’:
  lib/kvm_util.c:125:30: error: ‘KVM_CAP_DIRTY_LOG_RING_ACQ_REL’ undeclared (first use in this function); did you mean ‘KVM_CAP_DIRTY_LOG_RING’?
    125 |         if (vm_check_cap(vm, KVM_CAP_DIRTY_LOG_RING_ACQ_REL))
        |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        |                              KVM_CAP_DIRTY_LOG_RING

I'll send a separate patch for that.

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h'
  diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lore.kernel.org/lkml/Y6H3b1Q4Msjy5Yz3@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-12-20 15:15:57 -03:00
Christian Ehrig
e26aa600ba bpf: Add flag BPF_F_NO_TUNNEL_KEY to bpf_skb_set_tunnel_key()
This patch allows to remove TUNNEL_KEY from the tunnel flags bitmap
when using bpf_skb_set_tunnel_key by providing a BPF_F_NO_TUNNEL_KEY
flag. On egress, the resulting tunnel header will not contain a tunnel
key if the protocol and implementation supports it.

At the moment bpf_tunnel_key wants a user to specify a numeric tunnel
key. This will wrap the inner packet into a tunnel header with the key
bit and value set accordingly. This is problematic when using a tunnel
protocol that supports optional tunnel keys and a receiving tunnel
device that is not expecting packets with the key bit set. The receiver
won't decapsulate and drop the packet.

RFC 2890 and RFC 2784 GRE tunnels are examples where this flag is
useful. It allows for generating packets, that can be decapsulated by
a GRE tunnel device not operating in collect metadata mode or not
expecting the key bit set.

Signed-off-by: Christian Ehrig <cehrig@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221218051734.31411-1-cehrig@cloudflare.com
2022-12-19 23:53:15 +01:00
Arnaldo Carvalho de Melo
43a3ce77ae tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
To pick the changes from:

  f8b435f93b ("fscrypt: remove unused Speck definitions")
  e0cefada13 ("fscrypt: Add SM4 XTS/CTS symmetric algorithm support")

That don't result in any changes in tooling, just causes this to be
rebuilt:

  CC      /tmp/build/perf-urgent/trace/beauty/sync_file_range.o
  LD      /tmp/build/perf-urgent/trace/beauty/perf-in.o

addressing this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fscrypt.h' differs from latest version at 'include/uapi/linux/fscrypt.h'
  diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Link: https://lore.kernel.org/lkml/Y6CHSS6Rn9YOqpAd@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-12-19 12:46:36 -03:00
Linus Torvalds
8fa590bf34 ARM64:
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 * Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
   which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d:
   "Fix a number of issues with MTE, such as races on the tags being
   initialised vs the PG_mte_tagged flag as well as the lack of support
   for VM_SHARED when KVM is involved.  Patches from Catalin Marinas and
   Peter Collingbourne").
 
 * Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 * Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 * Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 * Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 s390:
 
 * Second batch of the lazy destroy patches
 
 * First batch of KVM changes for kernel virtual != physical address support
 
 * Removal of a unused function
 
 x86:
 
 * Allow compiling out SMM support
 
 * Cleanup and documentation of SMM state save area format
 
 * Preserve interrupt shadow in SMM state save area
 
 * Respond to generic signals during slow page faults
 
 * Fixes and optimizations for the non-executable huge page errata fix.
 
 * Reprogram all performance counters on PMU filter change
 
 * Cleanups to Hyper-V emulation and tests
 
 * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
   running on top of a L1 Hyper-V hypervisor)
 
 * Advertise several new Intel features
 
 * x86 Xen-for-KVM:
 
 ** Allow the Xen runstate information to cross a page boundary
 
 ** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
 
 ** Add support for 32-bit guests in SCHEDOP_poll
 
 * Notable x86 fixes and cleanups:
 
 ** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
 
 ** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
    years back when eliminating unnecessary barriers when switching between
    vmcs01 and vmcs02.
 
 ** Clean up vmread_error_trampoline() to make it more obvious that params
    must be passed on the stack, even for x86-64.
 
 ** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
    of the current guest CPUID.
 
 ** Fudge around a race with TSC refinement that results in KVM incorrectly
    thinking a guest needs TSC scaling when running on a CPU with a
    constant TSC, but no hardware-enumerated TSC frequency.
 
 ** Advertise (on AMD) that the SMM_CTL MSR is not supported
 
 ** Remove unnecessary exports
 
 Generic:
 
 * Support for responding to signals during page faults; introduces
   new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
 
 Selftests:
 
 * Fix an inverted check in the access tracking perf test, and restore
   support for asserting that there aren't too many idle pages when
   running on bare metal.
 
 * Fix build errors that occur in certain setups (unsure exactly what is
   unique about the problematic setup) due to glibc overriding
   static_assert() to a variant that requires a custom message.
 
 * Introduce actual atomics for clear/set_bit() in selftests
 
 * Add support for pinning vCPUs in dirty_log_perf_test.
 
 * Rename the so called "perf_util" framework to "memstress".
 
 * Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the memstress tests.
 
 * Add a common ucall implementation; code dedup and pre-work for running
   SEV (and beyond) guests in selftests.
 
 * Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).
 
 * A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
   breakpoints, stage-2 faults and access tracking.
 
 * x86-specific selftest changes:
 
 ** Clean up x86's page table management.
 
 ** Clean up and enhance the "smaller maxphyaddr" test, and add a related
    test to cover generic emulation failure.
 
 ** Clean up the nEPT support checks.
 
 ** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
 
 ** Fix an ordering issue in the AMX test introduced by recent conversions
    to use kvm_cpu_has(), and harden the code to guard against similar bugs
    in the future.  Anything that tiggers caching of KVM's supported CPUID,
    kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
    the caching occurs before the test opts in via prctl().
 
 Documentation:
 
 * Remove deleted ioctls from documentation
 
 * Clean up the docs for the x86 MSR filter.
 
 * Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
 KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
 mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
 yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
 Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
 MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
 =QAYX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87a97d: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
2022-12-15 11:12:21 -08:00
Kumar Kartikeya Dwivedi
2706053173 bpf: Rework process_dynptr_func
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type
for use in callback state, because in case of user ringbuf helpers,
there is no dynptr on the stack that is passed into the callback. To
reflect such a state, a special register type was created.

However, some checks have been bypassed incorrectly during the addition
of this feature. First, for arg_type with MEM_UNINIT flag which
initialize a dynptr, they must be rejected for such register type.
Secondly, in the future, there are plans to add dynptr helpers that
operate on the dynptr itself and may change its offset and other
properties.

In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed
to such helpers, however the current code simply returns 0.

The rejection for helpers that release the dynptr is already handled.

For fixing this, we take a step back and rework existing code in a way
that will allow fitting in all classes of helpers and have a coherent
model for dealing with the variety of use cases in which dynptr is used.

First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together
with a DYNPTR_TYPE_* constant that denotes the only type it accepts.

Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this
fact. To make the distinction clear, use MEM_RDONLY flag to indicate
that the helper only operates on the memory pointed to by the dynptr,
not the dynptr itself. In C parlance, it would be equivalent to taking
the dynptr as a point to const argument.

When either of these flags are not present, the helper is allowed to
mutate both the dynptr itself and also the memory it points to.
Currently, the read only status of the memory is not tracked in the
dynptr, but it would be trivial to add this support inside dynptr state
of the register.

With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to
better reflect its usage, it can no longer be passed to helpers that
initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr.

A note to reviewers is that in code that does mark_stack_slots_dynptr,
and unmark_stack_slots_dynptr, we implicitly rely on the fact that
PTR_TO_STACK reg is the only case that can reach that code path, as one
cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In
both cases such helpers won't be setting that flag.

The next patch will add a couple of selftest cases to make sure this
doesn't break.

Fixes: 2057156738 ("bpf: Add bpf_user_ringbuf_drain() helper")
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-08 18:25:31 -08:00
Eyal Birger
4f4ac4d910 tools: add IFLA_XFRM_COLLECT_METADATA to uapi/linux/if_link.h
Needed for XFRM metadata tests.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20221203084659.1837829-4-eyal.birger@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-05 21:58:28 -08:00
Javier Martinez Canillas
30ee198ce4 KVM: Reference to kvm_userspace_memory_region in doc and comments
There are still references to the removed kvm_memory_region data structure
but the doc and comments should mention struct kvm_userspace_memory_region
instead, since that is what's used by the ioctl that replaced the old one
and this data structure support the same set of flags.

Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Message-Id: <20221202105011.185147-4-javierm@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-02 12:54:40 -05:00