Minor improvements to the documentation for BPF helpers:
* Fix formatting for the description of "bpf_socket" for
bpf_getsockopt() and bpf_setsockopt(), thus suppressing two warnings
from rst2man about "Unexpected indentation".
* Fix formatting for return values for bpf_sk_assign() and seq_file
helpers.
* Fix and harmonise formatting, in particular for function/struct names.
* Remove blank lines before "Return:" sections.
* Replace tabs found in the middle of text lines.
* Fix typos.
* Add a note to the footer (in Python script) about "bpftool feature
probe", including for listing features available to unprivileged
users, and add a reference to bpftool man page.
Thanks to Florian for reporting two typos (duplicated words).
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200511161536.29853-4-quentin@isovalent.com
Bring minor improvements to bpftool documentation. Fix or harmonise
formatting, update map types (including in interactive help), improve
description for "map create", fix a build warning due to a missing line
after the double-colon for the "bpftool prog profile" example,
complete/harmonise/sort the list of related bpftool man pages in
footers.
v2:
- Remove (instead of changing) mark-up on "value" in bpftool-map.rst,
when it does not refer to something passed on the command line.
- Fix an additional typo ("hexadeximal") in the same file.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200511161536.29853-3-quentin@isovalent.com
Replace the use of kernel-only integer typedefs (u8, u32, etc.) by their
user space counterpart (__u8, __u32, etc.).
Similarly to what libbpf does, poison the typedefs to avoid introducing
them again in the future.
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200511161536.29853-2-quentin@isovalent.com
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200507185057.GA13981@embeddedor
runqslower doesn't specify include path for uapi/bpf.h. This causes the
following warning:
In file included from runqslower.c:10:
.../tools/testing/selftests/bpf/tools/include/bpf/bpf.h:234:38:
warning: 'enum bpf_stats_type' declared inside parameter list will not
be visible outside of this definition or declaration
234 | LIBBPF_API int bpf_enable_stats(enum bpf_stats_type type);
Fix this by adding -I tools/includ/uapi to the Makefile.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Yonghong Song says:
====================
Motivation:
The current way to dump kernel data structures mostly:
1. /proc system
2. various specific tools like "ss" which requires kernel support.
3. drgn
The dropback for the first two is that whenever you want to dump more, you
need change the kernel. For example, Martin wants to dump socket local
storage with "ss". Kernel change is needed for it to work ([1]).
This is also the direct motivation for this work.
drgn ([2]) solves this proble nicely and no kernel change is not needed.
But since drgn is not able to verify the validity of a particular pointer value,
it might present the wrong results in rare cases.
In this patch set, we introduce bpf iterator. Initial kernel changes are
still needed for interested kernel data, but a later data structure change
will not require kernel changes any more. bpf program itself can adapt
to new data structure changes. This will give certain flexibility with
guaranteed correctness.
In this patch set, kernel seq_ops is used to facilitate iterating through
kernel data, similar to current /proc and many other lossless kernel
dumping facilities. In the future, different iterators can be
implemented to trade off losslessness for other criteria e.g. no
repeated object visits, etc.
User Interface:
1. Similar to prog/map/link, the iterator can be pinned into a
path within a bpffs mount point.
2. The bpftool command can pin an iterator to a file
bpftool iter pin <bpf_prog.o> <path>
3. Use `cat <path>` to dump the contents.
Use `rm -f <path>` to remove the pinned iterator.
4. The anonymous iterator can be created as well.
Please see patch #19 andd #20 for bpf programs and bpf iterator
output examples.
Note that certain iterators are namespace aware. For example,
task and task_file targets only iterate through current pid namespace.
ipv6_route and netlink will iterate through current net namespace.
Please see individual patches for implementation details.
Performance:
The bpf iterator provides in-kernel aggregation abilities
for kernel data. This can greatly improve performance
compared to e.g., iterating all process directories under /proc.
For example, I did an experiment on my VM with an application forking
different number of tasks and each forked process opening various number
of files. The following is the result with the latency with unit of microseconds:
# of forked tasks # of open files # of bpf_prog calls # latency (us)
100 100 11503 7586
1000 1000 1013203 709513
10000 100 1130203 764519
The number of bpf_prog calls may be more than forked tasks multipled by
open files since there are other tasks running on the system.
The bpf program is a do-nothing program. One millions of bpf calls takes
less than one second.
Although the initial motivation is from Martin's sk_local_storage,
this patch didn't implement tcp6 sockets and sk_local_storage.
The /proc/net/tcp6 involves three types of sockets, timewait,
request and tcp6 sockets. Some kind of type casting or other
mechanism is needed to handle all these socket types in one
bpf program. This will be addressed in future work.
Currently, we do not support kernel data generated under module.
This requires some BTF work.
More work for more iterators, e.g., tcp, udp, bpf_map elements, etc.
Changelog:
v3 -> v4:
- in bpf_seq_read(), if start() failed with an error, return that
error to user space (Andrii)
- in bpf_seq_printf(), if reading kernel memory failed for
%s and %p{i,I}{4,6}, set buffer to empty string or address 0.
Documented this behavior in uapi header (Andrii)
- fix a few error handling issues for bpftool (Andrii)
- A few other minor fixes and cosmetic changes.
v2 -> v3:
- add bpf_iter_unreg_target() to unregister a target, used in the
error path of the __init functions.
- handle err != 0 before handling overflow (Andrii)
- reference count "task" for task_file target (Andrii)
- remove some redundancy for bpf_map/task/task_file targets
- add bpf_iter_unreg_target() in ip6_route_cleanup()
- Handling "%%" format in bpf_seq_printf() (Andrii)
- implement auto-attach for bpf_iter in libbpf (Andrii)
- add macros offsetof and container_of in bpf_helpers.h (Andrii)
- add tests for auto-attach and program-return-1 cases
- some other minor fixes
v1 -> v2:
- removed target_feature, using callback functions instead
- checking target to ensure program specified btf_id supported (Martin)
- link_create change with new changes from Andrii
- better handling of btf_iter vs. seq_file private data (Martin, Andrii)
- implemented bpf_seq_read() (Andrii, Alexei)
- percpu buffer for bpf_seq_printf() (Andrii)
- better syntax for BPF_SEQ_PRINTF macro (Andrii)
- bpftool fixes (Quentin)
- a lot of other fixes
RFC v2 -> v1:
- rename bpfdump to bpf_iter
- use bpffs instead of a new file system
- use bpf_link to streamline and simplify iterator creation.
References:
[1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
[2]: https://github.com/osandov/drgn
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The added test includes the following subtests:
- test verifier change for btf_id_or_null
- test load/create_iter/read for
ipv6_route/netlink/bpf_map/task/task_file
- test anon bpf iterator
- test anon bpf iterator reading one char at a time
- test file bpf iterator
- test overflow (single bpf program output not overflow)
- test overflow (single bpf program output overflows)
- test bpf prog returning 1
The ipv6_route tests the following verifier change
- access fields in the variable length array of the structure.
The netlink load tests the following verifier change
- put a btf_id ptr value in a stack and accessible to
tracing/iter programs.
The anon bpf iterator also tests link auto attach through skeleton.
$ test_progs -n 2
#2/1 btf_id_or_null:OK
#2/2 ipv6_route:OK
#2/3 netlink:OK
#2/4 bpf_map:OK
#2/5 task:OK
#2/6 task_file:OK
#2/7 anon:OK
#2/8 anon-read-one-char:OK
#2/9 file:OK
#2/10 overflow:OK
#2/11 overflow-e2big:OK
#2/12 prog-ret-1:OK
#2 bpf_iter:OK
Summary: 1/12 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175923.2477637-1-yhs@fb.com
The implementation is arbitrary, just to show how the bpf programs
can be written for bpf_map/task/task_file. They can be costomized
for specific needs.
For example, for bpf_map, the iterator prints out:
$ cat /sys/fs/bpf/my_bpf_map
id refcnt usercnt locked_vm
3 2 0 20
6 2 0 20
9 2 0 20
12 2 0 20
13 2 0 20
16 2 0 20
19 2 0 20
%%% END %%%
For task, the iterator prints out:
$ cat /sys/fs/bpf/my_task
tgid gid
1 1
2 2
....
1944 1944
1948 1948
1949 1949
1953 1953
=== END ===
For task/file, the iterator prints out:
$ cat /sys/fs/bpf/my_task_file
tgid gid fd file
1 1 0 ffffffff95c97600
1 1 1 ffffffff95c97600
1 1 2 ffffffff95c97600
....
1895 1895 255 ffffffff95c8fe00
1932 1932 0 ffffffff95c8fe00
1932 1932 1 ffffffff95c8fe00
1932 1932 2 ffffffff95c8fe00
1932 1932 3 ffffffff95c185c0
This is able to print out all open files (fd and file->f_op), so user can compare
f_op against a particular kernel file operations to find what it is.
For example, from /proc/kallsyms, we can find
ffffffff95c185c0 r eventfd_fops
so we will know tgid 1932 fd 3 is an eventfd file descriptor.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175922.2477576-1-yhs@fb.com
Currently, only one command is supported
bpftool iter pin <bpf_prog.o> <path>
It will pin the trace/iter bpf program in
the object file <bpf_prog.o> to the <path>
where <path> should be on a bpffs mount.
For example,
$ bpftool iter pin ./bpf_iter_ipv6_route.o \
/sys/fs/bpf/my_route
User can then do a `cat` to print out the results:
$ cat /sys/fs/bpf/my_route
fe800000000000000000000000000000 40 00000000000000000000000000000000 ...
00000000000000000000000000000000 00 00000000000000000000000000000000 ...
00000000000000000000000000000001 80 00000000000000000000000000000000 ...
fe800000000000008c0162fffebdfd57 80 00000000000000000000000000000000 ...
ff000000000000000000000000000000 08 00000000000000000000000000000000 ...
00000000000000000000000000000000 00 00000000000000000000000000000000 ...
The implementation for ipv6_route iterator is in one of subsequent
patches.
This patch also added BPF_LINK_TYPE_ITER to link query.
In the future, we may add additional parameters to pin command
by parameterizing the bpf iterator. For example, a map_id or pid
may be added to let bpf program only traverses a single map or task,
similar to kernel seq_file single_open().
We may also add introspection command for targets/iterators by
leveraging the bpf_iter itself.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200509175920.2477247-1-yhs@fb.com
These two helpers will be used later in bpf_iter bpf program
bpf_iter_netlink.c. Put them in bpf_helpers.h since they could
be useful in other cases.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175919.2477104-1-yhs@fb.com
Two new libbpf APIs are added to support bpf_iter:
- bpf_program__attach_iter
Given a bpf program and additional parameters, which is
none now, returns a bpf_link.
- bpf_iter_create
syscall level API to create a bpf iterator.
The macro BPF_SEQ_PRINTF are also introduced. The format
looks like:
BPF_SEQ_PRINTF(seq, "task id %d\n", pid);
This macro can help bpf program writers with
nicer bpf_seq_printf syntax similar to the kernel one.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175917.2476936-1-yhs@fb.com
Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.
bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.
For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.
For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
read kernel memory. Reading kernel memory may fail in
the following two cases:
- invalid kernel address, or
- valid kernel address but requiring a major fault
If reading kernel memory failed, the %s string will be
an empty string and %p{i,I}{4,6} will be all 0.
Not returning error to bpf program is consistent with
what bpf_trace_printk() does for now.
bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_disable(), which guards
bpf prog call, calls preempt_disable().
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
Add bpf_reg_type PTR_TO_BTF_ID_OR_NULL support.
For tracing/iter program, the bpf program context
definition, e.g., for previous bpf_map target, looks like
struct bpf_iter__bpf_map {
struct bpf_iter_meta *meta;
struct bpf_map *map;
};
The kernel guarantees that meta is not NULL, but
map pointer maybe NULL. The NULL map indicates that all
objects have been traversed, so bpf program can take
proper action, e.g., do final aggregation and/or send
final report to user space.
Add btf_id_or_null_non0_off to prog->aux structure, to
indicate that if the context access offset is not 0,
set to PTR_TO_BTF_ID_OR_NULL instead of PTR_TO_BTF_ID.
This bit is set for tracing/iter program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175912.2476576-1-yhs@fb.com
Only the tasks belonging to "current" pid namespace
are enumerated.
For task/file target, the bpf program will have access to
struct task_struct *task
u32 fd
struct file *file
where fd/file is an open file for the task.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175911.2476407-1-yhs@fb.com
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.
The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.
Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
Macro DEFINE_BPF_ITER_FUNC is implemented so target
can define an init function to capture the BTF type
which represents the target.
The bpf_iter_meta is a structure holding meta data, common
to all targets in the bpf program.
Additional marker functions are called before or after
bpf_seq_read() show()/next()/stop() callback functions
to help calculate precise seq_num and whether call bpf_prog
inside stop().
Two functions, bpf_iter_get_info() and bpf_iter_run_prog(),
are implemented so target can get needed information from
bpf_iter infrastructure and can run the program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175907.2475956-1-yhs@fb.com
To produce a file bpf iterator, the fd must be
corresponding to a link_fd assocciated with a
trace/iter program. When the pinned file is
opened, a seq_file will be generated.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com
A new bpf command BPF_ITER_CREATE is added.
The anonymous bpf iterator is seq_file based.
The seq_file private data are referenced by targets.
The bpf_iter infrastructure allocated additional space
at seq_file->private before the space used by targets
to store some meta data, e.g.,
prog: prog to run
session_id: an unique id for each opened seq_file
seq_num: how many times bpf programs are queried in this session
done_stop: an internal state to decide whether bpf program
should be called in seq_ops->stop() or not
The seq_num will start from 0 for valid objects.
The bpf program may see the same seq_num more than once if
- seq_file buffer overflow happens and the same object
is retried by bpf_seq_read(), or
- the bpf program explicitly requests a retry of the
same object
Since module is not supported for bpf_iter, all target
registeration happens at __init time, so there is no
need to change bpf_iter_unreg_target() as it is used
mostly in error path of the init function at which time
no bpf iterators have been created yet.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
bpf iterator uses seq_file to provide a lossless
way to transfer data to user space. But we want to call
bpf program after all objects have been traversed, and
bpf program may write additional data to the
seq_file buffer. The current seq_read() does not work
for this use case.
Besides allowing stop() function to write to the buffer,
the bpf_seq_read() also fixed the buffer size to one page.
If any single call of show() or stop() will emit data
more than one page to cause overflow, -E2BIG error code
will be returned to user space.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175904.2475468-1-yhs@fb.com
Added BPF_LINK_UPDATE support for tracing/iter programs.
This way, a file based bpf iterator, which holds a reference
to the link, can have its bpf program updated without
creating new files.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175902.2475262-1-yhs@fb.com
Given a bpf program, the step to create an anonymous bpf iterator is:
- create a bpf_iter_link, which combines bpf program and the target.
In the future, there could be more information recorded in the link.
A link_fd will be returned to the user space.
- create an anonymous bpf iterator with the given link_fd.
The bpf_iter_link can be pinned to bpffs mount file system to
create a file based bpf iterator as well.
The benefit to use of bpf_iter_link:
- using bpf link simplifies design and implementation as bpf link
is used for other tracing bpf programs.
- for file based bpf iterator, bpf_iter_link provides a standard
way to replace underlying bpf programs.
- for both anonymous and free based iterators, bpf link query
capability can be leveraged.
The patch added support of tracing/iter programs for BPF_LINK_CREATE.
A new link type BPF_LINK_TYPE_ITER is added to facilitate link
querying. Currently, only prog_id is needed, so there is no
additional in-kernel show_fdinfo() and fill_link_info() hook
is needed for BPF_LINK_TYPE_ITER link.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
A bpf_iter program is a tracing program with attach type
BPF_TRACE_ITER. The load attribute
attach_btf_id
is used by the verifier against a particular kernel function,
which represents a target, e.g., __bpf_iter__bpf_map
for target bpf_map which is implemented later.
The program return value must be 0 or 1 for now.
0 : successful, except potential seq_file buffer overflow
which is handled by seq_file reader.
1 : request to restart the same object
In the future, other return values may be used for filtering or
teminating the iterator.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
The target can call bpf_iter_reg_target() to register itself.
The needed information:
target: target name
seq_ops: the seq_file operations for the target
init_seq_private target callback to initialize seq_priv during file open
fini_seq_private target callback to clean up seq_priv during file release
seq_priv_size: the private_data size needed by the seq_file
operations
The target name represents a target which provides a seq_ops
for iterating objects.
The target can provide two callback functions, init_seq_private
and fini_seq_private, called during file open/release time.
For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net
name space needs to be setup properly during file open and
released properly during file release.
Function bpf_iter_unreg_target() is also implemented to unregister
a particular target.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
We want to have a tighter control on what ports we bind to in
the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means
connect() becomes slightly more expensive. The expensive part
comes from the fact that we now need to call inet_csk_get_port()
that verifies that the port is not used and allocates an entry
in the hash table for it.
Since we can't rely on "snum || !bind_address_no_port" to prevent
us from calling POST_BIND hook anymore, let's add another bind flag
to indicate that the call site is BPF program.
v5:
* fix wrong AF_INET (should be AF_INET6) in the bpf program for v6
v3:
* More bpf_bind documentation refinements (Martin KaFai Lau)
* Add UDP tests as well (Martin KaFai Lau)
* Don't start the thread, just do socket+bind+listen (Martin KaFai Lau)
v2:
* Update documentation (Andrey Ignatov)
* Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com
The intent is to add an additional bind parameter in the next commit.
Instead of adding another argument, let's convert all existing
flag arguments into an extendable bit field.
No functional changes.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-4-sdf@google.com
1. Move pkt_v4 and pkt_v6 into network_helpers and adjust the users.
2. Copy-paste spin_lock_thread into two tests that use it.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-3-sdf@google.com
Move the following routines that let us start a background listener
thread and connect to a server by fd to the test_prog:
* start_server - socket+bind+listen
* connect_to_fd - connect to the server identified by fd
These will be used in the next commit.
Also, extend these helpers to support AF_INET6 and accept the family
as an argument.
v5:
* drop pthread.h (Martin KaFai Lau)
* add SO_SNDTIMEO (Martin KaFai Lau)
v4:
* export extra helper to start server without a thread (Martin KaFai Lau)
* tcp_rtt is no longer starting background thread (Martin KaFai Lau)
v2:
* put helpers into network_helpers.c (Andrii Nakryiko)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-2-sdf@google.com
The '==' expression itself is bool, no need to convert it to bool again.
This fixes the following coccicheck warning:
arch/x86/net/bpf_jit_comp32.c:1478:50-55: WARNING: conversion to bool not needed here
arch/x86/net/bpf_jit_comp32.c:1479:50-55: WARNING: conversion to bool not needed here
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200506140352.37154-1-yanaijie@huawei.com
Luke Nelson says:
====================
This patch series introduces a set of optimizations to the BPF JIT
on RV64. The optimizations are related to the verifier zero-extension
optimization and BPF_JMP BPF_K.
We tested the optimizations on a QEMU riscv64 virt machine, using
lib/test_bpf and test_verifier, and formally verified their correctness
using Serval.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch optimizes BPF_JSET BPF_K by using a RISC-V andi instruction
when the BPF immediate fits in 12 bits, instead of first loading the
immediate to a temporary register.
Examples of generated code with and without this optimization:
BPF_JMP_IMM(BPF_JSET, R1, 2, 1) without optimization:
20: li t1,2
24: and t1,a0,t1
28: bnez t1,0x30
BPF_JMP_IMM(BPF_JSET, R1, 2, 1) with optimization:
20: andi t1,a0,2
24: bnez t1,0x2c
BPF_JMP32_IMM(BPF_JSET, R1, 2, 1) without optimization:
20: li t1,2
24: mv t2,a0
28: slli t2,t2,0x20
2c: srli t2,t2,0x20
30: slli t1,t1,0x20
34: srli t1,t1,0x20
38: and t1,t2,t1
3c: bnez t1,0x44
BPF_JMP32_IMM(BPF_JSET, R1, 2, 1) with optimization:
20: andi t1,a0,2
24: bnez t1,0x2c
In these examples, because the upper 32 bits of the sign-extended
immediate are 0, BPF_JMP BPF_JSET and BPF_JMP32 BPF_JSET are equivalent
and therefore the JIT produces identical code for them.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Björn Töpel <bjorn.topel@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20200506000320.28965-5-luke.r.nels@gmail.com
This patch adds an optimization to BPF_JMP (32- and 64-bit) BPF_K for
when the BPF immediate is zero.
When the immediate is zero, the code can directly use the RISC-V zero
register instead of loading a zero immediate to a temporary register
first.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Björn Töpel <bjorn.topel@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20200506000320.28965-4-luke.r.nels@gmail.com
This patch adds two optimizations for BPF_ALU BPF_END BPF_FROM_LE in
the RV64 BPF JIT.
First, it enables the verifier zero-extension optimization to avoid zero
extension when imm == 32. Second, it avoids generating code for imm ==
64, since it is equivalent to a no-op.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Björn Töpel <bjorn.topel@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20200506000320.28965-3-luke.r.nels@gmail.com
Commit 66d0d5a854 ("riscv: bpf: eliminate zero extension code-gen")
added support for the verifier zero-extension optimization on RV64 and
commit 46dd3d7d28 ("bpf, riscv: Enable zext optimization for more
RV64G ALU ops") enabled it for more instruction cases.
However, BPF_LSH BPF_X and BPF_{LSH,RSH,ARSH} BPF_K are still missing
the optimization.
This patch enables the zero-extension optimization for these remaining
cases.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Björn Töpel <bjorn.topel@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20200506000320.28965-2-luke.r.nels@gmail.com
The newly added bpf_stats_handler function has the wrong #ifdef
check around it, leading to an unused-function warning when
CONFIG_SYSCTL is disabled:
kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function]
static int bpf_stats_handler(struct ctl_table *table, int write,
Fix the check to match the reference.
Fixes: d46edd671a ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de
Remove the unnecessary member of address in struct xdp_umem as it is
only used during the umem registration. No need to carry this around
as it is not used during run-time nor when unregistering the umem.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-3-git-send-email-magnus.karlsson@intel.com
Change two variables names so that it is clearer what they
represent. The first one is xsk_list that in fact only contains the
list of AF_XDP sockets with a Tx component. Change this to xsk_tx_list
for improved clarity. The second variable is size in the ring
structure. One might think that this is the size of the ring, but it
is in fact the size of the umem, copied into the ring structure to
improve performance. Rename this variable umem_size to avoid any
confusion.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-2-git-send-email-magnus.karlsson@intel.com
gcc-10 warns about accesses to zero-length arrays:
kernel/bpf/core.c: In function 'bpf_patch_insn_single':
cc1: warning: writing 8 bytes into a region of size 0 [-Wstringop-overflow=]
In file included from kernel/bpf/core.c:21:
include/linux/filter.h:550:20: note: at offset 0 to object 'insnsi' with size 0 declared here
550 | struct bpf_insn insnsi[0];
| ^~~~~~
In this case, we really want to have two flexible-array members,
but that is not possible. Removing the union to make insnsi a
flexible-array member while leaving insns as a zero-length array
fixes the warning, as nothing writes to the other one in that way.
This trick only works on linux-3.18 or higher, as older versions
had additional members in the union.
Fixes: 60a3b2253c ("net: bpf: make eBPF interpreter images read-only")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200430213101.135134-6-arnd@arndb.de
This patch adds an optimization that uses the asr immediate instruction
for BPF_ALU BPF_ARSH BPF_K, rather than loading the immediate to
a temporary register. This is similar to existing code for handling
BPF_ALU BPF_{LSH,RSH} BPF_K. This optimization saves two instructions
and is more consistent with LSH and RSH.
Example of the code generated for BPF_ALU32_IMM(BPF_ARSH, BPF_REG_0, 5)
before the optimization:
2c: mov r8, #5
30: mov r9, #0
34: asr r0, r0, r8
and after optimization:
2c: asr r0, r0, #5
Tested on QEMU using lib/test_bpf and test_verifier.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200501020210.32294-3-luke.r.nels@gmail.com
This patch optimizes the code generated by emit_a32_arsh_r64, which
handles the BPF_ALU64 BPF_ARSH BPF_X instruction.
The original code uses a conditional B followed by an unconditional ORR.
The optimization saves one instruction by removing the B instruction
and using a conditional ORR (with an inverted condition).
Example of the code generated for BPF_ALU64_REG(BPF_ARSH, BPF_REG_0,
BPF_REG_1), before optimization:
34: rsb ip, r2, #32
38: subs r9, r2, #32
3c: lsr lr, r0, r2
40: orr lr, lr, r1, lsl ip
44: bmi 0x4c
48: orr lr, lr, r1, asr r9
4c: asr ip, r1, r2
50: mov r0, lr
54: mov r1, ip
and after optimization:
34: rsb ip, r2, #32
38: subs r9, r2, #32
3c: lsr lr, r0, r2
40: orr lr, lr, r1, lsl ip
44: orrpl lr, lr, r1, asr r9
48: asr ip, r1, r2
4c: mov r0, lr
50: mov r1, ip
Tested on QEMU using lib/test_bpf and test_verifier.
Co-developed-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Luke Nelson <luke.r.nels@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200501020210.32294-2-luke.r.nels@gmail.com
Karsten Graul says:
====================
net/smc: add and delete link processing
These patches add the 'add link' and 'delete link' processing as
SMC server and client. This processing allows to establish and
remove links of a link group dynamically.
v2: Fix mess up with unused static functions. Merge patch 8 into patch 4.
Postpone patch 13 to next series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As SMC server, when a second link was deleted, trigger the setup of an
asymmetric link. Do this by enqueueing a local ADD_LINK message which
is processed by the LLC layer as if it were received from peer. Do the
same when a new IB port became active and a new link could be created.
smc_llc_srv_add_link_local() enqueues a local ADD_LINK message.
And smc_llc_srv_delete_link_local() is used the same way to enqueue a
local DELETE_LINK message. This is used when an IB port is no longer
active.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add smc_llc_process_srv_delete_link() to process a DELETE_LINK request
as SMC server. When the request is to delete ALL links then terminate
the whole link group. If not, find the link to delete by its link_id,
send the DELETE_LINK request LLC message and wait for the response.
No matter if a response was received, clear the deleted link and update
the link group state.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add smc_llc_process_cli_delete_link() to process a DELETE_LINK request
as SMC client. When the request is to delete ALL links then terminate
the whole link group. If not, find the link to delete by its link_id,
send the DELETE_LINK response LLC message and then clear the deleted
link. Finally determine and update the link group state.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a work that is scheduled when a new DELETE_LINK LLC request is
received. The work will call either the SMC client or SMC server
DELETE_LINK processing.
And use the LLC flow framework to process incoming DELETE_LINK LLC
messages, scheduling the llc_del_link_work for those events.
With these changes smc_lgr_forget() is only called by one function and
can be migrated into smc_lgr_cleanup_early().
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link group moved from asymmetric to symmetric state then the
dangling asymmetric link can be deleted. Add smc_llc_find_asym_link() to
find the respective link and add smc_llc_delete_asym_link() to delete
it.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>