drivers/net/ethernet/freescale/fec.h
7d650df99d ("net: fec: add pm_qos support on imx6q platform")
40c79ce13b ("net: fec: add stop mode support for imx8 platform")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The unsolicited NA message with all-nodes multicast dest address should
be valid, as this also means the link could reach the target.
Also rename bond_validate_ns() to bond_validate_na().
Reported-by: LiLiang <liali@redhat.com>
Fixes: 5e1eeef69c ("bonding: NS target should accept link local address")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ns_ip6_target was set, the ipv6_dev_get_saddr() will be called to get
available source address and send IPv6 neighbor solicit message.
If the target is global address, ipv6_dev_get_saddr() will get any
available src address. But if the target is link local address,
ipv6_dev_get_saddr() will only get available address from our interface,
i.e. the corresponding bond interface.
But before bond interface up, all the address is tentative, while
ipv6_dev_get_saddr() will ignore tentative address. This makes we can't
find available link local src address, then bond_ns_send() will not be
called and no NS message was sent. Finally bond interface will keep in
down state.
Fix this by sending NS with unspecified address if there is no available
source address.
Reported-by: LiLiang <liali@redhat.com>
Fixes: 5e1eeef69c ("bonding: NS target should accept link local address")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The value is only ever set once in bond_3ad_initialize and only ever
read otherwise. There seems to be no reason to set the variable via
bond_3ad_initialize when setting the global variable will do. Change
ad_ticks_per_sec to a const to enforce its read-only usage.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is caused by the global variable ad_ticks_per_sec being zero as
demonstrated by the reproducer script discussed below. This causes
all timer values in __ad_timer_to_ticks to be zero, resulting
in the periodic timer to never fire.
To reproduce:
Run the script in
`tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which
puts bonding into a state where it never transmits LACPDUs.
line 44: ip link add fbond type bond mode 4 miimon 200 \
xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast
setting bond param: ad_actor_sys_prio
given:
params.ad_actor_system = 0
call stack:
bond_option_ad_actor_sys_prio()
-> bond_3ad_update_ad_actor_settings()
-> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
-> ad.system.sys_mac_addr = bond->dev->dev_addr; because
params.ad_actor_system == 0
results:
ad.system.sys_mac_addr = bond->dev->dev_addr
line 48: ip link set fbond address 52:54:00:3B:7C:A6
setting bond MAC addr
call stack:
bond->dev->dev_addr = new_mac
line 52: ip link set fbond type bond ad_actor_sys_prio 65535
setting bond param: ad_actor_sys_prio
given:
params.ad_actor_system = 0
call stack:
bond_option_ad_actor_sys_prio()
-> bond_3ad_update_ad_actor_settings()
-> set ad.system.sys_priority = bond->params.ad_actor_sys_prio
-> ad.system.sys_mac_addr = bond->dev->dev_addr; because
params.ad_actor_system == 0
results:
ad.system.sys_mac_addr = bond->dev->dev_addr
line 60: ip link set veth1-bond down master fbond
given:
params.ad_actor_system = 0
params.mode = BOND_MODE_8023AD
ad.system.sys_mac_addr == bond->dev->dev_addr
call stack:
bond_enslave
-> bond_3ad_initialize(); because first slave
-> if ad.system.sys_mac_addr != bond->dev->dev_addr
return
results:
Nothing is run in bond_3ad_initialize() because dev_addr equals
sys_mac_addr leaving the global ad_ticks_per_sec zero as it is
never initialized anywhere else.
The if check around the contents of bond_3ad_initialize() is no longer
needed due to commit 5ee14e6d33 ("bonding: 3ad: apply ad_actor settings
changes immediately") which sets ad.system.sys_mac_addr if any one of
the bonding parameters whos set function calls
bond_3ad_update_ad_actor_settings(). This is because if
ad.system.sys_mac_addr is zero it will be set to the current bond mac
address, this causes the if check to never be true.
Fixes: 5ee14e6d33 ("bonding: 3ad: apply ad_actor settings changes immediately")
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit d5410ac7b0 ("net:bonding:support balance-alb interface
with vlan to bridge") introduced a reference count leak by not releasing
the reference acquired by ip_dev_find(). Remedy this by insuring the
reference is released.
Fixes: d5410ac7b0 ("net:bonding:support balance-alb interface with vlan to bridge")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/26758.1660194413@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, tls_device_down synchronizes with tls_device_resync_rx using
RCU, however, the pointer to netdev is stored using WRITE_ONCE and
loaded using READ_ONCE.
Although such approach is technically correct (rcu_dereference is
essentially a READ_ONCE, and rcu_assign_pointer uses WRITE_ONCE to store
NULL), using special RCU helpers for pointers is more valid, as it
includes additional checks and might change the implementation
transparently to the callers.
Mark the netdev pointer as __rcu and use the correct RCU helpers to
access it. For non-concurrent access pass the right conditions that
guarantee safe access (locks taken, refcount value). Also use the
correct helper in mlx5e, where even READ_ONCE was missing.
The transition to RCU exposes existing issues, fixed by this commit:
1. bond_tls_device_xmit could read netdev twice, and it could become
NULL the second time, after the NULL check passed.
2. Drivers shouldn't stop processing the last packet if tls_device_down
just set netdev to NULL, before tls_dev_del was called. This prevents a
possible packet drop when transitioning to the fallback software mode.
Fixes: 89df6a8104 ("net/bonding: Implement TLS TX device offload")
Fixes: c55dcdd435 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/r/20220810081602.1435800-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In my test, balance-alb bonding with two slaves eth0 and eth1,
and then Bond0.150 is created with vlan id attached bond0.
After adding bond0.150 into one linux bridge, I noted that Bond0,
bond0.150 and bridge were assigned to the same MAC as eth0.
Once bond0.150 receives a packet whose dest IP is bridge's
and dest MAC is eth1's, the linux bridge will not match
eth1's MAC entry in FDB, and not handle it as expected.
The patch fix the issue, and diagram as below:
eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac)
|
bond0.150(mac:eth0_mac)
|
bridge(ip:br_ip, mac:eth0_mac)--other port
Suggested-by: Hu Yadi <huyd12@chinatelecom.cn>
Signed-off-by: Sun Shouxin <sunshouxin@chinatelecom.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver piggybacks on time stamps kept by the network stack
for the purpose of the netdev TX watchdog, and this is problematic
because it does not work with NETIF_F_LLTX devices.
It is hard to say why the driver looks at dev_trans_start() of the
slave->dev, considering that this is updated even by non-ARP/NS probes
sent by us, and even by traffic not sent by us at all (for example PTP
on physical slave devices). ARP monitoring in active-backup mode appears
to still work even if we track only the last TX time of actual ARP
probes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
9c5de246c1 ("net: sparx5: mdb add/del handle non-sparx5 devices")
fbb89d02e3 ("net: sparx5: Allow mdb entries to both CPU and ports")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
commit 0622cab034 ("bonding: fix 802.3ad aggregator reselection"),
resolve case, when there is several aggregation groups in the same bond.
bond_3ad_unbind_slave will invalidate (clear) aggregator when
__agg_active_ports return zero. So, ad_clear_agg can be executed even, when
num_of_ports!=0. Than bond_3ad_unbind_slave can be executed again for,
previously cleared aggregator. NOTE: at this time bond_3ad_unbind_slave
will not update slave ports list, because lag_ports==NULL. So, here we
got slave ports, pointing to freed aggregator memory.
Fix with checking actual number of ports in group (as was before
commit 0622cab034 ("bonding: fix 802.3ad aggregator reselection") ),
before ad_clear_agg().
The KASAN logs are as follows:
[ 767.617392] ==================================================================
[ 767.630776] BUG: KASAN: use-after-free in bond_3ad_state_machine_handler+0x13dc/0x1470
[ 767.638764] Read of size 2 at addr ffff00011ba9d430 by task kworker/u8:7/767
[ 767.647361] CPU: 3 PID: 767 Comm: kworker/u8:7 Tainted: G O 5.15.11 #15
[ 767.655329] Hardware name: DNI AmazonGo1 A7040 board (DT)
[ 767.660760] Workqueue: lacp_1 bond_3ad_state_machine_handler
[ 767.666468] Call trace:
[ 767.668930] dump_backtrace+0x0/0x2d0
[ 767.672625] show_stack+0x24/0x30
[ 767.675965] dump_stack_lvl+0x68/0x84
[ 767.679659] print_address_description.constprop.0+0x74/0x2b8
[ 767.685451] kasan_report+0x1f0/0x260
[ 767.689148] __asan_load2+0x94/0xd0
[ 767.692667] bond_3ad_state_machine_handler+0x13dc/0x1470
Fixes: 0622cab034 ("bonding: fix 802.3ad aggregator reselection")
Co-developed-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: Maksym Glubokiy <maksym.glubokiy@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20220629012914.361-1-yevhen.orlov@plvision.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add per port priority support for bonding active slave re-selection during
failover. A higher number means higher priority in selection. The primary
slave still has the highest priority. This option also follows the
primary_reselect rules.
This option could only be configured via netlink.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding ARP monitor fails to decrement send_peer_notif, the
number of peer notifications (gratuitous ARP or ND) to be sent. This
results in a continuous series of notifications.
Correct this by decrementing the counter for each notification.
Reported-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Fixes: b0929915e0 ("bonding: Fix RTNL: assertion failed at net/core/rtnetlink.c for ab arp monitor")
Link: https://lore.kernel.org/netdev/b2fd4147-8f50-bebd-963a-1a3e8d1d9715@redhat.com/
Tested-by: Jonathan Toppins <jtoppins@redhat.com>
Reviewed-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/9400.1655407960@famine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Setting RLB_NULL_INDEX is not needed as this is done in bond_alb_initialize
which is called by bond_open.
Also reduce the number of rtnl_unlock calls by just using the standard
goto cleanup path.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for reporting errors via extack in both bond_newlink
and bond_changelink.
Instead of having to look in the kernel log for why an option was not
correct just report the error to the user via the extack variable.
What is currently reported today:
ip link add bond0 type bond
ip link set bond0 up
ip link set bond0 type bond mode 4
RTNETLINK answers: Device or resource busy
After this change:
ip link add bond0 type bond
ip link set bond0 up
ip link set bond0 type bond mode 4
Error: unable to set option because the bond is up.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guard ns_targets in struct bond_params by CONFIG_IPV6, which could save
256 bytes if IPv6 not configed. Also add this protection for function
bond_is_ip6_target_ok() and bond_get_targets_ip6().
Remove the IS_ENABLED() check for bond_opts[] as this will make
BOND_OPT_NS_TARGETS uninitialized if CONFIG_IPV6 not enabled. Add
a dummy bond_option_ns_ip6_targets_set() for this situation.
Fixes: 4e24be018e ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/20220531063727.224043-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When adding bond new parameter ns_targets. I forgot to print this
in bond master proc info. After updating, the bond master info will look
like:
ARP IP target/s (n.n.n.n form): 192.168.1.254
NS IPv6 target/s (XX::XX form): 2022::1, 2022::2
Fixes: 4e24be018e ("bonding: add new parameter ns_targets")
Reported-by: Li Liang <liali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220530062639.37179-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When setting bond NS target, we use bond_is_ip6_target_ok() to check
if the address valid. The link local address was wrongly rejected in
bond_changelink(), as most time the user just set the ARP/NS target to
gateway, while the IPv6 gateway is always a link local address when user
set up interface via SLAAC.
So remove the link local addr check when setting bond NS target.
Fixes: 129e3c1bab ("bonding: add new option ns_ip6_target")
Reported-by: Li Liang <liali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing the rcu_read_lock in bond_ethtool_get_ts_info() as
discussed [1], I didn't notice it could be called via setsockopt,
which doesn't hold rcu lock, as syzbot pointed:
stack backtrace:
CPU: 0 PID: 3599 Comm: syz-executor317 Not tainted 5.18.0-rc5-syzkaller-01392-g01f4685797a5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
bond_option_active_slave_get_rcu include/net/bonding.h:353 [inline]
bond_ethtool_get_ts_info+0x32c/0x3a0 drivers/net/bonding/bond_main.c:5595
__ethtool_get_ts_info+0x173/0x240 net/ethtool/common.c:554
ethtool_get_phc_vclocks+0x99/0x110 net/ethtool/common.c:568
sock_timestamping_bind_phc net/core/sock.c:869 [inline]
sock_set_timestamping+0x3a3/0x7e0 net/core/sock.c:916
sock_setsockopt+0x543/0x2ec0 net/core/sock.c:1221
__sys_setsockopt+0x55e/0x6a0 net/socket.c:2223
__do_sys_setsockopt net/socket.c:2238 [inline]
__se_sys_setsockopt net/socket.c:2235 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2235
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f8902c8eb39
Fix it by adding rcu_read_lock and take a ref on the real_dev.
Since dev_hold() and dev_put() can take NULL these days, we can
skip checking if real_dev exist.
[1] https://lore.kernel.org/netdev/27565.1642742439@famine/
Reported-by: syzbot+92beb3d46aab498710fa@syzkaller.appspotmail.com
Fixes: aa6034678e ("bonding: use rcu_dereference_rtnl when get bonding active slave")
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220519020148.1058344-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drivers should call the TSO setting helper, GSO is controllable
by user space.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces some sk_ipv6only tests with ipv6_only_sock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b5f862180d was introduced to discard lowest hash bit for layer3+4 hashing
but it also removes last bit from non layer3+4 hashing
Below script shows layer2+3 hashing will result in same slave to be used with above commit.
$ cat hash.py
#/usr/bin/python3.6
h_dests=[0xa0, 0xa1]
h_source=0xe3
hproto=0x8
saddr=0x1e7aa8c0
daddr=0x17aa8c0
for h_dest in h_dests:
hash = (h_dest ^ h_source ^ hproto ^ saddr ^ daddr)
hash ^= hash >> 16
hash ^= hash >> 8
print(hash)
print("with last bit removed")
for h_dest in h_dests:
hash = (h_dest ^ h_source ^ hproto ^ saddr ^ daddr)
hash ^= hash >> 16
hash ^= hash >> 8
hash = hash >> 1
print(hash)
Output:
$ python3.6 hash.py
522133332
522133333 <-------------- will result in both slaves being used
with last bit removed
261066666
261066666 <-------------- only single slave used
Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before adding yet another possibly contended atomic_long_t,
it is time to add per-cpu storage for existing ones:
dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler
Because many devices do not have to increment such counters,
allocate the per-cpu storage on demand, so that dev_get_stats()
does not have to spend considerable time folding zero counters.
Note that some drivers have abused these counters which
were supposed to be only used by core networking stack.
v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
Change in netdev_core_stats_alloc() (Jakub)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: jeffreyji <jeffreyji@google.com>
Reviewed-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch add a new bonding option ns_ip6_target, which correspond
to the arp_ip_target. With this we set IPv6 targets and send IPv6 NS
request to determine the health of the link.
For other related options like the validation, we still use
arp_validate, and will change to ns_validate later.
Note: the sysfs configuration support was removed based on
https://lore.kernel.org/netdev/8863.1645071997@famine
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new bonding parameter ns_targets to store IPv6 address.
Add required bond_ns_send/rcv functions first before adding
IPv6 address option setting.
Add two functions bond_send/rcv_validate so we can send/recv
ARP and NS at the same time.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function bond_handle_vlan() is split from bond_arp_send() for later
IPv6 usage.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __bond_release_one(), bond_set_carrier() is only called when bond
device has no slave. Therefore, if we remove the up slave from a master
with two slaves and keep the down slave, the master will remain up.
Fix this by moving bond_set_carrier() out of if (!bond_has_slaves(bond))
statement.
Reproducer:
$ insmod bonding.ko mode=0 miimon=100 max_bonds=2
$ ifconfig bond0 up
$ ifenslave bond0 eth0 eth1
$ ifconfig eth0 down
$ ifenslave -d bond0 eth1
$ cat /proc/net/bonding/bond0
Fixes: ff59c4563a ("[PATCH] bonding: support carrier state for master")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/1645021088-38370-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cleanup_net() is competing with other rtnl users.
Batching bond_net_exit() factorizes all rtnl acquistions
to a single one, giving chance for cleanup_net()
to progress much faster, holding rtnl a bit longer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When 803.2ad mode enables a participating port, it should update
the slave-array. I have observed that the member links are participating
and are part of the active aggregator while the traffic is egressing via
only one member link (in a case where two links are participating). Via
kprobes I discovered that slave-arr has only one link added while
the other participating link wasn't part of the slave-arr.
I couldn't see what caused that situation but the simple code-walk
through provided me hints that the enable_port wasn't always associated
with the slave-array update.
Fixes: ee63771474 ("bonding: Simplify the xmit function for modes that use xmit_hash")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20220207222901.1795287-1-maheshb@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since ipv6 neighbor solicitation and advertisement messages
isn't handled gracefully in bond6 driver, we can see packet
drop due to inconsistency between mac address in the option
message and source MAC .
Another examples is ipv6 neighbor solicitation and advertisement
messages from VM via tap attached to host bridge, the src mac
might be changed through balance-alb mode, but it is not synced
with Link-layer address in the option message.
The patch implements bond6's tx handle for ipv6 neighbor
solicitation and advertisement messages.
Suggested-by: Hu Yadi <huyd12@chinatelecom.cn>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Sun Shouxin <sunshouxin@chinatelecom.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current release - new code bugs:
- tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
- tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n
- nf_tables: set last expression in register tracking area
- nft_connlimit: fix memleak if nf_ct_netns_get() fails
- mptcp: fix removing ids bitmap setting
- bonding: use rcu_dereference_rtnl when getting active slave
- fix three cases of sleep in atomic context in drivers: lan966x, gve
- handful of build fixes for esoteric drivers after netdev->dev_addr
was made const
Previous releases - regressions:
- revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke
Linux compatibility with USGv6 tests
- procfs: show net device bound packet types
- ipv4: fix ip option filtering for locally generated fragments
- phy: broadcom: hook up soft_reset for BCM54616S
Previous releases - always broken:
- ipv4: raw: lock the socket in raw_bind()
- ipv4: decrease the use of shared IPID generator to decrease the
chance of attackers guessing the values
- procfs: fix cross-netns information leakage in /proc/net/ptype
- ethtool: fix link extended state for big endian
- bridge: vlan: fix single net device option dumping
- ping: fix the sk_bound_dev_if match in ping_lookup
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHy4mMACgkQMUZtbf5S
IrvOZg/9HyOFAJrCYBlgA3zskHBdqYOGn9M3LCIevBcrCzQeigT+U1uWCINfBn+H
DmsljeYKTicHZ38+HjdNXmzdnMqHtU+iJl4Ep1mcDNywygofW8JcS2Nf0n6Y+hK6
nzyEa23DBt9UAiLmGXUTIoJwEhDRbuL/eH1/ZkkPLG7GtShtEDAKHg+dJBgHbYgJ
0MQs3Q4s6AQ1PYOC0Z0zByhpSrAo2c4X/tr6g2ExNxU0vnydUbjIME0a5clFULr+
ziVeOo4e83FINPaZiYAXEDbMGUC0z+rp1RoGsgRCdTnixi5BclkmEeGRaChYJHTZ
T7tfIC2H0vZHu5/pAXFqwEHiRbminLv9jLkvA1/J67jbnpoNWTLD2jkuIWFlaY/Z
xDm7LnVBB1CdLmXYo2ItSC/8ws9GANpJOq/vFvm+uOYZNKUVctfQ5viA3+hOSULC
6BJHC0m5UminHZPVge9s1XZClarHK4jMMTH9Du2sHLsl3fedNxbgvcVPFdHswLdF
uYiUGMSrIXuQjXw6SNmR4/voJgzikvYhT+jwMn4vTeWoFQFi5eNUch0MPfUImlXG
e3T6WJHrOY3yJFyWQQ9GGLStchD72+iGq2uWLfOIyu9NRKCNBj4Kkm3bUvfqYp5b
d5sP/nl93o3um4WskxB/fDLyhSXWjprgM9mKI45ilPhUC8bWQyo=
=mwR3
-----END PGP SIGNATURE-----
Merge tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and can.
Current release - new code bugs:
- tcp: add a missing sk_defer_free_flush() in tcp_splice_read()
- tcp: add a stub for sk_defer_free_flush(), fix CONFIG_INET=n
- nf_tables: set last expression in register tracking area
- nft_connlimit: fix memleak if nf_ct_netns_get() fails
- mptcp: fix removing ids bitmap setting
- bonding: use rcu_dereference_rtnl when getting active slave
- fix three cases of sleep in atomic context in drivers: lan966x, gve
- handful of build fixes for esoteric drivers after netdev->dev_addr
was made const
Previous releases - regressions:
- revert "ipv6: Honor all IPv6 PIO Valid Lifetime values", it broke
Linux compatibility with USGv6 tests
- procfs: show net device bound packet types
- ipv4: fix ip option filtering for locally generated fragments
- phy: broadcom: hook up soft_reset for BCM54616S
Previous releases - always broken:
- ipv4: raw: lock the socket in raw_bind()
- ipv4: decrease the use of shared IPID generator to decrease the
chance of attackers guessing the values
- procfs: fix cross-netns information leakage in /proc/net/ptype
- ethtool: fix link extended state for big endian
- bridge: vlan: fix single net device option dumping
- ping: fix the sk_bound_dev_if match in ping_lookup"
* tag 'net-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
net: bridge: vlan: fix memory leak in __allowed_ingress
net: socket: rename SKB_DROP_REASON_SOCKET_FILTER
ipv4: remove sparse error in ip_neigh_gw4()
ipv4: avoid using shared IP generator for connected sockets
ipv4: tcp: send zero IPID in SYNACK messages
ipv4: raw: lock the socket in raw_bind()
MAINTAINERS: add missing IPv4/IPv6 header paths
MAINTAINERS: add more files to eth PHY
net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout()
net: bridge: vlan: fix single net device option dumping
net: stmmac: skip only stmmac_ptp_register when resume from suspend
net: stmmac: configure PTP clock source prior to PTP initialization
Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"
connector/cn_proc: Use task_is_in_init_pid_ns()
pid: Introduce helper task_is_in_init_pid_ns()
gve: Fix GFP flags when allocing pages
net: lan966x: Fix sleep in atomic context when updating MAC table
net: lan966x: Fix sleep in atomic context when injecting frames
ethernet: seeq/ether3: don't write directly to netdev->dev_addr
ethernet: 8390/etherh: don't write directly to netdev->dev_addr
...
bond_option_active_slave_get_rcu() should not be used in rtnl_mutex as it
use rcu_dereference(). Replace to rcu_dereference_rtnl() so we also can use
this function in rtnl protected context.
With this update, we can rmeove the rcu_read_lock/unlock in
bonding .ndo_eth_ioctl and .get_ts_info.
Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Fixes: 94dd016ae5 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove PDE_DATA() completely and replace it with pde_data().
[akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
[akpm@linux-foundation.org: now fix it properly]
Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wrong hash sends single stream to multiple output interfaces.
The offset calculation was relative to skb->head, fix it to be relative
to skb->data.
Fixes: a815bde56b ("net, bonding: Refactor bond_xmit_hash for use with
xdp_buff")
Reviewed-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Moshe Tal <moshet@nvidia.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In Linux bonding scenario, one packet is copied to several copies and sent
by all slave device of bond0 in mode 3(broadcast mode). The mode 3 xmit
function bond_xmit_broadcast() only ueses the last slave device's tx result
as the final result. In this case, if the last slave device is down, then
it always return NET_XMIT_DROP, even though the other slave devices xmit
success. It may cause the tx statistics error, and cause the application
(e.g. scp) consider the network is unreachable.
For example, use the following command to configure server A.
echo 3 > /sys/class/net/bond0/bonding/mode
ifconfig bond0 up
ifenslave bond0 eth0 eth1
ifconfig bond0 192.168.1.125
ifconfig eth0 up
ifconfig eth1 down
The slave device eth0 and eth1 are connected to server B(192.168.1.107).
Run the ping 192.168.1.107 -c 3 -i 0.2 command, the following information
is displayed.
PING 192.168.1.107 (192.168.1.107) 56(84) bytes of data.
64 bytes from 192.168.1.107: icmp_seq=1 ttl=64 time=0.077 ms
64 bytes from 192.168.1.107: icmp_seq=2 ttl=64 time=0.056 ms
64 bytes from 192.168.1.107: icmp_seq=3 ttl=64 time=0.051 ms
192.168.1.107 ping statistics
0 packets transmitted, 3 received
Actually, the slave device eth0 of the bond successfully sends three
ICMP packets, but the result shows that 0 packets are transmitted.
Also if we use scp command to get remote files, the command end with the
following printings.
ssh_exchange_identification: read: Connection timed out
So this patch modifies the bond_xmit_broadcast to return NET_XMIT_SUCCESS
if one slave device in the bond sends packets successfully. If all slave
devices send packets fail, the discarded packets stats is increased. The
skb is released when there is no slave device in the bond or the last slave
device is down.
Fixes: ae46f184bc ("bonding: propagate transmit status")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If the userspace program is distributed in binary form (distro package),
there is no way to know on which kernel versions it will run.
Let's only check if the flag was set when do SIOCSHWTSTAMP. And return
hwtstamp_config with flag HWTSTAMP_FLAG_BONDED_PHC_INDEX to notify
userspace whether the new feature is supported or not.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 085d610008 ("Bonding: force user to add HWTSTAMP_FLAG_BONDED_PHC_INDEX when get/set HWTSTAMP")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
When 802.3ad bond mode is configured the ad_actor_system option is set to
"00:00:00:00:00:00". But when trying to set the all-zeroes MAC as actors'
system address it was failing with EINVAL.
An all-zeroes ethernet address is valid, only multicast addresses are not
valid values.
Fixes: 171a42c38c ("bonding: add netlink support for sys prio, actor sys mac, and port key")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20211221111345.2462-1-ffmancera@riseup.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When there is a failover, the PHC index of bond active interface will be
changed. This may break the user space program if the author didn't aware.
By setting this flag, the user should aware that the PHC index get/set
by syscall is not stable. And the user space is able to deal with it.
Without this flag, the kernel will reject the request forwarding to
bonding.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 94dd016ae5 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
KCSAN reported a data-race [1] around tx_rebalance_counter
which can be accessed from different contexts, without
the protection of a lock/mutex.
[1]
BUG: KCSAN: data-race in bond_alb_init_slave / bond_alb_monitor
write to 0xffff888157e8ca24 of 4 bytes by task 7075 on cpu 0:
bond_alb_init_slave+0x713/0x860 drivers/net/bonding/bond_alb.c:1613
bond_enslave+0xd94/0x3010 drivers/net/bonding/bond_main.c:1949
do_set_master net/core/rtnetlink.c:2521 [inline]
__rtnl_newlink net/core/rtnetlink.c:3475 [inline]
rtnl_newlink+0x1298/0x13b0 net/core/rtnetlink.c:3506
rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2491
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5589
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x5fc/0x6c0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x6e1/0x7d0 net/netlink/af_netlink.c:1916
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2409
___sys_sendmsg net/socket.c:2463 [inline]
__sys_sendmsg+0x195/0x230 net/socket.c:2492
__do_sys_sendmsg net/socket.c:2501 [inline]
__se_sys_sendmsg net/socket.c:2499 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2499
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888157e8ca24 of 4 bytes by task 1082 on cpu 1:
bond_alb_monitor+0x8f/0xc00 drivers/net/bonding/bond_alb.c:1511
process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
worker_thread+0x616/0xa70 kernel/workqueue.c:2445
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x00000001 -> 0x00000064
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 1082 Comm: kworker/u4:3 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bond1 bond_alb_monitor
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have VLAN PTP support(via get_ts_info) on kernel, and bond support(by
getting active interface via netlink message) on userspace tool linuxptp.
But there are always some users who want to use PTP with VLAN over bond,
which is not able to do with the current implementation.
This patch passed get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device
with bond mode active-backup/tlb/alb. With this users could get kernel native
bond or VLAN over bond PTP support.
Test with ptp4l and it works with VLAN over bond after this patch:
]# ptp4l -m -i bond0.23
ptp4l[53377.141]: selected /dev/ptp4 as PTP clock
ptp4l[53377.142]: port 1: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[53377.143]: port 0: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[53377.143]: port 0: INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[53384.127]: port 1: LISTENING to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES
ptp4l[53384.127]: selected local clock e41d2d.fffe.123db0 as best master
ptp4l[53384.127]: port 1: assuming the grand master role
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we use hard code number to verify if we are in the
arp_interval timeslice. But some user may want to reduce/extend
the verify timeslice. With the similar team option 'missed_max'
the uers could change that number based on their own environment.
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev->gso_max_segs is written under RTNL protection, or when the device is
not yet visible, but is read locklessly.
Add netif_set_gso_max_segs() helper.
Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
where we can to better document what is going on.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I do fuzz test for bonding device interface, I got the following
use-after-free Calltrace:
==================================================================
BUG: KASAN: use-after-free in bond_enslave+0x1521/0x24f0
Read of size 8 at addr ffff88825bc11c00 by task ifenslave/7365
CPU: 5 PID: 7365 Comm: ifenslave Tainted: G E 5.15.0-rc1+ #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
Call Trace:
dump_stack_lvl+0x6c/0x8b
print_address_description.constprop.0+0x48/0x70
kasan_report.cold+0x82/0xdb
__asan_load8+0x69/0x90
bond_enslave+0x1521/0x24f0
bond_do_ioctl+0x3e0/0x450
dev_ifsioc+0x2ba/0x970
dev_ioctl+0x112/0x710
sock_do_ioctl+0x118/0x1b0
sock_ioctl+0x2e0/0x490
__x64_sys_ioctl+0x118/0x150
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f19159cf577
Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 78
RSP: 002b:00007ffeb3083c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffeb3084bca RCX: 00007f19159cf577
RDX: 00007ffeb3083ce0 RSI: 0000000000008990 RDI: 0000000000000003
RBP: 00007ffeb3084bc4 R08: 0000000000000040 R09: 0000000000000000
R10: 00007ffeb3084bc0 R11: 0000000000000246 R12: 00007ffeb3083ce0
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeb3083cb0
Allocated by task 7365:
kasan_save_stack+0x23/0x50
__kasan_kmalloc+0x83/0xa0
kmem_cache_alloc_trace+0x22e/0x470
bond_enslave+0x2e1/0x24f0
bond_do_ioctl+0x3e0/0x450
dev_ifsioc+0x2ba/0x970
dev_ioctl+0x112/0x710
sock_do_ioctl+0x118/0x1b0
sock_ioctl+0x2e0/0x490
__x64_sys_ioctl+0x118/0x150
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 7365:
kasan_save_stack+0x23/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x24/0x40
__kasan_slab_free+0xf2/0x130
kfree+0xd1/0x5c0
slave_kobj_release+0x61/0x90
kobject_put+0x102/0x180
bond_sysfs_slave_add+0x7a/0xa0
bond_enslave+0x11b6/0x24f0
bond_do_ioctl+0x3e0/0x450
dev_ifsioc+0x2ba/0x970
dev_ioctl+0x112/0x710
sock_do_ioctl+0x118/0x1b0
sock_ioctl+0x2e0/0x490
__x64_sys_ioctl+0x118/0x150
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Last potentially related work creation:
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xb7/0xd0
insert_work+0x43/0x190
__queue_work+0x2e3/0x970
delayed_work_timer_fn+0x3e/0x50
call_timer_fn+0x148/0x470
run_timer_softirq+0x8a8/0xc50
__do_softirq+0x107/0x55f
Second to last potentially related work creation:
kasan_save_stack+0x23/0x50
kasan_record_aux_stack+0xb7/0xd0
insert_work+0x43/0x190
__queue_work+0x2e3/0x970
__queue_delayed_work+0x130/0x180
queue_delayed_work_on+0xa7/0xb0
bond_enslave+0xe25/0x24f0
bond_do_ioctl+0x3e0/0x450
dev_ifsioc+0x2ba/0x970
dev_ioctl+0x112/0x710
sock_do_ioctl+0x118/0x1b0
sock_ioctl+0x2e0/0x490
__x64_sys_ioctl+0x118/0x150
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88825bc11c00
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 0 bytes inside of
1024-byte region [ffff88825bc11c00, ffff88825bc12000)
The buggy address belongs to the page:
page:ffffea00096f0400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x25bc10
head:ffffea00096f0400 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
raw: 057ff00000010200 ffffea0009a71c08 ffff888240001968 ffff88810004dbc0
raw: 0000000000000000 00000000000a000a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88825bc11b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88825bc11b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88825bc11c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88825bc11c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88825bc11d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Put new_slave in bond_sysfs_slave_add() will cause use-after-free problems
when new_slave is accessed in the subsequent error handling process. Since
new_slave will be put in the subsequent error handling process, remove the
unnecessary put to fix it.
In addition, when sysfs_create_file() fails, if some files have been crea-
ted successfully, we need to call sysfs_remove_file() to remove them.
Since there are sysfs_create_files() & sysfs_remove_files() can be used,
use these two functions instead.
Fixes: 7afcaec496 (bonding: use kobject_put instead of _del after kobject_add)
Signed-off-by: Huang Guobin <huangguobin4@huawei.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 406f42fa0d ("net-next: When a bond have a massive amount
of VLANs...") introduced a rbtree for faster Ethernet address look
up. To maintain netdev->dev_addr in this tree we need to make all
the writes to it go through appropriate helpers.
Make sure local references to netdev->dev_addr are constant.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use dev_addr_set() instead of writing directly to netdev->dev_addr
in various misc and old drivers.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A new helper to detect if a net device name is in use was added. Use it
here as the return reference from __dev_get_by_name was discarded.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameter bond_params is a relatively large 192 byte sized
struct so pass it by reference rather than by value to reduce
copying.
Addresses-Coverity: ("Big parameter passed by value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With nested bonding devices the nested bond device's ndo_bpf was
called without a program causing it to decrement the static key
without a prior increment leading to negative count.
Fix the issue by 1) only calling slave's ndo_bpf when there's a
program to be loaded and 2) only decrement the count when a program
is unloaded.
Fixes: 9e2ee5c7e7 ("net, bonding: Add XDP support to the bonding driver")
Reported-by: syzbot+30622fb04ddd72a4d167@syzkaller.appspotmail.com
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On configs where there is no confirgured direct route to the target of
the ARP probes, these probes are still sent and may be replied to
properly, so no need to repeatedly complain about the missing route.
Signed-off-by: David Decotigny <ddecotig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a more user friendly netlink error message when a device can't be
enslaved because it has IFF_MASTER, by not referring directly to a
kernel internal flag.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new vlan+srcmac xmit policy is not implementable with XDP since
in many cases the 802.1Q payload is not present in the packet. This
can be for example due to hardware offload or in the case of veth
due to use of skbuffs internally.
This also fixes the NULL deref with the vlan+srcmac xmit policy
reported by Jonathan Toppins by additionally checking the skb
pointer.
Fixes: a815bde56b ("net, bonding: Refactor bond_xmit_hash for use with xdp_buff")
Reported-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/20210812145241.12449-1-joamaki@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There seems to be no reason to have different error messages between
netlink and printk. It also cleans up the function slightly.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.
Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.
The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.
Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].
-----------------------| CPU |--| rxpck/s |--| txpck/s |----
without patch (1 dev):
XDP_DROP: 3.15% 48.6Mpps
XDP_TX: 3.12% 18.3Mpps 18.3Mpps
XDP_DROP (RSS): 9.47% 116.5Mpps
XDP_TX (RSS): 9.67% 25.3Mpps 24.2Mpps
-----------------------
with patch, bond (1 dev):
XDP_DROP: 3.14% 46.7Mpps
XDP_TX: 3.15% 13.9Mpps 13.9Mpps
XDP_DROP (RSS): 10.33% 117.2Mpps
XDP_TX (RSS): 10.64% 25.1Mpps 24.0Mpps
-----------------------
with patch, bond (2 devs):
XDP_DROP: 6.27% 92.7Mpps
XDP_TX: 6.26% 17.6Mpps 17.5Mpps
XDP_DROP (RSS): 11.38% 117.2Mpps
XDP_TX (RSS): 14.30% 28.7Mpps 27.4Mpps
--------------------------------------------------------------
RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.
[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
[4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
In preparation for adding XDP support to the bonding driver
refactor the packet hashing functions to be able to work with
any linear data buffer without an skb.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com
commit <47595e32869f> ("<MAINTAINERS: Mark some staging directories>")
indicated the ipx network layer as obsolete in Jan 2018,
updated in the MAINTAINERS file
now, after being exposed for 3 years to refactoring,
so to delete the ipx net layer related code for good.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option lacp_active, which is similar with team's runner.active.
This option specifies whether to send LACPDU frames periodically. If set
on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".
Note, the LACPDU state frames still will be sent when init or unbind port.
v2: remove module parameter
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some time ago, I reported a calltrace issue
"did not find a suitable aggregator", please see[1].
After a period of analysis and reproduction, I find
that this problem is caused by concurrency.
Before the problem occurs, the bond structure is like follows:
bond0 - slaver0(eth0) - agg0.lag_ports -> port0 - port1
\
port0
\
slaver1(eth1) - agg1.lag_ports -> NULL
\
port1
If we run 'ifenslave bond0 -d eth1', the process is like below:
excuting __bond_release_one()
|
bond_upper_dev_unlink()[step1]
| | |
| | bond_3ad_lacpdu_recv()
| | ->bond_3ad_rx_indication()
| | spin_lock_bh()
| | ->ad_rx_machine()
| | ->__record_pdu()[step2]
| | spin_unlock_bh()
| | |
| bond_3ad_state_machine_handler()
| spin_lock_bh()
| ->ad_port_selection_logic()
| ->try to find free aggregator[step3]
| ->try to find suitable aggregator[step4]
| ->did not find a suitable aggregator[step5]
| spin_unlock_bh()
| |
| |
bond_3ad_unbind_slave() |
spin_lock_bh()
spin_unlock_bh()
step1: already removed slaver1(eth1) from list, but port1 remains
step2: receive a lacpdu and update port0
step3: port0 will be removed from agg0.lag_ports. The struct is
"agg0.lag_ports -> port1" now, and agg0 is not free. At the
same time, slaver1/agg1 has been removed from the list by step1.
So we can't find a free aggregator now.
step4: can't find suitable aggregator because of step2
step5: cause a calltrace since port->aggregator is NULL
To solve this concurrency problem, put bond_upper_dev_unlink()
after bond_3ad_unbind_slave(). In this way, we can invalid the port
first and skip this port in bond_3ad_state_machine_handler(). This
eliminates the situation that the slaver has been removed from the
list but the port is still valid.
[1]https://lore.kernel.org/netdev/10374.1611947473@famine/
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All other user triggered operations are gone from ndo_ioctl, so move
the SIOCBOND family into a custom operation as well.
The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now,
but there are still a few definitions in obsolete wireless drivers as well
as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR
helpers from inside the kernel.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
This is a purely cosmetic change intended to help readers find
their way through the implementation.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bonding driver supports two command codes for each operation: one
in the SIOCDEVPRIVATE range and another one with the same definition
but a unique command code.
Only the second set currently works in compat mode, as the ifr_data
expansion overwrites part of the ifr_slave field.
Move the private ones into ndo_siocdevprivate and change the
implementation to call the other function. This makes both version
work correctly.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 9a5605505d (" bonding: Add struct bond_ipesc to manage SA") is causing
following build error when XFRM is not selected in kernel config.
lld: error: undefined symbol: xfrm_dev_state_flush
>>> referenced by bond_main.c:3453 (drivers/net/bonding/bond_main.c:3453)
>>> net/bonding/bond_main.o:(bond_netdev_event) in archive drivers/built-in.a
Fixes: 9a5605505d (" bonding: Add struct bond_ipesc to manage SA")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Taehee Yoo <ap420073@gmail.com>
CC: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_ipsec_offload_ok() is called to check whether the interface supports
ipsec offload or not.
bonding interface support ipsec offload only in active-backup mode.
So, if a bond interface is not in active-backup mode, it should return
false but it returns true.
Fixes: a3b658cfb6 ("bonding: allow xfrm offload setup post-module-load")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To dereference bond->curr_active_slave, it uses rcu_dereference().
But it and the caller doesn't acquire RCU so a warning occurs.
So add rcu_read_lock().
Splat looks like:
WARNING: suspicious RCU usage
5.13.0-rc6+ #1179 Not tainted
drivers/net/bonding/bond_main.c:571 suspicious
rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ping/974:
#0: ffff888109e7db70 (sk_lock-AF_INET){+.+.}-{0:0},
at: raw_sendmsg+0x1303/0x2cb0
stack backtrace:
CPU: 2 PID: 974 Comm: ping Not tainted 5.13.0-rc6+ #1179
Call Trace:
dump_stack+0xa4/0xe5
bond_ipsec_offload_ok+0x1f4/0x260 [bonding]
xfrm_output+0x179/0x890
xfrm4_output+0xfa/0x410
? __xfrm4_output+0x4b0/0x4b0
? __ip_make_skb+0xecc/0x2030
? xfrm4_udp_encap_rcv+0x800/0x800
? ip_local_out+0x21/0x3a0
ip_send_skb+0x37/0xa0
raw_sendmsg+0x1bfd/0x2cb0
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bonding has been supporting ipsec offload.
When SA is added, bonding just passes SA to its own active real interface.
But it doesn't manage SA.
So, when events(add/del real interface, active real interface change, etc)
occur, bonding can't handle that well because It doesn't manage SA.
So some problems(panic, UAF, refcnt leak)occur.
In order to make it stable, it should manage SA.
That's the reason why struct bond_ipsec is added.
When a new SA is added to bonding interface, it is stored in the
bond_ipsec list. And the SA is passed to a current active real interface.
If events occur, it uses bond_ipsec data to handle these events.
bond->ipsec_list is protected by bond->ipsec_lock.
If a current active real interface is changed, the following logic works.
1. delete all SAs from old active real interface
2. Add all SAs to the new active real interface.
3. If a new active real interface doesn't support ipsec offload or SA's
option, it sets real_dev to NULL.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bonding interface can be nested and it supports ipsec offload.
So, it allows setting the nested bonding + ipsec scenario.
But code does not support this scenario.
So, it should be disallowed.
interface graph:
bond2
|
bond1
|
eth0
The nested bonding + ipsec offload may not a real usecase.
So, disallowing this scenario is fine.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To dereference bond->curr_active_slave, it uses rcu_dereference().
But it and the caller doesn't acquire RCU so a warning occurs.
So add rcu_read_lock().
Test commands:
ip netns add A
ip netns exec A bash
modprobe netdevsim
echo "1 1" > /sys/bus/netdevsim/new_device
ip link add bond0 type bond
ip link set eth0 master bond0
ip link set eth0 up
ip link set bond0 up
ip x s add proto esp dst 14.1.1.1 src 15.1.1.1 spi 0x07 mode \
transport reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))' \
0x44434241343332312423222114131211f4f3f2f1 128 sel src 14.0.0.52/24 \
dst 14.0.0.70/24 proto tcp offload dev bond0 dir in
ip x s f
Splat looks like:
=============================
WARNING: suspicious RCU usage
5.13.0-rc3+ #1168 Not tainted
-----------------------------
drivers/net/bonding/bond_main.c:448 suspicious rcu_dereference_check()
usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by ip/705:
#0: ffff888106701780 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{3:3},
at: xfrm_netlink_rcv+0x59/0x80 [xfrm_user]
#1: ffff8880075b0098 (&x->lock){+.-.}-{2:2},
at: xfrm_state_delete+0x16/0x30
stack backtrace:
CPU: 6 PID: 705 Comm: ip Not tainted 5.13.0-rc3+ #1168
Call Trace:
dump_stack+0xa4/0xe5
bond_ipsec_del_sa+0x16a/0x1c0 [bonding]
__xfrm_state_delete+0x51f/0x730
xfrm_state_delete+0x1e/0x30
xfrm_state_flush+0x22f/0x390
xfrm_flush_sa+0xd8/0x260 [xfrm_user]
? xfrm_flush_policy+0x290/0x290 [xfrm_user]
xfrm_user_rcv_msg+0x331/0x660 [xfrm_user]
? rcu_read_lock_sched_held+0x91/0xc0
? xfrm_user_state_lookup.constprop.39+0x320/0x320 [xfrm_user]
? find_held_lock+0x3a/0x1c0
? mutex_lock_io_nested+0x1210/0x1210
? sched_clock_cpu+0x18/0x170
netlink_rcv_skb+0x121/0x350
[ ... ]
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To dereference bond->curr_active_slave, it uses rcu_dereference().
But it and the caller doesn't acquire RCU so a warning occurs.
So add rcu_read_lock().
Test commands:
ip link add dummy0 type dummy
ip link add bond0 type bond
ip link set dummy0 master bond0
ip link set dummy0 up
ip link set bond0 up
ip x s add proto esp dst 14.1.1.1 src 15.1.1.1 spi 0x07 \
mode transport \
reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))' \
0x44434241343332312423222114131211f4f3f2f1 128 sel \
src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp offload \
dev bond0 dir in
Splat looks like:
=============================
WARNING: suspicious RCU usage
5.13.0-rc3+ #1168 Not tainted
-----------------------------
drivers/net/bonding/bond_main.c:411 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/684:
#0: ffffffff9a2757c0 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{3:3},
at: xfrm_netlink_rcv+0x59/0x80 [xfrm_user]
55.191733][ T684] stack backtrace:
CPU: 0 PID: 684 Comm: ip Not tainted 5.13.0-rc3+ #1168
Call Trace:
dump_stack+0xa4/0xe5
bond_ipsec_add_sa+0x18c/0x1f0 [bonding]
xfrm_dev_state_add+0x2a9/0x770
? memcpy+0x38/0x60
xfrm_add_sa+0x2278/0x3b10 [xfrm_user]
? xfrm_get_policy+0xaa0/0xaa0 [xfrm_user]
? register_lock_class+0x1750/0x1750
xfrm_user_rcv_msg+0x331/0x660 [xfrm_user]
? rcu_read_lock_sched_held+0x91/0xc0
? xfrm_user_state_lookup.constprop.39+0x320/0x320 [xfrm_user]
? find_held_lock+0x3a/0x1c0
? mutex_lock_io_nested+0x1210/0x1210
? sched_clock_cpu+0x18/0x170
netlink_rcv_skb+0x121/0x350
? xfrm_user_state_lookup.constprop.39+0x320/0x320 [xfrm_user]
? netlink_ack+0x9d0/0x9d0
? netlink_deliver_tap+0x17c/0xa50
xfrm_netlink_rcv+0x68/0x80 [xfrm_user]
netlink_unicast+0x41c/0x610
? netlink_attachskb+0x710/0x710
netlink_sendmsg+0x6b9/0xb70
[ ... ]
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The commit 3c9ef511b9 ("bonding: avoid adding slave device with
IFF_MASTER flag") fix a crash when add slave device with IFF_MASTER,
but it rejects the scenario of nested bonding device.
As Eric Dumazet described: since there indeed is a usage scenario about
nesting bonding, we should not break it.
So we add a new judgment condition to allow nesting of bonding device.
Fixes: 3c9ef511b9 ("bonding: avoid adding slave device with IFF_MASTER flag")
Suggested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Di Zhu <zhudi21@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following steps will definitely cause the kernel to crash:
ip link add vrf1 type vrf table 1
modprobe bonding.ko max_bonds=1
echo "+vrf1" >/sys/class/net/bond0/bonding/slaves
rmmod bonding
The root cause is that: When the VRF is added to the slave device,
it will fail, and some cleaning work will be done. because VRF device
has IFF_MASTER flag, cleanup process will not clear the IFF_BONDING flag.
Then, when we unload the bonding module, unregister_netdevice_notifier()
will treat the VRF device as a bond master device and treat netdev_priv()
as struct bonding{} which actually is struct net_vrf{}.
By analyzing the processing logic of bond_enslave(), it seems that
it is not allowed to add the slave device with the IFF_MASTER flag, so
we need to add a code check for this situation.
Signed-off-by: Di Zhu <zhudi21@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The round-robin rr_tx_counter was shared across CPUs leading to
significant cache thrashing at high packet rates. This patch switches
the round-robin packet counter to use a per-cpu variable to decide
the destination slave.
On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh
(-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after
this patch.
"perf top -e cache_misses" before:
12.31% [bonding] [k] bond_xmit_roundrobin_slave_get
10.59% [sch_fq_codel] [k] fq_codel_dequeue
9.34% [kernel] [k] skb_release_data
after:
15.42% [sch_fq_codel] [k] fq_codel_dequeue
10.06% [kernel] [k] __memset
9.12% [kernel] [k] skb_release_data
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable ret is being initialized with a value that is never read,
it is being updated later on. The assignment is redundant and can be
removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Silence this warning by using strscpy_pad() directly:
drivers/net/bonding/bond_main.c:4877:3: warning: 'strncpy' specified bound 16 equals destination size [-Wstringop-truncation]
4877 | strncpy(params->primary, primary, IFNAMSIZ);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Additionally replace other strncpy() uses, as it is considered deprecated:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202102150705.fdR6obB0-lkp@intel.com
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/becase/because/
s/reqeusts/requests/
s/funcions/functions/
s/addreses/addresses/
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Code indent should use tabs where possible, so
use tabs instead of space for code indent.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Braces {} are not necessary for single statement blocks,
so remove these braces {}.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect code indent for conditional statements.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add some blank lines after declarations as required.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If bond_kobj_init() or later kzalloc() in bond_alloc_slave() fail,
then we call kobject_put() on the slave->kobj. This in turn calls
the release function slave_kobj_release() which will always try to
cancel_delayed_work_sync(&slave->notify_work), which shouldn't be
done on an uninitialized work struct.
Always initialize the work struct earlier to avoid problems here.
Syzbot bisected this down to a completely pointless commit, some
fault injection may have been at work here that caused the alloc
failure in the first place, which may interact badly with bisect.
Reported-by: syzbot+bfda097c12a00c8cae67@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond works in mode 4, and performs down/up operations on the bond
that is normally negotiated. The probability of bond-> slave_arr is NULL
Test commands:
ifconfig bond1 down
ifconfig bond1 up
The conflict occurs in the following process:
__dev_open (CPU A)
--bond_open
--queue_delayed_work(bond->wq,&bond->ad_work,0);
--bond_update_slave_arr
--bond_3ad_get_active_agg_info
ad_work(CPU B)
--bond_3ad_state_machine_handler
--ad_agg_selection_logic
ad_work runs on cpu B. In the function ad_agg_selection_logic, all
agg->is_active will be cleared. Before the new active aggregator is
selected on CPU B, bond_3ad_get_active_agg_info failed on CPU A,
bond->slave_arr will be set to NULL. The best aggregator in
ad_agg_selection_logic has not changed, no need to update slave arr.
The conflict occurred in that ad_agg_selection_logic clears
agg->is_active under mode_lock, but bond_open -> bond_update_slave_arr
is inspecting agg->is_active outside the lock.
Also, bond_update_slave_arr is normal for potential sleep when
allocating memory, so replace the WARN_ON with a call to might_sleep.
Signed-off-by: jinyiting <jinyiting@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove repeated word "that".
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following make W=1 kernel build warning:
drivers/net/bonding/bond_main.c:982: warning: expecting prototype for change_active_interface(). Prototype was for bond_change_active_slave() instead
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the incorrect interface name is stored in the slaves/active_slave
option of the bonding sysfs, the kernel does not record the log that
interface does not exist.
This patch adds a log for -ENODEV error, which will facilitate users to
figure out such issue.
Signed-off-by: Jianlin Lv <Jianlin.Lv@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 2055a99da8.
This change rejects legitimate configurations.
A slave doesn't need to exist nor implement ndo_slave_setup.
Signed-off-by: David S. Miller <davem@davemloft.net>
When slave is NULL or slave_ops->ndo_neigh_setup is NULL, no error
return code of bond_neigh_init() is assigned.
To fix this bug, ret is assigned with -EINVAL in these cases.
Fixes: 9e99bfefdb ("bonding: fix bond_neigh_init()")
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond driver needs to be patched to support new ethtool speeds.
Currently it emits a single warning [1] when it encounters an unknown
speed. As evident by the two previous patches, this is not explicit
enough. Instead, promote it to an error.
[1]
bond10: (slave swp1): unknown ethtool speed (200000) for port 1 (set it to 0)
v2:
* Use pr_err_once() instead of WARN_ONCE()
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to use 3ad mode with 400G devices we need to extend
the supported speeds.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to use 3ad mode with 200G devices we need to extend
the supported speeds.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.
Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.
Unlike bond_eth_hash(), this hash function is using the full source MAC
address instead of just the last byte, as there are so few components to
the hash, and in the no-vlan case, we would be returning just the last
byte of the source MAC as the hash value. It's entirely possible to have
two NICs in a bond with the same last byte of their MAC, but not the same
MAC, so this adjustment should guarantee distinct hashes in all cases.
This has been rudimetarily tested to provide similar results to the
proprietary solution it is aiming to replace. A patch for iproute2 is also
posted, to properly support the new mode there as well.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Thomas Davis <tadavis@lbl.gov>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Implement TLS TX device offload for bonding interfaces.
This allows kTLS sockets running on a bond to benefit from the
device offload on capable lower devices.
To allow a simple and fast maintenance of the TLS context in SW and
lower devices, we bind the TLS socket to a specific lower dev.
To achieve a behavior similar to SW kTLS, we support only balance-xor
and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
in bond_sk_check(), done in a previous patch.
For the above configuration, the SW implementation keeps picking the
same exact lower dev for all the socket's SKBs. The device offload
behaves similarly, making the decision once at the connection creation.
Per socket, the TLS module should work directly with the lowest netdev
in chain, to call the tls_dev_ops operations.
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the current offload status based on the logic under bond_sk_check().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for more cases that call netdev_update_features().
While here, move the features logic to the stage where struct bond
is already updated, and pass it as the only parameter to function
bond_set_xfrm_features().
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add ndo_sk_get_lower_dev() implementation for bond interfaces.
Support only for the cases where the socket's and SKBs' hash
yields identical value for the whole connection lifetime.
Here we restrict it to L3+4 sockets only, with
xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hash logic on L3 will be used in a downstream patch for one more use
case.
Take it to a function for a better code reuse.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When creating a static bond (e.g. balance-xor), all ports will always
be enabled. This is set, and the corresponding notification is sent
out, before the port is linked to the bond upper.
In the offloaded case, this ordering is hard to deal with.
The lower will first see a notification that it can not associate with
any bond. Then the bond is joined. After that point no more
notifications are sent, so all ports remain disabled.
This change simply sends an extra notification once the port has been
linked to the upper to synchronize the initial state.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't try to adjust XFRM support flags if the bond device isn't yet
registered. Bad things can currently happen when netdev_change_features()
is called without having wanted_features fully filled in yet. This code
runs both on post-module-load mode changes, as well as at module init
time, and when run at module init time, it is before register_netdevice()
has been called and filled in wanted_features. The empty wanted_features
led to features also getting emptied out, which was definitely not the
intended behavior, so prevent that from happening.
Originally, I'd hoped to stop adjusting wanted_features at all in the
bonding driver, as it's documented as being something only the network
core should touch, but we actually do need to do this to properly update
both the features and wanted_features fields when changing the bond type,
or we get to a situation where ethtool sees:
esp-hw-offload: off [requested on]
I do think we should be using netdev_update_features instead of
netdev_change_features here though, so we only send notifiers when the
features actually changed.
Fixes: a3b658cfb6 ("bonding: allow xfrm offload setup post-module-load")
Reported-by: Ivan Vecera <ivecera@redhat.com>
Suggested-by: Ivan Vecera <ivecera@redhat.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Link: https://lore.kernel.org/r/20201205172229.576587-1-jarod@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can remove one of the ifdef blocks here, and instead of setting both
the xfrm hw_features and features flags, then unsetting the feature
flags if not in AB, wait to set the features flags if we're actually in AB
mode.
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Link: https://lore.kernel.org/r/20201205174003.578267-1-jarod@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.
Conflicts:
drivers/net/can/usb/gs_usb.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.
Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.
Fix all the places which made use of this implicit include.
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Virtual netdevs should use NETIF_F_GSO_SOFTWARE to forward GSO skbs
as-is and let the final drivers deal with them when supported.
Also remove NETIF_F_GSO_UDP_L4 from bonding and team drivers as it's
now included in the "software" list.
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.
In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ARP monitor is used for link detection, ARP replies are
validated for all slaves (arp_validate=3) and fail_over_mac is set to
active, two slaves of an active-backup bond may get stuck in a state
where both of them are active and pass packets that they receive to
the bond. This state makes IPv6 duplicate address detection fail. The
state is reached thus:
1. The current active slave goes down because the ARP target
is not reachable.
2. The current ARP slave is chosen and made active.
3. A new slave is enslaved. This new slave becomes the current active
slave and can reach the ARP target.
As a result, the current ARP slave stays active after the enslave
action has finished and the log is littered with "PROBE BAD" messages:
> bond0: PROBE: c_arp ens10 && cas ens11 BAD
The workaround is to remove the slave with "going back" status from
the bond and re-enslave it. This issue was encountered when DPDK PMD
interfaces were being enslaved to an active-backup bond.
I would be possible to fix the issue in bond_enslave() or
bond_change_active_slave() but the ARP monitor was fixed instead to
keep most of the actions changing the current ARP slave in the ARP
monitor code. The current ARP slave is set as inactive and backup
during the commit phase. A new state, BOND_LINK_FAIL, has been
introduced for slaves in the context of the ARP monitor. This allows
administrators to see how slaves are rotated for sending ARP requests
and attempts are made to find a new active slave.
Fixes: b2220cad58 ("bonding: refactor ARP active-backup monitor")
Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we tear down a network namespace, we unregister all
the netdevices within it. So we may queue a slave device
and a bonding device together in the same unregister queue.
If the only slave device is non-ethernet, it would
automatically unregister the bonding device as well. Thus,
we may end up unregistering the bonding device twice.
Workaround this special case by checking reg_state.
Fixes: 9b5e383c11 ("net: Introduce unregister_netdevice_many()")
Reported-by: syzbot+af23e7f3e0a7e10c8b67@syzkaller.appspotmail.com
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/net/bonding/bond_alb.c:1222: warning: Function parameter or member 'bond' not described in 'alb_set_mac_address'
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/net/bonding/bond_main.c:329: warning: Function parameter or member 'proto' not described in 'bond_vlan_rx_add_vid'
drivers/net/bonding/bond_main.c:362: warning: Function parameter or member 'proto' not described in 'bond_vlan_rx_kill_vid'
drivers/net/bonding/bond_main.c:964: warning: Function parameter or member 'new_active' not described in 'bond_change_active_slave'
drivers/net/bonding/bond_main.c:964: warning: Excess function parameter 'new' description in 'bond_change_active_slave'
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Thomas Davis <tadavis@lbl.gov>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renames and missing descriptions.
Fixes the following W=1 kernel build warning(s):
drivers/net/bonding/bond_3ad.c:140: warning: Function parameter or member 'port' not described in '__get_first_agg'
drivers/net/bonding/bond_3ad.c:140: warning: Excess function parameter 'bond' description in '__get_first_agg'
drivers/net/bonding/bond_3ad.c:1655: warning: Function parameter or member 'agg' not described in 'ad_agg_selection_logic'
drivers/net/bonding/bond_3ad.c:1655: warning: Excess function parameter 'aggregator' description in 'ad_agg_selection_logic'
drivers/net/bonding/bond_3ad.c:1817: warning: Function parameter or member 'port' not described in 'ad_initialize_port'
drivers/net/bonding/bond_3ad.c:1817: warning: Excess function parameter 'aggregator' description in 'ad_initialize_port'
drivers/net/bonding/bond_3ad.c:1976: warning: Function parameter or member 'timeout' not described in 'bond_3ad_initiate_agg_selection'
drivers/net/bonding/bond_3ad.c:2274: warning: Function parameter or member 'work' not described in 'bond_3ad_state_machine_handler'
drivers/net/bonding/bond_3ad.c:2274: warning: Excess function parameter 'bond' description in 'bond_3ad_state_machine_handler'
drivers/net/bonding/bond_3ad.c:2508: warning: Function parameter or member 'link' not described in 'bond_3ad_handle_link_change'
drivers/net/bonding/bond_3ad.c:2508: warning: Excess function parameter 'status' description in 'bond_3ad_handle_link_change'
drivers/net/bonding/bond_3ad.c:2566: warning: Function parameter or member 'bond' not described in 'bond_3ad_set_carrier'
drivers/net/bonding/bond_3ad.c:2677: warning: Function parameter or member 'bond' not described in 'bond_3ad_update_lacp_rate'
drivers/net/bonding/bond_3ad.c:1655: warning: Function parameter or member 'agg' not described in 'ad_agg_selection_logic'
drivers/net/bonding/bond_3ad.c:1655: warning: Excess function parameter 'aggregator' description in 'ad_agg_selection_logic'
drivers/net/bonding/bond_3ad.c:1817: warning: Function parameter or member 'port' not described in 'ad_initialize_port'
drivers/net/bonding/bond_3ad.c:1817: warning: Excess function parameter 'aggregator' description in 'ad_initialize_port'
drivers/net/bonding/bond_3ad.c:1976: warning: Function parameter or member 'timeout' not described in 'bond_3ad_initiate_agg_selection'
drivers/net/bonding/bond_3ad.c:2274: warning: Function parameter or member 'work' not described in 'bond_3ad_state_machine_handler'
drivers/net/bonding/bond_3ad.c:2274: warning: Excess function parameter 'bond' description in 'bond_3ad_state_machine_handler'
drivers/net/bonding/bond_3ad.c:2508: warning: Function parameter or member 'link' not described in 'bond_3ad_handle_link_change'
drivers/net/bonding/bond_3ad.c:2508: warning: Excess function parameter 'status' description in 'bond_3ad_handle_link_change'
drivers/net/bonding/bond_3ad.c:2566: warning: Function parameter or member 'bond' not described in 'bond_3ad_set_carrier'
drivers/net/bonding/bond_3ad.c:2677: warning: Function parameter or member 'bond' not described in 'bond_3ad_update_lacp_rate'
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Broadcast mode bonds transmit a copy of all traffic simultaneously out of
all interfaces, so the "speed" of the bond isn't really the aggregate of
all interfaces, but rather, the speed of the slowest active interface.
Also, the type of the speed field is u32, not unsigned long, so adjust
that accordingly, as required to make min() function here without
complaining about mismatching types.
Fixes: bb5b052f75 ("bond: add support to read speed and duplex via ethtool")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
Very similar to commit 544f287b84
("bonding: check error value of register_netdevice() immediately"),
we should immediately check the return value of register_netdevice()
before doing anything else.
Fixes: 005db31d5f ("bonding: set carrier off for devices created through netlink")
Reported-and-tested-by: syzbot+bbc3a11c4da63c1b74d6@syzkaller.appspotmail.com
Cc: Beniamino Galvani <bgalvani@redhat.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bond_ipsec_* helpers don't need RTNL, and can potentially get called
without it being held, so switch from rtnl_dereference() to
rcu_dereference() to access bond struct data.
Lightly tested with xfrm bonding, no problems found, should address the
syzkaller bug referenced below.
Reported-by: syzbot+582c98032903dcc04816@syzkaller.appspotmail.com
CC: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's possible that device removal happens when the bond is in non-AB mode,
and addition happens in AB mode, so bond_ipsec_del_sa() never gets called,
which leaves security associations in an odd state if bond_ipsec_add_sa()
then gets called after switching the bond into AB. Just call add and
delete universally for all modes to keep things consistent.
However, it's also possible that this code gets called when the system is
shutting down, and the xfrm subsystem has already been disconnected from
the bond device, so we need to do some error-checking and bail, lest we
hit a null ptr deref.
Fixes: a3b658cfb6 ("bonding: allow xfrm offload setup post-module-load")
CC: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the moment, bonding xfrm crypto offload can only be set up if the bonding
module is loaded with active-backup mode already set. We need to be able to
make this work with bonds set to AB after the bonding driver has already
been loaded.
So what's done here is:
1) move #define BOND_XFRM_FEATURES to net/bonding.h so it can be used
by both bond_main.c and bond_options.c
2) set BOND_XFRM_FEATURES in bond_dev->hw_features universally, rather than
only when loading in AB mode
3) wire up xfrmdev_ops universally too
4) disable BOND_XFRM_FEATURES in bond_dev->features if not AB
5) exit early (non-AB case) from bond_ipsec_offload_ok, to prevent a
performance hit from traversing into the underlying drivers
5) toggle BOND_XFRM_FEATURES in bond_dev->wanted_features and call
netdev_change_features() from bond_option_mode_set()
In my local testing, I can change bonding modes back and forth on the fly,
have hardware offload work when I'm in AB, and see no performance penalty
to non-AB software encryption, despite having xfrm bits all wired up for
all modes now.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Reported-by: Huy Nguyen <huyn@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clang warns:
drivers/net/bonding/bond_main.c:4657:23: warning: equality comparison
with extraneous parentheses [-Wparentheses-equality]
if ((BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP))
~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/bonding/bond_main.c:4681:23: warning: equality comparison
with extraneous parentheses [-Wparentheses-equality]
if ((BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP))
~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
This warning occurs when a comparision has two sets of parentheses,
which is usually the convention for doing an assignment within an
if statement. Since equality comparisons do not need a second set of
parentheses, remove them to fix the warning.
Fixes: 18cb261afd ("bonding: support hardware encryption offload to slaves")
Link: https://github.com/ClangBuiltLinux/linux/issues/1066
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: kernelci.org bot <bot@kernelci.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than requiring every hw crypto capable NIC driver to do a check for
slave_dev being set, set real_dev in the xfrm layer and xso init time, and
then override it in the bonding driver as needed. Then NIC drivers can
always use real_dev, and at the same time, we eliminate the use of a
variable name that probably shouldn't have been used in the first place,
particularly given recent current events.
CC: Boris Pismenny <borisp@mellanox.com>
CC: Saeed Mahameed <saeedm@mellanox.com>
CC: Leon Romanovsky <leon@kernel.org>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
Suggested-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, this support is limited to active-backup mode, as I'm not sure
about the feasilibity of mapping an xfrm_state's offload handle to
multiple hardware devices simultaneously, and we rely on being able to
pass some hints to both the xfrm and NIC driver about whether or not
they're operating on a slave device.
I've tested this atop an Intel x520 device (ixgbe) using libreswan in
transport mode, succesfully achieving ~4.3Gbps throughput with netperf
(more or less identical to throughput on a bare NIC in this system),
as well as successful failover and recovery mid-netperf.
v2: just use CONFIG_XFRM_OFFLOAD for wrapping, isolate more code with it
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dynamic key update for addr_list_lock still causes troubles,
for example the following race condition still exists:
CPU 0: CPU 1:
(RCU read lock) (RTNL lock)
dev_mc_seq_show() netdev_update_lockdep_key()
-> lockdep_unregister_key()
-> netif_addr_lock_bh()
because lockdep doesn't provide an API to update it atomically.
Therefore, we have to move it back to static keys and use subclass
for nest locking like before.
In commit 1a33e10e4a ("net: partially revert dynamic lockdep key
changes"), I already reverted most parts of commit ab92d68fc2
("net: core: add generic lockdep keys").
This patch reverts the rest and also part of commit f3b0a18bb6
("net: remove unnecessary variables and callback"). After this
patch, addr_list_lock changes back to using static keys and
subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
not have to change back to ->ndo_get_lock_subclass().
And hopefully this reduces some syzbot lockdep noises too.
Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
Cc: Taehee Yoo <ap420073@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object. Previous
commit "b8eb718348b8" fixed a similar problem.
Fixes: 07699f9a7c ("bonding: add sysfs /slave dir for bond slave devices.")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This merge includes updates to bonding driver needed for the rdma stack,
to avoid conflicts with the RDMA branch.
Maor Gottlieb Says:
====================
Bonding: Add support to get xmit slave
The following series adds support to get the LAG master xmit slave by
introducing new .ndo - ndo_get_xmit_slave. Every LAG module can
implement it and it first implemented in the bond driver.
This is follow-up to the RFC discussion [1].
The main motivation for doing this is for drivers that offload part
of the LAG functionality. For example, Mellanox Connect-X hardware
implements RoCE LAG which selects the TX affinity when the resources
are created and port is remapped when it goes down.
The first part of this patchset introduces the new .ndo and add the
support to the bonding module.
The second part adds support to get the RoCE LAG xmit slave by building
skb of the RoCE packet based on the AH attributes and call to the new
.ndo.
The third part change the mlx5 driver driver to set the QP's affinity
port according to the slave which found by the .ndo.
====================
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Currently, bonding always returns NETDEV_TX_OK to its caller.
It is worth trying to be more accurate : TCP for instance
can have different recovery strategies if it can have more
precise status, if packet was dropped by slave qdisc.
This is especially important when host is under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit b3e80d44f5
("bonding: fix lockdep warning in bond_get_stats()") the dynamic
key is no longer necessary, as we compute nest level at run-time.
So, we can just remove it to save some lockdep key entries.
Test commands:
ip link add bond0 type bond
ip link add bond1 type bond
ip link set bond0 master bond1
ip link set bond0 nomaster
ip link set bond1 master bond0
Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverts the folowing commits:
commit 064ff66e2b
"bonding: add missing netdev_update_lockdep_key()"
commit 53d374979e
"net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"
commit 1f26c0d3d2
"net: fix kernel-doc warning in <linux/netdevice.h>"
commit ab92d68fc2
"net: core: add generic lockdep keys"
but keeps the addr_list_lock_key because we still lock
addr_list_lock nestedly on stack devices, unlikely xmit_lock
this is safe because we don't take addr_list_lock on any fast
path.
Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add implementation of ndo_get_xmit_slave. Find the slave by using the
helper function according to the bond mode. If the caller set all_slaves
to true, then it assumes that all slaves are available to transmit.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Keep all slaves in array so it could be used to get the xmit slave
assume all the slaves are active.
The logic to add slave to the array is like the usable slaves, except
that we also add slaves that currently can't transmit - not up or active.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add helper function to get the xmit slave in active-backup mode.
It's only one line function that return the curr_active_slave,
but it will used both in the xmit flow and by the new .ndo to get
the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add helper function to get the xmit slave when bond is in round
robin mode. Change bond_xmit_slave_id to bond_get_slave_by_id, then
the logic for find the next slave for transmit could be used
both by the xmit flow and the .ndo to get the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Both xor and 802.3ad modes use bond_xmit_hash to get the xmit slave.
Export the logic to helper function so it could be used in the
following patches by the .ndo to get the xmit slave.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>