mirror of
git://git.yoctoproject.org/linux-yocto.git
synced 2025-10-22 23:13:01 +02:00
![]() A crash in conntrack was reported while trying to unlink the conntrack entry from the hash bucket list: [exception RIP: __nf_ct_delete_from_lists+172] [..] #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack] #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack] #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack] [..] The nf_conn struct is marked as allocated from slab but appears to be in a partially initialised state: ct hlist pointer is garbage; looks like the ct hash value (hence crash). ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected ct->timeout is 30000 (=30s), which is unexpected. Everything else looks like normal udp conntrack entry. If we ignore ct->status and pretend its 0, the entry matches those that are newly allocated but not yet inserted into the hash: - ct hlist pointers are overloaded and store/cache the raw tuple hash - ct->timeout matches the relative time expected for a new udp flow rather than the absolute 'jiffies' value. If it were not for the presence of IPS_CONFIRMED, __nf_conntrack_find_get() would have skipped the entry. Theory is that we did hit following race: cpu x cpu y cpu z found entry E found entry E E is expired <preemption> nf_ct_delete() return E to rcu slab init_conntrack E is re-inited, ct->status set to 0 reply tuplehash hnnode.pprev stores hash value. cpu y found E right before it was deleted on cpu x. E is now re-inited on cpu z. cpu y was preempted before checking for expiry and/or confirm bit. ->refcnt set to 1 E now owned by skb ->timeout set to 30000 If cpu y were to resume now, it would observe E as expired but would skip E due to missing CONFIRMED bit. nf_conntrack_confirm gets called sets: ct->status |= CONFIRMED This is wrong: E is not yet added to hashtable. cpu y resumes, it observes E as expired but CONFIRMED: <resumes> nf_ct_expired() -> yes (ct->timeout is 30s) confirmed bit set. cpu y will try to delete E from the hashtable: nf_ct_delete() -> set DYING bit __nf_ct_delete_from_lists Even this scenario doesn't guarantee a crash: cpu z still holds the table bucket lock(s) so y blocks: wait for spinlock held by z CONFIRMED is set but there is no guarantee ct will be added to hash: "chaintoolong" or "clash resolution" logic both skip the insert step. reply hnnode.pprev still stores the hash value. unlocks spinlock return NF_DROP <unblocks, then crashes on hlist_nulls_del_rcu pprev> In case CPU z does insert the entry into the hashtable, cpu y will unlink E again right away but no crash occurs. Without 'cpu y' race, 'garbage' hlist is of no consequence: ct refcnt remains at 1, eventually skb will be free'd and E gets destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy. To resolve this, move the IPS_CONFIRMED assignment after the table insertion but before the unlock. Pablo points out that the confirm-bit-store could be reordered to happen before hlist add resp. the timeout fixup, so switch to set_bit and before_atomic memory barrier to prevent this. It doesn't matter if other CPUs can observe a newly inserted entry right before the CONFIRMED bit was set: Such event cannot be distinguished from above "E is the old incarnation" case: the entry will be skipped. Also change nf_ct_should_gc() to first check the confirmed bit. The gc sequence is: 1. Check if entry has expired, if not skip to next entry 2. Obtain a reference to the expired entry. 3. Call nf_ct_should_gc() to double-check step 1. nf_ct_should_gc() is thus called only for entries that already failed an expiry check. After this patch, once the confirmed bit check passes ct->timeout has been altered to reflect the absolute 'best before' date instead of a relative time. Step 3 will therefore not remove the entry. Without this change to nf_ct_should_gc() we could still get this sequence: 1. Check if entry has expired. 2. Obtain a reference. 3. Call nf_ct_should_gc() to double-check step 1: 4 - entry is still observed as expired 5 - meanwhile, ct->timeout is corrected to absolute value on other CPU and confirm bit gets set 6 - confirm bit is seen 7 - valid entry is removed again First do check 6), then 4) so the gc expiry check always picks up either confirmed bit unset (entry gets skipped) or expiry re-check failure for re-inited conntrack objects. This change cannot be backported to releases before 5.19. Without commit |
||
---|---|---|
.. | ||
ipset | ||
ipvs | ||
core.c | ||
Kconfig | ||
Makefile | ||
nf_bpf_link.c | ||
nf_conncount.c | ||
nf_conntrack_acct.c | ||
nf_conntrack_amanda.c | ||
nf_conntrack_bpf.c | ||
nf_conntrack_broadcast.c | ||
nf_conntrack_core.c | ||
nf_conntrack_ecache.c | ||
nf_conntrack_expect.c | ||
nf_conntrack_extend.c | ||
nf_conntrack_ftp.c | ||
nf_conntrack_h323_asn1.c | ||
nf_conntrack_h323_main.c | ||
nf_conntrack_h323_types.c | ||
nf_conntrack_helper.c | ||
nf_conntrack_irc.c | ||
nf_conntrack_labels.c | ||
nf_conntrack_netbios_ns.c | ||
nf_conntrack_netlink.c | ||
nf_conntrack_ovs.c | ||
nf_conntrack_pptp.c | ||
nf_conntrack_proto_dccp.c | ||
nf_conntrack_proto_generic.c | ||
nf_conntrack_proto_gre.c | ||
nf_conntrack_proto_icmp.c | ||
nf_conntrack_proto_icmpv6.c | ||
nf_conntrack_proto_sctp.c | ||
nf_conntrack_proto_tcp.c | ||
nf_conntrack_proto_udp.c | ||
nf_conntrack_proto.c | ||
nf_conntrack_sane.c | ||
nf_conntrack_seqadj.c | ||
nf_conntrack_sip.c | ||
nf_conntrack_snmp.c | ||
nf_conntrack_standalone.c | ||
nf_conntrack_tftp.c | ||
nf_conntrack_timeout.c | ||
nf_conntrack_timestamp.c | ||
nf_dup_netdev.c | ||
nf_flow_table_bpf.c | ||
nf_flow_table_core.c | ||
nf_flow_table_inet.c | ||
nf_flow_table_ip.c | ||
nf_flow_table_offload.c | ||
nf_flow_table_procfs.c | ||
nf_flow_table_xdp.c | ||
nf_hooks_lwtunnel.c | ||
nf_internals.h | ||
nf_log_syslog.c | ||
nf_log.c | ||
nf_nat_amanda.c | ||
nf_nat_bpf.c | ||
nf_nat_core.c | ||
nf_nat_ftp.c | ||
nf_nat_helper.c | ||
nf_nat_irc.c | ||
nf_nat_masquerade.c | ||
nf_nat_ovs.c | ||
nf_nat_proto.c | ||
nf_nat_redirect.c | ||
nf_nat_sip.c | ||
nf_nat_tftp.c | ||
nf_queue.c | ||
nf_sockopt.c | ||
nf_synproxy_core.c | ||
nf_tables_api.c | ||
nf_tables_core.c | ||
nf_tables_offload.c | ||
nf_tables_trace.c | ||
nfnetlink_acct.c | ||
nfnetlink_cthelper.c | ||
nfnetlink_cttimeout.c | ||
nfnetlink_hook.c | ||
nfnetlink_log.c | ||
nfnetlink_osf.c | ||
nfnetlink_queue.c | ||
nfnetlink.c | ||
nft_bitwise.c | ||
nft_byteorder.c | ||
nft_chain_filter.c | ||
nft_chain_nat.c | ||
nft_chain_route.c | ||
nft_cmp.c | ||
nft_compat.c | ||
nft_connlimit.c | ||
nft_counter.c | ||
nft_ct_fast.c | ||
nft_ct.c | ||
nft_dup_netdev.c | ||
nft_dynset.c | ||
nft_exthdr.c | ||
nft_fib_inet.c | ||
nft_fib_netdev.c | ||
nft_fib.c | ||
nft_flow_offload.c | ||
nft_fwd_netdev.c | ||
nft_hash.c | ||
nft_immediate.c | ||
nft_inner.c | ||
nft_last.c | ||
nft_limit.c | ||
nft_log.c | ||
nft_lookup.c | ||
nft_masq.c | ||
nft_meta.c | ||
nft_nat.c | ||
nft_numgen.c | ||
nft_objref.c | ||
nft_osf.c | ||
nft_payload.c | ||
nft_queue.c | ||
nft_quota.c | ||
nft_range.c | ||
nft_redir.c | ||
nft_reject_inet.c | ||
nft_reject_netdev.c | ||
nft_reject.c | ||
nft_rt.c | ||
nft_set_bitmap.c | ||
nft_set_hash.c | ||
nft_set_pipapo_avx2.c | ||
nft_set_pipapo_avx2.h | ||
nft_set_pipapo.c | ||
nft_set_pipapo.h | ||
nft_set_rbtree.c | ||
nft_socket.c | ||
nft_synproxy.c | ||
nft_tproxy.c | ||
nft_tunnel.c | ||
nft_xfrm.c | ||
utils.c | ||
x_tables.c | ||
xt_addrtype.c | ||
xt_AUDIT.c | ||
xt_bpf.c | ||
xt_cgroup.c | ||
xt_CHECKSUM.c | ||
xt_CLASSIFY.c | ||
xt_cluster.c | ||
xt_comment.c | ||
xt_connbytes.c | ||
xt_connlabel.c | ||
xt_connlimit.c | ||
xt_connmark.c | ||
xt_CONNSECMARK.c | ||
xt_conntrack.c | ||
xt_cpu.c | ||
xt_CT.c | ||
xt_dccp.c | ||
xt_devgroup.c | ||
xt_dscp.c | ||
xt_DSCP.c | ||
xt_ecn.c | ||
xt_esp.c | ||
xt_hashlimit.c | ||
xt_helper.c | ||
xt_hl.c | ||
xt_HL.c | ||
xt_HMARK.c | ||
xt_IDLETIMER.c | ||
xt_ipcomp.c | ||
xt_iprange.c | ||
xt_ipvs.c | ||
xt_l2tp.c | ||
xt_LED.c | ||
xt_length.c | ||
xt_limit.c | ||
xt_LOG.c | ||
xt_mac.c | ||
xt_mark.c | ||
xt_MASQUERADE.c | ||
xt_multiport.c | ||
xt_nat.c | ||
xt_NETMAP.c | ||
xt_nfacct.c | ||
xt_NFLOG.c | ||
xt_NFQUEUE.c | ||
xt_osf.c | ||
xt_owner.c | ||
xt_physdev.c | ||
xt_pkttype.c | ||
xt_policy.c | ||
xt_quota.c | ||
xt_rateest.c | ||
xt_RATEEST.c | ||
xt_realm.c | ||
xt_recent.c | ||
xt_REDIRECT.c | ||
xt_repldata.h | ||
xt_sctp.c | ||
xt_SECMARK.c | ||
xt_set.c | ||
xt_socket.c | ||
xt_state.c | ||
xt_statistic.c | ||
xt_string.c | ||
xt_tcpmss.c | ||
xt_TCPMSS.c | ||
xt_TCPOPTSTRIP.c | ||
xt_tcpudp.c | ||
xt_TEE.c | ||
xt_time.c | ||
xt_TPROXY.c | ||
xt_TRACE.c | ||
xt_u32.c |